
Dat – Distributed Dataset Synchronization and Versioning - ColinWright
https://github.com/datproject/docs/blob/master/papers/dat-paper.md
======
pfraze
We use Dat in Beaker[1] to host sites and files from the user's device. It's a
pretty interesting protocol. It's developed by Code for Science[2], a
501(c)(3) led by Max Ogden[3] and with protocol dev led by Mafintosh[4]; their
mission is to help with archival of science and civic data.

Some interesting properties:

1\. It uses a BitTorrent-style of swarm, but primarily to sync signed append-
only logs, which are in fact flattened Merkle Trees (similar to Certificate
Transparency). The Dat archives are addressed by public keys. The tree is used
to enforce the append-only constraint by making it easy to detect if the
history has been changed by the author.

2\. The "Secret Sharing" feature. The public key of a Dat archive is hashed
before querying or announcing on the discovery network, and then the traffic
is encrypted using the public key as a symmetric key. This has the effect of
hiding the content from the network, and thus making the public key of a Dat a
"read capability": you have to know the key to access its files.

There's a reference implementation in JS available at
[https://github.com/datproject/dat-node](https://github.com/datproject/dat-
node), and a fair number of tools being built around it.

1 [https://beakerbrowser.com/](https://beakerbrowser.com/)

2 [https://datproject.org/](https://datproject.org/)

3 [https://twitter.com/denormalize](https://twitter.com/denormalize)

4 [https://twitter.com/mafintosh](https://twitter.com/mafintosh)

~~~
nerdponx
I wish Beaker had picked a different name. It collides with the Beaker
Notebook, a Jupyter alternative that unfortunately never seemed to gain
traction but had some really killer features that Jupyter has yet to pick up,
especially the ability to mix Python, R, and Julia cells in the same notebook.

~~~
pfraze
Yeah I regret that. We'll consider a rename at some point. Here's their
project: [http://beakernotebook.com/](http://beakernotebook.com/)

~~~
alexvoda
Will you keep the laboratory glassware theme?

Flask is already used ( [http://flask.pocoo.org/](http://flask.pocoo.org/) )

Maybe Retort? Funnel? Or maybe a proper name? Berzelius? Erlenmeyer would be
difficult for people to pronounce.

------
rspeer
As someone who creates open, medium-sized, reusable datasets, is Dat something
I should try? Is it too early? The linked page is very much about technical
details of the implementation and not about how one would typically use it.

I maintain ConceptNet [1], a multilingual knowledge graph. I do everything I
can to make its published results reproducible. The biggest hurdle for people
reproducing it has always been getting the data -- building it requires about
100 GB of raw data or 15 GB of computed data that can be imported into
PostgreSQL.

I once tried git-annex. It turned out not to be a good choice -- its tools
were flaky, its usage patterns confusing, it leaves a permanent record of your
mistakes in configuring data sources, and it was very hard to convince to use
ordinary HTTP downloads instead of trying to get read-write access to S3
(which wouldn't work for anyone but me). Now I have weird branches and remotes
in my repositories, and weird data in my S3 buckets, that I can't get rid of
in case someone tries to use git-annex in a way I told them would work.

After that I just went with distributing the data with plain HTTP downloads
from S3. I wish I could do better than this. The only semblance of versioning
is putting the date in the URL, and also people in Asia tell me that the build
fails because their downloads from us-east-1 get interrupted. Oh, and if I
ever stop paying for S3, everything will break.

If I tried making data reproducible with Dat, would it be safe to promise
people that they could use Dat to get the data? Even if in the future I don't
like Dat anymore?

For instance, do I have to commit to hosting the data somewhere? If not, who
does? Does it disappear when people lose interest, like BitTorrent?

[1] [http://conceptnet.io](http://conceptnet.io)

~~~
nl
Thankyou for your work on ConceptNet. It's the best public knowledge graph in
existence.

Just today I was using the multilingual Conceptnet-numberbatch word
vectors[1], which would not be possible without your work.

To your point though - you can use Amazon S3 as a seed for Bitorrent
downloads, which might help some and reduce what you pay. See [2]

[1] [https://github.com/commonsense/conceptnet-
numberbatch](https://github.com/commonsense/conceptnet-numberbatch)

[2]
[http://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.htm...](http://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.html)

~~~
rspeer
Yep, ConceptNet Numberbatch is my work too, and it's been the most effective
way to show that knowledge graphs matter -- that there is more to know about
word relationships than you can get from distributional semantics ("word2vec")
alone.

~~~
nl
Oh really? Very nice... although I'm only using the aligned distributional
semantic nature of them.

I have some background in question answering over knowledge graphs, though, so
I'm familiar with their strengths.

~~~
rspeer
I'd be interested to hear about what you're doing with it.

In my company Luminoso's work, it's important in building domain-specific
models that can be used for topic detection, search, and classification.
Beyond that, I use it for mostly the basic demos -- word similarity, text
similarity, analogies, et cetera.

I believe based on its performance there that it should be a pure upgrade to
the kind of applications that use word2vec, but I'd like to know what
particular applications it's being used in besides my own.

------
nwmcsween
Wasn't dat originally going to be a part of ipfs or was it the browser? What
are the reasons for dat vs ipfs?

~~~
hobofan
Contrary to IPFS dat is more focused on what can be achieved right now while
sacrificing true decentralisation, and is more opinonated with providing in
its use case with providing things like versioning.

~~~
tbv
What do you think it sacrifices? IPFS is more focused on static blob
addressing while Dat focuses on data sources, and in that sense, there’s a
single authority over a dataset. But I see that as a positive, since
mutability is pretty valuable.

TBH, I'm not sure how a site can work without mutability.

~~~
GhotiFish
a dynamic site can't work without mutability, ipfs can't deal with mutability,
so ipfs comes with ipns, which allows you to statically reference content that
might change.

~~~
vertex-four
Of course, the design of IPNS makes it impossible to prove that you've got the
latest version of a name's value, and makes it relatively easy to attack. I
don't know if Dat has the same issue, I haven't looked at it.

~~~
lgierth
It's not impossible by design, it's simply a feature that hasn't been
implemented so far. IPFS is by design pluggable on all layers and thus
theoretically capable of a ton of stuff.

~~~
vertex-four
So IPFS is, according to you, not defined as its protocol, but instead as its
API? So I could build an IPFS implementation that just grabs stuff from my
(centralised) web server and say I'm using IPFS? Either nonsense, or useless -
the goal here, I thought, was to build a global decentralised filesystem that
looks the same from everyone's perspective.

All DHTs suffer from this issue. It's just particularly likely that IPNS's use
of a DHT will lead to attacks. Making it not susceptible to this would require
a redesign of IPNS's protocol.

------
draw_down
Max Ogden and Mafintosh are incredibly talented and productive. Awe-inspiring
to see the things they make.

------
skybrian
It seems like Dat has some usability quirks that might take some getting used
to:

\- You can publish new versions to a URL until you somehow forget the private
key, and then it's fixed forever, so long as people hang onto copies.

\- There's nothing to prevent people from passing around a URL with a version
in it. So, although it looks like the author has some control, this is an
illusion; publishing is irrevocable and anything published could go viral.
(This is generally true of making copies, but it's the opposite of Snapchat.)

\- Suppose someone chooses to publish a private key? Is it a world-writable
URL? Hmm.

~~~
maxogden
Great analysis. We anticipate that in order to fix these three usability
issues around trust we will need to provide a centralized identity provider in
the future. This would also address privacy issues especially regarding
leaking what dats you are accessing. The design philosophy around Dat is to
start from the end of the completely decentralized spectrum but be flexible in
letting the application choose the tradeoffs as they move more towards
centralized components.

~~~
skybrian
Good to know.

It would be good to figure out key rotation for Dat URL providers since this
probably has to be built into the protocol.

Any thoughts on integrating with keybase? I like keybase's model where you
have device-specific keys. But this would probably make moving a Dat URL
provider to a different machine trickier.

This all assumes that well-known Dat URL's become an important thing to
preserve (they are published in papers, etc) even though they are very user-
unfriendly, even more than IP addresses.

A naming system on top of them would make key rotation a non-issue (rotate Dat
URL's instead) and you could completely replace or remove the history, sort
like a git rebase. But that loses other nice properties of the system?

I suppose irrevocability is something we deal with in git repos all the time.
Although you can do a rebase locally, once a commit is accepted by a popular
project, they're unlikely to let you remove it from history. The review
process makes it unlikely that any really embarrassing mistake would be
accepted, so this seems ok in practice.

------
fiatjaf
Nice, are you sure they aren't going to change the protocol to something
totally new and with radically different features in maybe two months.

