Hacker News new | past | comments | ask | show | jobs | submit login
Dat – Distributed Dataset Synchronization and Versioning (github.com/datproject)
229 points by ColinWright on May 25, 2017 | hide | past | favorite | 39 comments

We use Dat in Beaker[1] to host sites and files from the user's device. It's a pretty interesting protocol. It's developed by Code for Science[2], a 501(c)(3) led by Max Ogden[3] and with protocol dev led by Mafintosh[4]; their mission is to help with archival of science and civic data.

Some interesting properties:

1. It uses a BitTorrent-style of swarm, but primarily to sync signed append-only logs, which are in fact flattened Merkle Trees (similar to Certificate Transparency). The Dat archives are addressed by public keys. The tree is used to enforce the append-only constraint by making it easy to detect if the history has been changed by the author.

2. The "Secret Sharing" feature. The public key of a Dat archive is hashed before querying or announcing on the discovery network, and then the traffic is encrypted using the public key as a symmetric key. This has the effect of hiding the content from the network, and thus making the public key of a Dat a "read capability": you have to know the key to access its files.

There's a reference implementation in JS available at https://github.com/datproject/dat-node, and a fair number of tools being built around it.

1 https://beakerbrowser.com/

2 https://datproject.org/

3 https://twitter.com/denormalize

4 https://twitter.com/mafintosh

I wish Beaker had picked a different name. It collides with the Beaker Notebook, a Jupyter alternative that unfortunately never seemed to gain traction but had some really killer features that Jupyter has yet to pick up, especially the ability to mix Python, R, and Julia cells in the same notebook.

Yeah I regret that. We'll consider a rename at some point. Here's their project: http://beakernotebook.com/

Will you keep the laboratory glassware theme?

Flask is already used ( http://flask.pocoo.org/ )

Maybe Retort? Funnel? Or maybe a proper name? Berzelius? Erlenmeyer would be difficult for people to pronounce.

It also collides with the beaker framework

As someone who creates open, medium-sized, reusable datasets, is Dat something I should try? Is it too early? The linked page is very much about technical details of the implementation and not about how one would typically use it.

I maintain ConceptNet [1], a multilingual knowledge graph. I do everything I can to make its published results reproducible. The biggest hurdle for people reproducing it has always been getting the data -- building it requires about 100 GB of raw data or 15 GB of computed data that can be imported into PostgreSQL.

I once tried git-annex. It turned out not to be a good choice -- its tools were flaky, its usage patterns confusing, it leaves a permanent record of your mistakes in configuring data sources, and it was very hard to convince to use ordinary HTTP downloads instead of trying to get read-write access to S3 (which wouldn't work for anyone but me). Now I have weird branches and remotes in my repositories, and weird data in my S3 buckets, that I can't get rid of in case someone tries to use git-annex in a way I told them would work.

After that I just went with distributing the data with plain HTTP downloads from S3. I wish I could do better than this. The only semblance of versioning is putting the date in the URL, and also people in Asia tell me that the build fails because their downloads from us-east-1 get interrupted. Oh, and if I ever stop paying for S3, everything will break.

If I tried making data reproducible with Dat, would it be safe to promise people that they could use Dat to get the data? Even if in the future I don't like Dat anymore?

For instance, do I have to commit to hosting the data somewhere? If not, who does? Does it disappear when people lose interest, like BitTorrent?

[1] http://conceptnet.io

I see you’re using JSON-LD in conceptnet. If you start passing that data around using distributed systems, you'll inevitably want to start incorporating content-addressed links into the data. I recommend looking into IPLD as a data model for handling that. https://ipld.io The spec is still open -- this would be a good time to give feedback and/or spell out your use cases in this space.

Thankyou for your work on ConceptNet. It's the best public knowledge graph in existence.

Just today I was using the multilingual Conceptnet-numberbatch word vectors[1], which would not be possible without your work.

To your point though - you can use Amazon S3 as a seed for Bitorrent downloads, which might help some and reduce what you pay. See [2]

[1] https://github.com/commonsense/conceptnet-numberbatch

[2] http://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.htm...

Yep, ConceptNet Numberbatch is my work too, and it's been the most effective way to show that knowledge graphs matter -- that there is more to know about word relationships than you can get from distributional semantics ("word2vec") alone.

Oh really? Very nice... although I'm only using the aligned distributional semantic nature of them.

I have some background in question answering over knowledge graphs, though, so I'm familiar with their strengths.

I'd be interested to hear about what you're doing with it.

In my company Luminoso's work, it's important in building domain-specific models that can be used for topic detection, search, and classification. Beyond that, I use it for mostly the basic demos -- word similarity, text similarity, analogies, et cetera.

I believe based on its performance there that it should be a pure upgrade to the kind of applications that use word2vec, but I'd like to know what particular applications it's being used in besides my own.

Conceptnet looks really cool, I'm going to dig into that later tonight.

Dat's really similar to BitTorrent when it comes to availability; it doesn't do anything automatically to guarantee it. If you choose to use Dat, you'll need to ensure a peer exists, though public peer services will be available soon.

Dat's still young and you'll probably have to endure some hiccups, but if you do want to give it a try, PM me and I can help you get started.

Should I just try using the tools at https://datproject.org/ ?

It sounds like Dat is a ways off from being something I could use as an authoritative source of data, but I could include it as one way to get the ConceptNet data. If it succeeds, that could save on the S3 bill and maybe even distribute the data across continents better. (Heck, I'm sure a lot of the downloads are scripts I'm running, and I could be file-sharing the files to myself.)

And I guess a Dat URL shouldn't really be an authoritative entry point referred to in a paper, as it would just point to one version of the data with no context. Maybe Dat plus Zenodo could do the trick eventually.

We are releasing updated desktop & command line app on Tuesday with support for the protocol, as defined in the paper. I'd recommend checking back then (though some, such as CLI, are released already on npm).

Your points are spot on and things we've been thinking a lot about. I also wouldn't feel comfortable putting a Dat link in a paper yet, but that is an eventual goal because of the persistence properties of dat compared to http urls.

> Maybe Dat plus Zenodo could do the trick eventually.

Yes, exactly!! Dat supports http publishing right now, so you can run `dat sync --http` on a server and it'll live publish files from the source to an http site.

We are working on also supporting http downloads. The idea is you publish to Zenodo, including the SLEEP metadata, dat can then clone the files over http (including content verification) or via the peer network, i.e. `dat clone zenodo.org/record/439922`.

We are super excited for the http downloading because it'll allow dat to store files on any data repository with a good api and a http file backend. We've been talking with the Dataverse folks on how to accomplish this there and have an eye on others such as Zenodo.

> We are releasing updated desktop & command line app on Tuesday with support for the protocol, as defined in the paper.

Awesome, congratulations! :):)

For distributed/p2p software these are just amazing times to be alive :)

Hmmm, is the data set something which would fairly naturally fit in a series of SQLite databases?

100GB is way too large for the project I'm working on at the moment (dbhub.io), as even a bunch of people downloading something that large would nuke our sponsorship budget since we're just starting out (still pre-launch).

However, if we gain traction and become cash positive, data sets this size would be good to cater to. :)

I used to keep it in SQLite (much easier to distribute than PostgreSQL). It worked a lot better than many other options I tried. However, rebuilding the database from updated data would take more than a day, and some queries were too slow.

Switching to PostgreSQL sped things up, at the cost of requiring a separate database process, dealing with psql's weird access control, and adding an inconvenient step of loading the data using COPY commands.

Hmmm, sounds like the data itself would be feasible then. SQLite could be considered just a data transport container for this purpose. :)

What underlying ontology does it use?

Its own, I guess? ConceptNet is really not so much about having an upper ontology, it's about relations between natural language words and phrases. Its set of relations is effectively a superset of WordNet's.

Hmm but for example. If I search "harry potter", one of the things I get is "harry potter is defined as... " "boy who lives under the stair". What can I do with this? "boy", "lives", "under", and "stair", while present in conceptnet, have no logical relation to anything else...

Wasn't dat originally going to be a part of ipfs or was it the browser? What are the reasons for dat vs ipfs?

Contrary to IPFS dat is more focused on what can be achieved right now while sacrificing true decentralisation, and is more opinonated with providing in its use case with providing things like versioning.

What do you think it sacrifices? IPFS is more focused on static blob addressing while Dat focuses on data sources, and in that sense, there’s a single authority over a dataset. But I see that as a positive, since mutability is pretty valuable.

TBH, I'm not sure how a site can work without mutability.

a dynamic site can't work without mutability, ipfs can't deal with mutability, so ipfs comes with ipns, which allows you to statically reference content that might change.

Yeah, IPFS + IPNS is effectively on par with Dat archive mutability, but mutability being built-in to Dat + its verifiable history log is particularly well-suited to building peer-to-peer websites

Of course, the design of IPNS makes it impossible to prove that you've got the latest version of a name's value, and makes it relatively easy to attack. I don't know if Dat has the same issue, I haven't looked at it.

It's not impossible by design, it's simply a feature that hasn't been implemented so far. IPFS is by design pluggable on all layers and thus theoretically capable of a ton of stuff.

So IPFS is, according to you, not defined as its protocol, but instead as its API? So I could build an IPFS implementation that just grabs stuff from my (centralised) web server and say I'm using IPFS? Either nonsense, or useless - the goal here, I thought, was to build a global decentralised filesystem that looks the same from everyone's perspective.

All DHTs suffer from this issue. It's just particularly likely that IPNS's use of a DHT will lead to attacks. Making it not susceptible to this would require a redesign of IPNS's protocol.

They're pretty similar. Beaker supported both once, which may be what you're thinking of. This is what guided our decision: https://beakerbrowser.com/docs/inside-beaker/other-technolog...

The two features I list above, the append-only histories and secret-sharing, are unique to Dat. And, for us, the URL spec was a big deal.

Actually there's been some movement regarding URLs :) You can check out the reasoning and plan here: [1]

IPFS still has the long-term goal of path addressing (NURI), we just hadn't yet completely figured out what the upgrade path should look like. The discussion linked above is turning into a spec and into actions, i.e. IPFS will be trying to get as much as possible of Electron's protocol API [2] into WebExtensions, [3] and make us of that in the browser addons. [4]

[1] https://github.com/ipfs/specs/pull/152#issuecomment-28462886...

[2] https://electron.atom.io/docs/api/protocol/

[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1271553

[4] https://github.com/ipfs/in-web-browsers

So from what I can tell, the reason y'all keep pushing for NURIs has to do with the focus on hash-addressing, right? You're trying to get away from the concept of domains, which does make sense: a content-addressed folder or blob can live independently of a domain, because it's immutable and self-verifying, so why have domains at all.

Here's why I think you're shooting yourself in the foot with the NURIs, though.

1) You're focusing on syntax. The "there is no domain" premise works just as fine with your stage two of `ipfs://{hash}`, so dismantling domains doesnt really justify the NURI change. The path syntax, of `/ipfs/{hash}`, contains functionally the same information.

2) If you still need IPNS, then you still need a concept of domains, so the "there is no domain" premise isn't really accurate.

There's the concept of nestability in NURIs that is supposed to increase protocol composition, but I think you're overgeneralizing, at the cost of breaking backwards compatibility. When we looked at using IPFS in Beaker, the NURI was a major problem for us, because we're limited to what Electron/Chrome provides. It's not just API surface either: there's a lot of code in Chromium that makes assumptions around the standard URL syntax. Are you really so sure that nestable references are worth the headache? Because you're gambling the entire IPFS project on it.

> Because you're gambling the entire IPFS project on it.

One of the reasons I like the IPFS project so much is that they produce a lot of good side products (multiformats, libp2p) that can help anyone build an IPFS-like system.

So I wouldn't see it that bleak. Even if that choice of NURIs is a fatal flaw as you claim, it is still a very surface-level problem that could be fixed in a fork of it.

Max Ogden and Mafintosh are incredibly talented and productive. Awe-inspiring to see the things they make.

It seems like Dat has some usability quirks that might take some getting used to:

- You can publish new versions to a URL until you somehow forget the private key, and then it's fixed forever, so long as people hang onto copies.

- There's nothing to prevent people from passing around a URL with a version in it. So, although it looks like the author has some control, this is an illusion; publishing is irrevocable and anything published could go viral. (This is generally true of making copies, but it's the opposite of Snapchat.)

- Suppose someone chooses to publish a private key? Is it a world-writable URL? Hmm.

Great analysis. We anticipate that in order to fix these three usability issues around trust we will need to provide a centralized identity provider in the future. This would also address privacy issues especially regarding leaking what dats you are accessing. The design philosophy around Dat is to start from the end of the completely decentralized spectrum but be flexible in letting the application choose the tradeoffs as they move more towards centralized components.

Good to know.

It would be good to figure out key rotation for Dat URL providers since this probably has to be built into the protocol.

Any thoughts on integrating with keybase? I like keybase's model where you have device-specific keys. But this would probably make moving a Dat URL provider to a different machine trickier.

This all assumes that well-known Dat URL's become an important thing to preserve (they are published in papers, etc) even though they are very user-unfriendly, even more than IP addresses.

A naming system on top of them would make key rotation a non-issue (rotate Dat URL's instead) and you could completely replace or remove the history, sort like a git rebase. But that loses other nice properties of the system?

I suppose irrevocability is something we deal with in git repos all the time. Although you can do a rebase locally, once a commit is accepted by a popular project, they're unlikely to let you remove it from history. The review process makes it unlikely that any really embarrassing mistake would be accepted, so this seems ok in practice.

All true, though if you leak the private key, what will happen is that (due to lack of strict consensus between the leaked-key users) conflicting updates will be published, causing a detectable split history. That's a corruption event.

At the moment that would result in each leaked-key user maintaining a different history, with different peers only downloading the updates from the leak-author they happen to receive data from first. But in the future what will happen, once we get to writing the software for it, is the corruption event will be detected and recorded by all possible peers, freezing the dat from receiving future updates.

Nice, are you sure they aren't going to change the protocol to something totally new and with radically different features in maybe two months.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact