
Dat: non-profit, secure, and distributed package manager for data - tambourine_man
https://datproject.org/
======
pfraze
Dat is both a protocol and a toolset. The protocol is basically an improved
version of BitTorrent that supports changes and versioning (here's a short
page on its technical merits [1]). The toolset is a commandline and some
desktop apps, and we use it in the Beaker Browser.

Dat couldn't be made by nicer people [2]. It's a non-profit led by Max Ogden
[3] and the protocol work is led by Mafintosh [4], both of whom are pretty
well known in the nodejs community. The project started as a way for academics
to work on tabular datasets, but they pretty quickly found out that academics
more often want to work on unstructured (or custom-structured) files. So Max
recruited Mafintosh and they started working on the p2p protocol, which
focused on improving archival and data-sharing flows within labs.

There's a pretty simple CLI you can install from NPM (npm i -g dat). Give it a
try. Also, they need help securing grants, so if you have a talent or a
connection in that area, definitely get in touch with them. They're doing good
work.

1\. [https://beakerbrowser.com/docs/inside-
beaker/](https://beakerbrowser.com/docs/inside-beaker/)

2\. [https://datproject.org/team](https://datproject.org/team)

3\. [https://twitter.com/denormalize](https://twitter.com/denormalize)

4\. [https://twitter.com/mafintosh](https://twitter.com/mafintosh)

~~~
SOLAR_FIELDS
I’ve contributed to a couple of Mafintosh’s projects. the dude is impressively
prolific and writes good code. Nice to see him get some recognition here.

------
filiwickers
Hi everyone, I'm one of the core contributors to Dat, @joeahand. Happy to
answer any questions. It's an interesting time to see this posted because I've
been working on a new datproject.org site recently =).

Dat Project started with a focus on increasing access to research & public
data. To support the data tools we built a peer-to-peer protocol. People are
doing some really cool stuff on top of Dat (such as Beaker Browser), we're
really excited about it and want to make sure to support all the neat use
cases.

We'll be launching an updated site soon to highlight more of the work around
the protocol and what the community is building. Our main use case will still
be data management but most of what you see on the current site will shift to
a new domain.

~~~
baldfat
Care to explain how this is different then Resilio
[https://www.resilio.com/](https://www.resilio.com/)

I use this with encryption for my data folders on my projects.

~~~
filiwickers
Ya, there are a few other related questions below. Resilio is BitTorrent-
based. But I'm not 100% familiar with how Resilio differs from BitTorrent.

The core difference is in our approach. We're all open source and a non-
profit. We're also really focused on the research data use case where
BitTorrent is less easily deployed.

We hope that making an open and easy to use p2p protocol will enable other
developers to build applications on top, and something like Resilio could be
one.

------
zitterbewegung
So how does this compare with quilt? From what I see

1\. Quilt is for profit while Dat is non-profit.

2\. Dat has ~20 datasets that are public Quilt has 50+ that are public

3\. Dat is on a shared network while Quilt is hosted on a centralized server

4\. Both of them offer version control and hosting. Quilt has private hosting
for a fee. Dat seems to have only public hosting

5.Quilt is funded by YC. Dat is funded by non profits.

6\. Quilt has a Python interface while Dat has one in Javascript

I understand who Quilt is targeting but I'm having trouble understanding who
Dat is targeting?

~~~
tbv
I'm one of the creators of the Beaker browser[1] and the reason we use Dat is
that as a p2p protocol, it offers a lot of neat properties, including making
datasets more resilient. As long as one peer on the network is hosting a
dataset, it will be reachable, even if the original author has stopped hosting
it.

I won't speak authoritatively on behalf of the Dat team, but I believe one of
their goals is to make it difficult for public scientific datasets to be lost,
and data living on a centralized server is particularly vulnerable to that.

1\.
[https://github.com/beakerbrowser/beaker](https://github.com/beakerbrowser/beaker)

~~~
rspeer
The use case really speaks to me, but I'm not convinced that decentralization
is going to help datasets not to get lost.

I spent a while trying to download recent updates to the Reddit comment corpus
[1], which is hosted on BitTorrent. The downloads never seem to finish.

It seems to me that decentralization means that, when a dataset stops being
new and exciting, it will disappear. How will Dat counter this?

[1]
[https://www.reddit.com/r/datasets/comments/65o7py/updated_re...](https://www.reddit.com/r/datasets/comments/65o7py/updated_reddit_comment_dataset_as_torrents/)

~~~
yoshuaw
Because Dat is just a protocol, decentralization is a choice. For quick,
ephemeral exchanges direct P2P works brilliantly. For longer lived data sets,
sharing it with a (commercial) mirror might make sense. Or perhaps you host it
yourself. The beauty is that you, as a user of the protocol, get to decide
what works best for you.

------
e12e
The p2p and security aspects look very interesting[1] - apart from the obvious
network effect/existing user base and tooling - are there any reasons to not
prefer dat to bittorrent for all the things?

It looks like an interesting way to store/share backups and server images -
easily scaling bandwidth and availability with the nerd to spin up new
instances, or shifting across data centers?

Maybe also as an apt back-end similar to:

[https://wiki.debian.org/DebTorrent](https://wiki.debian.org/DebTorrent)

[1]
[https://docs.datproject.org/security](https://docs.datproject.org/security)

~~~
tbv
To draw a quick contrast between Dat and BitTorrent, BitTorrent magnet links
are static, meaning if you change the content, you get an entirely new magnet
link. Dat archives (a networked directory, essentially) are mutable, so you
can publish modifications at a consistent address.

I’m not familiar with DebTorrent, but if you’re interested to learn more about
the innards of Dat, this post by pfraze is a good place to start:

[https://beakerbrowser.com/2017/06/19/cryptographically-
secur...](https://beakerbrowser.com/2017/06/19/cryptographically-secure-
change-feeds.html)

------
noobermin
I can think of one use for this. I run computational plasma physics
simulations for a living, and while some of us have gotten better about
sharing code, sharing simulation results for published papers would be
beneficial. Will have to think about this.

~~~
pfraze
You should jump on #dat or get in touch with someone on the team via twitter.
They'd be happy to help you get setup.

------
sandGorgon
This is very cool, but I think you need to go down to fine-grained permissions
to be truly effective. Otherwise , it is pretty much bittorrrent with a
private tracker.

For example, I should be able to give unique URLs (for the same data) to
different users and expire one but continue for the other,etc.

~~~
watson
One other crucial difference between dat and bittorrent is that dat allows you
to update datasets. In bittorrent, you can't change/add files once you've
shared your torrent

~~~
sandGorgon
the DHT mutable data BEP-46 is proposing to take care of that.

[http://www.libtorrent.org/dht_store.html](http://www.libtorrent.org/dht_store.html)

previous HN discussion -
[https://news.ycombinator.com/item?id=12257065](https://news.ycombinator.com/item?id=12257065)

~~~
rakoo
No, BEP46 only allows you to mutate DHT items, not torrent. By construction
torrents are immutable (if you change one bit, or the name of the files) then
the torrent identity changes.

What you need to mimic dat is a more integrated way to tell other peers that
the torrent changed and to check the new one... which is not there yet.

~~~
sandGorgon
I'm not an expert at this, but this is what the RFC says:

[http://www.bittorrent.org/beps/bep_0046.html](http://www.bittorrent.org/beps/bep_0046.html)

> _The intention is to allow publishers to serve content that might change
> over time in a more decentralized fashion. Consumers interested in the
> publisher 's content only need to know their public key + optional salt. For
> instance, entities like Archive.org could publish their database dumps, and
> benefit from not having to maintain a central HTTP feed server to notify
> consumers about updates. _

You are technically right that the torrent file is immutable, but basically
this lets clients know that the torrent is updated using the DHT data. The
outcome is the same.

------
rambojazz
It looks cool, but I don't completely understand what this is about. Is it a
way to share files over a P2P network? Isn't this basically the same as
Torrent or IPFS?

~~~
watson
It's very similar to bittorrent but with a few key differences. For one,
bittorrent doesn't allow you to update/add files in a dataset once you've
shared the torrent. Dat does. Dat also has versioning built in.

IPFS seems to me to be a bit over-engineered whereas dat is a lot more
simple/low level - something that suits my way of working really well.

~~~
rambojazz
> IPFS seems to me to be a bit over-engineered whereas dat is a lot more
> simple/low level

I also love simpler things. I'd be curious to understand where is IPFS over-
engineered and where is Dat better, instead.

~~~
watson
I probably didn't give IPFS enough credit. What I should have said was that
IPFS does way more than I need. Dat seems to hit the sweet spot for me

------
baldfat
This seems like my setup but I use Resilio (Bittorrent Sync) with encryption.
I don't need version control for my data but you get archives if you want one.

------
noobermin
I recall at least two other "data package managers" out there being shared on
HN a couple of months back.

~~~
FabioFleitas
You happen to remember which ones? Would be interested in checking them out.

~~~
noobermin
I was thinking it would take a while to look back, but the new "upvoted
submissions" tool is more useful than I thought. The project was Quilt[0] and
here[1] is the comments on it. The other project I saw referenced in the
comments was Pachyderm[2]. It looks like Dat indeed was referenced in the
comments, and the main difference is Dat is distributed while Quilt is
centralized (like github)...and Pachyderm is more like git.

[0] [https://quiltdata.com/](https://quiltdata.com/)

[1]
[https://news.ycombinator.com/item?id=14771406](https://news.ycombinator.com/item?id=14771406)

[2] [http://www.pachyderm.io/](http://www.pachyderm.io/)

EDIT: Annnd...looking up, I see others have already referenced quilt in other
comments here.

------
erikb
naive me's first thought would be: Why not use git? Can someone with more
experience in this area explain?

~~~
jhoechtl
The git toolset is, out of the box, weak on tabular data. Also out of the box
large file support requires one central repository.

None of that is true for Dat

~~~
icebraining
_out of the box large file support requires one central repository._

How so? As far as I know, git doesn't treat large files any differently than
small ones.

~~~
watson
I'll try to answer your question, but it's been a while since I last looked at
it, so I might get some/most of this wrong - so don't shoot me:

As far as I know git isn't good at storing binary data. Git depends on line
breaks to be able to diff and make change sets. If you store a binary file in
git and make an update to it - even though that update only changed 1 byte,
the entire new version of the file is stored again. Dat uses Rabin
fingerprinting to intelligently slice binary files into chunks that are less
likely to change. That make dat a lot more efficient at storing, versioning,
and syncing videos, images, and other large binary files.

~~~
icebraining
Conceptually, git doesn't use change-sets, each commit is a snapshot of all
the files in the current version. For storage and transmission efficiency,
though, it can store and send them in packfiles, which use delta compression
based on LibXDiff - which uses Rabin's fingerprint algorithm (as well as other
algorithm by Joshua P. MacDonald) for binary files.

------
subcosmos
Can the fetch command be "get-dat"?...

Please?!

~~~
Fnoord
You could easily make an alias.

------
spraak
There was a startup I came across recently doing about the same thing... I
don't recall the name.

~~~
Fnoord
Do you mean Quilt Data, Inc with the product Quilt? Its being mentioned all
over this thread.

------
colordrops
Dat being short for data, would a package for the state assemblies be "Dat
Ass"?

~~~
squaredpants
Astute.

