
Academic Torrents: A distributed system for sharing enormous datasets - iamjeff
http://academictorrents.com/
======
WhitneyLand
API suggestions:

1). Don't use hard coded values for types

>GET /apiv2/entries?cat=6 -- List entries that are datasets

>GET /apiv2/entries?cat=5 -- List entries that are papers

These could be written as: /apiv2/entries/datasets /apiv2/entries/papers

2). You may not need path elements like entries, entry, collection, and
collection name. For example further simplification would leave

    
    
      /datasets
    
      /papers
    

3). Don't use capitals like this, switch to lowercase: /apiv2/entry/INFOHASH

4). Use HTTP verbs in a standard, semantic way. For example this

    
    
      POST /apiv2/collection -- create an collection
    
      POST /apiv2/collection/collection-name/update
    
      POST /apiv2/collection/collection-name/delete
    
      POST /apiv2/collection/collection-name/add
    
      POST /apiv2/collection/collection-name/remove
    

This could all be collapsed into one form and would in turn be more familiar
to developers.

~~~
NeutronBoy
Question about your first point - doesn't it depend on the use case?

What would happen if I wanted to get a list of datasets and papers? (maybe in
this case it's nonsensical, but it's a problem with some other APIs I've used
and I've never figured a good way to work around it)

GET /apiv2/entries?cat=5&cat=6 vs 2 separate requests and client logic to
combine results?

~~~
WhitneyLand
1) Does all this depend on use case? Yes, heavily. Best practices are needed,
but final design choices are tailored.

2) I don't like dup query params, it's not well defined:
[http://stackoverflow.com/questions/1746507/authoritative-
pos...](http://stackoverflow.com/questions/1746507/authoritative-position-of-
duplicate-http-get-query-keys)

For returning multiple types at once there are a few common strategies.

For related types some APIs offer an "include" or "embed" approach like this:
[http://www.vinaysahni.com/best-practices-for-a-pragmatic-
res...](http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-
api#autoloading)

Another approach is to support a query syntax for items, where items is a
container record for multiple possible types.

If a certain multi-type scenario super common you may want build the concept
into the API itself as basic functionality.

Again, your final choice should be as simple as possible but no simpler,
taking into account ease of use, performance, etc.

------
tombert
I'm pretty glad that torrents have started to break out of the "it's only for
warez" stereotype. It's a useful technology, regardless of what made it
popular.

~~~
NelsonMinar
That happened around 2008. Or even earlier; Linux distros were being
distributed via Bittorrent back in 2006. [https://torrentfreak.com/popular-
linux-distro-torrents/](https://torrentfreak.com/popular-linux-distro-
torrents/)

~~~
gens
I remember a game called "GunZ" (the 2005 one, i think) internally using
torrents to propagate updates. But it was a long time ago.

~~~
dansze
A huge number of games used torrents for patching since around that time,
notably in the MMO scene. WoW and every Nexon game come to mind. AFAIK the
Battle.net launcher still downloads updates via torrent.

~~~
dsl
The Blizzard updater is actually a very cool download utility, worth
hacking/poking at.

AFAIK it pioneered the concept of "web seeds", using HTTP GETs with a Range:
header to download specific chunks from a CDN that were not healthy/available
in the swarm.

------
rakoo
If only WebTorrent
([https://github.com/feross/webtorrent](https://github.com/feross/webtorrent))
worked with standard bittorrent protocol instead of a custom one on top of
WebRTC, we would have live access to all the papers and "displayable" data
directly instead of firing up a torrent downloader just for a small file.

~~~
sktrdie
I don't think you understand how WebTorrent works. WebTorrent in fact works
with the regular BitTorrent network if you run it from node, and falls back to
use WebRTC when used in the browser.

So you can seed those torrents directly in the browser with something like
instant.io.

~~~
rakoo
I do know how WebTorrent works. The problem is that it creates a completely
parallel network of nodes, which on the surface happen to exchange the same
framing of messages, so:

\- When WebTorrent runs on the standard bittorrent network from node, that
doesn't change anything: it's still not available from the browser.

\- When WebTorrent runs on the WebRTC network through instant.ion or anything
else, it will only work if somebody else is also seeding the same torrent _in
their browser_. Which they can only have in the browser if they first got it
somewhere else. Oh and I'm willing to bet that _none_ of the nodes who
currently have the content (ie on the bittorrent network) also share it on the
WebTorrent network.

I don't expect classic bittorrent peers to ever implement the mess that is
WebRTC just to accommodate browsers, unfortunately.

~~~
mtgx
I think the problem lies with the browser vendors, who aren't implementing the
bittorrent protocol in the browser.

If WebTorrent were to do that itself, it would have to become a "plugin"
rather than just an extension.

So start asking Mozilla/Google to implement the bittorrent protocol in the
browser (or even better, implement IPFS directly, as that's a more wholesome
technology specifically made for the browser).

~~~
the8472
> I think the problem lies with the browser vendors, who aren't implementing
> the bittorrent protocol in the browser.

browser vendors shouldn't have to implement it. they should expose posix-like
APIs (bsd sockets, file IO) or process management+ipc via plain pipes (talk to
native bittorrent client) so it could be provided through an extension.

The problem with browsers is that they create a backwards-incompatible API
stack. This is understandable for web content. Not so for extensions.

------
danso
So, what are the rights and licenses for this data? I see that one of them is
Yelp photo data from a Kaggle contest [0]. Yelp distributes another Academic
data set, but you have to fill out a form and agree to their TOS [1]. So
they're OK with the data being available like this?

Another random datapoint: When EdX/Harvard released a dataset showing how
students performed/dropped out, I uploaded a copy to my S3 to mirror and
linked to it from HN. I got a polite email the next day asking for it to be
taken down. Academics are (rightfully, IMO) protective of their data and its
distribution (particularly its _attribution_ ).

One thing I would love to see on here is stuff from ICPRS, such as its mirror
of the FBI's National Incident-Based Reporting System [2]. As far as I can
tell, it's free for anyone to download after you fill out a form. But it also
should be free to distribute in the public domain, but for all I know, ICPSR
has an agreement with the FBI to only distribute that data with an academic
license.

(The FBI website has the data in aggregate form, but not the gigabytes that
ICPSR does)

[0]
[http://academictorrents.com/details/19c3aa2166d7bfceaf3d76c0...](http://academictorrents.com/details/19c3aa2166d7bfceaf3d76c0d36f812e0f1b87bc)

[1]
[https://www.yelp.com/dataset_challenge/dataset](https://www.yelp.com/dataset_challenge/dataset)

[2]
[https://www.icpsr.umich.edu/icpsrweb/NACJD/NIBRS/](https://www.icpsr.umich.edu/icpsrweb/NACJD/NIBRS/)

------
edraferi
How does this compare to IPFS ([https://ipfs.io/](https://ipfs.io/))?

That project maintains a number of archival datasets, including arXiv:
[https://ipfs.io/ipfs/QmZBuTfLH1LLi4JqgutzBdwSYS5ybrkztnyWAfR...](https://ipfs.io/ipfs/QmZBuTfLH1LLi4JqgutzBdwSYS5ybrkztnyWAfRBP729WB/archives/)

Seems like an opportunity to combine efforts.

------
wodenokoto
A lot of these "data sets" appears to be coursera courses. I'm not sure if
those are legal to redistribute. It also clutters the browse function since a
lot of results aren't data sets

~~~
Inthenameofmine
Archive.org is now legally providing the coursera videos. So it should be
legal to distribute this way as well.

------
babak_ap
Anyone looking for Wireless Data, I suggest taking a look at:
[http://crawdad.org/](http://crawdad.org/) and
[http://www.cise.ufl.edu/~helmy/MobiLib.htm#traces](http://www.cise.ufl.edu/~helmy/MobiLib.htm#traces)

------
_lpa_
I like this idea! In my research we deal with relatively large amounts of
sequence data, all of which needs to be associated with geo
([https://www.ncbi.nlm.nih.gov/gds/](https://www.ncbi.nlm.nih.gov/gds/)).
While geo is in many ways a good thing, it is not the most pleasant to use - I
would love it if we could use something like torrents instead.

I feel like there is a danger, however, that using torrents would facilitate
the thousands of nonstandard (often redundant) formats bioinformaticians seem
to create.

------
jakub_h
I'm wondering if torrents as such are actually useful for this. I'd figure
some kind of virtual file system (perhaps _based_ on BitTorrent) would be very
useful. You'd simply pass a file path to an open() routine in your scientific
code and data would get opened transparently. You currently have this with
URLs and HTTP but there's no useful caching or data distribution.

~~~
seanp2k2
You might like [https://tahoe-lafs.org/trac/tahoe-
lafs/browser/trunk/docs/ab...](https://tahoe-lafs.org/trac/tahoe-
lafs/browser/trunk/docs/about.rst) ,
[https://urbit.org/docs/using/filesystem/](https://urbit.org/docs/using/filesystem/)
, or [https://ipfs.io](https://ipfs.io)

BTSync and SyncThing are also tools to do this, and I'm sure there are FUSE
things to work with BT and block chains ("bittorrent fuse" google results look
promising).

~~~
espadrine
With the same principles, [http://infinit.sh/](http://infinit.sh/) adds
security and editability to the mix.

------
rkda
You might also be interested in dat. They're trying to solve this problem as
well.

[http://dat-data.com/](http://dat-data.com/)

"Dat is a decentralized data tool for distributing data small and large."

------
degenerate
I already knew about Academic Torrents, but didn't know about Kaggle, which
was linked[1] from one of the 'popular' data sets on AT:
[https://www.kaggle.com/c/yelp-restaurant-photo-
classificatio...](https://www.kaggle.com/c/yelp-restaurant-photo-
classification)

It looks like one of those logo-design-competition-sites, but for big data.
Anyone compete in one of these?

[1]:
[http://academictorrents.com/details/19c3aa2166d7bfceaf3d76c0...](http://academictorrents.com/details/19c3aa2166d7bfceaf3d76c0d36f812e0f1b87bc)

------
WhitneyLand
It's a great initiative. One more step to help bring science collaboration
into the modern Internet world.

How much data do you have? How much storage do you project is needed? I'm
wondering how practical it would have been to use centralized storage, which
has its own advantages.

------
xchaotic
Is there anything preventing the abuse of this systems to share copyrighted
content? Just call it "Population.Data.torrent" where the actual data is a
movie?

------
arxpoetica
How does this compare with, say, noms? [https://github.com/attic-
labs/noms](https://github.com/attic-labs/noms)

~~~
aboodman
Several immediate things come to mind:

\- this is a mechanism for sharing files and directories (e.g., zipped csv
files), whereas noms defines its own structured data model that is much more
granular

\- noms has versioning built-in, you can track the history of a particular
dataset

\- this is firmly based on bittorrent. you could maybe run noms on top of
bittorrent, but it's more intended to be run like git, where you talk directly
to a server that you want to collaborate with

------
amai
See also [https://www.globus.org/](https://www.globus.org/).

------
jakeogh
Similar project: [http://911datasets.org](http://911datasets.org)

------
danielmorozoff
This is fantastic. I am so glad someone built this.

------
Philipp__
This is amazing! Thank you for sharing this!

------
mtgx
.com? Is that wise? I imagine the domain will be seized as soon as the site
becomes popular enough. They should at least have some contingency plan to
deal with that.

~~~
ORioN63
Torrents don't mean illegal content.

~~~
sandworm101
In his defence, it doesn't really matter whether the content is legal or not.
There are people out there who see the torrents as evil. These people have
political power and have been known to disrupt and seize .com domains on the
scantiest of evidence. A more resilient tld would be a good idea.

~~~
hueving
Do you have an example of a domain being seized for hosting legitimate
torrents?

