
Academic Torrents – Making 27TB of research data available - jacquesm
http://academictorrents.com/
======
peterlk
I find it fascinating how difficult it is to find geological data. The
combined datasets of oil and mining companies plus government data has a huge
amount of Earth mapped. And yet, this data is extremely hard to find in a
computer-consmable way. Most of it is locked up in pdf or image scans of maps,
or locked in proprietary MapInfo/Autodesk formats. It seems to me that a large
dataset of all human knowledge of Earth would be massively valuable to
humanity. Unfortunately, oil/mineral maps are a cornerstone of a lot of very
powerful companies. So I don't think we'll see them any time soon.

Organizing this data would also be a hell of an effort because the maps use
different projections, are from a huge variety of times, and are often
inconsistent (overlapping areas with different mineral deposit analyses).

I suppose I can dream, though.

~~~
colek42
Most US government maps are available in a single clearinghouse.
[https://nationalmap.gov/](https://nationalmap.gov/). State governments and
counties also have websites with geological (need it for a septic permit) data
and land plots. The data is just getting more open, which is awesome. It may
be in different formats, but nothing a few lines of python and a PostGIS
database can't handle.

~~~
peterlk
This is definitely the right direction. But the thing I really want is a map
of minerals, faults, and topography. I am finding topography, but not the
other two. Perhaps I'm just not seeing them? Imagine how much better our
earthquake predicting abilities might become if we could couple global seismic
data with a full map of Earth's fault lines. I know thin is pretty unlikely,
but humanity would be better off, and most importantly, it would be really
cool (we could learn a lot of stuff)

~~~
close04
The reason some geological maps are almost impossible to find is because they
are treated as a national safety item. In some countries the law forbids
companies from making these available. So an oil company that created the map
can only use it and in a pretty controlled fashion.

More that that, even if the map can be used more freely in the country (by
multiple parties) it may be illegal for the map to cross the border in any
form. This also means electronic signals over the internet if they cross the
border.

I guess the reason some of them are getting out in the open is because they
are no longer considered "critical".

Disclaimer: My work intersected the oil and gas domain at some point.

~~~
samstave
> __ _The reason some geological maps are almost impossible to find is because
> they are treated as a national safety item. In some countries the law
> forbids companies from making these available. So an oil company that
> created the map can only use it and in a pretty controlled fashion._ __

/ THE/ reason.

Also, this is why some believe the war in Afghanistan was even waged - was to
get the "boots on __ground __* " == on-top of the $1T in mineral deposits.

While we kvetch constantly about stupid political distractions, there are, in
fact, many government-corporate entities that are in the long-game.

[https://www.livescience.com/47682-rare-earth-minerals-
found-...](https://www.livescience.com/47682-rare-earth-minerals-found-under-
afghanistan.html)

Look at even the completely obvious:

[https://www.amazon.com/Hundred-Year-Marathon-Strategy-
Replac...](https://www.amazon.com/Hundred-Year-Marathon-Strategy-Replace-
Superpower/dp/1250081343#customerReviews)

and even as openly published ten years ago:

[http://www.dailymail.co.uk/news/article-1036105/How-
Chinas-t...](http://www.dailymail.co.uk/news/article-1036105/How-Chinas-
taking-Africa-West-VERY-worried.html)

\---

The point is that we are already in the long war for resource control. And
sadly, I would say that the US is not in the upper-hand at this time, aside
from building up the military...

~~~
sizzle
Who has the upper hand, China?

~~~
baq
Yeah, the Chinese Africa is going to be an interesting war zone.

------
xd
Weird to think this was what the internet/www was designed for from day one..

~~~
betterunix2
I think that is an urban legend, along the lines of "the Internet was created
to survive nuclear attacks." AFAIUI the Internet was created to unify various
predecessor networks (NSFNet, ARPANET, etc.), and as an experiment in
computerized communications. Sharing research data may have been an early
application of the Internet or its predecessors, but I am not so sure it was
the driving purpose of the project.

~~~
antod
In fairness they did also mention the www as well as the internet. Tim Berners
Lee definitely had sharing academic data in mind with the web. So they were
half right.

~~~
antod
I shoulda said 'research data' instead of 'academic data'. Fits better with
the discussion topic and TBL's purpose.

------
StavrosK
Is there a list of at-risk torrents? Basically, if I wanted to donate X GB to
help seed, what is the single most important torrent I could seed?

I imagine the relevant metric would be "importance / current number of
seeders".

~~~
ieee8023
You can browse collections and then order by seeders. Here are some
collections I think should be mirrored:

[http://academictorrents.com/collection/deep-
learning?sort_fi...](http://academictorrents.com/collection/deep-
learning?sort_field=seeders&sort_dir=ASC)

[http://academictorrents.com/collection/medical?sort_field=se...](http://academictorrents.com/collection/medical?sort_field=seeders&sort_dir=ASC)

[http://academictorrents.com/collection/joes-recommended-
mirr...](http://academictorrents.com/collection/joes-recommended-mirror-
list?sort_field=seeders&sort_dir=ASC)

~~~
voltagex_
I'm assuming those globe icons means there's a web seed (HTTP server)
involved? If so, I'd prioritise the ones that don't have that backing.

~~~
ieee8023
Yes. But webseeds also go away over time when a department shuts down a server
or a student account expires.

------
IanCal
> Distribute your public data globally for free to ensure it is available
> forever!

What steps are in place to ensure this over reasonable timescales (20-50
years)?

~~~
stephengillie
What would the Millennium Clock[0] version of a data storage device look like?

[0][http://longnow.org/essays/millennium-
clock/](http://longnow.org/essays/millennium-clock/)

~~~
mabbo
It wouldn't be a single device. That's far too subject to attacks. Instead,
let's picture a decentralized system like BitTorrent, but with the goal of
data resiliency.

I picture a system where a person adds themselves as a node to a certain group
of files- like BitTorrent- but instead of downloading everything, they choose
how much space and bandwidth they're offering and the system grabs the "best"
pieces. Best in this case is about spreading the data out to improve
resiliency, like any dates storage service would. It's not a human choice but
an optimization problem.

If 250 people each offered 100MB of space, a 5GB file could be maintained with
quadruple redundancy. Nodes come and go, and the system would minimize data
transfers while aiming to maximize redundancy. Try to spread the files far and
wide geographically, to put popular pieces of the data onto nodes offering
more bandwidth.

Hmm. This sounds like a fun project to try...

~~~
zrm
If you're going to build something like this, especially when the data sets
are typically large, this is a _very_ strong case for erasure coding.

If you split a 5GB file into (20) 250MB chunks and then mirror each of them
four times, there is a nontrivial probability that one of the chunks loses all
four mirrors. Especially when you're using notoriously unreliable volunteer
hosting.

If you split the same 5GB file into (20) 250MB data chunks and (60) 250MB
erasure chunks, you consume the same amount of total storage but have to lose
61 chunks (>75% of _all_ the hosts) before you lose any data.

~~~
voltagex_
Offtopic: Have you looked at PAR2? I seem to recall it being really good but
having some critical bug.

------
sbr464
Curious if a potential solution would be having open, read only databases,
that you could query directly, vs everyone copying the same data over and
over. Kind of how you don’t download Wikipedia but access what you need. I
realize there are a lot of things to consider. But not even a rest api/etc, an
actual database.

Realize it wouldn’t scale, would cost money etc, but could be interesting

~~~
qixxiq
Google BigQuery does this though. They host huge public data files and then
only charge for the queries.

~~~
gwern
Also 'requester pays' on Amazon S3 buckets.

------
jxub
Perhaps I'm going off tangent, but the social dynamics associated with
torrenting are pretty darn interesting.

On one hand, they seem to converge towards a consensus with most seeded and
downloaded files and popularity as a trust factor. On the other, they also
promote the dissemination of ideas the knowledge of which poses a threat to
the status quo that is, the state towards which a society was coerced to.

On one hand, Torrents are about rejecting the Publisher and Big Media status
but on the other they are about arriving to a democratic status about which
films/books/... are the best or most useful.

And don't even get me started about the constant ethical dilemmas associated
with sharing and who should control or own the data.

To link all that threads into a broader topic, we could associate the torrent
subculture to the Dionysian archetype which Nietzsche wrote about.

~~~
Symbiote
I work with a slowly changing dataset that's about 100GB to download in full.
A few people a week download it.

I've considered adding a torrent download, because it includes built-in
verification of the download. A common problem is users reporting that their
download over HTTP is corrupt, but I'm not sure if they'd be able or want to
use Bittorrent.

(Also, for many users the download is probably fine, but they can't open it in
Excel. Bittorrent won't help that. )

~~~
solarkraft
For 100GB of important data I'd be willing to buy an extra hard drive.

I'd expect your users to be willing to set up a torrent client. It's not even
difficult.

~~~
duckerude
It's not that difficult, but it can be scary.

The few times I've published files over bittorrent I've had to reassure people
that torrenting itself isn't any more illegal than other download methods.

It's also not clear ahead of time how difficult it's going to be.

If torrenting is the only way, some people won't bother.

~~~
Symbiote
That's what I'm concerned with; tarnishing a good reputation with a "scary"
word.

Bandwidth is no problem for us, we have a faster connection than all the
users. Users in Africa and Latin America would probably benefit most, but I'd
need to research whether they'd be prepared to use Bittorrent before
implementing this.

------
natch
All the .torrent files are served over http so with a simple MITM attack a bad
actor could swap in their own custom tweaked version of any data set here in
order to achieve whatever goals that might serve for the bad actor's
interests.

I really wish we could get basic security concepts added to the default
curriculum for grade schoolers. You shouldn't need a PhD in computer security
to know this stuff. These site creators have PhDs in other fields, but
obviously no concept of security. This stuff should be basic literacy for
everyone.

~~~
westurner
> _This stuff should be basic literacy for everyone._

Arguably, one compromised PKI x.509 CA jeopardizes all SSL/TLS channel sec if
there's no certificate pinning and an alternate channel for distributing
signed cert fingerprints (cryptographically signed hashes).

We could teach blockchain and cryptocurrency principles: private/secret key,
public key, hash verification; there there's money on the table.

GPG presumes secure key distribution (`gpg --verify .asc`).

TUF is designed to survive certain role key compromises.
[https://theupdateframework.github.io](https://theupdateframework.github.io)

------
the_greyd
Perfect use case for [http://datproject.org/](http://datproject.org/). It has
git versioning on top of bittorrent, so if something gets updated in the
dataset you only download the diff (unlike torrent).

------
hamiltont
Does anyone understand the reasoning behind this statement:

> We would like to avoid the blind mirroring of all data.

Found at
[http://academictorrents.com/about.php#mirroring](http://academictorrents.com/about.php#mirroring)

~~~
ieee8023
Sometimes people upload 1TB files which are not intended to be mirrored or not
of interest to many people. We don't want people who donate hosting to mirror
this content unless they really want to. But we also want to make it easy and
automatic to mirror content. Using collections, which each have an RSS feed,
content can be curated by someone you trust to decide what should be mirrored.
I curate many collections including videos lectures, deep learning, and
medical datasets.

~~~
hamiltont
Got it. Not trying to knock the hard work put into this, I'm actually thrilled
to see this initiative and only intend to be constructive. Personally, I would
rather be asked to trust that site admins were auditing each torrent to ensure
it at least looks legitimate, before passing final legal responsibility on to
me as a seeder. Leaving users to identify contributors they can trust to never
include "Pirated Movie 2018" into their donated seedbox sounds like quite a
hurdle to attracting new seeders willing to participate in a "legitimate
bittorrent use case" project.

~~~
ieee8023
We perform our own audits of all data and request a justification why it is
academic data if it appears to be "Movie 2018". We think the collections model
is the best balance between a walled garden and zero censorship. You can be
assured that no collections curated by me will have a bad torrent in it! Here
are a few:

[http://academictorrents.com/collection/medical](http://academictorrents.com/collection/medical)

[http://academictorrents.com/collection/video-
lectures](http://academictorrents.com/collection/video-lectures)

[http://academictorrents.com/collection/joes-recommended-
mirr...](http://academictorrents.com/collection/joes-recommended-mirror-list)

[http://academictorrents.com/collection/computer-
vision](http://academictorrents.com/collection/computer-vision)

[http://academictorrents.com/collection/deep-
learning](http://academictorrents.com/collection/deep-learning)

------
ddebernardy
Are the datasets all legit? For instance, this looks like a quarterly scrape
of Reddit in full:

[http://academictorrents.com/details/85a5bd50e4c365f8df70240f...](http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b)

~~~
matt4711
I'm pretty sure all the twitter datasets violate the twitter TOCs.

~~~
JamesMcMinn
On a quick pass of the Twitter datasets, they all seem to conform to Twitter's
developer Terms.

~~~
matt4711
Like the requirement that you have to delete tweets in datasets that have been
deleted on twitter?

~~~
JamesMcMinn
As far as I could tell, none of them actually contain tweets (e.g. any JSON),
just IDs, and mostly user IDs at that.

------
natch
My browser reports the "create an account" page is not secure, so maybe best
not to use this as an uploader at least until they fix that. For the creator
of the site: pages that collect passwords should be served over https.

~~~
Klathmon
All pages should be served over HTTPS. It's not only about keeping secrets.

~~~
natch
Good point.

------
patall
Its good that more than one way exist for something like this, though I
personally prefer something like zenodo, were every record automatically gets
a DOI attached.

(Zenodo is limited to 50GB though)

------
htor
"gta_full_dist.tar" seems to be one of the biggest "datasets" featured on
here. funny this data business.

~~~
dunpeal
Did you actually check it? It has nothing to do with piracy or video games, as
you seem to imply.

------
danielmorozoff
I have been wondering why this hasn't existed for years! Thank you guys for
making this. Long awaited.

~~~
p1esk
This has existed for years.

~~~
danielmorozoff
I have wondered for years... apologies for the poor grammar

------
leemailll
I can't find any info on how the data is hosted from the website, so I am
wondering whether it works like pirate bay or it also hosts data itself? If
the former is the case, it will be hard for researcher to use and share. One
reason is that academic institutions nowadays has tightened control on net
access, which definitely hinders hosting large amount of data shared with BT
protocol; second, because researches are often fragmented which by itself will
limit interests of possible users, then the sharing falls on a few people's
goodwill.

~~~
epilogue
It's in the name - Academic 'Torrents'. They just host the torrent files which
are only a couple hundred kilobytes. I feel like your sorting of missing the
point of a service like this, it's not to provide potentially the fastest
download available, but it's to ensure data is accessible, even if the
original download source is unavailable or is inaccessible for certain people
or locations.

As long as you're not downloading copyrighted data there should be no issue
with using the BT protocol on a company or academic network, providing their
is no outright ban on the protocol in your network usage policy. The BT
protocol itself actually lends itself quite well to large datasets such as
what is hosted here due to its inbuilt error checking (so no more spending
hours downloading a huge dataset only to find your connection did something
silly for a second and corrupted the whole file) and can provide much faster
download speeds on popular files due to the number of peers available, instead
of a normal hosting arrangement which would likely provide slower speeds on
popular files due to network congestion and file access speeds.

~~~
user5994461
P2P is banned in the French academic network. I expect the same in neighboring
countries, although I didn't get to review their terms of service.

~~~
dewey
I'd guess if a researcher has to use P2P there's probably a way for them to
get the data / get whitelisted. I'm pretty sure the "P2P is banned" is mostly
aimed at download copyright infringing content.

~~~
user5994461
It's more than copyright infringement. Once you start having a few students or
researcher deploying P2P, it's going to saturate the dedicated 10 Gb links
very quickly.

P2P will consume any amount of upload bandwidth available. It's horrendous to
have inside your network, as a university or research center.

~~~
dewey
It's not much different than any other service, if you don't limit it or use
in moderation it'll saturate the network. If someone would host some linux
ISO's over http like some universities do for Linux distributions it'll have
the same effect.

------
evilzardoz
I have some very big concerns about this.

1\. It appears to be sponsored by seedbox hosting companies -plus- a google
ad. This is misleading (no, it is -not- directly sponsored nor endorsed by
Salesforce, which is the Google Ad I see).

2\. Many higher education institutions will block BitTorrent on their
firewalls to prevent/reduce copyright infringement.

3\. How legitimate is the data? Is there any vetting of the content to ensure
that it doesn't violate copyright or that the data was legally obtained eg,
the site scrapes? A DMCA takedown is too late if we've already accidentally
seeded infringing information and could harm our reputation.

4\. The site claims to be "used by" a group of very big names (Stanford, MIT,
UT Austin etc). Did they ask/give permission to be cited? Do they endorse the
use of this service?

5\. HTTPS. Please?

It's a great idea but it needs a bit more polish before I could even suggest
this to my management.

------
anderspitman
Anyone know why they opted not to use Webtorrent for this? Obviously straight
Bittorrent is more battle-tested, but the extra friction of having to know how
to work a BT client is non-trivial.

~~~
wincy
I’d call it trivial. My 60 year old uncle who shouts at his computer uses
BitTorrent. Anyone who wants these files will be able to figure it out.

------
loblollyboy
the search on this would be better if you could browse by subject. on my
firefox, the checkboxes (for type of resource) don't highlight

------
partycoder
I can see some overlap with Internet 2.

------
sweetp
great resource, thanks for the link

------
Lapsa
papers section is full with Trivedi Effect. gets tiresome to filter it out

------
sweetp
great resource!

------
shawn
I see this is a centralized service. Is there protection from DMCA takedowns?

For example, if someone uploads a bundle of 50 N64 ROMs, and someone tries to
DMCA that link so that this service no longer provides an index to it, is
there censorship resistance?

This is almost exactly what I need: A distributed index of massive datasets.
I've been building a web browser that can source from such content. The idea
is that you write a website which refers to some resources by SHA256, and
anyone else running the browser will transmit the resource to you if they have
it.

That would let you build an emulator which can play any ROM in history,
without having to explicitly download the ROMs. It's equivalent to clicking on
a link to "Super mario brothers" and then seeing it play immediately in your
browser. No explicit downloading.

From a long term standpoint, the vision is that you can build whatever games
you want, using whatever assets you want, and nobody can tell you that you
can't.

So it was funny to see this service pop up, because it's nearly an identical
use case: "I have some data (ROMs), I want to make it available to everyone
else (people running the emulator), and it's decentralized so no one can say
no (bittorrent)." But that raises the question of scope, or whether such use
cases would be welcome.

~~~
fwip
Rather than building your own browser, take a look at Beaker Browser.

~~~
shawn
Thanks! Unfortunately Dat doesn't provide any anonymity.

~~~
fwip
That's true, and a worthwhile consideration if you're doing illegal things. If
you route your connection through Tor you'd be good to go, right?

~~~
shawn
Yes, but users can't be expected to know how to do that. And my users are the
ones who want to play ROMs.

Maybe it's possible to bundle tor with the executable though, and have the
network connections automatically route through tor.

------
h4b4n3r0
I was hoping for the whole 18TB Google Open Images in one torrent, but alas,
they only offer metadata for download, same as Google itself. What a bear of a
dataset it is to obtain and work with.

------
gammateam
FINALLY, the inability to browse was my biggest issue with scihub.

All these sheltered academics were heralding it as the best thing ever,
without even the basic user experience expected this decade.

Yes, I know about the chrome extensions.

~~~
viraptor
Because browsing is provided by other services already. The usual use case for
sci-hub is: find an article you're interested in, find it's super expensive,
copy the doi into sci-hub.

It would be cool if sci-hub came with a catalogue, but it's very usable
without it. And it's not just sheltered academics (or even just academics)
that are using it.

