Hacker News new | past | comments | ask | show | jobs | submit login
Academic Torrents – Making 27TB of research data available (academictorrents.com)
1082 points by jacquesm on Aug 12, 2018 | hide | past | web | favorite | 140 comments

I find it fascinating how difficult it is to find geological data. The combined datasets of oil and mining companies plus government data has a huge amount of Earth mapped. And yet, this data is extremely hard to find in a computer-consmable way. Most of it is locked up in pdf or image scans of maps, or locked in proprietary MapInfo/Autodesk formats. It seems to me that a large dataset of all human knowledge of Earth would be massively valuable to humanity. Unfortunately, oil/mineral maps are a cornerstone of a lot of very powerful companies. So I don't think we'll see them any time soon.

Organizing this data would also be a hell of an effort because the maps use different projections, are from a huge variety of times, and are often inconsistent (overlapping areas with different mineral deposit analyses).

I suppose I can dream, though.

Most US government maps are available in a single clearinghouse. https://nationalmap.gov/. State governments and counties also have websites with geological (need it for a septic permit) data and land plots. The data is just getting more open, which is awesome. It may be in different formats, but nothing a few lines of python and a PostGIS database can't handle.

This is definitely the right direction. But the thing I really want is a map of minerals, faults, and topography. I am finding topography, but not the other two. Perhaps I'm just not seeing them? Imagine how much better our earthquake predicting abilities might become if we could couple global seismic data with a full map of Earth's fault lines. I know thin is pretty unlikely, but humanity would be better off, and most importantly, it would be really cool (we could learn a lot of stuff)

The reason some geological maps are almost impossible to find is because they are treated as a national safety item. In some countries the law forbids companies from making these available. So an oil company that created the map can only use it and in a pretty controlled fashion.

More that that, even if the map can be used more freely in the country (by multiple parties) it may be illegal for the map to cross the border in any form. This also means electronic signals over the internet if they cross the border.

I guess the reason some of them are getting out in the open is because they are no longer considered "critical".

Disclaimer: My work intersected the oil and gas domain at some point.

>The reason some geological maps are almost impossible to find is because they are treated as a national safety item. In some countries the law forbids companies from making these available. So an oil company that created the map can only use it and in a pretty controlled fashion.

/THE/ reason.

Also, this is why some believe the war in Afghanistan was even waged - was to get the "boots on ground*" == on-top of the $1T in mineral deposits.

While we kvetch constantly about stupid political distractions, there are, in fact, many government-corporate entities that are in the long-game.


Look at even the completely obvious:


and even as openly published ten years ago:



The point is that we are already in the long war for resource control. And sadly, I would say that the US is not in the upper-hand at this time, aside from building up the military...

I received the following (candid) comment from a geologist that I asked. He said that I could share his comment

> I don't have time to write the essay on this, but we definitely DONT wage war to get "boots on the ground". That's absurd and if true why go to that effort when there is the rest of the world to but your boots on without war. That's one of the biggest problems. The Afghan Geo Survey was at PDAC pleading people to come work in their country but, I for one won't be going AND NO ONE is going to put money into that huge risk. Why would they with the rest of the world available to make a real map you have to have boots on the ground, (like Iv done for 40 years) not a computer with .tab and .shp files. That is a derivative product that comes after collection of real data that costs money and is proprietary intellectual capital. Nobody sees my map without a non disclosure agreement or confidentiality agreement. And that happens across the world for any private expenditure. Exxon, Total, Shell, etc possibly have the the most vast earth science databases created by mankind. They represent $trillions of investment and are not about to make that publicly available. Publicly available earth science information is created by, curated by, and made available through governments, =tax $. Thus, it is meager, everywhere.

I'm sorry - but I feel that this post agrees with my comment.

When I said "get boots on the ground" I was also saying that the point of having a military presence is a long-term play ensuring that we secure some avenue to access of minerals at-some-point in the future - and this aligns with the corporate goals as well.

Maybe I worded it poorly, but for the most part - I feel I am saying the same thing as your friend, though I am more cynical of the motives of our government/military and more suspicious that they are much more cunning than they would want the general populous to believe.

Who has the upper hand, China?

Yeah, the Chinese Africa is going to be an interesting war zone.

By a long shot....

That data is the secret sauce for mining and oil and gas companies. Them giving it away would be like Facebook giving away their datasets.

The Australian government has a huge amount of data available in computer consumable ways.

To search the catalog from GA go to:


There is an emphasis on open formats and using open source software.

GA data is also extensively used by the Australian National Map at:


Also GA is available from data.gov.au


Wow, didn't know this exists! Thanks for the links!

> I find it fascinating how difficult it is to find geological data.

Discoverability for openly released scientific datasets is a huge problem in general. While some enterprising folks have worked on adding parsers for scientific data formats such NetCDF and HDF5 to Apache Tika (which can then be indexed by Solr/Elasticsearch/whatever) [0], the vast majority of scientific file formats don't have parsers available. Even worse, in the climate of publish-or-perish, most scientists are unaware of or less likely to prioritize the incorporation of metadata extraction / indexing tools, even though these would make their data more readily searchable based on relevant metadata (such as equipment settings, etc).

I have some personal experience in this area- when I was working as a research assistant, I basically did helpdesk support for an open access dataset, answering questions from researchers at other institutions. I'd estimate that of the questions I received in my inbox, close to 70% could have been resolved with a good implementation of faceted search. A related issue I encountered is that rather than relevant metadata existing alongside a dataset, sometimes I'd have to dive into an article's methods section to find it, often in a weird place that wasn't obvious at first glance due to the obtuse writing style that is encouraged for scientific publications.

The bigger problem, however, is that the culture of science in academia right now puts way too much emphasis on flashiness over sustainability and admittedly non-sexy tasks like properly versioning and packaging scientific software, documenting analyses, and producing well-characterized datasets.

[0] https://www.slideshare.net/chrismattmann/scientific-data-cur...

It's not up currently, but make a note about a domain for a project I'm working on and releasing soon. It's a general platform for accumulating any type of geo/metadata/media possible about a point in space.


I mentioned a few details about it in this comment https://news.ycombinator.com/item?id=17532101

Sort of like Wikimapia?

In concept, but not execution. More scientific and the platform doing a lot of work for you. You upload tons of specific raw data. Think lidar, drone footage, data from core samples, raw precision gps data, material analysis, etc.

The platform understands a lot of things already and is able to organize/chart/map/visualize automatically, and extensible where not.

A comment from a geologist who I asked about this (he said I could share on his behalf):

> Faults of the world that effect mankind are mostly known and well mapped at the surface and below the surface through assorted geophysical means. We can't see there otherwise. Google Earth has made a tremendous contribution to mankind and there are many efforts to make Google geo-earth at least in small chunks. Universities are the best homes for this type of effort. Quality of data is a huge factor. The location of the San Andreas Fault is very well understood. Yet 20 years ago, there were NO geologic maps of even rudimentary quality for the San Francisco Bay area. Maps yes. Useful, printed or digital No. I doubt that has changed. Reason? Politics, funding, right people to do the job, etc BART and bridge routes are well studied for engineering purposes and should be public information, but try find a geology map derived from that suitable for public use. When I work in foreign countries and even in the US, there are regional maps, say 1:1,000,000. But that offers very little value. Now I am writing the book on the geology where I am working. I am the first one there to figure out what I am looking at and the potential value for any mineral commodity. You can't do what I do from a computer or satellite. You HAVE to be boots on the ground. The way forward for advancing geologic understanding is quality mappers on the ground knowing what they are looking at, translating that to useful information that conveys both facts and interpretations and then that to digital. We are a looooong loooong way from having geology Google Earth, but if we did, it would be great for mankind. Raise taxes?

In Australia, I'm pretty sure mineral maps are what's funding Geoscience Australia, which is pretty much the reason we have digital maps at all.

As for data being "locked away" in formats, well, no one's really come up with the One True Geospatial format, so you've got TAB and SHP. I prefer TAB, because I hate column name limitations. Use QGIS [1] or GDAL [2] to get the data into a format you want, or for really heavy lifting, apply for a home use licence for FME Desktop [3] (which uses GDAL among other things).

TL;DR: Mineral maps are worth cash money, and that's why you can't get at them.

1: https://qgis.org/en/site/ - can a benevolent billionaire please throw a couple of million dollars at this thing?

2: https://www.gdal.org/ - GDAL makes the world go round (or ellipsoid, depending on projection)

3: https://www.safe.com/free-fme-licenses/home-use/

Weird to think this was what the internet/www was designed for from day one..

I think that is an urban legend, along the lines of "the Internet was created to survive nuclear attacks." AFAIUI the Internet was created to unify various predecessor networks (NSFNet, ARPANET, etc.), and as an experiment in computerized communications. Sharing research data may have been an early application of the Internet or its predecessors, but I am not so sure it was the driving purpose of the project.

In fairness they did also mention the www as well as the internet. Tim Berners Lee definitely had sharing academic data in mind with the web. So they were half right.

I shoulda said 'research data' instead of 'academic data'. Fits better with the discussion topic and TBL's purpose.

Is there a list of at-risk torrents? Basically, if I wanted to donate X GB to help seed, what is the single most important torrent I could seed?

I imagine the relevant metric would be "importance / current number of seeders".

I'm assuming those globe icons means there's a web seed (HTTP server) involved? If so, I'd prioritise the ones that don't have that backing.

Yes. But webseeds also go away over time when a department shuts down a server or a student account expires.

It would be great to get a personalized RSS feed of torrents. This way, knowledge of at-risk torrents can be distributed efficiently. It even allows prioritization.

I wonder if you can set up a distributed-hash-table sort of thing that lets you reliably query for less-popular torrents. Like a magnet-link system that supports top-k queries.

Sometimes if a torrent doesn't have seeds it is a signal that people do not value that data. Not all data should be stored forever. The community should vote with their servers what we should store forever and at what level of redundancy.

> Distribute your public data globally for free to ensure it is available forever!

What steps are in place to ensure this over reasonable timescales (20-50 years)?

The project is run by the U.S. 501(c)3 Non-profit called Institute for Reproducible Research (http://reproducibilityinstitute.org) and this site has an overhead cost of ~$500/year. We plan to fund this project for at least the next 30 years. The community hosts the data and we also coordinate donations of hosting from our sponsors (listed on the home page).

We also run the project ShortScience.org! Check it out!

> The community hosts the data

This is the key thing for me, so there are no guarantees of the data being available? Or is all the data backed by an owner hosted box (and backups)?

The main goal is to facilitate distribution. We work with research groups and provide "best effort" hosting with the resources from our sponsors. We expect the uploader has a backup of the data in all cases.

What would the Millennium Clock[0] version of a data storage device look like?


It wouldn't be a single device. That's far too subject to attacks. Instead, let's picture a decentralized system like BitTorrent, but with the goal of data resiliency.

I picture a system where a person adds themselves as a node to a certain group of files- like BitTorrent- but instead of downloading everything, they choose how much space and bandwidth they're offering and the system grabs the "best" pieces. Best in this case is about spreading the data out to improve resiliency, like any dates storage service would. It's not a human choice but an optimization problem.

If 250 people each offered 100MB of space, a 5GB file could be maintained with quadruple redundancy. Nodes come and go, and the system would minimize data transfers while aiming to maximize redundancy. Try to spread the files far and wide geographically, to put popular pieces of the data onto nodes offering more bandwidth.

Hmm. This sounds like a fun project to try...

If you're going to build something like this, especially when the data sets are typically large, this is a very strong case for erasure coding.

If you split a 5GB file into (20) 250MB chunks and then mirror each of them four times, there is a nontrivial probability that one of the chunks loses all four mirrors. Especially when you're using notoriously unreliable volunteer hosting.

If you split the same 5GB file into (20) 250MB data chunks and (60) 250MB erasure chunks, you consume the same amount of total storage but have to lose 61 chunks (>75% of all the hosts) before you lose any data.

Offtopic: Have you looked at PAR2? I seem to recall it being really good but having some critical bug.

And here I go on a wiki spiral. Thanks!

That sounds a lot like IA.BAK [1], a git-annex backed attempt at making a backup copy of the Internet Archive.

[1] https://www.archiveteam.org/index.php?title=INTERNETARCHIVE....

Edit: https://git-annex.branchable.com/design/iabackup/ has more details.

It would be a collection of lithographically etched plates. These are stable across a wide collection of conditions. The coding scheme would be reed solomon or a variant of it, or possible even just massive duplication with checksums. Multiple duplicate backups of the entire data would be stored in geographically disparate locations. This reduces the "asteroid hits <city>" problem. A wide range of different technologies would be used, randomly, across the copies. This reduces the likelihood of systematic catastrophic failure due to a specific technology not working as expected.

This is not that different from how current millenia-scale data storage has worked for... well, millenia. The primary challenge is in finding a universal key (if somebody comes along a million years later, will they share enough context to decode the archive?)

From the same people: http://rosettaproject.org/disk/concept/

It doesn't really scale, but then neither does the Clock.

Curious if a potential solution would be having open, read only databases, that you could query directly, vs everyone copying the same data over and over. Kind of how you don’t download Wikipedia but access what you need. I realize there are a lot of things to consider. But not even a rest api/etc, an actual database.

Realize it wouldn’t scale, would cost money etc, but could be interesting

I've been thinking about the same for quite a while now. In fact, look at the overlap between GraphQL and SQL conceptually. I absolutely think there is something to this.

In the past, I have used certain wide open read only genomics databases (not going to name it so it doesn't get hammered by HN).

Other posters are right about services such as BigQuery but I think there's a place for an open source project here that interfaces SQL to databases through a layer that adds caching, throttling and more services on top of that. That's how you make it scale.

The Dremio project (open source by the backers of Apache Arrow) has a SQL REST API that converts a standard SQL dialect/datatypes to the underlying systems. I think that's a good start and Dremio has a ton of other awesome functionality like Apache Arrow caching.

Simple model is expose an expression language (even could be not SQL, like jsoniq, or other expression languages), mapper from that to SQL, web service API on top with a pluggable connector model.

I say that I'm going to start an open source project around this all the time but haven't gotten the inertia to do it. Argh!

Google BigQuery does this though. They host huge public data files and then only charge for the queries.

Also 'requester pays' on Amazon S3 buckets.

Your right, I've tried it before and forgot about it.

This is the kind of application that dat, swarm, ipfs/filecoin are aiming to support.

More specifically orbitdb [1] which is built on top of ipfs is a potential starting point. It implements a decentralized and distributed append only log data structure as well as object store and key/value lookup db. Not quite a SQL database though.

[1] https://github.com/orbitdb/orbit-db/blob/master/README.md

How about using a database engine with built in horizontal sharding to distribute a big database to multiple users in smaller pieces, just like bittorrent but with query-capabilities.

Thanks to the recent big-data nosql craze this should be easy to find off the shelf. As example i know MongoDB has all this built in, not sure it is designed with untrusted nodes in mind but otherwise fits all requirements.

I think cockroachDB has the automatic node registration/balancing? I guess you’d have to deal with the shard or data you would choose to host

There’s one feature of Faunadb that stood out, the ability to create a seemingly unlimited number of nested databases, where the db assigned to you appeared to be the root/full database, with full query/write privileges.

Imagine a public dataset that you could query but also update with your own findings and metadata, and choose to have it shared upstream or keep private. The engine could diff your work to avoid duplicating a copy of the data. You could sync it locally for speed if needed

Thanks for the resource though! Looks interesting

That’s what the current academic publication infrastructure was built to be. Then capitalism happened, and we have abusive monopolies instead.

Historically, as empires collapse, they burn their libraries, but the usually don’t get to the copies that are kept on the outskirts.

For sure, I was thinking more simply how most of us developers etc, immediately create an abstraction (rest/graphql/sdk) vs allowing direct queries to a db. Would be interesting if libraries/knowledge/datasets could be a query away, actual free info, not a search engine/promoted/abstraction in any way.

Perhaps I'm going off tangent, but the social dynamics associated with torrenting are pretty darn interesting.

On one hand, they seem to converge towards a consensus with most seeded and downloaded files and popularity as a trust factor. On the other, they also promote the dissemination of ideas the knowledge of which poses a threat to the status quo that is, the state towards which a society was coerced to.

On one hand, Torrents are about rejecting the Publisher and Big Media status but on the other they are about arriving to a democratic status about which films/books/... are the best or most useful.

And don't even get me started about the constant ethical dilemmas associated with sharing and who should control or own the data.

To link all that threads into a broader topic, we could associate the torrent subculture to the Dionysian archetype which Nietzsche wrote about.

>On the other, they also promote the dissemination of ideas the knowledge of which poses a threat to the status quo that is, the state towards which a society was coerced to.

Actually, one of the biggest uses of torrents is to disseminate pop culture materials that fall right in the middle of US culture. Probably dwarfing "radical" stuff by many orders of magnitude.

Does popular hip hop music threaten or promote the status quo?

It does the latter.

Only when it was unpopular did it do the former.

Popularity was brought about by its being templated, packaged, and sold, thus bringing it into the status quo's portfolio.

That is ignoring that a sufficiently motivated actor can ensure that doesn't happen.

In one of my private trackers there is a person with a seedbox that downloads every single torrent as soon as it is uploaded, and they have been doing so for quite a few years now.

This ensures that while some things will indeed, be seeded more, nothing quite vanishes.

Then again the form media of that specific tracker is fairly small, so it is not prohibitively expensive to archive everything. One raw ISO Blueray movie file elsewhere could be thousands upon thousands of torrents in that specific tracker.

Maybe something of utility would be creating a distributed torrent system that is a bit more closely tied to the tracker. Where membership would require you to integrate to the swarm by automatically downloading a percentage of the entire corpus, ensuring the health of the tracker.

So a new peer would be bearing part of the load of having everything be accessible.

I think this would require decently heavy curation, but I could see how it could be useful for something like the OP specifically, where having scientific papers lost for good would be a shame.

There were a lot of trackers where you had rating, which would depend on the range of titles available from you and your prowess at disseminating them. But really, torrents work just fine even without those measures, even for niche content.

The absence of rules is anarchic, but not democratic. In anarchy, the powerful coerce, manipulate, and otherwise dominate the masses, creating a status quo that the powerful desire, and abusing the weak without restraint. Historically, the outcome is despots, warlords, feudalism, and brutality. In democracy everyone has an equal vote and equal rights, and it requires a system of rules.

Many had the same hopes for the Internet and social media, for example. But when these things became valuable - influential - powerful interests acted to control and manipulate them, to obtain money, political power and social outcomes. It's hard to claim that the results are that people are choosing information that is "the best or most useful".

I think politics and social outcomes, such as status quos, are unavoidable results of human interaction. Eliminating rules eliminates the protection against arbitrary power and returns us to the world of despots. The politics is unavoidable; the question is, how do we want to manage it?

EDIT: Some major edits; sorry if you read an earlier version.

Fwiw, modern anarchism tends to emphasize consensus decision making and building strong process; the Greek root after all is "without leaders," not "without rules."

Modern democracy, with its focus on leaders and representation, still gives us extreme power and economic imbalances and is arguably a barely disguised oligarchy.

I think we agree that getting rid of all the rules is a bad thing, though.

For me torrenting is mostly just about its stigma for being illegal on the one side and its very competitive performance on the other side.

So as soon as someone distributes some data via a torrent, everybody starts asking if it is legal to use that data. When the data is offered via a download link on some website, most people assume that they got the data through a legal channel.

> For me torrenting is mostly just about its stigma for being illegal on the one side

There are still legitimate reasons to use torrents apart from piracy. I particularly like the physibles section of TPB https://thepiratebay.org/browse/605 which are not inherently illegal (well apart from the 3D printable guns which are a bit iffy and in a legal gray area).

Though they do tend to be largely useless. Like some banned books, people download them just to have them.

I work with a slowly changing dataset that's about 100GB to download in full. A few people a week download it.

I've considered adding a torrent download, because it includes built-in verification of the download. A common problem is users reporting that their download over HTTP is corrupt, but I'm not sure if they'd be able or want to use Bittorrent.

(Also, for many users the download is probably fine, but they can't open it in Excel. Bittorrent won't help that. )

You might want to look into something like syncthing. It uses a distributed protocol like torrents but also supports updating files to new versions while only resyncing the diff and sharing folders. Open source, donation-funded.


The torrent download sounds like a great application for this -- hopefully it can take some strain off your servers as well. Linux distributions mostly do this. Only thing is how you want to deal with updates to the dataset.

For 100GB of important data I'd be willing to buy an extra hard drive.

I'd expect your users to be willing to set up a torrent client. It's not even difficult.

It's not that difficult, but it can be scary.

The few times I've published files over bittorrent I've had to reassure people that torrenting itself isn't any more illegal than other download methods.

It's also not clear ahead of time how difficult it's going to be.

If torrenting is the only way, some people won't bother.

That's what I'm concerned with; tarnishing a good reputation with a "scary" word.

Bandwidth is no problem for us, we have a faster connection than all the users. Users in Africa and Latin America would probably benefit most, but I'd need to research whether they'd be prepared to use Bittorrent before implementing this.

What would really help torrenting would be to build it into browsers as an alternative http(s) transport for large files (instead of/to augment CDNs). If Warner and Sony found themselves automatically torrenting their content and their bandwidth bills falling would they change their attitude?

I'm quite sure it would not at all change their position on torrenting copyrighted works without permission.

I definitely agree with that, but they have pushed a narrative that "torrenting is inherently bad and should be blocked by ISPs/firewalls/etc" which means it's hard to use as a transport for software updates, video delivery and the like.

Why does torrenting have to be all these things? Why can't it just be a distributed file sharing scheme?

It is a file sharing scheme, these are just the human dynamics that stem from it.

All the .torrent files are served over http so with a simple MITM attack a bad actor could swap in their own custom tweaked version of any data set here in order to achieve whatever goals that might serve for the bad actor's interests.

I really wish we could get basic security concepts added to the default curriculum for grade schoolers. You shouldn't need a PhD in computer security to know this stuff. These site creators have PhDs in other fields, but obviously no concept of security. This stuff should be basic literacy for everyone.

> This stuff should be basic literacy for everyone.

Arguably, one compromised PKI x.509 CA jeopardizes all SSL/TLS channel sec if there's no certificate pinning and an alternate channel for distributing signed cert fingerprints (cryptographically signed hashes).

We could teach blockchain and cryptocurrency principles: private/secret key, public key, hash verification; there there's money on the table.

GPG presumes secure key distribution (`gpg --verify .asc`).

TUF is designed to survive certain role key compromises. https://theupdateframework.github.io

Perfect use case for http://datproject.org/. It has git versioning on top of bittorrent, so if something gets updated in the dataset you only download the diff (unlike torrent).

Does anyone understand the reasoning behind this statement:

> We would like to avoid the blind mirroring of all data.

Found at http://academictorrents.com/about.php#mirroring

Sometimes people upload 1TB files which are not intended to be mirrored or not of interest to many people. We don't want people who donate hosting to mirror this content unless they really want to. But we also want to make it easy and automatic to mirror content. Using collections, which each have an RSS feed, content can be curated by someone you trust to decide what should be mirrored. I curate many collections including videos lectures, deep learning, and medical datasets.

Got it. Not trying to knock the hard work put into this, I'm actually thrilled to see this initiative and only intend to be constructive. Personally, I would rather be asked to trust that site admins were auditing each torrent to ensure it at least looks legitimate, before passing final legal responsibility on to me as a seeder. Leaving users to identify contributors they can trust to never include "Pirated Movie 2018" into their donated seedbox sounds like quite a hurdle to attracting new seeders willing to participate in a "legitimate bittorrent use case" project.

We perform our own audits of all data and request a justification why it is academic data if it appears to be "Movie 2018". We think the collections model is the best balance between a walled garden and zero censorship. You can be assured that no collections curated by me will have a bad torrent in it! Here are a few:






Isn't this the purpose of IPFS? You get to host what you want.

Are the datasets all legit? For instance, this looks like a quarterly scrape of Reddit in full:


I'm pretty sure all the twitter datasets violate the twitter TOCs.

On a quick pass of the Twitter datasets, they all seem to conform to Twitter's developer Terms.

Like the requirement that you have to delete tweets in datasets that have been deleted on twitter?

As far as I could tell, none of them actually contain tweets (e.g. any JSON), just IDs, and mostly user IDs at that.

>Are the datasets all legit?

I mean Academia has destroyed the scientific method, turning it into:

Who needs a PhD and what does your Professor want to prove true?

Ive started to ONLY trust industry.

Because industry is dedicated to finding truth?

Not all, some are non-free and commercially licensed.

What makes them "non-legit"?

Seems like you can’t upload unless you have an account registered with an academic email address.

The form I saw didn't ask for an academic email address. Just an email address.

Legit how?

Legit as in not subject to firms potentially coming after them because they're distributing their data. (I've no idea what Reddit's terms are but I wouldn't be surprised if they had an issue or two with a dump of their historical data being available for download free of charge on a 3rd party website.)

Mostly - This repository is for data hoarders and archivists. They don't necessarily care whether it is legit or legal. The goal is to harvest the most quality data.

My browser reports the "create an account" page is not secure, so maybe best not to use this as an uploader at least until they fix that. For the creator of the site: pages that collect passwords should be served over https.

All pages should be served over HTTPS. It's not only about keeping secrets.

Good point.

Its good that more than one way exist for something like this, though I personally prefer something like zenodo, were every record automatically gets a DOI attached.

(Zenodo is limited to 50GB though)

"gta_full_dist.tar" seems to be one of the biggest "datasets" featured on here. funny this data business.

Did you actually check it? It has nothing to do with piracy or video games, as you seem to imply.

I have been wondering why this hasn't existed for years! Thank you guys for making this. Long awaited.

This has existed for years.

I have wondered for years... apologies for the poor grammar

I can't find any info on how the data is hosted from the website, so I am wondering whether it works like pirate bay or it also hosts data itself? If the former is the case, it will be hard for researcher to use and share. One reason is that academic institutions nowadays has tightened control on net access, which definitely hinders hosting large amount of data shared with BT protocol; second, because researches are often fragmented which by itself will limit interests of possible users, then the sharing falls on a few people's goodwill.

It's in the name - Academic 'Torrents'. They just host the torrent files which are only a couple hundred kilobytes. I feel like your sorting of missing the point of a service like this, it's not to provide potentially the fastest download available, but it's to ensure data is accessible, even if the original download source is unavailable or is inaccessible for certain people or locations.

As long as you're not downloading copyrighted data there should be no issue with using the BT protocol on a company or academic network, providing their is no outright ban on the protocol in your network usage policy. The BT protocol itself actually lends itself quite well to large datasets such as what is hosted here due to its inbuilt error checking (so no more spending hours downloading a huge dataset only to find your connection did something silly for a second and corrupted the whole file) and can provide much faster download speeds on popular files due to the number of peers available, instead of a normal hosting arrangement which would likely provide slower speeds on popular files due to network congestion and file access speeds.

P2P is banned in the French academic network. I expect the same in neighboring countries, although I didn't get to review their terms of service.

I'd guess if a researcher has to use P2P there's probably a way for them to get the data / get whitelisted. I'm pretty sure the "P2P is banned" is mostly aimed at download copyright infringing content.

It's more than copyright infringement. Once you start having a few students or researcher deploying P2P, it's going to saturate the dedicated 10 Gb links very quickly.

P2P will consume any amount of upload bandwidth available. It's horrendous to have inside your network, as a university or research center.

It's not much different than any other service, if you don't limit it or use in moderation it'll saturate the network. If someone would host some linux ISO's over http like some universities do for Linux distributions it'll have the same effect.

You could limit upload to be inside the network. P2p is not horrendous to have inside your network, it can save you a lot of download-link if people inside the network share files with each other.

Torrents are the most effective, reliable and convenient way to distribute large files, its adoption shouldn't be blocked by bad configuration and policies.

Universities are often seeding Linux distro torrents themselves, I'm sure there is a way around that.

Linux packages are distributed by HTTP or FTP. There can be public mirrors handled by the network operator or some universities.

It benefits everyone because the automatic mirror selection in Linux distributions picks the lowest latency mirror automatically. That means all OS running in the academic network will pick the public academic mirror, since it's the closest.

I'm talking about distro ISOs, not about package manager repositories.

> P2P is banned in the French academic network. I expect the same in neighboring countries

There's no Janet policy (UK academic network) against P2P, although there are institutional policies in place (typically for student residential networks).

To be honest it would be difficult to have a policy that wouldn't impinge on some of the more unusual protocols used in research.

Has never been a problem at my German university (don't know if it is theoretically forbidden or not). Sure, if you misbehaved you'd get in trouble with the sysops, but as long as you don't break the network or do something outsiders loudly complain to the university about nobody cares.

"P2P" is a broad term which could include anything federated. Does Tribler [1] work?

[1] https://www.tribler.org

They are all banned by the terms of usage. Some of the popular ones are also blocked by technical means.

Based on the sponsors I'd say a lot of the content is hosted by some seedbox companies so you wouldn't have to worry about people seeding at the beginning or on slow connections that much.

The data is hosted by the community and we also coordinate hosting from our sponsors (Listed on the home page)

We work with academic institutions to ensure they allow this service. Please report universities which block the service using the feedback button shown in the lower right of the webpage.

We also encourage HTTP seeds to be specified (aka url-lists) by the uploader to offer a backup URL which can be contacted automatically if BT is blocked. We also offer a python API designed for clusters and university computers written in pure python which supports HTTP seeds: https://github.com/AcademicTorrents/python-r-api

The name is academic torrents, I can assure you that this is P2P.

I have some very big concerns about this.

1. It appears to be sponsored by seedbox hosting companies -plus- a google ad. This is misleading (no, it is -not- directly sponsored nor endorsed by Salesforce, which is the Google Ad I see).

2. Many higher education institutions will block BitTorrent on their firewalls to prevent/reduce copyright infringement.

3. How legitimate is the data? Is there any vetting of the content to ensure that it doesn't violate copyright or that the data was legally obtained eg, the site scrapes? A DMCA takedown is too late if we've already accidentally seeded infringing information and could harm our reputation.

4. The site claims to be "used by" a group of very big names (Stanford, MIT, UT Austin etc). Did they ask/give permission to be cited? Do they endorse the use of this service?

5. HTTPS. Please?

It's a great idea but it needs a bit more polish before I could even suggest this to my management.

Anyone know why they opted not to use Webtorrent for this? Obviously straight Bittorrent is more battle-tested, but the extra friction of having to know how to work a BT client is non-trivial.

I’d call it trivial. My 60 year old uncle who shouts at his computer uses BitTorrent. Anyone who wants these files will be able to figure it out.

the search on this would be better if you could browse by subject. on my firefox, the checkboxes (for type of resource) don't highlight

I can see some overlap with Internet 2.

great resource, thanks for the link

papers section is full with Trivedi Effect. gets tiresome to filter it out

great resource!

I see this is a centralized service. Is there protection from DMCA takedowns?

For example, if someone uploads a bundle of 50 N64 ROMs, and someone tries to DMCA that link so that this service no longer provides an index to it, is there censorship resistance?

This is almost exactly what I need: A distributed index of massive datasets. I've been building a web browser that can source from such content. The idea is that you write a website which refers to some resources by SHA256, and anyone else running the browser will transmit the resource to you if they have it.

That would let you build an emulator which can play any ROM in history, without having to explicitly download the ROMs. It's equivalent to clicking on a link to "Super mario brothers" and then seeing it play immediately in your browser. No explicit downloading.

From a long term standpoint, the vision is that you can build whatever games you want, using whatever assets you want, and nobody can tell you that you can't.

So it was funny to see this service pop up, because it's nearly an identical use case: "I have some data (ROMs), I want to make it available to everyone else (people running the emulator), and it's decentralized so no one can say no (bittorrent)." But that raises the question of scope, or whether such use cases would be welcome.

This project isn't really what you're looking for. ROMs are more program than data, and Academic Torrents in particular is scoped to scientific datasets. What you're describing is more akin to Olive, bwFLA, or the Internet Archive's Javascript emulators [0]. For any of these, I don't know in particular if anyone has attempted to create a decentralized store of ROMs/disk images, although the general rule in digital preservation is to have at least 3 copies stored in different locations. The Internet Archive might have some redundancy for ROMs via the IA.bak project, which uses git-annex to distribute Internet Archive data across multiple machines [1].

[0] https://mellon.org/media/filer_public/0c/3e/0c3eee7d-4166-4b...

[1] http://iabak.archiveteam.org/

Rather than building your own browser, take a look at Beaker Browser.

Thanks! Unfortunately Dat doesn't provide any anonymity.

That's true, and a worthwhile consideration if you're doing illegal things. If you route your connection through Tor you'd be good to go, right?

Yes, but users can't be expected to know how to do that. And my users are the ones who want to play ROMs.

Maybe it's possible to bundle tor with the executable though, and have the network connections automatically route through tor.

I was hoping for the whole 18TB Google Open Images in one torrent, but alas, they only offer metadata for download, same as Google itself. What a bear of a dataset it is to obtain and work with.

FINALLY, the inability to browse was my biggest issue with scihub.

All these sheltered academics were heralding it as the best thing ever, without even the basic user experience expected this decade.

Yes, I know about the chrome extensions.

Because browsing is provided by other services already. The usual use case for sci-hub is: find an article you're interested in, find it's super expensive, copy the doi into sci-hub.

It would be cool if sci-hub came with a catalogue, but it's very usable without it. And it's not just sheltered academics (or even just academics) that are using it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact