Hacker News new | comments | show | ask | jobs | submit login
Academic Torrents (academictorrents.com)
448 points by yinghang 1361 days ago | hide | past | web | 63 comments | favorite

One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released. However, it seems like that doesn't matter at all in this case, because any scenario I can think of which could be solved by editing the dataset (like redacting private info that was accidentally included) wouldn't avoid the original problem: that they accidentally released private info in the first place. Perhaps it'd be useful to edit the original dataset in order to add to it / enhance it with more info, but in that case they could just release a second dataset as an addendum.

So the core idea seems solid. Thank you for this!

There are attempts to feel out a process for "updating torrents". However, this is long from becoming a standard practice in the BitTorrent ecosystem. Check this[0] out for more info.

[0] http://www.bittorrent.org/beps/bep_0039.html

One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released.

for scientific endeavours, this should be considered a feature, not a bug.

> but in that case they could just release a second dataset as an addendum

Or for some data it would make sense to partition the data into smaller chunks instead of one huge archive. That way adding a chunk (the new year's data for a multi-year dataset perhaps) just menas releasing a new torrent with the extra srchive in and a name meaningful enough to indicate the difference. Anyone with the last set could then just download the new partition (and any modified ones).

BitTorrent Sync might be useful for that:


I'd used BT Sync for a couple of weeks to sync data between my own machines. It works neatly. One question here. When you modify some part of a big file, does the program send out only the difference to the other authorized machines, or entire file? Let's say a researcher exports her data to a 1GB CSV file of my interest. I download it. In the following week the same researcher updates her CVS with more data, now it has 1.01GB in size. How big my next download will be?

Seems as though it supports patching so only the parts that are changed would be synced. Of course, the download size is completely dependent on what parts of the file were changed.


Hopefully sharefest.me would be another alternative pretty soon

If it's stored on ZFS, Copy on Write will let you edit a copy that only stores the changed files, and deduplication could give back even more space (if necessary and RAM permits).

The team should learn from the ghost-town that is BioTorrents[1] and offer more than just a tracker. [1] http://www.biotorrents.net/browse.php?incldead=1

That's one reason I'd prefer that academics just put data into some kind of local university archive, where possible. Many universities provide resources to host scientific data (and have done so for decades, since the days of ftp.dept.university.edu servers), and putting it there makes it more likely that it'll still be there in 10 years. Torrents by comparison tend to be: 1) slow, as you rely on random seeders rather than a university that's peered onto Internet2 or the LambdaRail; and 2) unreliably seeded, as people drop off. Plus the workflow of "curl -O URL" is nicer than torrenting.

Universities typically have great bandwidth and good peering, and already host much larger data repositories than this seems to be targeting (e.g. here's a 30-terabyte repository, http://gis.iu.edu/), so they should be able to provide space for your local scientific data. Complain if not!

Another alternative is something like the Dryad Digital Repository:


It's meant to include companion datasets for published papers, and gives out DOIs so datasets can be cited in other works. And it's mirrored at various universities to prevent loss.

Perhaps if universities would robustly seed their staffs and students torrents?

Kind of solves a problem that doesn't exist though doesn't it? It isn't like these universities are crying about bandwidth costs and it isn't like demand is maxing out their upstream.

It's not just about their bandwidth; some countries / zones / networks have much better local connectivity than external, particularly international.

For example, until a few years ago, some of our ISPs had different caps for national vs international traffic, and there were popular forks of P2P clients that allowed you to filter based on that.

We have since moved to unlimited everything, but I wouldn't be surprised if some countries still had different caps or speeds for international traffic.

I agree about university data.

But there is a need for a way to distributed large datasets that come out of nonacademic projects.

For example, the DBPedia data dumps are very slow to download at the moment.

You can have both an use a web seed with most clients.

So, it's quite cheap to get a seeding box from LeaseWeb, in ascending levels of sophistication:

* 100mbps unmetered 2x2tb 39 eur/mo

* 1gbps unmetered 24x2tb 349 eur/mo

* 10gbps unmetered 24x2tb 1089 eur/mo

I'm tempted to grab the first, and open a GitTip account in case anyone wants to chip in towards the second (4tb isn't a lot of space as far as this stuff goes). The third is unlikely to be useful; this stuff is long tail by its nature, so storage is probably more important.

Though in a world containing Google Fiber, would it still be a valuable service?

There's a university box seeding the torrent I'm grabbing (2011 weather patterns), but it still seems to be going quite slowly.

That does not sound cheap to me as a graduate student.

$50/month for a side project that I want to grow and nurture isn't a lot by any stretch, even on a gTA stipend. Given that you could subsidize it through your program, it becomes even cheaper

This is really cool.

I simply wished that the messaging was more clear and told a story that I could tell to my friends who ultimately are "too busy" to think about the value of this product.

Unfortunately "We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds." Just isn't a story that I can tell to my buddies and get them excited.

Thanks for the comment. We've created a shorter "pitch" style presentation for the non-technical / too-busy, which summarizes the benefits etc. in a short several minute description.


Respectfully - that's an EVEN LONGER message. You need a sentence. Ten words, tops.

What do you use for the tracker?

Wow, this is pretty cool -- one of the most direct approaches to open-data that I've seen so far (and the research world is of course in dire need of this kind of open data/connect-the-dots enabling effort)!

I think it would be pretty cool to have trending datasets on the front page (I'm sure you could do a small cron that would find the most-downloaded per-week/per-day/etc)

Also, while not a dire necessity, I think a cooler name would help this project fly farther -- You should be able to make a play on "data torrents", maybe something like datastorm/samplerain/datawave/dataswell/Acadata?

Any way, trivial stuff aside, nice implementation -- bookmarked for when I get the urge to do a data-analysis project!

Thanks! But this is not my project. It is something created by a grad student I met just a couple hours ago at a hack night discussion.

So what do I do if I want to seed them all? Also, are all the data sets (and other things) freely licensed, i.e. no “non-commercial use only” clauses or things of that nature? Can I count on this going forward?

A few TB of FOIA information related to the September 11th attacks is available via BT.

Direct link: http://911datasets.org/images/911datasets.org_all_torrents_J...

Excellent! It's far too early to tell, but I'd like to be hopeful that this distribution network could be another nail in the coffin of the old, expensive, dead-tree journals.

I guess you mean papers. Go back to reddit, libtard.

Yes, I do mean papers. I'm not on reddit, you mouth-breathing neanderthal.

Any reason passwords for user accounts are limited to 40 chars?

Projects like this confirm my suspicion that traditional academic publishing is going to take a nosedive in the next few years. Working in this industry as I do, I don't see commercial publishers moving quickly enough to change. Really love the idea of this and can't help but support the general ethos of it, even if it / its descendants will put a lot of us out of a job.

Brilliant idea if I understand it correctly. Just want to check that my use case would fit. I just submitted my first and main paper for my PhD to Icarus. I'm planning on soon uploading it to ArXiv as well. My paper is theoretical in nature and through a suite of Monte Carlo simulations I generated a few hundred MBs of data. Can I make use of this system as a way to deposit that data so that it's available to anyone that wants to verify the conclusions I reach in my paper and possibly extend the research?

Wow! That's a snappy site. Major props to the frontend dev(s).

Looks like just stock Bootstrap (not that there is anything wrong with that).

True. I guess ppl of academia are used to "different" quality/snappiness levels ;)

I'm surprised they don't have the Google Books n-gram dataset [1]. Then again, maybe they're more focused on data that doesn't have a good home already than on mirroring.

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

Many of the datasets that I've seen in academia are stored in static SQL databases that tend to be about 10-20 terabytes. Where does this leave individuals with limited resources who would like to query large databases without having to juggle the data management side of research? Are there softwares that make database querying P2P accessible?

The problem is the word "torrent". Too many negative connotations for many in the traditional academic world.

And you’re writing this on “Hacker News”…

Maybe this can help in turning around those connotations.

I have an idle server with 500 Mb/s upload. Now I can finally put it to good use! :)

I remember the old days of DC++ whenever I hear blazing fast speeds.

Great ! Looking forward to coursera, edx and ocw videos too

That'd be great actually. Especially for those who're not able to reach Coursera because of the stupid laws.

If you can get a cheap VPS in the US, you can use coursera downloader to grab all the content and then rsync it to your home country.

I'm able to reach Coursera just fine. I don't live in any of those countries (nor I'm from any of them). I just thought it'd be nice to make them available to everyone, because that's the way it should be.

I use coursera downloader because it's hard to keep up with Coursera's own schedule. I already have a ton of materials from different courses on my computer and I would be happy to make them available to everyone, but my upload speed sucks.

We would need a significant number of seeders in order for this to become a successfully used product. Perhaps, universities can seed data?

this seems to be very focused on US academics, at least that is what impression I'm given by labeling ".edu" addresses. It gives a feeling that these torrents/datasets are of better quality. I'm also missing a catalog on this tracker, some basic taxonomy would be most welcome...

I didn't get that impression. Are you referring to the ".edu" address of the creators of the site? Do you mean people with a ".edu" address, and therefore at an American institution, give you a sense of their work being higher quality?

I think he's referring to the "[edu]" label on the browsing pages (like [0]) which indicates that the uploader has a .edu email address. I'm not too sure about other countries, but at least in Germany, not many academical institutions actually have those, just normal .de ones.

[0] http://academictorrents.com/browse.php?cat=5

to clarify: torrents are marked "edu" if the user has a .edu address, this makes those torrents stand out. The majority of non us universities do not offer *.edu addresses to their staff and students.

Yes, you're right. Which then brings up the question about how to determine if the data comes from an "academic" address, as was pointed out, only US institutions or institutions which are accredited by the US Dept. of Education can apply for an .edu top level domain— meaning nobody has it.

I am no expert on torrents, but I like this conceptually. Publicly funded academic research should be free.

What a wonderful idea. This fits so well with the torrent protocol (maybe even philosophically speaking).

This awesome! Thanks for sharing!

awesome invention! Could this be connected with the google scholar to add keyword searching?

Aaron Swartz's dream come true?


what exactly was his dream?

thanks for sharing! I shall store in the vault of Hard Drives I keep here in the desert

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact