Hacker News new | past | comments | ask | show | jobs | submit login
Academic Torrents (academictorrents.com)
297 points by julianj 8 months ago | hide | past | favorite | 32 comments

Yeah, it could really benefit from some organizational work, like on more mature music torrent trackers or such. Categories, mandatory tags, unified names, reviewed by community-chosen category-wise moderators. In it's current state in's basically a file dump, either you have the direct link, or you can only hope to find something interesting. Not that much better than sharing magnet links via public pastebin records...

One very interesting thing I wish would be studied in depth are the virtual economies of mature trackers. Limiting access to resources and granting increasing access for contributing and correcting quality has in places been extremely successful. It is interesting to see the varying quality and associated economic mechanics.

Some environments, based just on prestige, have big problems with toxicity (StackOverflow, Wikipedia) which I didn't see at all in some music trackers.

Wikipedia does cover that issue. Competing views are difficult to reconcile.


(using a version of the article from ten years ago because everything is unnecessarily verbose on wikipedia now)

I'm not sure what the point of quoting that is really. I guess if you subscribe to the idea that reality is somehow modified by your age, sex, race, education or whatever the heck then it has some relevance but then the whole idea behind an encyclopedia seems pointless and we should just each maintain our own unique knowledge bases as they will have no relevance to someone other than us.

That an article like that exists is patently absurd in my view and kind of makes me a bit ill. Things like that is what led to this: https://www.youtube.com/watch?v=C9SiRNibD14

I really firmly believe that if you think there is a European (?) science and an African science and they are distinct and equally valid then either me or you do not belong on Wikipedia and I would actually like Wikipedia to clarify their mission in this light.

I don't see the point of linking that either, but your "reality is neutral" argument is severely flawed. Wikipedia doesn't cover merely technical topics. Obviously there's not going to be a problem with systemic bias in an article on merge sort, but you don't think there's a potential issue with mainly wealthier, whiter, younger people writing articles on topics, for example, related to the history of colonialism? Think about how drastically perspectives on figures like Christopher Columbus have changed over just the last generation from bringing more diverse viewpoints into the conversation. Hell, we demonstrably see this today on the Japanese language Wikipedia with topics like the Nanking Massacre.

> Think about how drastically perspectives on figures like Christopher Columbus have changed over just the last generation from bringing more diverse viewpoints into the conversation.

In my view Wikipedia should not be a repository of value judgements or specific values that one should adopt - perspectives on Christopher Columbus is important and should be included but in no manner should those perspectives be made out to be incontrovertible or something other than value judgements and perspectives from specific points of view. I think it is valuable to understand the European perspective and native american perspectives at the time and throughout the following centuries for events.

But I don't think Wikipedia should be telling me I must think what Columbus did was good or bad - Wikipedia should not be trying to teach me morality - and as long as it does not do that I don't see how there is any problem with what topics Wikipedia covers and who writes it.

I think the only problem comes in when you attempt to do something which is impossible - like incorporate something which is fundamentally specific to specific people (morality) into something which purports to be valid for everyone.

Those judgments appear organically through mechanisms as simple as how much coverage a topic gets. The worst case is that a bunch of circa 1900 Europeans write this article https://en.wikipedia.org/wiki/Population_history_of_indigeno... and the impact of colonialism is mentioned in half a footnote rather than taking up the bulk of the article. If systemic bias were completely unchecked, entire articles might not exist.

You might also be interested in reading https://en.wikipedia.org/wiki/Chinese_Wikipedia#Self-censors...

I honestly struggle to reconcile what I read in the linked wiki article with what your comment mentions. "Systemic Bias"[1] doesn't seem to match with "reality is modified by your age, your...".

One can understand a possible path that goes "xyz information source is biased", "xyz info source isn't suitable for abc group", and "xyz info source is specific to xyz people, we need our own abc source". However, that seems to require a few assumptions? And still isn't as negative as that youtube video linked.

Would appreciate if you could elucidate on your views.

[1] (please forgive the scare quotes)

Quote from the article:

> The average Wikipedian on the English Wikipedia is ... (some characteristics)

This builds to conclusion:

> The systemic bias of the English Wikipedia is permanent. As long as the demographic of English speaking Wikipedians is not identical to the world's demographic composition, the version of the world presented in the English Wikipedia will always be the Anglophone Wikipedian's version of the world.

I don't see how you get to that conclusion form the premise other than by thinking that reality is modified by personal characteristics.

If there is an Anglophone Wikipedian's version of the world which includes things like gravity and science - then it is not valid for Africa (as the woman in the video is expressing) as Africa is not the Anglophone world ... not sure what about this is not clear.

And it absolutely is as bad as that youtube video I linked - you think that poor unfortunate woman came up with that drivel on her own? She is not nearly dumb enough - no single person can be that stupid.

You need years of academic circle jerking and hand picking of the dumbest arguments from the dumbest people to come up with something that stupid.

That definitely is an interesting issue that could be studied. From the practical perspective, speaking of this particular torrent tracker, I wouldn't speculate much and would just (more or less) copy the organizational structure of some tracker I know and see if it works (I assume some adjustments would need to be made, because people are different, content is different, whatever else I don't keep in mind will turn out to be different).

But if I were to speculate, I guess it always propagates from the top. The point is, that the visible community you can speak of is not entirely randomly chosen from the user base, and the user base are people who just want to use the product, not to play corporate mechanics. If in the end the goals of the general public are somewhat aligned with the internal community of ladder-climbers, it works out fine. Otherwise it doesn't.

(And, by the way, ladder-climbers in most of these communities tend not to be the nicest people by default... Let's just say, they are Dwight. So if you let them do stuff that is not desirable for the general community, they will.)

I think StackOverflow philosophy is flawed by design, the main point of user frustration always was the fact that questions that they very much need to get answered are closed as "too broad", "opinion-based" or something of the sorts. Dwights love to exercise their power by noticing that something can be close "as not good fit for this site", and users who want that stuff to be discussed obviously hate that. That is something that could be fixed from the top, but the top specifically wanted it this way.

Wikipedia is similar to that, but users and Dwights stand even further apart, since general user doesn't even make an account to make an edit, doesn't look who makes the edits and doesn't know the internal playground. The main point of frustration here is a user, who knows his stuff well and wants to share the knowledge, but is being shut down by a Dwight, because the subject is "of low importance" to him. This infuriates the user even more, considering that there are thousands of articles about some fucking Harry Potter-universe pokemon or whatever, which, naturally, doesn't raise an issue with Dwights, because they are Dwights and they love this stuff. This is also something to be solved organizationally from the very top.

Music trackers are way more meritocratic. People, who eventually get to be moderators can be formalistic or not — it varies — but they generally just want a lot of music on the tracker in a well-organised manner — and this is exactly what general public wants! It's another question how they get motivated by the platform to contribute so much — and involvement sometimes seems to be much more hard work than on Wikipedia — but the point is that they really do contribute useful stuff.

Also, music trackers tend to be way more liberal (in a sense to allow freedom, not to be left-wing politically, ironically, quite the opposite is true nowadays). Nobody cares is somebody is rude, racist or whatever, if off-topic flamewar goes over the top — the whole thread goes down. Otherwise, you can post whatever you want and nobody gives a shit and isn't pressured by the media to do something about it. After all, unlike twitter, reddit or stackoverflow, they aren't traded on the stock market.

We have collections which I guess should be featured on the front page. https://academictorrents.com/collections.php


Using RSS to allow mirrors to host different subjects is really clever, although some of the categories seem quite large (>5TB). It may be worth breaking up each category (sharding) to keep each to 100GB or less so a volunteer can pick a couple and not worry about running out of disk when a category grows.

Then it would be good to track how many seeds each category-shard has so volunteers can help where it's most needed.

Some individual items are multiple TB, which would make 100GB shards a little difficult.

I wish I could add Gwern's Danbooru dataset [0] here - 2.7TB of labeled anime images. But they only support torrent files up to 10MB, and that's over 20MB for the full dataset or 12MB for the SFW low-rez set...

Incidentally, when the torrent file for your anime image collection passes 20MB, something has obviously gone very w̵r̵o̵n̵g̵ right.

0: https://www.gwern.net/Danbooru2019

I should probably point out that this dataset has been used for some machine learning tech demos in the past, for example This Waifu Does Not Exist [0], a StyleGAN-based automatic anime portrait generation tool. So it's not completely outside of what the site already hosts...

0: https://www.thiswaifudoesnotexist.net/

Cool idea, it is impressive that it is still around - alas it is flawed the same way all scientific data is flawed.

There is no metadata - all you have is an awkward imprecise textual search of the abstract that comes with the data. Good luck hosting the world's data that way.

One nice thing about digital data, as opposed to physical artefacts, is that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

Through the magic of cryptographic hash algorithms, you can just keep your data sets floating around “raw” (like in these torrents), and then, elsewhere, ascribe metadata to the hash of the content it is meant to annotate.

Then, later, you can reassemble them in either order—either by first finding a data set, hashing it, and then looking up metadata in some metadata-hosting service; or by first browsing a catalogue of indexed metadata, finding out about a dataset that meets your needs, and then retrieving the data set by its hash.

Which is to say: with digital data, library science (creating metadata and chains-of-custody and indexing them for search) and archiving (ensuring access to pristine artifacts over time) don’t need to happen at the same time, in the same place. There can be separate “artifact hosting” and “metadata library” services. (Which is especially helpful in contexts where private IP is involved—you can still keep in your metadata library, the metadata for a data-set you don’t have the rights to; and those with the rights can go get the data-set themselves.)

  > s that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

You don't have to, but it's still mostly a good idea. But this stuff isn't either-or. We can have both.

This is especially true for research oriented files, where consumers are often unable or unwilling to maintain a functional metadata store, and do a lot of manual file handling. Saying "well, somebody could have set up a super-awesome metadata system that track this" doesn't magically make those resources exist.

This flexibility in time, specialization, and order of operation is surely one of the joys of modern digital collections.

Library scientists might say archiving and structuring and curation are all facets of that science. And you'll also want a hash search engine that finds related hashes, as there can be many revisions + versions, only some of which have some metadata.

Aaaand someone has to do the work for computing the index and annotating the hashes.

I think it's worth recognizing that this is a good first step in a hard problem. Hosting many TB of data for free isn't easy. Building an index on top of that data isn't easy either, and it looks like no such index exists today, but if someone decided to build that index they wouldn't need to worry about the hosting portion of the problem. That's a great starting point.

There is metadata. It is stored in bibtex along with every torrent. This format allows it to be a freeform database where the user can add fields as they want. We (Academic Torrents) can then build new ways to display this metadata. Also the "abstract" part of the metadata is rendered as markdown on the details page of a torrent. Here is a good example: https://academictorrents.com/details/d52ccc21455c7a82fd6e589...

Ok, I see that there is code provided there. Better than nothing but geez, it is not really what metadata should be like

  def get_labels(rightside):
    met = {}
    met['brain'] = (
        1. * (rightside != 0).sum() / (rightside == 0).sum())
    met['tumor'] = (
        1. * (rightside > 2).sum() / ((rightside != 0).sum() + 1e-10))
    met['has_enough_brain'] = met['brain'] > 0.30
    met['has_tumor'] = met['tumor'] > 0.01
    return met
I will say that it is very handy to know exactly how the labels were computed.

What I really meant is a way to search and select data based on metadata. For example has_tumor.

Also note how everything is still one single blob, to get one line of any of the files, one would need to download everything.

Bittorrent does support partial downloads that request only some files or byte ranges out of a torrent. Some of the torrents are just compressed zip's but for the others you could look at the code / documentation to see which files were relevant before downloading 10GB of data.

I think the abstract is sufficient for searching data; expecting some kind of smart database that can handle all the weird formats science uses is a bit much.

There are even torrent clients that export a FUSE VFS so you can use your standard tools.

| one would need to download everything

Just download it then. We got mp3 albums off Napster on modems back in the day, surely getting that torrent is easier and faster today.

To err is human, to forgive divine, to fix immortal.

Downloading some of this might be illegal. I see some entries that says "No license specified, the work may be protected by copyright."

This is amazing, really a great source of data.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact