
Academic Torrents - julianj
http://academictorrents.com/
======
krick
Yeah, it could really benefit from some organizational work, like on more
mature music torrent trackers or such. Categories, mandatory tags, unified
names, reviewed by community-chosen category-wise moderators. In it's current
state in's basically a file dump, either you have the direct link, or you can
only _hope_ to find something interesting. Not that much better than sharing
magnet links via public pastebin records...

~~~
colechristensen
One very interesting thing I wish would be studied in depth are the virtual
economies of mature trackers. Limiting access to resources and granting
increasing access for contributing and correcting quality has in places been
extremely successful. It is interesting to see the varying quality and
associated economic mechanics.

Some environments, based just on prestige, have big problems with toxicity
(StackOverflow, Wikipedia) which I didn't see _at all_ in some music trackers.

~~~
ryacko
Wikipedia does cover that issue. Competing views are difficult to reconcile.

[https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemi...](https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemic_bias&oldid=378210455)

(using a version of the article from ten years ago because everything is
unnecessarily verbose on wikipedia now)

~~~
ailideex
I'm not sure what the point of quoting that is really. I guess if you
subscribe to the idea that reality is somehow modified by your age, sex, race,
education or whatever the heck then it has some relevance but then the whole
idea behind an encyclopedia seems pointless and we should just each maintain
our own unique knowledge bases as they will have no relevance to someone other
than us.

That an article like that exists is patently absurd in my view and kind of
makes me a bit ill. Things like that is what led to this:
[https://www.youtube.com/watch?v=C9SiRNibD14](https://www.youtube.com/watch?v=C9SiRNibD14)

I really firmly believe that if you think there is a European (?) science and
an African science and they are distinct and equally valid then either me or
you do not belong on Wikipedia and I would actually like Wikipedia to clarify
their mission in this light.

~~~
lordlic
I don't see the point of linking that either, but your "reality is neutral"
argument is severely flawed. Wikipedia doesn't cover merely technical topics.
Obviously there's not going to be a problem with systemic bias in an article
on merge sort, but you don't think there's a potential issue with mainly
wealthier, whiter, younger people writing articles on topics, for example,
related to the history of colonialism? Think about how drastically
perspectives on figures like Christopher Columbus have changed over just the
last generation from bringing more diverse viewpoints into the conversation.
Hell, we demonstrably see this today on the Japanese language Wikipedia with
topics like the Nanking Massacre.

~~~
ailideex
> Think about how drastically perspectives on figures like Christopher
> Columbus have changed over just the last generation from bringing more
> diverse viewpoints into the conversation.

In my view Wikipedia should not be a repository of value judgements or
specific values that one should adopt - perspectives on Christopher Columbus
is important and should be included but in no manner should those perspectives
be made out to be incontrovertible or something other than value judgements
and perspectives from specific points of view. I think it is valuable to
understand the European perspective and native american perspectives at the
time and throughout the following centuries for events.

But I don't think Wikipedia should be telling me I must think what Columbus
did was good or bad - Wikipedia should not be trying to teach me morality -
and as long as it does not do that I don't see how there is any problem with
what topics Wikipedia covers and who writes it.

I think the only problem comes in when you attempt to do something which is
impossible - like incorporate something which is fundamentally specific to
specific people (morality) into something which purports to be valid for
everyone.

~~~
lordlic
Those judgments appear organically through mechanisms as simple as how much
coverage a topic gets. The worst case is that a bunch of circa 1900 Europeans
write this article
[https://en.wikipedia.org/wiki/Population_history_of_indigeno...](https://en.wikipedia.org/wiki/Population_history_of_indigenous_peoples_of_the_Americas)
and the impact of colonialism is mentioned in half a footnote rather than
taking up the bulk of the article. If systemic bias were completely unchecked,
entire articles might not exist.

You might also be interested in reading
[https://en.wikipedia.org/wiki/Chinese_Wikipedia#Self-
censors...](https://en.wikipedia.org/wiki/Chinese_Wikipedia#Self-
censorship_allegations)

------
yig
2016 HN discussion:
[https://news.ycombinator.com/item?id=12381791](https://news.ycombinator.com/item?id=12381791)

2014 HN discussion:
[https://news.ycombinator.com/item?id=7149006](https://news.ycombinator.com/item?id=7149006)

~~~
dang
2018 too:
[https://news.ycombinator.com/item?id=17744150](https://news.ycombinator.com/item?id=17744150)

------
robbya
[https://academictorrents.com/about.php#mirroring](https://academictorrents.com/about.php#mirroring)

Using RSS to allow mirrors to host different subjects is really clever,
although some of the categories seem quite large (>5TB). It may be worth
breaking up each category (sharding) to keep each to 100GB or less so a
volunteer can pick a couple and not worry about running out of disk when a
category grows.

Then it would be good to track how many seeds each category-shard has so
volunteers can help where it's most needed.

~~~
DuskStar
Some individual items are multiple TB, which would make 100GB shards a little
difficult.

------
DuskStar
I wish I could add Gwern's Danbooru dataset [0] here - 2.7TB of labeled anime
images. But they only support torrent files up to 10MB, and that's over 20MB
for the full dataset or 12MB for the SFW low-rez set...

Incidentally, when the _torrent file_ for your anime image collection passes
20MB, something has obviously gone very w̵r̵o̵n̵g̵ right.

0: [https://www.gwern.net/Danbooru2019](https://www.gwern.net/Danbooru2019)

~~~
DuskStar
I should probably point out that this dataset has been used for some machine
learning tech demos in the past, for example This Waifu Does Not Exist [0], a
StyleGAN-based automatic anime portrait generation tool. So it's not
completely outside of what the site already hosts...

0:
[https://www.thiswaifudoesnotexist.net/](https://www.thiswaifudoesnotexist.net/)

~~~
gwern
More than demos, papers too:
[https://www.gwern.net/Danbooru2019#applications](https://www.gwern.net/Danbooru2019#applications)

------
glofish
Cool idea, it is impressive that it is still around - alas it is flawed the
same way all scientific data is flawed.

There is no metadata - all you have is an awkward imprecise textual search of
the abstract that comes with the data. Good luck hosting the world's data that
way.

~~~
derefr
One nice thing about digital data, as opposed to physical artefacts, is that
you don’t need to keep digital data’s metadata attached to the data “at the
hip.”

Through the magic of cryptographic hash algorithms, you can just keep your
data sets floating around “raw” (like in these torrents), and then,
_elsewhere_ , ascribe metadata _to the hash_ of the content it is meant to
annotate.

Then, later, you can reassemble them in either order—either by first finding a
data set, hashing it, and then looking up metadata in some metadata-hosting
service; or by first browsing a catalogue of indexed metadata, finding out
about a dataset that meets your needs, and then retrieving the data set _by_
its hash.

Which is to say: with digital data, library science (creating metadata and
chains-of-custody and indexing them for search) and archiving (ensuring access
to pristine artifacts over time) don’t need to happen at the same time, in the
same place. There can be separate “artifact hosting” and “metadata library”
services. (Which is especially helpful in contexts where private IP is
involved—you can still keep in your metadata library, the metadata for a data-
set you don’t have the rights to; and those _with_ the rights can go get the
data-set themselves.)

~~~
bordercases
Aaaand someone has to do the work for computing the index and annotating the
hashes.

~~~
robbya
I think it's worth recognizing that this is a good first step in a hard
problem. Hosting many TB of data for free isn't easy. Building an index on top
of that data isn't easy either, and it looks like no such index exists today,
but if someone decided to build that index they wouldn't need to worry about
the hosting portion of the problem. That's a great starting point.

------
husainalshehhi
Downloading some of this might be illegal. I see some entries that says "No
license specified, the work may be protected by copyright."

------
aldoushuxley001
This is amazing, really a great source of data.

