
33GB of public domain JSTOR articles, and a manifesto - sp332
http://thepiratebay.org/torrent/6554331/Papers_from_Philosophical_Transactions_of_the_Royal_Society__fro
======
giberson
If I weren't too timid to risk doing so, I would do the following (read I hope
someone else does this).

Process the pdf's with an OCR program to extract as much text from each
document as possible. The extraction should be done page by page, so the
extracted text can be referenced to a PDF page#.

Then, provide a searchable/browse-able directory of the extracted content.
Each page of text has link to the original PDF page so you can easily open up
the PDF to the page the text was extracted from.

I'd also make all text user editable wiki style. Combined with the inline PDF
page references it would be super easy for any user to fix up translation
errors from the OCR process. Tie in a karma system to the users profile so
that edits can be thanked/kudos on a job well done to help with automating
moderation of user edits by rating the user's current karma to decide if the
edit should be accepted automatically or provided as an alternate version
other users can check and rate up if they think it should replace current
version.

Maybe mash in an image cropping service so diagrams can be cropped from the
PDF and inserted inline with the translated text. Provide simple wiki
formatting markup to allow users to format the articles.

Use ad revenue/donations to alleviate/cover hosting costs.

1, 2, 3, go.

~~~
anigbrowl
Well, most of this already exists in Google books, where I'm reading some of
these particular scientific journals right now, can switch to OCRed text at
any time (albeit with ftrange contemporarie fpelling), or download a facsimile
of any public domain work in PDF format. The only thing they don't have in
place (and should add) is the wiki-style crowd editing.

So the guy has basically built a 33gb torrent of stuff that was already freely
available to the public, just from a different source.

------
sp332
Jason Scott of textfiles.com has the whole archive downloaded here, if you
want to browse the metadata before downloading the files.
<http://cdmirror.textfiles.com/JSTOR_01_PhilTrans/>

~~~
sp332
Apologies to Jason for posting a link to his temp/staging folder. The files
will be available at the Internet Archive soon.

------
eykanal
For those of you who aren't familiar, there is an intitution that was set up
not long ago call the Public Library of Science (PLoS):

<http://www.plos.org/>

They have free journals in numerous fields, and gradually more big-name
authors (in my field, neuroscience, at least) have been publishing in it. Its
worth checking out.

~~~
juberpatel
Actually, the operation of PLoS journals raises an interesting question. These
journals charge the researchers for publishing papers so that these papers are
freely accessible to anyone. But the charges are in the range of $2200 - $3000
per paper! Doesn't that mean the traditional journals are charging reasonable
fees to the readers?

------
tylerneylon
[[TL;DR for this comment - Publishers are taking advantage of a prisoner's
dilemma / competitive closed market to monetize the near-zero value they
supply.]]

Professors don't get much money from direct publication -- in fact, many
conferences charge the professors who provide the content. They get paid by
grants and their schools. Professors don't want their research to reach a
limited audience. The universities doesn't want this either.

The only people in the chain who want limited access are the publishers, since
this is how they make money. But between researchers, universities, grants,
and publishers, the publishers contribute the least value _by far_. They
generally rely entirely on other professors to edit their journals and
conference proceedings, and for all the content. They charge ridiculous rates
- often thousands of dollars for a single annual journal subscription - and
get away with it because the system is not prone to change. Researchers are
rewarded for publishing in "the best" journals, so no one wants to take the
leap to publishing in a free space where there is currently much less
prestige.

That's basically why academic publishing is messed up. Because there's money
to made in keeping it messed up, and money to be lost in fixing it. But the
ones who generate the real value _do_ want things to be as freely available as
possible. If a critical mass of top-tier researchers agreed to stop publishing
in non-free journals and conferences, it would probably start a revolution in
this area -- but that's a lot to ask.

It's a prisoner's dilemma, in that the "traitor" researchers who keep
publishing in the old journals will be rewarded.

(This is all about academia -- I guess motivations may be different in
industry-backed research.)

------
rb2k_
And for people interested in how many bitcoins donations flow his way:

[http://blockexplorer.com/address/14csFEJHk3SYbkBmajyJ3ktpsd2...](http://blockexplorer.com/address/14csFEJHk3SYbkBmajyJ3ktpsd2TmwDEBb)

~~~
VMG
Nothing?

~~~
r00fus
I imagine it's a legal defense fund of sorts. It's not like he owns these
documents.

~~~
pyre
No one does, as they are pre-1923.

------
w1ntermute
Can anyone explain why all these documents are not available for free? Why
does the only place you can download them charge for the privilege?

~~~
pnathan
Well, there is a non-zero cost to hosting and delivering documents, as well as
creating the infrastructure to allow it.

Those costs must be made up somehow in the business model.

(before I get flamed, I'm not arguing that all documents should be $19 per
article. I'm just saying that there's a non-zero cost that needs to be passed
on to the customers somehow).

~~~
crocowhile
Also, JSTOR does NOT get those documents for free. They may have to pay the
publisher to host them on their server.

Academic publishing is a bitch and there is a lot going on lately towards a
common, world wide reform. However, JSTOR is not really the bad guy here.
Other publishers (e.g. Elsevier <http://en.wikipedia.org/wiki/Elsevier>) make
billions by publishing mainly work payed with tax payer money.

~~~
Vivtek
_They may have to pay the publisher to host them on their server._

Not in this case, apparently; they're public domain. Although JSTOR did foot
the bill for the scanning.

I'll take your other point, though: if there's real evil in the academic
publishing world, it's Elsevier.

~~~
crocowhile
The point is that this action and aaronsw's action make JSTOR look like the
villain while JSTOR are rather sitting on the good side of the battle.

(On the other hand, these actions do contribute to create public awareness so
I still haven't decided if they are good or not)

~~~
SDADASDA
it seems JSTOR is not pressing charges.

------
zeratul
Medical doctors were very unhappy to pay for research papers that were funded
by tax paying Americans. That's why since April 2008 all articles funded by
NIH have to be freely available: <http://publicaccess.nih.gov/> . Remember,
it's impossible for law to work backwards.

~~~
alanh
I am also quibbling with your claim “it's impossible for law to work
backwards.” (IANAL)

1) It _is_ unconstitutional to prosecute an individual for a crime that was
legal at time of perpetration.

2) However, the opposite — decriminalizing previously illegal behavior — can
be retroactive.

3) Changing future interpretation of copyright, etc., isn’t the same as case
#1. If I’m not mistaken, Congress has passed e.g. the Mickey Mouse copyright
law and the DMCA, which both extended “protection” & duration of copyright on
previously created works.

~~~
nullc
Maybe.

<http://www.scotusblog.com/case-files/cases/golan-v-holder/>

------
aidenn0
FYI it's potentially NSFW if you don't have adblock

~~~
sp332
D'oh, I always forget about those ads because I have adblock. _blushes_

------
equark
I don't fully understand the logic. Even if the underlying content is free,
how are JSTOR scans public domain? If Google spends millions of dollars
scanning in public domain books, I don't see how that gives Bing the right to
download them all from Google unless given permission.

It also seems we all benefit from allowing companies to invest in scanning
public domain works since for whatever reason nobody is doing this by hand
now.

~~~
meow
I think the title is kind of wrong. The torrent author says: "I've had these
files for a long time, but I've been afraid that if I published them I would
be subject to unjust legal harassment by those who profit from controlling
access to these works."

So he clearly says its not in public domain but from a moral point of view
they should be available to all the masses. The description of the torrent is
very interesting. Definitely worth taking a look even if not willing to
download the actual torrent.

~~~
microarchitect
_The portion of the collection included in this archive, ones published prior
to 1923 and therefore obviously in the public domain, total some 18,592 papers
and 33 gigabytes of data._

I think this statement means that the documents are indeed in the public
domain.

I also think equark makes a valid point about the scans. Further, I don't
understand why these guys are going after JSTOR, a non-profit organization.
I'd be more understanding of their methods if they went after somebody like
Elsevier.

~~~
meow
You are correct. This portion appears to be in public domain (though not
publicly available). Non-profit or not, the fact remains that most of these
documents remain behind pay-walls while still being considered to be in public
domain.

You have a valid point though. The access to these documents seem to be
governed by agreements between various publishers with aim to share the
published content among various institutions
(<http://en.wikipedia.org/wiki/JSTOR>). So there is no point blaming JSTOR for
the lack of access.

------
showerst
Just to be clear, these are apparently unrelated to the Aaronsw case, right?

~~~
iterationx
I had considered releasing this collection anonymously, but others pointed out
that the obviously overzealous prosecutors of Aaron Swartz would probably
accuse him of it and add it to their growing list of ridiculous charges. This
didn't sit well with my conscience, and I generally believe that anything
worth doing is worth attaching your name to.

I'm interested in hearing about any enjoyable discoveries or even useful
applications which come of this archive.

\- ---- Greg Maxwell - July 20th 2011 gmaxwell@gmail.com Bitcoin:
14csFEJHk3SYbkBmajyJ3ktpsd2TmwDEBb

~~~
slowpoke
I truly think you are a brave person for doing this under your real name. I
most likely couldn't do that.

Godspeed, good sir.

~~~
huhtenberg
That was a quote from manifesto.

------
emilis_info
Facebook won't let me post a link to this torrent. Anyone know a way around?

URL shorteners don't help.

~~~
davorak
I was unaware of that level of censorship at Facebook. Where are you trying to
post it.

~~~
mathew1988
Facebook even censors pirate bay links within private messages between you and
friends iirc. They don't block select other torrent sites though. They're a
bit odd like that.

Some more info [http://torrentfreak.com/facebook-blocks-all-pirate-bay-
links...](http://torrentfreak.com/facebook-blocks-all-pirate-bay-
links-090408/)

------
aridiculous
I wonder what would happen if this happened to Westlaw, one of the 2-3
industry standards for law firms. Incredibly expensive.

The irony alone in the law profession would be tremendous. It'd be interesting
to see if law firms would illegally access it: It would be obvious they were
if they previously only subscribed to Westlaw, but the reality is most law
firms subscribe to more than one database for emergency backup.

~~~
rdp
State and federal court opinions are freely available from the courts
themselves. Westlaw, LexisNexis, etc. gain much of their value from tools they
provide for analyzing the opinions. These companies manually produce
headnotes, case histories, and citation history (i.e., "shepardization").
Collating this value-added data would be vastly more difficult than
automatically pulling opinions out of the PACER database (or even JSTOR). Not
that I wouldn't like to see somebody try it . . .

------
Atropos
How many different paywalls are there and how many articles are trapped? I'm
able to access 5 different databases + one of the biggest research libraries
in my state and there are still often articles that are simply inacessible. Or
even more ridiculous: Single chapters of a book sometimes cost up to $ 20
online, even if the complete book could be bought for $ 30...

In my mind an easier way to disrupt this system would be to create a p2p site
for article sharing - this often takes place informally anyway. Just a place
where you could ask "Does anyone have article..." and then a friendly person
would upload it to some filehoster and shares the link.

------
mestudent
Is this[1] the "manifesto"?

If not can someone post it so I don't have to get around the block.

[1]:
[http://cdmirror.textfiles.com/JSTOR_01_PhilTrans/1st_READ.tx...](http://cdmirror.textfiles.com/JSTOR_01_PhilTrans/1st_READ.txt)

~~~
pimeys
The same one that is in the pirate bay page.

------
raldi
That'll teach 'em. Maybe next time JSTOR will think twice before protecting
their network from an apparent DoS attack.

~~~
wnight
Yeah, because someone downloading documents, at the rate supplied by your
server(!), is obviously an attacker.

For what it's worth, it would have been obvious they weren't facing a DoS.

------
flocial
This illustrates the sad state of affairs. The technology is there to
distribute this equitably (using torrents). Scanning these documents is a non-
trivial task and most people would only need a handful of the papers in this
collection for anything but intellectual curiosity. However, the pricing and
legal restrictions put in place for the distribution goes against the history
of scholarship. The only reason we have lots of ancient works of prose and
scholarship is because monasteries of various creeds institutionally copied
these works (by hand).

------
dbingham
Someone I know is suggesting that these documents were already free and
available on the web. I don't really know, since I haven't (and don't have the
bandwidth to, really) downloaded the torrent and cross referenced. Here are
the links he's provided:

"Unavailable anywhere else? Here's the ones from the 1600s:
[http://www.bodley.ox.ac.uk​/cgi-
bin/ilej/pbrowse.pl?i​tem=ti...](http://www.bodley.ox.ac.uk​/cgi-
bin/ilej/pbrowse.pl?i​tem=title&id=ilej.4.&title​=Philosophical+Transaction​s+of+the+Royal+Society)
. Here's the ones from 1832-1938:
[http://catalogue.bnf.fr/se​rvlet/RechercheEquation;js​ession...](http://catalogue.bnf.fr/se​rvlet/RechercheEquation;js​essionid=AA627D58AC31E3CE3​0D65A8FB2587CB6?TexteColle​ction=HGARSTUVWXYZ1DIECBMJ​NQLOKP&TexteTypeDoc=DESNFP​IBTMCJOV&Equation=IDP%3Dcb​37572031d&FormatAffichage=​0&host=catalogue)
. They're pretty widely available, for free."

"Looks like a bunch are on archive.org, too:
[http://www.archive.org/sea​rch.php?query=creator%3A%2​2Royal...](http://www.archive.org/sea​rch.php?query=creator%3A%2​2Royal+Society+of+London%2​2)

Can anyone confirm that these are the same articles in the torrent?

~~~
nullc
It's perhaps easier to go the other way.

Can you find this online? "Description of the Brain of Mr. Charles Babbage"

T1 - Description of the Brain of Mr. Charles Babbage, F.R.S JF - Philosophical
Transactions of the Royal Society of London. Series B, Containing Papers of a
Biological Character (1896-1934) VL - 200 SP - 117 EP - 131 PY - 1909/01/01/
UR - <http://dx.doi.org/10.1098/rstb.1909.0003> M3 -
doi:10.1098/rstb.1909.0003 AU - Horsley, V.

------
y0ghur7_xxx
Link to manifesto <http://pastebin.com/KudE4bWr> for people who don't have
access to the pirate bay.

------
ck2
So what percentage is that? Is a "page" still the 2k standard these days?

per wikipedia _as of November 2, 2010, the database contained 1,289 journal
titles in 20 collections representing 53 disciplines, and 303,294 individual
journal issues, totaling over 38 million pages of text_

~~~
sp332
Very little. The indictment against aaronsw claims that he got over 4 million
documents, and this archive only has 18,500. And I'm not even sure what
percentage of the total aaronsw got.

------
danbmil99
Can someone explain who exactly is so hellbent on prosecuting Aaronsw? Who's
eye did he poke? JSTOR isn't even pressing charges, so I assume it's some
other party.

------
cpeterso
When is someone going to do this for the ACM's journals?

------
blinkingled
So is it legal to (re-)distribute those files? Or are we going to see more
prosecutions __AA style involving distributors as well as downloaders?

------
sitkack
Please seed. If you are on a mac you can use transmissionbt.

<http://www.transmissionbt.com/>

Configure the bandwidth management to acceptable always on background levels
and minimize to the dock or put on a different desktop with spaces.

------
apas
That's why I love the Pirate Bay and its community.

Free art, technology and culture. Yep, this is the internet.

------
nextparadigms
This looks like another case of the "Streisand effect".

------
nl
That is one gutsy move by Greg Maxwell.

------
m0wfo
Only just realised Eircom has blocked access to TPB in Ireland at the request
of the 4 major record labels. The fact that this HN post is tangential to the
issue of music piracy annoys me. Another step closer to censorship.

------
slowcpu
Not terribly interesting. Although it is one of the oldest ( if not the oldest
) journals around, it is simply not widely read or terribly important. In
addition, from the journal's web site. "All issues back to 2001 are free to
access two years after publication."

Now, a collection of all of Nature's issues would have been fascinating.

------
username3
What's so good about these articles?

~~~
slowpoke
What's bad about more knowledge?

~~~
username3
I'm not saying it's bad. I'm curious what's special or interesting about these
documents.

------
ivankirigin
Scribd should index and host these. I doubt that would greatly add to their
current huge scale. They could make a dedicated site for it. They could also
probably get the help of the FOSS community to help make the search faster.

~~~
gwern
So, the solution to one publisher locking up and charging for access to public
domain material is... copying the material to another publisher that likes
locking up and charging for access?

~~~
ivankirigin
It is a solution to needing to download every file here just to access one
quickly. Scribd has charged document creators IIRC, not readers.

Scribd makes money off ads, so the incentives are actually really well aligned

