Hacker News new | comments | ask | show | jobs | submit login
65 out of the 100 most cited papers are paywalled (authorea.com)
273 points by jmnicholson on Nov 13, 2017 | hide | past | web | favorite | 77 comments

This is by far one of the most frustrating things to me about the current structure of the scientific community. Making all published research free to access for everyone would be a massive benefit to the general education of society and would allow anyone regardless of institutional affiliation to be involved in the process of science. Imagine how much better science reporting would be if every popular science article was expected to link directly to the full papers they were referencing.

As someone who hasn't been involved with a university or a laboratory for many years, I find myself continually extremely frustrated by how difficult it can be for me to keep up with new developments in the fields I studied in college.

I agree, and I'll explicitly add "history" as a part of "science" in the research sense you meant. I'm having a heck of a time trying to access historical records of medieval England from here in Silicon Valley because, although the information I need (land deeds, court records, etc.) was uncovered & translated by Victorian historians, and much of it exists in both text and image form and is ALREADY ONLINE in Google Books, the various English universities & their presses, English historical associations, libraries, etc., all jealously guard the information instead of releasing it, even to the point of preventing Google from showing it.

Example: Cambridge Univ Press takes books that were Victorian databases of medieval records, and in the name of "protecting our precious history", they just photocopy the Victorian pages, reprint them on newer paper, and put them under new copyright. If you want to look something up in the 10-volume database, then either blindly buy the $300 set of books and hope to find some items of interest scattered therein or go to the nearest library that has the full set, which turns out to be on the other side of the planet at Leeds Uni in Yorkshire, which will allow you to see these ancient texts printed in 2013 if you pay for library use and make an appointment at least 2 days in advance....

Meanwhile, the full text of the books is on Google Books at Google's expense and available in any browser, but Google is forced by the "protectors of history" at Cambridge to cloak many of the pages of 1000-yr-old data, because history is too precious to allow the unworthy to see their photocopied Victorian texts without first going on an old-fashioned quest, bribing the boatman, answering the troll's three questions, etc.

My hope is that at some point there will be a cultural change among "historical preservation" organizations where they decide that the greatest thing they can do to promote their field is to find every original document in every collection, carefully photograph it in hi-rez using whatever optical frequencies bring out the most faded detail, and contribute it to a free online host. Next step is then to create and freely post transcriptions, translations, and indexes, so that ANYONE can use the data for research, not just those who have enough gold in their purse, time for the quest, and can "answer me these questions three".

I’m with you, but the reason these documents aren’t free and online is money, of course. These archives, libraries and institutions are chronically underfunded. Scanning is expensive. They use copyright as revenue generation to plug the gap. I’m certain the academics that work there would love to publish all of history online for free, as long as you can tell them where their salary is going to come from.

I agree with everything you said, so I'm just adding some thoughts.

I'm trying to do historical research at my own expense and, like many others, will continue to do so as long as I can get access to raw data. Google will provide much of the data for free, and I (and MANY others) will do the analysis for free. It's another "open source" vs "commercial" situation. No one is required to give me expensive stuff for free, BUT if OTHERS are willing to pay the costs to scan, host, and provide it to everyone for free, but the "protectors of our precious heritage" won't allow it, it's no longer about the love of history but more like a repetition of history: we've got the power, so pay us for our, uh, "services".

As for scanning costs, though, Google is already scanning and hosting many of these materials for free and are being actively prevented from doing more (and even showing what they've already scanned), so I suspect we aren't far from having all we need for a 21st Century equivalent of a Carnegie Foundation. If Andrew Carnegie could build public libraries around the world, the robber barons of our century could fund an online Library of Alexandria to serve as a central repository. They could then offer to museums, private collectors, etc., to do the scanning and hosting and maybe even some ad sponsorship or donation mechanism to provide a small income stream back to the owners of the originals.

So basically the universities should buy the books, store them, repair them, scan them, edit the PDFs, make books, pay for all of that and then give you the PDFs for free.

Not sure why they should.

Should scientists and everyone else get papers for free paid by tax payers? Yes.

Universities should support research, they shouldn't only be capitalist machines. But even while chasing profits they could at least sell the damn books online and give you a table of contents.

Agree! It's really ridiculous and quite sad. I am no longer at a University so basically everything I read is stolen via Scihub.

Hopefully, more and more researcher will continue to preprint their work. Something we're trying to help with at https://www.authorea.com (disclosure: I work there)

It's a great site, which documents a horrible situation.

It'd be cool if it also showed which papers have preprints online. And which (I presume virtually all) are available via SciHub. Even links, if possible :)

And a funny thing is that, even though I work in a university, I still find myself using Scihub sometimes. Either we do not have the access (old papers are a problem) or I'm not present at work.

> Imagine how much better science reporting would be if every popular science article was expected to link directly to the full papers they were referencing.

I agree, but for the most part newspapers aren't even getting the "referencing" part right. That is, they just lazily write "a new study" instead of providing an actual, useful reference to the primary source. I don't think this would magically change with open access: Even for non-open works there would usually already be the possibility to link to a public abstract, but most science reporting simply does not care enough.

I think Computer Science is slowly headed in this direction.

I am usually able to find an Arxiv listing for most of the papers that I read.

Don't worry. Most of the most innovative developments aren't coming out of universities.

And there is always Sci Hub for the ones that do :)

Fund SciHub!

What I'd like to see is an IPFS feature that showed the "least shared" files in a set, so you could say "I want to help host the rarest 10 GB", for example.

You can do this, as all of SciHub is available as torrents.

Edit: you should be able to find the URLs with some googling, not going to post them here

Do you have a complete copy of scihub? The seeders seem to be very slow. Please let me know -- email is in my HN profile -- I'll even pay for complete copies.

If you get a hold of it, please seed! :)

The problem with that is that you can't serve content directly over HTTP from a torrent, whereas with IPFS you can.

More importantly, webtorrent is a client side hack when clients should be using webseeding [1], which is baked into torrent clients that use libtorrent.

[1] https://en.wikipedia.org/wiki/BitTorrent#Web_seeding

Now I'm curious, how much data would that be with the latest dump?

Last time I checked it was between 40TB - 60TB for just the articles, not including the libgen books.

At that time I calculated that it would cost about $1000 in refurbished 4TB external hard drives. Which is probably not the best way to store it.

Or about 2000 Empty Blu-Rays if you want to make searching for an article a nightmare.

...or about 10 LTO-7 tapes (6TB each --- "15TB" compressed, but I assume the article collection is already compressed), at slightly less cost for the media, and lot more for the drive; but with better reliability and even worse access times...

Oh damn, this is way bigger than I imagined.

Search indices != primary storage.

Ask Google about that.

Do you mind expanding on that? What are you suggesting I search for on Google?

I genuinely am trying to understand.

Forgive my brevity, typos, and poor writing as I'm writing this on my phone. He is stating that you would never actually search on your primary storage mediums. You would build your search index off the primary storage medium such that when looking up an article you do not search over the primary medium and only search the index and then once the article is found you read from said primary medium.

Well sure, but then you have to crawl though thousands of pieces of physical medium to load the file you want...

Do search indexes typically not include an identifier for a given resource? Seems to me that an index would be almost useless without them, as the match itself is more often than not far less useful than the document that contains it. Unless the information you need happens to be in the surrounding blurb.

So yeah, I'm not sure why you'd have to "crawl" through physical media when a search index ought to tell you right where your match is located. Is that not the entire point of indexing and searching?

An index maps a query to a specific item.

Say, a file reference within a storage hierarchy or record within a database.

(The concepts are fundamentally identical.)

Thank you for explaining :)

This, yes.

If I were to, say, build a document-management-system-as-a-filesystem,[1] I might approach it something like this:

1. Provide an independent storage tier for the actual content. Say, "stacks". This would store various formats of documents, uniquely identified (say, by a corresponding hash), and maintained independently of access.

2. I'd have an interface tier that would, simply put, be a search interface. The concept of "filesystem as document management" could present this in a familiar, apparently hierarchical fashion, but the hierarchy would actually be searches against various metadata. Say: author, publisher, title, publication date (as range), subject, or keywords.

3. These metadata would be comprised of one or more indices of the actual contents. The indices are what you want on fast storage, in memory if at all possible. Various forms of caching and distribution might provide for this.

4. In order to associate metadata with works and their attributes, you'd need an intake process. This means that as works are onboarded, the would be, effectively, catalogued and classified, with metadata fields supplied. It turns out that there are available metadata repositories for a large class of works (the US Library of Congress and OCLC are two organisations providing same), and it's possible that various distributed, crowdsourced, automated, AI, or similar mechanisms could be used to fill in metadata for works not otherwise categorised.

5. Further capabilities, access, reports, summaries, groupings, workflows, workgroups, projects, security layers, publication, editing, revision, editing, reference, exploration, etc., might be provided by filesystem analogues as possible, other means where not.

If the overal system sounds a lot like a library and cataloguing system, that might be because it rather much is.

Google's web index operates in much the same way. Primary storage is the origin server of a given URL. The index is maintained and accessed by Google within its own systems. Google typically returns a response in a few hundredths of a second. Retrieval of the referenced source typically takes a number of seconds, or roughly 2-3 orders of magnitude slower.



1. See: https://redd.it/6bgowu

IPFS could be such an amazing thing for the academic world, imagine every new paper including an IPFS hash alongside the citations, so you could instantly jump to them while reading the pdf. It would guarantee that you'd be seeing the same version that was cited and the link would never break as long as at least one person kept sharing it.

It would achieve the original goal of the world wide web as envisioned by Tim Berners-Lee, except even better.

Yes! I had exactly the same idea. I also love how anyone can reshare the file transparently, whereas with HTTP, if the original server doesn't have it any more, you have to manually hunt around for alternate links.

If you can't read it it doesn't exist. Research results are meant to be available and visible to all, or they are someone's private science diary. Also I believe that Nobel prizes should not be given out to be people whose research is not available to the general public.

The top 10 paywalled articles are all from the 20th century. The Open Access movement is great but it doesn't do anything to free up papers from the past.

A large part of the problem is the ridiculous duration of copyright. "Adsorption of Gases in Multimolecular Layers" is from 1938 and still paywalled.

In practice, almost all papers this popular will be available on random .edu sites and Google Scholar will find those technically-forbidden copies for you. But it is a significant problem if you don't have an institutional affiliation and you want to read articles that aren't among the top 5% cited. (Or at least it was a problem for me before sci-hub; I retained academic contacts who could email me any papers I wanted, but I had to cross a pretty high interest threshold before I'd bug someone to request that favor.)

Without SciHub and my access at MIT, most of the research I depend on would be out of bounds.

So, I tried to see if I could read some of the articles marked "paywall" and I had no trouble. My methodology: Google Scholar search the article title, and click the direct "PDF" link on the right side.

Eg: https://scholar.google.com/scholar?q=Tissue+sulfhydryl+group...

EDIT: My point here is that the statement in the article "the world’s most important research is inaccessible from the majority of the world" isn't exactly true. This isn't supposed to be an endorsement of academic publishing practices: if anything the fact that these publishers are effectively trying to scam readers out of money is all the more evident.

The fact that we have to low-key pirate research papers is just silly, though. The academic publishing system is just goofy, and I hope one of the projects that are currently trying to establish something better ends up taking hold soon.

It isn't piracy, all the publishers (in my experience & field of study) allow for free third-party hosting. Still agree with your second sentence though.

> My point here is that the statement in the article "the world’s most important research is inaccessible from the majority of the world" isn't exactly true.

This statement is still true even with your trick. The vast majority of people don't know about this and also even if they did know about this it would be technically difficult for many of them who aren't tech savvy. This is a huge barrier that shouldn't be discounted.

I don't see what tech savvyness has to do with it, many people use Google search. In fact, Scholar isn't even necessary, doing a regular Google search has the direct PDF links at the top of the search results.

I took a look at all 65 papers listed as paywalls.

Of them, 5 are actually available directly from the publisher so they shouldn't be listed as paywalls, and all of the remainder are available from at least one of Google Scholar/Google/Libgen; of the 60 actually-paywall papers, 54 are available from GS/G and only 6 force you to go all the way to Libgen. (I am taking the liberty of rehosting 10 of them myself, though, to get them into GS.)

Of the 65, notes on the ones not immediately available in GS:

> Density-functional thermochemistry. III. The role of exact exchange

Citation-only in Google Scholar but easily found in Google or SH/LG.

> Detection of specific sequences among DNA fragments separated by gel-electrophoresis

Paywall-only in Google Scholar, not immediately available in Google but easily gotten from SH/LG.

> Processing of X-ray diffraction data collected in oscillation mode

GS paywall-only, not in Google, but SH/LG.

> Isolation of biologically active ribonucleic acid from sources enriched in ribonuclease


> the attractions of proteins for small molecules and ions


> Helical microtubules of graphitic carbon

GS links to paywall but findable in G.

> A technique for radiolabeling DNA restriction endonuclease fragments to high specific activity

GS/G paywall but SH/LG.

> Phase annealing in SHELX-90: direct methods for larger structures

I am not sure why this one was listed as 'paywall' when it appears to be available directly from the publisher: http://journals.iucr.org/a/issues/1990/06/00/an0278/an0278.p...

> A study of the conditions and mechanism of the diphenylamine reaction for the colorimetric estimation of deoxyribonucleic acid

Also directly available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1215910/pdf/bio...

> Multiple range and multiple F tests

Also directly available (possibly with a free JSTOR account but if not, SH/LG): https://www.jstor.org/stable/pdf/3001478.pdf

> A new look at the statistical model identification

GS paywall but G & SH/LG.

> Improved M13 phage cloning vectors and host strains: nucleotide sequences of the M13mpl8 and pUC19 vectors


> Nitric oxide: physiology, pathophysiology, and pharmacology

G/GS paywall but SH/LG.

> An algorithm for least-squares estimation of nonlinear parameters

GS paywall but G & SH/LG

> A low-viscosity epoxy resin embedding medium for electron microscopy

GS/G paywall but SH/LG.

> Continuous cultures of fused cells secreting antibody of predefined specificity

G, and directly available: http://www.jimmunol.org/content/jimmunol/174/5/2453.full.pdf

> Homeostasis model assessment: insulin resistance and β-cell function from fasting plasma glucose and insulin concentrations in man

Directly available: https://link.springer.com/content/pdf/10.1007/BF00280883.pdf

Thanks for the work done here, gwern. Appreciate it.

> > Phase annealing in SHELX-90: direct methods for larger structures

> I am not sure why this one was listed as 'paywall' when it appears to be available directly from the publisher: > http://journals.iucr.org/a/issues/1990/06/00/an0278/an0278.p...

It pops up a http authentication box for me.

> > Multiple range and multiple F tests

> Also directly available (possibly with a free JSTOR account but if not, SH/LG):

> https://www.jstor.org/stable/pdf/3001478.pdf

You can only view 3 free items every 14 days, wouldn't call that exactly freely available.

> It pops up a http authentication box for me.

Might be referral-based. Try searching the title and going from the abstract.

> You can only view 3 free items every 14 days, wouldn't call that exactly freely available.

There's no verification of .edu addresses or anything, so you can make as many as you need. I wouldn't call that exactly paywalled either.

gwern you are amazing, thank you.

We should all be supporting Sci-Hub. There is no reason for these papers to be locked away from the public.

Archived copy, which can be read with JS disabled:


As a former professor I don't see much of an issue with this. Every research institution on the planet will have access to the articles.

Even if you don't have an affiliation (or your school doesn't subscribe to a particular journal), if you use Google Scholar to search for an article you can easily find pre-prints which are essentially the same thing. Additionally, if that still fails then the next option is to just email the author - they actually want you to read their work and will just send it out.

The real issue is that the societies are essentially extorting universities for hundreds of thousands of dollars per year when the writers have to pay to submit and readers have to pay to read. Many of the newer journals are becoming open access, but few of them have been able to make enough in roads to be considered "good journals." This is a completely separate topic than the one from the above article.

> Even if you don't have an affiliation

A very tiny number of people have affiliations; this isn't a realistic option.

> if you use Google Scholar to search for an article you can easily find pre-prints which are essentially the same thing

That hasn't been my experience. Lots of things I can't read.

I personally do see an issue if we make knowledge only accessible to a minority of people.

In my field (microbiology/bioinformatics), it's almost impossible to find pre-prints of pay walled papers and emailing the authors and hoping that they will respond at all also doesn't seem an extremely efficient process for literature research.

Maybe I am biased by my field - economics - in which even having "access" doesn't mean they are "accessible." There are very very few people without a PhD who are able to understand most of the good papers. There is an even smaller number who would be able to contribute to scientific discourse, which is the main reason these papers are made available.

The other thing that may only be localised to my field is that every author recognizes this problem and makes a copy available on their website or university's working paper site. I would only rarely resort to going to an actual journal because Google Scholar was way easier to find a copy of the article.

This is, of course, going to differ for a number of fields and there is a growing trend for authors and journals to open up their articles.

As an independent researcher in machine vision, I take strong issue with your first statement - the gateways to science controlled by a beauty contest/club membership is fundamentally horrible science. However with your third paragraph you completely redeemed yourself - I'd argue if you've published, I'd guess you'd be quite happy with anyone reading the paper, it's just the emails from cranks that are a pain :-) Practically I find that any paper I want can be found by tracking down the authors institutional home page - they always seem to find some way to make it available :-) Yay them!

Copyright desperately needs reform. It's silly we treat completely different areas with the same set of rules. From code to movies to math papers. Fine, let the Mouse be protected indefinitely. But nonfiction works have objective value to society. It's insane that 100 year old scientific works are still copyrighted and paywalled. It's wrong that you can be sued and even go to prison for spreading and preserving humanities knowledge.

The internet makes a lot of the concepts behind copyright fundamentally ridiculous. We honestly need to rewrite many of the rules from the ground up to take into account modern technology and what would best benefit society given the accessibility of information.

Yeah and how many are still inaccessible with scihub or libgen? I have access through my university to most journals but I always use scihub because it’s the easiest and fastest way to access any paper.

Answering the question "Why is Sci-Hub so popular?":

Because it works. It delivers information and knowledge to those who need it.

Because information and knowledge are public goods. As CUNY/GC says, an "increasingly unpopular idea",1,2,3 but an absolutely correct one.

Because it democratises information.

Because much the world cannot afford to pay US/EU/JP/AU prices for content. Including many of those in the US/EU/JP/AU. And most certainly virtually all outside. Billions and billions of people.

Because the research is (often) publicly funded, conducted in public institutions, and meant for the public.

Because information and markets simply don't work. https://redd.it/2vm2da

Deadweight losses from restricted access and perverse incentives for publication both taint the system.

Because much the content, EVERYTHING published before 1962, would have been public domain under the copyright law in force at the time, and much up through 1976 and the retrospective extensions of copyright it, and multiple subsequent copyright acts, have created.

Because 30% profit margins are excessive by any measure. Greed, in this case, is not good.

Because the interfaces to existing systems, a patchwork fragment of poorly administered, poorly designed, limited-access, and all partial systems are frankly far more tedious to navigate than Sci-Hub: Submit DOI or URL, get paper.

Because unaffiliated independent research is a thing.

Because the old regime is absolutely unsustainable. It will die. It is dying as we write this.

Because the roles of financing research and publication need not parallel the activity of accessing content. Ronald Coase's "Theory of the Firm" (1937, ), a paper which should be public domain today under the law in which it was created and published, and should have been by 1991 at the latest, but isn't, tells us why: transactions themselves have costs. http://sci-hub.ac/http://papers.ssrn.com/sol3/papers.cfm?abs...

Because journals no longer serve a primary role as publishers of academic material, but as gatekeepers over academic professional advancement. This perpetrates multiple pathologies: papers don't advance knowledge, academics are blackmailed into the system, and access to knowledge is curtailed

Because what the academic publishing industry calls "theft" the world calls "research".


See GC Presents, "At the Graduate Center, we believe knowledge is a public good. This idea inspires our research, teaching, and public events. We invite you to join us for timely discussions, diverse cultural perspectives, and thought-provoking ideas." https://www.gc.cuny.edu/Public-Programming/GC-Presents

See GC President Chase F. Robinson, introducing a conversation between Paul Krugman and Olivier Blanchard. A rare moment where the introduction itself contains some provocative thoughts. At about 50s into the video. (The remaining 72 minutes and 20 seconds aren't bad either if you're interested in discussions of global economics.) https://www.youtube.com/watch?v=zndOEQnMC44

Joseph Stiglitz, "Knowledge as a Global Public Good," in Global Public Goods: International Cooperation in the 21st Century, Inge Kaul, Isabelle Grunberg, Marc A. Stern (eds.), United Nations Development Programme, New York: Oxford University Press, 1999, pp. 308-325. http://s1.downloadmienphi.net/file/downloadfile6/151/1384343...


(This has proved to be among my more popular articles, including being picked up by the Open Access community.)

The best thing I learned in university was to figure out how to get paywalled articles for free. This involved looking through arxiv and looking for the authors website .



[SciHub](https://scihub.org/) is a project to "provide free access to research articles and latest research information without any barrier". It can also be used via Telegram at @scihubot.

Huh, didn't recognize the URL for Sci-Hub but then realized this is not THE sci-hub but another one who stole it's name. Correct URL would be https://sci-hub.cc/

My apologies, I googled "scihub" and picked the first result in English.

Those numbers seem a little disingenuous. If you work at a decent university you're not paying $20 to access every article.

Plus, I'd be stunned if all the people making the citations actually read the full paper. Some papers are cited because everyone knows you need to cite them.

Daily reminder that universities are funded by society. As are a lot of those papers...

So yes you ARE paying. Just not directly.

And if you're part of the 99% of us who AREN'T working at a university but still funding their research with taxes, we have to pay both the taxes AND the direct fees to access the papers.

I completely agree. I wasn't commenting on the thrust of the article, which I agree with, but rather the distorting figures that were used to promote it.

Only a very small segment of society can work at a university. Even a lot of PhDs are getting forced into the private sector due to the lack of available space on faculties. Anyway, just because I'm not doing basic research doesn't mean I can't benefit from reading the methods section of a research paper, particularly in this day and age where biotech, data science, and machine learning are exploding and it's actually very practical for people in the private sector to take advantage of the same technology used in cutting-edge research labs.

I agree, and I'm encouraged by the rise of open access. When it comes to data science and machine learning in particular, there is a large amount of open access material, which is awesome.

Basic biology and medical research is another story. Until we see fundamental changes in how researchers are assessed, I don't see how the situation is going to change though. But change it must, and will. Eventually.

But it makes them inaccessible to those of us who are longer at university.

I used to read the occasional paper, usually related to something outside my degree. Those papers are all legally inaccessible to me now.

Feel free to add up the subscriptions paid by the libraries of the 100 top universities if you feel that figure would be less disingenuous. I'm thinking it's going to be a vastly greater number of dollars for researchers to access research paid for by the public purse, but I haven't done the research to back that.

Just curious if anyone know where those dollars are actually going. How much to the publisher? The University? The researchers?

All of it goes to the publisher. You also forgot to ask if the peer-reviewer gets anything, and the answer to that question is also no.

Good question. Sorry to hear that's the case.

You rarely need to pay if you work at a university. When you're on the school network or VPN there's a way to bypass the paywall because the uni usually buys subscriptions for most journals. That does not work if you are not working in a university.

The vast majority of universities outside of developed countries can't afford to pay. I work at a national university in Cambodia. I've been trying to get the ministry of education to pay for a heavily discounted account (because we're a developing country) from JSTOR for the last two years with no luck and that's just a few hundred a year. Without libgen and sci-hub we'd be really screwed.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact