Hacker News new | past | comments | ask | show | jobs | submit login
Sci-Hub’s cache of pirated papers is so big, subscription journals are doomed (sciencemag.org)
632 points by happy-go-lucky on July 27, 2017 | hide | past | web | favorite | 241 comments



I'm very happy to see SciHub going strong - for all the obvious reasons. Now let's just hope they back up to IPFS (if they do, I'll happily pin some of it).

I want to go off a tangent here, though. Now that open access (whether arXiv or SciHub style) is becoming the norm, I wonder what can be done to improve the format of scientific papers? Like e.g. making them more like this:

http://worrydream.com/ScientificCommunicationAsSequentialArt...

instead of regular PDFs?


I don't know what that link is, but after 10 seconds of loading over LTE I gave up in favor of my data bundle. I hope it doesn't go towards whatever that is.

Personally I'm a big fan of blog posts: they're written for a broader audience in understandable language. I don't have any trouble reading English, but papers are almost universally written in a way that is more complicated than necessary. And in blog posts, if someone made a "novel new algorithm", at least you can be sure there's code, raw test data, screenshots, or whatever else is needed to DIY. And someone on the web is talking about it, perhaps even with comments below the post itself.


The link is to the Bret Victor's rewrite of a scientific paper, in which he introduces illustrations and interactivity for the important (and difficult) points of the paper.

I edited the link to the more "lightweight" version (i.e. just the page, without loading his entire homepage as well, which is kind of heavy), so please consider trying again.

I agree with you RE blog posts, and the point of me linking to Bret Victor's work is to give an example of a form that's better-suited to the media we operate with today. Papers still live in the paper age (pun not intended).


I love Bret's work, but I consider this rewrite pretty much entirely unreadable. I'm sure it's great for skimming - what with the bolding, and the flood of pictures - but that's not the point of scientific papers.


It would be better with a layout that flowed only from top to bottom. Reading text mixed with diagrams left-to-right, and then top-to-bottom, is hard, as you point out.

Since Bret has liberated the scientific article from paper, we may as well take advantage of the more flexible layout possible with digital media. Interactive diagrams are so nice for learning that I think it's worth pursuing Victor's idea further. Fred Akalin's blog has a couple great examples! https://www.akalin.com/quintic-unsolvability


scientific papers and general blog posts have completely different audiences and intentions.

The purpose of a paper is to be rigorous and exact and move the field of study forward. That tends to make them dense and use the fields jargon - which is short hand for massive amounts of information.


While this is partially true, it is also DEFINITELY true that there is a large amount of academic posturing going on in the field of Molecular Biology, at least. I've read MANY papers that could've been both more exactly and more simply written if they'd avoided needless jargon. Yet my own experience with it indicates that they probably had to put in the jargon in order to get published- so much so that they may have been required to do a re-write or two to insert more 'academic language' to get accepted. Quite annoying.


Academic posturing is a superficial criticism, at least in the life sciences. I've spent far more time trying to replicate intermediate protocols in peer reviewed papers than I've spent trying to understand their data and conclusions. If it weren't for ambitious undergrads and PhD candidates, I'd spend most of my time identifying why one manufacturer's reagents failed but another's didn't than actually doing research or writing grants.

The problem isn't the jargon itself, it's the fact that most labs have built up decades of institutional knowledge that they hoard for fear of their research being "scooped." I'm just thankful I don't work in downstream medical research anymore, where the number of replicable papers goes from one in two to one in ten, at best.


I agree the the posturing is not the essence of the infection, and didn't mean to communicate that. It does make wading through papers substantially more difficult, however, and that is a sincere and severe annoyance, though the tendency to hoard knowledge for fear of being scooped is a greater one.

Your comment about reagents from one manufacturer failing while another succeeded brings back fond memories of trying to make our own botulinum neurotoxin to study neurons- and the rest about hoarding institutional knowledge reminds me why I tried to get out of that. I was one of those ambitious undergrads, but I permitted myself to wilt under the insanity of a system in which I KNEW that a partnered laboratory knew exactly what to do, but they were forbidden to share the knowledge because of politics. By the time I had cut through the wasteful uselessness, I discovered that the only sane person there, one who had been willing to share knowledge, she had actually DIED.

And so I merrily skipped over to the bioinformatics department, which has issues as well but in which I was able to manage them fairly effectively. :D


> The problem isn't the jargon itself, it's the fact that most labs have built up decades of institutional knowledge that they hoard for fear of their research being "scooped."

Unfortunately it's the same for software and algorithms (at least in bioinformatics, which is my field). Everything is treated like closely guarded secrets, at least in the smaller environments.

Thankfully the tendency is slowly being reversed and there are even things developed out in the open: the "problem" with those is that being in the open, it's harder to get publications out.


Yep even well written papers will be dense and jargonistic for laymen -- and it is childish to expect everything to be well written.

But imagine if the incentives of scholars somehow shift so that they look less to funding committees and more to the audiences of blogs and magazine articles. I think such a shift would do more good than harm.

Scholars would still write serious papers to establish their reputation amongst each-other, but insofar as they rush to publish, I'd rather they pump out accessible blog style articles than the kind of gibberish that is made today in order to inflate the citation metrics.


You might be referring to a different field that I usually read, but "the kind of gibberish" that is in papers is much more clear and exact than blog-style hand wavy explanations.


When you say blog posts I assume you don't mean something like the OP?

you mean something like https://terrytao.wordpress.com/

or

https://blogs.princeton.edu/imabandit/

Then yes I think these are a very valuable and increasingly important part of the scientific process. I think you might benefit from labelling the blog as 'math blog' which is quite distinct from the original link.


> I don't have any trouble reading English, but papers are almost universally written in a way that is more complicated than necessary.

This is a function of two things. The first is that venues often have strict page limits for their submissions. Authors typically have a lot to say and not a lot of space to say it, so they pick language that is precise, dense, and typically colorless. The second is that every domain has a lot of style and nomenclature that the layman won't know. It can be learned, but you must remember scientists write for each other, not for the layman.


The first point: fair enough. I'd still see that as a shortcoming of the medium and something to be solved (rather than "just the way it is"), but that's reasonable.

The second: I was talking about reading papers from my own field. That I don't understand a Biology paper (which indeed I don't -- I've tried) is perfectly understandable, but infosec papers shouldn't (and don't) contain any unknown lingo to me. It's just a very convoluted way of writing.


Regarding short dense papers: have you seen the infamous SELU paper? It's got an epic 93 page appendix.

https://arxiv.org/abs/1706.02515


> I don't know what that link is, but after 10 seconds of loading over LTE I gave up in favor of my data bundle.

The whole thing is 283 kB and half of that is due to three small images that are already compressed as much as is possible (but that are actually part of the comment about the paper formatting, not the paper itself).

That seems very reasonable for a scientific paper which would have been about the same size as a PDF.


It's my fault. The link originally was this:

http://worrydream.com/#!/ScientificCommunicationAsSequential...

which loads the article... plus the entire homepage of Bret Victor, which is kind of heavy.


We don't need fancy graphics - we need all the data, always.

The primary shortcoming of current scientific papers is that they're too short. They're written to physically fit into a paper journal, but their brevity massively impairs efforts to validate and replicate research. We're taught in high school science class that a paper should include all the information needed to replicate a study, but in practice that simply isn't true. Our methods of publication haven't kept pace with the complexity of modern research and analysis methods. Failure to replicate and research misconduct is eroding the foundations of science.

In 2017, the appendices of a paper should include everything - the full dataset, complete lab notes, full source code, the works.


Can I push back on the notion that "open access" can constitute pirated material (any more than the leaked Windows 2000 code makes Windows "open source"). Conflating the two things has been an anti-OA tactic in the past.

(Disclaimer: Posting in a personal capacity.)


Makes no difference if you aren't planning to redistribute the code or document (possibly as the basis of a derived work).

If you could get the Windows source code with some paid subscription program, so that it's not pirated, it still wouldn't be open source. It would be more like "open access".

Open source means we can modify it and share our modified version, and even charge for it.

Most users of scientific papers aren't looking to extend those papers and redistribute modified copies; they just read them privately and make citations (which nobody can tell whether they were from a licensed or pirated copy).


I was drawing analogy between two unconnected situations not saying they were equivalent. Sorry if that wasn't clear.


No, it would be open source. There's the source code. It's open; anyone can go look at it. Whether or not you can modify it without violating a license has nothing to do with whether or not you can read it.

And this is exactly why Richard Stallman doesn't like the term. https://www.gnu.org/philosophy/open-source-misses-the-point....


That article describes this as a misunderstanding:

> The official definition of “open source software” (which is published by the Open Source Initiative and is too long to include here) was derived indirectly from our criteria for free software. [...]

> However, the obvious meaning for the expression “open source software”—and the one most people seem to think it means—is “You can look at the source code.” That criterion is much weaker than the free software definition, much weaker also than the official definition of open source. It includes many programs that are neither free nor open source.

> Since the obvious meaning for “open source” is not the meaning that its advocates intend, the result is that most people misunderstand the term.

That is, Stallman is criticizing "open source" for being, among other things, easy to misunderstand as not requiring free software licensing — even though it does, in fact, require free software licensing.


That's not the OSI definition, and your Richard Stallman quote isn't saying what you think it's saying.

It's saying that one of the problems with "open source" is that you can make the same mistake you just made (thinking that being able to see the source code makes it "open source").

Richard Stallman's primary issue with "open source" is that they intentionally avoid talking about issues of software freedom, and have consistently muddied the waters by co-opting free software as "open source".


Just because the term is confusing doesn't mean you are obligated to be confused by it.


I think more precisely you mean that unauthorized copies aren't "open access" in the usual sense of the OA movement (and many people want OA to be achieved through official licensing, not unofficial distribution). That is, it's the unauthorized copies that aren't OA.


>Now let's just hope they back up to IPFS (if they do, I'll happily pin some of it).

Looks like they've gone in the opposite direction. Access to the torrent repository is no longer available. The only way to download articles now is by clicking through ads.


The torrent repository is available on libgen.



The older dump is still available via https://thepiratebay.org/torrent/11674459/The_Library_Genesi... at least


By any chance, do you know what the rationale for this change is?

If the torrent repository were gone permanently, it would be extremely sad, since it's (AFAIK) the only way of having automated and/or bulk download of papers, for example for data mining — sci-hub.io is understandably not really suited for this.


>By any chance, do you know what the rationale for this change is?

No I tried to pose the same question recently on the forum but it's locked down for new users. At best it has something to do with recent litigation. At worst they're being hypocritical and seeking that sweet ad dollar. I agree that the current situation is only marginally better than the status quo.


What about the link you posted? It's... a paper with uncomfortably wide lines that hurt readability and grey text (why is it always grey) which ditto. The only interesting thing are the animations, but that's only really useful in a small percentage of works (even here it comes off as gimmicky) and you can just link it in the paper anyway.


> I wonder what can be done to improve the format of scientific papers?

See the Distill initiative, for one direction:

https://distill.pub

HN discussion:

https://news.ycombinator.com/item?id=13915808


I don't know but _please_ let be something I can print on real paper and sketch on.


I prefer PDFs. Easy to read, easy to print, no weird layout.


They're pretty bad for meta-scientific research. That is, they are very much a for-human-eyeballs-only format.


Agreed, I've download parts of the sci-hub archive and want to analyze the articles but I can never even get a decent solution for extracting multilingual text. I've tried the well-known open source tools for this and none have really been satisfactory.


Try narrowing down to papers in English only ? Why try to deal with a problem which has no feasible solution for now ?


I hate PDFs. Hard to read, don't own a printer, weird layout.


I hate hating PDFs. Hard to not read, don't not own a printer, unweird layout. Disagreeing is fun. Did I do it right?


I actually meant what I said, though.

HTML (in some reasonable form) is my favourite format for reading text.


Improving and advocating for standard methods of data publishing accompanying scientific papers would be great. That and promoting open source (including open-sourcing the papers themselves, as in the source text and algorithms used to generate the accompanying resources etc).


An interactive/collaborative standard for viewing, annotating, and sharing notes is really needed in this space. The technology exists to sit on top of the PDF files that are already created but a new standard to write and publish content is needed.


Funnily enough, I've been thinking lately that it might be nice for journals to publish annotated Jupyter notebooks or similar instead of conventional prose-based methods sections.


In the short term, .tex might be the most realistic option. It's not exactly the most accessible format, but a lot of people already use it so it's less additional work.

It's plain text so you can throw parsers at it to extract text, data, and formulae. It can be automatically converted into HTML for online display, PDF for downloading, etc. It's not pretty, of course, but we already have plenty of experience parsing and extracting useful information from another widely used format that mixes semantics and presentation all over the place, so why not?


Poster sessions at conferences have traditionally filled this role. I view any enhancement along this form as an evolution of that form of communication.

PDF's are a danged-sight easier to organize for research as well, IMO.


I came here to say the exact same thing. A friend and I made a pinning service two weeks ago (www.eternum.io) and we'd be happy to donate some space/bandwidth by pinning a subset of the papers.


Isn't this a problem that Xanadu[1] is trying to solve?

[1] http://www.xanadu.net/


Writing papers as they are now is already a labor-intensive and arduous process. That link is great [1], but producing such a thing would add quite a bit of time to the paper cycle. Besides, many people still like to print papers out...

[1] because it's Bret Victor, of course


Have you looked at [eLife](https://elifesciences.org/articles/29763) article format ? Their tools are open source. It's very impressive.


whither the university?

(1) MOOCs - student does not need to pay exorbitant tuition fees

(2) Sci-Hub - student does not need access to a university library

(3) ? universal accreditation - this piece is missing

===

idea for start-up: aggregate MOOCs, package instructions on how to retrieve information online, connect instructors and tutors to students, arrange accreditation, offer BA, MA, PhD programs, we have start-ups in every industry but higher level education, could it be possible? …


>arrange accreditation

This will turn out to be harder than it appears. The accreditation process is complex, expensive, and bureaucratic - possibly by design.

PhDs accreditation would be particularly complicated. PhDs are partly a form of academic hazing and lineage development and very much not just about the research/content.

Unfortunately education is more of a political problem than an information redistribution problem.


MOOCs are very spotty. You may be able to do this for some of CS, but it would be tough to put together a complete high quality degree in anything else.


MOOCs are not very useful for any kind of science that requires lab access, and Sci-Hub is not very useful for any kind of scholarly material that's not scientific, so I'd say universities are still far from obsolete.


One example of an attempt to improve the formatting and sharing of scientific research beyond paywalled PDFs: https://andrewgyork.github.io/rescan_line_sted/ https://andrewgyork.github.io/publication_template/


Good riddance, limiting access to scientific articles is a detriment to the advancement of humanity.


The sooner all these assholes go out of business, the better

I hope scihub has a well thought out exit plan, in case of seizure.

~ 40 M files, at 1 MB each (my own estimate). I hope they have backups / insurance files with all the papers collected this far. Given the data volume it's not trivial but surely not impossible.


They backup to Library Genesis as far as I understand it, Library Genesis is available via Torrents & Usenet - http://gen.lib.rus.ec/repository_torrent/


The IPFS network is ideal for this kind of situation.

I should say I'm bias as I run ipfsstore.it but to host all 26gb there would not be expensive.


> The IPFS network is ideal for this kind of situation.

I like ipfs, but in this situation specifically it's got problems. It's not private and it's not secret. It means that, by hosting a mirror of the data you're announcing: I'm publicly offering these documents, likely illegally as far as copyright is involved.

(And without a country/region limitation, which may be another issue)


> It's not private and it's not secret

Neither is BitTorrent.


How about Freenet?


ipfs overhad is still way to big. also 26TB not 25gb.


Overhead in what aspect?


Adding files to IPFS creates a copy of the file in their blockstore implementation. The new filestore should have solved this problem but I haven't used it yet.


besides the mentioned duplication of all files, the dht network chatter to announce a large repository by default produces many TB of traffic a month.


So for your service 26TB * 0.044 USD/Gigabyte = $1144/month to host all the Scihub data

Quite a big fee :p


That seems rediculously cheap and easy to come up with funds for. Storing the entire progress of science for $1k/mo? That's a good deal.


True, a low grade ICO for SciHub, backing an multi-country IPFS store with a decent web interface on top could raise that money easily. You could create a mini forum/social network around it for commenting on papers. Buying a coin to make an account or freemium buying a 'VIP' accounts ala Dribbble...maybe even a DocumentCloud style annotation system, so it has some self-sustaining business model other than raising initial capital.

Although I don't know much about creating 'DApps' or how Ethereum or IPFS works or what SciHub contains beyond a primitive grasp [1], just sayin :p

[1] I watched a video of a talk "Distributed Apps with IPFS (Juan Benet) - Full Stack Fest 2016" https://www.youtube.com/watch?v=jONZtXMu03w


Why does everything have to be an ICO? What's the point of a token here? Just give the appropriate organization some charitable funds.


That's one way to look at it.

Another is 13 2TB drives at $70 a pop. May run into some network congestion sharing however :)


In which way is IPFS ideal?


Wouldn't IPFS make it easier to access individual files compared to a Torrent file?

26TB is a massive amount of data.

Having downloaded (very) large dumps in the past with torrents, I've found it always a bit annoying especially when attempting to access a subset, torrent client interfaces are less than ideal here. Not that someone can't build a good document browser on top of torrents - ala those movie torrent streaming apps.

I'm not familiar with how IPFS works in detail but being able to access it via a file system sounds much better UX wise. Hopefully it can support that scale.

But to your point the primary end-goal of "censorship-free" persistent access is largely the same AFAIK.


You don't have to download an entire torrent to access a single file. Most torrent client can prioritize files so they get downloaded first, or avoid downloading everything in favor of just one or two. That is, unless the torrent is just a single zip or something.


Yes I mentioned that in my comment but it's not an ideal (or even practical) interface when you have 26TB of files. You'd have to build a special torrent interface to deal with that in a useful way. Existing torrent clients are far from capable of handling this meaningfully.


and if everyone does that, noone will offer the particular file except hosts, so the availabilty can be spotty - and the torrent interfaces have no way to tell upfront.


the db dump is public too, from the tip of my tongue libgen forums quoted something like 100TB altogether. (this includes more than scihub cache)


Yes, if you have a seedbox or something please host a few random parts! I have three or four hosted pieces and there aren't too many peers.


Is there any way to tell which torrents have fewer seeds, without having to add them all to my client?


I wasn't able to figure out a way so I just picked a few at random. Is there an easy way to roll up like a heroku app or something that would check the health of only those torrents? That could be really useful to the libgen guys.


I couldn't find anything, so I ended up making it myself using TorrentSniff.

http://genesis.andreparames.com/

Most of the trackers seem to be down, so it's just getting data from tracker2.wasabii.com, and even that is pretty slow, so it'll take a while to fetch for all 1693 torrents.

I wrote a script that pings the tracker and keeps regenerating a static HTML file, so it'll keep updating itself, though not very fast.


Just saw this, thanks! Super cool.


I would like to know as well.



That's brilliant; nice work! Will try to seed some soon.


Basically, the publishers asked for this. Denying open access to old papers from humanitys point of view is wastefull. The planet is full of hungry minds. Who knows where the next Ramanujan comes from and which discipline he or she chooses but given the non existent transaction cost of reading an old paper it would be beyond silly if they could not do it for free.


Publishers can provide value, and should not be forced to do so at a loss. Peer review, curating for quality, editing, typesetting, and creating physical copies could be expensive.

True, current publishers don't compensate peer reviewers, limit curation to the set of people who will pay their submission fee, force editing and typesetting back on the authors, and charge egregious fees disproportionate to the cost of distribution. But the concept of the business isn't invalid.


Nobody cares if publishers could theoretically add value in some alternate universe. In practice, none of them do.

This isn't throwing the baby out with the bath water, this is people trying to stop this nasty ass bath water from being thrown out because there used to be a baby in it.


Working at one, I have to say I don't think this is true for all publishers.

Prices are a different thing, but I'm definitely spending my day here adding value.


Some think anything can be open sourced but that does not work in the current world where we live. For an obvious example SpaceX could not have been an open source project.

I suppose the general notion is that the only problem publishers solve is the distribution which is kinda non-value-adding nowadays, but don't think about the editorial process, which needs actual work.

Sure, I think you guys add value. My critique was towards the cost of old papers and the ownership which journals take for published work. You guys need to get paid but the world also needs it's open access to the scientific corpus.

The papers which have very little value adding editorial input do have a harder time legitimizing their slice of the pie, though.


Maybe not open source, but an open hardware project. The particular problem for SpaceX is that a lot of the IP is covered by ITAR: https://en.wikipedia.org/wiki/International_Traffic_in_Arms_... See this FAQ entry: https://www.reddit.com/r/spacex/wiki/faq/company#wiki_what_p...

But in theory it's possible, e.g. Facebook's Open Compute Project, it just needs the right kind of funding model.


To be honest, there's a lot of things that could be done differently.

It's just a matter of finding a way to do it that doesn't jeopardize the entire business.


Honest question: what is it you do that adds value?


I personally work on the Nature Index


Ok, fair enough - I can see how that's useful.


You could still add value without publisher. Many papers are funded by grants, and large chunk of grant goes to publisher. Remove publisher and here you have it - free access, peer review, editing.


Wonderfully said.


Yeah, just to spell out how things work in CS for the things you mentioned:

> Peer review

Organized and performed by volunteers.

> curating for quality

Ditto.

> editing, typesetting

Done by authors (LaTeX isn't perfect, but it gets you "good enough" typesetting. And the last time a journal edited a paper of mine they mangled it because the person editing wasn't familiar with technical topics & jargon).

> creating physical copies

No one needs these any more; they just print the PDF if necessary.


>Publishers can provide value, and should not be forced to do so at a loss.

Yes, they can. But they don't.

>Peer review,

Free. Volunteers.

>curating for quality, editing, typesetting,

They don't do this. You are expected to do it all. Your papers are sent back for revisions and/or rejected by peer review editors and reviewers, who again, work for free.

>and creating physical copies could be expensive.

Physical copies cost additional money to acquire anyway.


> Free. Volunteers.

I don't think this is a good solution, because in reality reviewing these papers is complicated work for people who have spent years, often times decades, in their field. They do deserve to be paid.

I do think publishers provide tangible, real value, but often the value they provide is significantly outweighed by the problems that arise from the way they run their businesses.


>I don't think this is a good solution, because in reality reviewing these papers is complicated work for people who have spent years, often times decades, in their field. They do deserve to be paid.

I absolutely agree. But this is not how it works and suggestions to pay reviewers is usually met with haughty academic nonsense.

Publishers in academia (and in the normal book industry, to boot) add near zero value while extracting huge fees.


Nobody is talking about forcing publishers to do anything, though. If demanding free access to electronic copies of papers actually makes their business untenable as they claim, they should put their money where their mouth is and shut it down; at that point, based on what will replace it, we will see how much value they actually added to society.


> Publishers can provide value [...]

That's like saying that dictatorships can be a superior form of government, because they provide centralized planning and decision making at a much lower overall cost. True, current dictatorships don't do so... but the concept isn't invalid.


Well, there have been successful dictatorships in the past...


Ghengis Khan was successful.


I disagree. Journal publishers provide value by crafting in house editorials and industry news pieces, soliciting review papers on pertinent topics, suggesting editorials for controversial works, and by filtering out obviously crappy manuscripts. High quality, higher impact journals (Science, Nature, NEJM, JAMA etc) do these things and I believe they add a lot of value.

Lower impact journals are basically glorified FTP front ends. They don't create content, validate that content, editorialize that content, or even select that content. They just host and charge an enormous premium for it.


Easy. Make raw articles public, then let publishers do the the value-add and sell the result.


They do none of that now; they provide nothing of value; they only ride on the coattails of prestigious venues and on the backs of reviewers' hard work.


Sure, I think organizing the peer review process has some merit. I don't have an opinion if that's the optimal way to do it. Thus I wrote that older papers should be available for free. If not hosted on the journal servers then available free of copy right issues elsewhere.


Sure, they can. If they don't extract monopoly-level rents.


Yes, they priced themselves out of their market.


I had quite a bit of exposure to pirate journal archives before sci-hub arrived. A couple of easy improvements that I saw with past pirate libraries, that it'd be nice to have on sci-hub:

- Strip download watermarks ("Downloaded by Wisconsin State University xxx.xxx.xxx.xxx on January 12, 2017 13:45:12"). Many times, journals published by the same publisher do the watermarking similarly so you need write just one pdftk (or other PDF manipulation software) script for every journal under their banner. At worst, it's a one-script-per journal effort.

- PDF optimization. A lot of publishers produce un-optimized PDFs that could be 25% (or more) smaller with a completely lossless space optimization pass. This should save storage/network costs for access to individual papers and, more importantly, reduce the burden for bulk mirrors.

(I'd contribute the scripted passes myself if I had contacts within sci-hub.)


I deal with making PDFs a lot, and some of my processing steps (using pdfjam) make the file size explode. Currently I have no efficient way to make them smaller again. I would love to see how you do it.


The Multivalent PDF tool is still my go-to. It hasn't been updated in ages, but the jar still works fine:

http://multivalent.sourceforge.net/download.html

If you run

java tool.pdf.Compress file.pdf

it will generate file-o.pdf and often trim quite a bit of weight.

That's probably what I would suggest sci-hub use too because it's easy to use and automatic.

EDIT: this looks quite promising too: https://github.com/pts/pdfsizeopt

But it has bugs with handling some images... converts them to black boxes. I'd only use this one if you manually verify the result first (so not a fully scripted process). (Or this could be due to me not installing all the bleeding edge dependencies from source as instructed.)


Thanks. I didn't try Multivalent as I didn't have a Java VM. Maybe I should take another look. I tried pdfsizeopt, and IIRC it didn't actually reduce my particular PDFs.


You've found watermarks in sci-hub PDFs? I hadn't noticed that but maybe I was too excited about being able to get the damn paper.

You could tweet at them with a link to a Github repo of your scripts, they are active on Twitter


Yes, the example article I linked to on another comment has the name of the university and the download date watermarked. And that's on an article for a large, long-established journal; it's not like they're just missing watermarks on niche journals.

Thanks for the tip about tweeting. I need to set up another github account disconnected from my professional identity but that shouldn't be too hard.

EDIT: and maybe a separate Twitter account. And maybe both accounts used only from public wifi, too. I wouldn't put much past the publishers when it comes to getting revenge on people who help sci-hub. Unlike Elbakyan, I am not out of their reach.


You can contact Alexandra Elbakyan in VK: https://vk.com/alexandra.elbakyan



The change won't be immediate though. I don't think universities, which are journals' bread and butter, are going to stop their subscriptions anytime soon. Stopping a journal subscription because everyone is using sci-hub anyway (even if they researchers really are on an individual basis) might open the door for copyright suits against the universities, which would undoubtably be more expensive than just keeping the subs going, especially since its just a line item in an accountant's book. I'm sure it will happen eventually, but journals might have enough time for some to pivot to some more nuanced business model before they go bust.


> I don't think universities, which are journals' bread and butter, are going to stop their subscriptions anytime soon.

Around 60 German universities have cancelled their contracts with Elsevier: http://www.the-scientist.com/?articles.view/articleNo/49906/...

Obviously, the stated reason for cancellation would be that costs or too high, not that everyone downloads papers from sci-hub.


The European Union requires publicly funded papers to be publicly available. Elsevier did not provide open access, and was therefore deemed illegal for publicly funded research.

They have since introduced open access options, and universities have renewed their subscriptions.


As far as academic publication goes, Elsevier is literally the devil. Go 60 German universities.


That's awesome, honestly I hope I'm wrong :)


That being said, think about the other lines in the accountants books that could be better funded if universities didn't have to pay for journal subscriptions. That funding could go towards hiring more professors, paying professors better, hiring more research assistants, outfitting research labs better, etc.

If we aren't paying researchers for peer review, the journal pay wall is just a net loss in productivity. It's just one more useless middleman.


if universities didn't have to pay for journal subscriptions

You've misunderstood something there. With the "open access" model the costs are merely shifted around. In the traditional model the library would pay for subscription, now it's the researcher paying page charges for publication. In the past of society publishers the society had a reputation to lose, but Hindawi doesn't have such restraints on it.


In many fields, this change is a reflection of the shift in costs and responsibility for publication. I'll speak of conference proceedings in CS, since that's what I know best (but these are roughly the equivalent of journals in other fields - peer reviewed, 10-20 pages per article, etc.)

20 years ago, the process involved the creation of a physical artifact -- bound paper, handed out to participants, mailed to subscribers, etc. This is a fairly non-trivial undertaking, certainly costing quite a few thousand dollars.

As of a few years ago, the last of the major conferences in my area stopped creating printed proceedings. Along with that goes away the need (desire?) to ensure precise formatting rules, etc. -- the job of the typesetter is gone. We just submit PDFs, and everyone does their own.

The other "costs" of a journal or conference were already borne by the research community -- most of our conferences and journals are all-volunteer. It's a decentralized cost spread across the salaries of everyone in the community.

In our model, at least, the role of the publisher is greatly diminished. They provide some hosting, some indexing... and not much else.

Not all fields are like that. Science and Nature, for example, have paid editors. Fine. Happy to pay for that.

The shift in costs happened a while ago - the costs are probably in the range of $10-$100 per paper now, compared to quite a bit more before.


> In the traditional model the library would pay for subscription, now it's the researcher paying page charges for publication.

Not necessarily. In my own field (a branch of linguistics), many of our journals are open access, but they are still free to publish in. Some of the publishing costs are covered by the fact that these journals are put out by learned societies that are sitting on large endowments or receive state funding, and even for journals that are open access online, libraries often still want to pay for hard copies.


In my field (chemical biology) the page charges for open access are substantial. Traditional journals are still going strong; this may be changing, though. We don't do conferences, they seem to be a CS thing. The PLOS family of journals is reputable and charges north of 2000 dollars: http://journals.plos.org/plosone/s/publication-fees

Hindawi is half that, but they will publish anything.


...if universities didn't have to pay for journal subscriptions. That funding could go towards hiring more professors...

And more lawyers! If the journals did sue the universities it would be a one time legal cost to defeat them (thereby driving them out of business) versus the ongoing and growing cost of journal subscriptions.


Um, I know our legal system isn't perfect, but the outcome of a court case is still largely decided by, you know, actual laws and facts. You can't just throw money at some lawyers and assume you get to "defeat" whoever you want.

I totally agree with the position that for-profit journals are bad for the research community, but is there any doubt that legally they have the right to charge universities for the content that they own the rights to?


It is the researchers who are downloading papers, not the universities.


Perhaps the universities should really look over the submitter contracts and see if they can directly grant open access to the university’s own past submissions. Some sort of collective group could help them share the cost of that legal analysis.


The funding would not go towards those things. If you understand anything about universities, it would go towards expanding the bureaucratic workforce.


>might open the door for copyright suits against the universities

How so? The university will just say that they terminate the subscription to save costs (which is true), and they are not responsible for legal transgressions of each researcher. They just can't tell them "get your papers from sci-hub". Even the researchers themselves are only downloading, not distributing, which in many legislations is looked at quite favorably with comparatively small punishments.


There's a long tail with these subscriptions.

So no one will drop their top-10 subscription, while expensive tail journals are probably doomed (tail journals are either very cheap and run by schools, or VERY expensive and managed by Elsevier).


I doubt they would have a claim to make unless the university officially endorsed SciHub. Otherwise, how would the institution be expected to know that researchers didn't buy access themselves?


My guess would be that they would be motivated enough to check references of individual researchers. If they reference a paper on their journal, and neither the researcher or the university have a subscription, they might harass them with legal threats at the very least. Its horrible and I hope that doesn't happen, but its something I would not put past them if they felt their bottom line was affected. The RIAA put a lot more effort into protecting their copyright than the scenario I just described.


Alexandra Elbakyan's work is one of the most positive and important things to happen in the last 3 decades in the field of science, which has been gradually losing its luster due to the bastardization and devaluation of the field by politicians and salespeople using it like hucksters.

Elbakyan's work has inspired me to only publish my work in jornals that embrace open access and open data. I'll be damned if I am a slave to impact factor and other haughty metrics.


The origin of the web was to disseminate scientific knowledge. The guardians of that knowledge- the journal publishers- have absolutely failed to make a viable business model out of this, while many companies who adopted the web made billions.

While I do not use Sci-Hub, I think that users who use it are doing so morally and ethically (in the sense of conscientious objection). i hope they are also willing to pay penalties if they are found to be violating copyright (this is generally considered a requirement for intential protest).


SciHub shows the way papers should work on the Internet.

The other day in a HN discussion, someone cited a paper in response to my comment. I was able, in 30 seconds, get to the full text of that paper, which allowed me to reevaluate my opinion in context of what I read.

This is how science can, and should be useful for individuals. And beyond arXiv and SciHub, it generally isn't.


I'm an academic and fully making research open access. But when I read discussions like this I always wonder about the people who work at journals. The process of peer-review, copy-editing, and online publication is not something that can be done for free.

Granted, the current journal publishers spend too much money on overhead, and I certainly don't support the for-profit ones like Elsevier that rake in huge profits. But I also don't see much allowance made for the fact that publishing research in a peer-reviewed format involves labor which should continue to be compensated in some way (incidentally, it's not entirely true that researchers themselves aren't compensated - publishing offers an indirect benefit but a real one in the sense that publications are directly linked to salary increases down the line).


> But when I read discussions like this I always wonder about the people who work at journals. The process of peer-review, copy-editing, and online publication is not something that can be done for free.

Peer review is done by others in the field (peers), typically for free. Online publication is "where do I stick this PDF", and does not require per-paper work (or if it does that can be done by the author). That leaves us with copyediting, and in many cases, copyediting issues are caught by peer review rather than any paid editor.


I do peer review myself and am familiar with the model. IMO, when I'm contacted to review someone's work, the journal employee contacting me is performing a legitimate service. They did research to find my name and email and determine that I'm competent to assess the paper. They or another editor have also reviewed the paper in the first place to determine if it's worthy of being peer-reviewed at all, which requires some domain-specific knowledge. And if I pass on peer reviewing, they have to find another person to ask, and so on. Multiplied by dozens of papers, that can be a substantial amount of work.

I also think that you underestimate the importance of copy-editors. It's not the job of a peer reviewer to make sure that the author uses an apostrophe properly, etc. There needs to be a specialist who is dedicated to catching those errors, particularly since, as you note, peer-reviewers aren't directly compensated so it's unfair to burden them with a whole other set of responsibilities.


> I also think that you underestimate the importance of copy-editors

Many Elsevier and Springer journals no longer perform copy-editing. Authors are expected to have their paper checked by a native English speaker and copy-edited at their own expense, and then provide the journal with a camera-ready PDF. In my own field, it is the open-access journals published by non-profits that actually have the best language quality and typesetting.

This is a problem that goes beyond journals into for-profit scholarly publishing more generally. In my field, Brill is an infamous publisher for this: it demands camera-ready PDFs for most of the monographs it puts out. So, your library ends up having to spend 400€ on a book where about all the publisher contributed – besides unpaid peer review – is printing, binding, and mailing it out.


> There needs to be a specialist who is dedicated to catching those errors,

I think "needs" might be too strong here. We (the field of CS) get by without it just fine -- the standard practice is to include copy-editing "nits" at the end of one's review. A shepherd, assigned during the final phase, does a final pass on the paper before approving it.

Yup. Typos slip through. There are papers with poor English. Oh well. For the most part, it works out pretty well.

(And as a reviewer, I have little objection to also noting writing fixes while I'm reading your paper. I'm going to spend anywhere from 30 minutes to 5 hours reading the thing -- the writing fixes are a small additional cost. If you've done a decent job on the writing in the first place. If it's totally botched, I'll reject your paper and tell you to fix it before submitting it again. :-)

Now, would I prefer that the authors of submitted papers had to pay $50 for someone to do a copy-editing pass before I reviewed it? Heck yes. But perhaps we'll get DNNs to fix this for us one of these days. :)


That's a fair point. One thing I'm getting from this discussion is that it's difficult to generalize when it comes to academic publishing, because norms vary substantially between fields. I've seen the copy editing "nits" you mention at the end of reviewers' letters in my field (history) but I suspect that historians would rebel if the copy editor's job was entirely foisted off on us.

Likewise, it's normal for editors in my field to make meta-level suggestions about writing style. I'd wager that history journals place a greater emphasis on prose style than CS journals do, since making an historical argument often depends on telling a compelling narrative. Hence a publishing model that works for a CS journal might not work for humanities journals and vice versa.


Absolutely. And our papers are a fair bit shorter than yours, for the most part. :). 14 pages, 2 column, 10 or 11pt type is the norm for us, with figures and references included.


Even if we don't put all the copy-editing burden on the author (it's their reputation that is affected by the mistakes), the costs for organizing peer review and copy-editing could be covered by a fairly small submission fee, or alternatively donations or the sale of hard-copies. You could even offer freemium models where anyone can read the papers for free, but value-add options are sold to libraries and universities (integration into library catalog, better search, etc).


I think something like the freemium model you mention is the way to go. As others have mentioned in this discussion, the alternatives simply rely on pushing the costs to different sectors (like funding agencies or even, in the worst case scenarios, researchers themselves). I keep waiting for this sort of thing to happen. Perhaps it could even be bundled into some kind of social network that would offer an alternative to academia.edu. It seems that there is a lot of inertia when it comes to scholarly publishing though.


Let me rephrase that a little. In your first paragraph you argue that the work journals actually do is finding others to do the real work. And in the second paragraph you say that the copy editor is there to catch missing apostrophes. It's true that that is not "zero" work, but it's not a very strong case. And it certainly does not justify the absurd cost model we are currently stuck with.


We agree, it absolutely doesn't justify the current cost model. I hate having to donate my time to Elsevier as much as the next academic.

Maybe I gave the wrong impression by mentioning apostrophes, however. Professional copy editing is, in my view, super important in terms of differentiating a good journal from a bad or mediocre journal. As is the vetting procedure of finding appropriate peer reviewers and coordination between them. Whether or not you think of that as "real" work compared to producing actual research, it is still work that should be compensated in some way, in my view.

An analogy might be the professor teaching a class vs the maintenance guys who makes sure the projectors are working and the lights turn on. One might be more important than the other, but I don't think either should be working for free.


@benbreen even if some papers vet peer reviewers, I've heard many just use suggestions for reviewers provided by the author...


> The process of peer-review, copy-editing, and online publication

I'm confused that you aren't aware of this but, peer review and copy-editing is by-and-large done by other academics, not journal publishers, and, at most, for what amounts to an honorarium.

And online publication is not a job, per se. We automated it a long time ago. Teenages have tumblr blogs now. There is no reason why the practical aspects of online journal curation could not be handled by a very small team of people working in the IT departments of a few universities.

Journal article publication may contribute to academic careers, but that has nothing to do with the publishers' involvement: it has to do with the quality of the journal, which is driven by the academics in writing and reviewing for it.

Simply put, journal publishers are >entirely< overhead.


See my comment below. Copy-editing isn't performed by other academics (or at least it shouldn't be given their other responsibilities, in my view).

And although peer review relies on specialists volunteering their time, finding and vetting those specialists still requires work.

I'm not trying to say that the current publishing model is defensible, but I am pointing out that running a high quality peer reviewed journal still requires at least one or two dedicated workers. Whether you think they should be paid for their time or expected to volunteer is a different issue, but we shouldn't pretend that the whole apparatus is an illusion created by greedy publishers.


I always - always - sent my paper to at least one non-coauthor for a final copy-edit before being submitted.

Also, no editor at any journal (I've published ~20 papers) made any edits to my papers.


This must be highly field-specific then. I published a peer-reviewed paper a couple months ago that came back with around 50 suggested copy edits, as well as a few paragraphs of suggested changes from the journal editor. And then it went through the process again in a lesser form at the page proofs stage. Granted, a lot of those suggested edits were to make it conform to a somewhat arbitrary and tedious house style, but it also caught things that I or the other people I shared the paper with didn't see.


Oh, I didn't include edits to stay within house style, but then, I wouldn't submit a paper that wasn't already in house style.

If the editor sent some copy edits to the text, I'd just send it back to them unedited and ask them to publish it. In fact, I love pushing back against unreasonable editorial requests.


The companies do not recompense the reviewers or editors meaningfully. They ride only on the coattails of prestigious venues, with all the work done by volunteers.


Would it make sense for the peer review to be uncoupled from the distribution? I don't know about if this were physical, but, digitally, if e.g. ACMEcorp serves the paper, and if review team 1, team 2, etc.. can all attach their signatures of peer review in e.g. PeerDB, doesn't that make the review more transparent and detangle incentives?


Napster apparently forged a reasonable business model for music publishers, let's hope this forges one for journals.


MSP does all of this for quite reasonable rates.


Yes, I agree completely. I'd hope that the arXiv model would win - making SciHub unnecessary in the process. It hasn't yet.


> While I do not use Sci-Hub, I think that users who use it are doing so morally and ethically (in the sense of conscientious objection). i hope they are also willing to pay penalties if they are found to be violating copyright (this is generally considered a requirement for intential protest).

I'm having trouble expanding this into a sensible-sounding general moral principle. Does this hold regardless of how large the penalties are? (Relevant considering the often very high punitive damages for copyright infringement at least in the case of entertainment media.) Is it specific to some kinds of laws or government system or fully universal in the sense that somebody protesting the policies of a stereotypical dictatorship is also morally obliged to be willing to be shot/flogged/whatever the corresponding punishment under that system is?


I agree with your comments. It's something a social studies teacher mentioned when I was in high school, and it never completely made sense to me, but it is the established principle. You can see more about the reasoning here: https://plato.stanford.edu/entries/civil-disobedience/


Also if you're motivated and dig well enough, you'll find plenty of copies on college classes lectures websites. Albeit nowhere near scihub but still..


So, fundamental question here - if scientific articles (or anything that can be copy protected, etc.) can be released online in this manner to "free the knowledge", and yet, given such free access, there are still people that will pay for a subscription to access the same scientific articles, wouldn't that be the best solution?

I see people commenting that just because of this release, universities won't cancel their subscriptions to the journals. Well, that would be great - let them keep paying, while the content also gets out for free.

This is like the trend where you can pay what you want for stuff, or nothing. I wonder if that model would apply to scientific research - pay what you want for the paper, or nothing - but if you want to support that research.. hopefully people would still pay.

Just thinking out loud... probably already been thought of or wouldn't work (or I'm just self-defeatist). :)


Let me give you an unpopular answer -

- outside top 500 scientists across all the fields combined no one produces quality papers.

- the manuscripts received by EICs are garbage. EICs themselves rarely touch anything, or edit anything.

- "peer review" process in all non-the-most-upper-tier publication is a joke.

- copy-editing and proofreading is non-existent.

About 15 years ago a lot of journals that are now owned by the major publishers were society publications. This means that societies owned the title, owned the copyright and owned the process. Societies could not find people capable and willing to run a journal for $5k or so a paper including distribution and printing. So the societies went to the likes of Springer, Informa, T&F, Elsevier etc. Those said "Ok, sure. $2k per paper and we produce and distribute it or $5k per paper and we do everything". That was too much for societies. Instead the societies said said "Hey, what if we sold you the rights to the journal? Would you in that case do all of it for free, and let our EIC suck his or her thumb while still being the man/woman on the title page?" "Of course" said the publishers.

And so now we are here. Production costs did not go down. Distribution costs did not go down. In fact, distribution costs increased because before that the journals were printed by a little printers in the US as the volume of most of journals is microscopic ( later printing went to China in the same microscopic quantities ). Now, however, journals also need to be distributed in all kinds of wacky XML formats on all stages of production, which needs to be coded, in most cases by hand.

So unless every Joe, Jack and Jill, the scientist, wants to go back into the publishing world nothing is going to change.

[Source: pillow talk]


> outside top 500 scientists across all the fields combined no one produces quality papers

Just to make sure I read this correctly: Are you saying that, when judging by paper quality, there are only 500 good scientists on Earth?


Only about 500 produce readable manuscripts to start. Publishers have their names because it changes schedules that much.


That sounds absurd. What are your sources?


Pillow talk with someone who ran production for hundreds of STEM journals for two of the publishers


A root cause there would be the opportunity cost the scientist/society calculates to publish themselves vs have a publisher do it.

The equation today isn't the same - with open source / community effort, I suspect the cost/effort of publication today doesn't even come close.

I was just randomly looking at what it would cost (rough estimate) for 25TB of S3 storage - 2,000,000 put requests a month, 2,000,000 get requests, 25TB bandwidth OUT and 25TB bandwidth IN - and it's less than $3,500 per month. So the cost of hosting the published item seems actually quite reasonable... reducing the cost of the rest of the "publication" plumbing are problems that technology also helps address.


That's not the cost. The cost is to take a garbage paper with broken figures, broken text, in a barely parse-able English, written by someone who does not believe in citations or attributions, does not follow Chicago and produce a paper that is being distributed.

Late addition: The cost of distribution of a scientific journal in electronic format is minuscule - put it on a blog. Medium.com is free.

Another addition: I have seen a few of the author contacts. All of the ones that I saw said the author could publish his original manuscript on the web to free. The author could just not publish the result of the publisher's processing of the manuscript. Someone should ask the authors why aren't they publishing their original submissions.


> Late addition: The cost of distribution of a scientific journal in electronic format is minuscule

"The annual budget for arXiv is approximately $826,000 for 2013 to 2017, funded jointly by Cornell University Library, the Simons Foundation (in both gift and challenge grant forms) and annual fee income from member institutions."

https://en.wikipedia.org/wiki/ArXiv

> Medium.com is free.

Until its not, or until, as a commercial enterprise, Medium decides to completely pivot, or decides a controversial paper is too much bad publicity. If I publish to Medium, will future researchers be able to find my article in 10 years? In 50? Also I don't see how you've taken peer review into account here.


Agreed that the best publishers act as both a filter and improve the production quality of the papers, and that this is a true cost. I'd hope that they could continue to do what they do and also get paid by their subscribers. Ideally, they could do that AND the works could be published for free, and those who value their efforts would in turn reward the publishers for their efforts as well by subscribing to the paid version of their publications.

That said, Wikipedia mostly works. Between professional editors and the community, decent knowledge is categorized, edited/peer-reviewed and interlinked. Why the same cannot be done with academic papers is beyond me, other than to assume people simply don't want to change the status quo.

I also agree with your last point - so long as authors can contractually publish their own work, they should - and the Internet should make it as easy as humanely possible. I've published a paper before to an academic journal and had that clause, and then put it online on my own site as well (until I took my site down).

Random thing - I do think the standards of what constitutes knowledge matter, but boy do we hold them so high. I think there are many forms knowledge comes in, and expecting them to all be polished perfection is perhaps too lofty a goal. Knowledge is always a work in progress. Discerning between fact and fiction is the same whether you have a polished paper or a garbage one, it's just one is a lot easier to consume and you have more trust inherent by the nature of the reputation associated with the publisher, author, citations, etc.


>outside top 500 scientists across all the fields combined no one produces quality papers.

This is an exceptionally incorrect statement to make. The other points are fine, which only makes it more baffling you would make this point.


I'd rather see that money put into actual research rather than paying for access to free things. And approximately none of those funds go to pay for research.


Agreed.. is that like saying that researchers should do a "pay what you want for my paper" online and self publish/host the knowledge? Maybe somebody just needs to provide a standard "pay what you want" feature that academics can embed easily into their own site, and then self-publish / host.


Do you think pay to cite would be a good model?


No, not at all. Nor would models like "pay into a pool and get paid based on citations", either. That'd create perverse incentives, not least of which more incentive to push for citations to your own works when doing review.


Sure, but if you're paid based on the number of people citing your work doesn't that create the right incentives? People making more money are precisely those providing valuable information/methods to other researchers.

Sure, there's an ethics problem during reviews to push additional citations for people in your university or something but that's not too different from the current situation with grant applications, just more direct.


Hmm, that's interesting - I think maybe that's just like licensing, right? Except in this case, the licensor is the researcher/author and the licensee is the author citing the source?

Combining the two thoughts, "suggested payment to cite" with a "pay what you want" model could interesting too.

I suppose this would be a direct transfer of value from those who want to build knowledge on other people's work, to those who did the work. As long as it was not unduly expensive to do so, maybe - after all, the saying goes "paying my dues". I would think you wouldn't want to inhibit or slow down people from building on knowledge either (which is what things like Sci-Hub seem to be trying to avoid).

I wrote a few published papers some time ago, and I was told that without citations, a paper would be rejected as academic research should always be based on enhancements or references to prior work. So it seems kind of built in to the mechanism that citations are the requirement to build knowledge on top of when creating such works.

It would be interesting to see how a model like this could work. Since the value created by the researcher is essentially part of a web of value, it might be hard to determine who should get credit.

For example, we have 3 existing papers, A, B and C, and one new paper, D. Paper A cites paper B, which cites paper C - and all of them are additive to our knowledge base. Therefore, as the author of paper D, if we were to pay to cite paper A, should that also include some payment to the authors of paper B and C too?

I suppose one addition you could add to this - payment to only LIVING researcher(s)/author(s). If the researcher/author is deceased, then the payment stops. In this regard, it sounds like royalty/licensing fees.


in a way researchers paying for reading research is an absurd inflation.


Therefore, should the results of all research be made free and public? I was hoping we could allow people to pay if they wanted but also get it for free, and have both be legitimate.


Big and famous universities have deals, but not the rest, which is the reason why scihub exists in the first place. Maybe publishers should up their game and provide near free access to any PhD student or higher and a better uptime (people laughed at the fact that scihub+libgen hack was more reliable than springerlink).

Students would pay a little fee, if that doesn't impede their research too much.

Also I was thinking of a time based value. Papers / Journals older than X years, or that have already received a fair amount of fees could fall into free access. That could make students looking for alternative or forgotten ideas more motivated.


I'm all for Sci-Hub disrupting the dominance of RELX Group (a.k.a. Elsevier) and other for-profit publishers that make such a big profit off the backs of researchers (who write and edit for free) and grant-making organizations (who fund those researchers).

But it's unfortunate that Sci-Hub is also disrupting non-profit scholarly associations that cover their own budgets through journal subscriptions. In these cases, the fact that libraries and readers have to pay for access to an article is somewhat balanced out by the fact that those fees are going to pay for staff, conferences, and the other worthwhile activities of the non-profit associations.


So is Sci-Hub like Oink[1] for scientific papers?

EDIT: For those not familiar, Oink was a torrenting site but what distinguished it from the tons of other sites was how highly curated it was. High quality audio, proper grouping and genres, and best of all you could request anything that was missing and the community would magically add it.

[1]: https://en.wikipedia.org/wiki/Oink%27s_Pink_Palace


Kind of the opposite, sci-hub isn't really curated at all. Its goal is to have a PDF for everything that has a DOI, more or less. And the primary way of retrieving papers is to already know the exact paper you're looking for, found via some other search method like Google Scholar. It's not at all focused on categorizing or browsing papers, just slurping all of them into a big repository.


> Kind of the opposite, sci-hub isn't really curated at all. Its goal is to have a PDF for everything that has a DOI, more or less.

Side-stepping the politics and legality for a moment, I find it interesting how this is a very natural example of the fault-tolerant design of networks.

A 'DOI' is a resource locator. There's a 'fault' in the network in that some people cannot locate the resource when providing the DOI. Sci-hub has sprung up to route around this fault, and provide people with the resource.

Practically, there are real people making decisions who are involved in this process, but on an abstract level it feels like this is somehow inevitable? It feels like an emergent behavior from any sufficiently advanced network, and any attempt to stop it seems futile.


I guess another important part is that it's the universities' connection that scihub 'uploaders' tunnel through that are key - they have the special network access to DOI resources that make it free for on-site university users to get anything for free, and then the 'uploaders' relay them to outsiders. That's a difference between papers and something like 'scihub for any pdf ebook via ISBN' - there's no special network that can get all ebooks for free just via ISBN - except maybe the Library of Congress ;)


> I guess another important part is that it's the universities' connection that scihub 'uploaders' tunnel through that are key

Does this happen in realtime? I read that sci-hub sometimes uses credentials from legitimate users, but in that case they would be easily identified.

So, perhaps someone with access submits papers on request? And if so, what about watermarking?


> And if so, what about watermarking?

years ago i recommended to scihub that they should use https://github.com/kanzure/pdfparanoia but they weren't too concerned.


Plenty of the papers I've gotten from scihub are watermarked. The "downloaded by [IP] at [timestamp]" footer could probably be stripped, but they don't appear to do so, or at least not consistently.


A lot of papers get ingested into sci-hub while they're still in the preprint/early-access stage. Sci-hub doesn't appear to reingest papers after the final version is published. It's one of my very few gripes about the system. A more quality-focused system would probably handle this edge case.


This seems more like a versioning issue. Can't papers that pass review get a new DOI? I'm thinking something like semver for scientific papers or scientific data, more generally. DOIs even look like semver versions, so why not add more space for the different stages of publishing?


I'm talking about papers like this:

http://sci-hub.cc/10.1021/acs.jpcc.7b06335

It's already passed peer review, but the final version will be tidied up with additional editing and will have e.g. the line-by-line numbering stripped out. I see sci-hub versions of papers from months ago, or older, that never update to the final edition. The content doesn't differ much but the final version may have better-placed figures, corrected typos, and other minor improvements.


If different versions of a paper have different DOI then the DOI is broken and should be replaced by something better.

I consider it a heresy that some sites (fuck ResearchGate, Academia.edu) consider it fine to alter the PDFs. Document integrity should be the holy grail. Files should be final and never get touched. If they are, they are different.


Wait, aren't those two statements opposed to each other?

Different versions of a paper should have different DOIs. A DOI is fundamentally quite abstract, and there is no reason that an object-appropriate versioning scheme can't be laid on top. Figshare, for example, uses a versioning system where there is a single DOI referring to the overall artefact (e.g. 10.6084/m9.figshare.2066037) and a versioned-by-suffix DOI referring to a specific version (e.g. 10.6084/m9.figshare.2066037.v16). That works really well IMO.


ArXiv itself does this. See the 'cite as' section of this one: https://arxiv.org/abs/1707.08475, which has the note

arXiv:1707.08475 [stat.ML] (or arXiv:1707.08475v1 [stat.ML] for this version)


Oops, yes of course. That's what I meant. Typo. :)


The publisher is using DOIs incorrectly.


We all miss oink.

sci-hub is not like it, but it's more curated that tbp. You can enter a paper's URL or DOI and it'll take you right to it.

I think there is curation that should be build on top of it, especially if the concept of "review" could be moved to an online system.


The death of for-profit scientific journal companies will be a beautiful thing for the world. It's really rare to see something that is so purely valueless. This industry is sort of unicornlike - they've managed to extract rents in an area where they add literally zero value. It's truly an amazing thing, and it will be even more amazing to watch it die.


How does this doom subscription journals? I mean it would be nice but realistically it just means they move to exploit the university subscriptions since the professors can't admit use of illegally obtained copies. They can further exploit the authors since many journals require payment from the author for submission and some journals charge in the hundreds. One might say "just publish to a different journal" but it's not that easy. Because of the heavy reliance on Impact Factor in scientific publishing, it is the journals with those high impact factors that the authors will try to publish to. Regardless of whether or not they are being pirated.

This is sad to say, but in reality I think this isn't going to massively impact things for the publishers. Academia at its core is where the problem lies. Sure paid subscriptions are a big part of things, but it's the stuff most don't realize (the authorship fees and institution sub fees) that give the publishers power.


A similar corollary is the continued use of TI calculators in school.

The hardware is ancient. It's really overpriced. https://xkcd.com/768/

But they're still selling lots! Why? They're approved by the examination boards, and have an effective monopoly. (oligopoly if you include HP and Casio, but most schools choose one brand and require all students to fall in line).

The academic market is not remotely capitalist, so even though there are cheaper & better options available that people will use in everyday life, the "official" results will still use the officially-sanctioned products.


It dooms subscription journals because it breaks their leverage.

Right now, publishers offer take it or leave it bundles of hundreds of journals at prices the publisher set. If a library doesn't want to pay, the library/university system loses access to everything. And many people at the university can't do their work without such access.

SciHub existing allows negotiators for the universities and libraries to shrug and say whatever when Elsevier threatens to cut access to every Elsevier journal.


Even if this were the case, then it's still very easy to control the market because of the authorship fees coupled with the drive toward publishing in subscription-model journals (thanks again to Impact Factor). Maybe your advisor took care of some of those costs, but most journals do have a significant fee for submission alone, not to mention the re-submission after a review requires a significant change. Sci-Hub can't change that so it still doesn't doom them.

Not to mention Elsevier, till exempel, WILL be cracking down on piracy especially after their grant from the US courts regarding Sci-Hub. Sci-hub hasn't doomed anything and the publishers still control academia. You should be mad about this, and you should blame academia itself. Push for open journals like PLoS One when possible and try to convince people to move off that god-awful Impact Factor metric and you might inflict damage on the big publishers.


I've met the author of the study, Daniel Himmelstein, who is quite passionate about making information free. Projects in his github account (https://github.com/dhimmel) tend to use a CC0 license. Some of his work involves aggregation of data (e.g., https://github.com/dhimmel/hetionet) that is encumbered and he has put a lot of effort into making it as free as possible. His project carefully documents the license for each data point and he took the time to ask copyright holders that do not provide an explicit license to do so.


The Noah's Pirate Ark that will save all of humanity's knowledge from unreliable publishers.


Looks like Aaron Swartz's vision for the free, collective ownership of mankind's scientific knowledge is well on its way. I wish he were still alive to see Sci-Hub in action.


I don't think that sci-hub is going to kill off institutional journal subscriptions in the developed world. It's similar to how developed-country universities didn't stop buying licensed software and start passing around cracked versions to their faculty and students. Journal revenue isn't going to plummet like CD sales after Napster, because it's not individuals doing most of the purchasing in the first place.

Individuals and institutions in poor countries may well turn to sci-hub. I certainly have. But I would venture that not much of the journals' revenue came from individuals or poor institutions in the first place. I didn't pay to read paywalled papers before sci-hub either; I got them via authors' sites or personal contacts, or just didn't get to read them at all.


Tend to agree. There might even be an argument that SciHub provides something of a pressure release valve that will allow paywalls to continue. Similar to how pirated Windows copies can be said to have helped solidify Windows' market share. Or even, in my own mind, how pirated Adobe applications helped people learn the tools and go on to work at companies with legitimate licenses.

Elsevier, et al, can now make a legitimate claim that they are not restricting access to those who cannot afford a sub. But their tax on academia will continue.


The academic world has missed some decades of advancement of communication. In a world where all published science is open for meta-processing, the burden of validating science would shift to search engines. There would be search engines competing with scientific-SEO of course, but in the medium-run this would improve scientific writing, and possibly speed up science in general. In the end there will always be some private actors doing the work of "ranking" scientists. Academics are hanging on to the current peer-review journals precisely because they don't want to give that power to other actors.


Does Sci-Hub actually have all the papers or are they just retrieving them on-demand?

Publishers are tracking mass downloads (see the Aaron Swartz case) so given some of the very obscure papers I've retrieved from Sci-Hub I assume it's unlikely they downloaded them beforehand. My go-to assumption for how it works is that a bunch of people have donated access to their university network access and Sci-Hub is just a load-balancing / cache layer.


Yes, it lazy-loads papers using credentials from institutional subscribers and then stores them for fast retrieval. When you're the first person to request a paper you'll see an animation of a biphenyl molecule rotating while the site fetches the paper from an institution with access. Sometimes, there is no corresponding institution with access currently hooked up to sci-hub and the retrieval process fails. But I've only seen that happen a few times.


I'll be that guy who will gladly eat some downvotes for this apparently unpopular opinion:

"Science" and "subscription" (or any monetary incentive) don't compute in a single sentence. Aren't scientists funded by governments and/or corporations? Why should anyone pay them a royalty above that?

It's a legit question and not trolling, don't mistake my slightly angry tone for degrading please.


Can someone who's familiar with this research paper subscription model that is threatened a la Elsevier explain to me how we got here?

I am curious if at one time did Universities publish these independently and were they more accessible to the public? When did this practice of restricting access to papers via subscriptions begin?


This long article about Robert Maxwell does a great job answering your question: https://www.theguardian.com/science/2017/jun/27/profitable-b...


This is a really fascinating read. Thanks! Cheers.


I work for a scholarly publisher and I'd be very interested in hearing about what—aside from cost—would cause you to go to Sci-Hub for a paper?

Is it reading experience? Site performance? Difficulty in navigating publishers' sites?

Are there any good experiences you can point to? I'm really interested in making this better.


They should expand into engineering: I don't see any IEEE or ISO standards in there, for instance.


Now if we could get the government version of this...


Thanks! That reminds me I should donate to Sci-Hub!


This will be a catalyst for open-access


Yeah, but open access is just 'stuff the cost of publishing into the grant application' and because journals make money for each work published it encourages rubber stamp reviewing and publishing noise.

Open access is probably a net good, but it just shifts the costs around.


There's stuff like the fair open access manifesto: https://fairoa.org/ .

I suppose that's partly a backlash against predatory open access journals that publish any rubbish as long as (significant) fees are paid..


If only. I'm convinced they will find a way to shut her down.


I was surprised that it's still considered rude to link to sci-hub: https://news.ycombinator.com/item?id=14714577#14715252

Anyone know if this is a typical sentiment? I'm just curious if it's true that many researchers are offended by this movement, and what the reasons are.

I firmly believe that there are always two sides to any topic, so we should explore the flipside. What are some arguments against blatantly opening up access to paywalled articles?


If someone were to post a link to a pirated version of one of my academic publications in a Hacker News discussion with me, I would be relieved that I didn't have to do it myself to let people see it.


I was a small time academic once, and I shared all my stuff by linking to scihub.

My university paid 500-2000€ / article in extortion errrrr i mean publishing fees. That should be more than enough.


further down, the author in question says he supports sci-hub and is happy to see it linked so that more people can read his work. he was just surprised that someone would post a link so blatantly to a pirate site. on the other hand, he says that he knows researchers who would be upset by this but doesn't give their rationale.


Yeah, I should've clarified that I was curious about the people he mentioned. Researchers contribute some of the most valuable comments to HN, so mainly I wanted to avoid making them feel unwelcome. If linking to sci-hub ends up shooing them off HN, it's not worth the cost.

It seems like the sentiment is in the other direction, though. Maybe most researchers don't care.


Well, this is linking to IEEE. I have mixed feelings about this, since IEEE is actually doing a good job about publishing. They have far better reviews than most other journals, and members free get access to most journals anyways.


Plus they tend to make old articles free pretty quickly.


It's not just the publishing industry that is the problem. It is merely a symptom of the greater malaise in higher education as a whole.

The focus is on degrees, not on true learning. So much of what occurs is in universities is total waste. But people put up with it to get the paper. As long as people keep blindly giving absurd sums of money to get the paper, these expensive publications will last. The answer is for people to wake up and value learning over a diploma. When that happens, then finally issues like this will go away. Heck, as a bunch of people have pointed out, many of these papers aren't even for real learning. They are worded in such a way as to make them sound smart to their peers, but unintelligible to the public.


Yes and it's no good if who you referring to is smart enough to read between the lines


and it's better than all of their websites!


Anybody mirrored (or attempted to do so) libgen ?




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: