(Yes I read the paper and I’ve seen all the other factors they considered, but those factors simply aren’t as good predictors/indicators of interest as an organic download number from any channel, whether it’s ScienceDirect or SciHub.)
Edit: To be clear, papers in their supposedly "limited access" (if we want to draw the same conclusion) control group appear to be available on Sci-Hub, i.e. as open access as their Sci-Hub group. Details in this comment: https://news.ycombinator.com/item?id=23710992
As to why my lab might be preferring papers from big-name journal: I think 1) there's admittedly our own bias of wanting to cite big papers vs. small papers, 2) if we're doing some basic literature search, Google will also yield results that are already more popular. Really, at the moment, I struggle to see how our behavior changes about which papers we're citing with or without scihub.
One thing I would do a lot with Google Scholar is find a highly-cited, somewhat recent (past decade or so) paper and then click into its citations to find the very most recent related work. IIRC this interface was chronological so there was no big-journal bias and I found a lot of stuff in random journals. Low-friction quality-filtering through SciHub was very helpful for me in this process.
I worked in a theoretical field, so it's actually possible to assess the quality of a paper just by reading it. If you're doing something experimental you may have to lean on the big-journal filter a little bit more.
Under what conditions do you see that changing? You yourself say how useful it is, but discount the effect it has on the citation graph. Could we say the samething about arxiv?
This is exactly why research needs to be open and freely available. The esoteric, rarely cited papers are locked behind paywalls and away from search engines. And in doing so, their impact fades over time as the flock centers around a single vein of focus. The future should look like a net not a river.
I am not an academic and I read papers for their metalessons in areas that interest me. But without Sci-Hub, lots of research would not be accessible outside of its niche, esp if it was unlucky enough to not be in the right journal at the right time.
On hilarious pattern I see, is when a bigshot in a field publishes a new paper, there is a race to cite it by some nobody. Basically the academic publishing version of "first post".
> we found that articles downloaded from Sci-hub were cited 1.72 times more than papers not downloaded from Sci-hub and that the number of downloads from Sci-hub was a robust predictor of future citations
> The results suggest that limited access to publications may limit some scientific research from achieving its full impact.
My uncharitable read is they slapped an interesting yet unsupported conclusion onto a rather boring observation.
Edit: In particular, it seems to me that papers in their control group are available on SciHub (correct me if I’m wrong), just nobody bothered to download them in the time window they analyzed. Which directly contradicts the “limited access” part of their claim.
I was not sure about this, but if this is the case then their conclusion seems to be unrelated to their findings.
I just looked into the actual data set, and analyzed a few papers from it, e.g.  the first paper in scopus_nature.csv. They're definitely available on Sci-Hub at least at the moment , and there's no reason to believe they weren't available back then.
Also, here's what they say about the dataset, emphasis mine:
> As a quality control, we performed a random sampling of all the articles retrieved, excluding those already present in the first data set. As in this second data set, the number of Sci-hub downloads is precisely equal to zero, we regard it as a control group (nC = 4,015) from which we are going to estimate comparisons for our experimental group (nE = 4,646).
Downloads equal to zero != not available.
Of course, that is hardly a surprising result. I think most authors have read most papers they cite (although things like reviewer/editor shenanigans could make this less than 100%).
Of course, such a pedestrian result does not grab attention.
I'm wondering about this too. What could be the possible other reason for the papers not being downloaded? Boring titles? Preprints easily available? Open access?
> and there's no reason to believe they weren't available back then.
Is it possible that the articles really were not available, as they used the data from sep-2015 to feb-2016, early version of SciHub?
That's within the realm of possibilities but sure doesn't sound like the case the way they described their methodology. That would have required a catalog of available papers on Sci-Hub too, but the dataset they used only contained download logs, so they couldn't have known whether a zero-download paper was available or not (unless I misunderstood what the first dataset is about).
This was still a garbage attempt to 'prove' this conclusion. And I don't have any faith the authors measured the true magnitude of these effects.
"Papers with more citations are downloaded more frequently on Sci-Hub" makes more sense to me than "Papers downloaded frequently from Sci-Hub get more citations".
The first data set contained all articles (nE = 4,646) that were downloaded from Sci-hub between September 2015 and February 2016 [24, 25]. The second data set was extracted from the Scopus database and contained all articles published in the selected journals within the same time period. From this data set, we excluded articles already present in the first data set.
Their methods lend to circular reasoning since the conclusion could also be "papers of high interest (cited more often) are more likely to be downloaded from scihub"
In this case, there's something that actually can help establish the direction of the causality arrow: time. If the citations precede the downloads, then sure, it's more well known papers get downloaded more. This papers show that the downloads precede citations though:
"number of downloads from Sci-hub was a robust predictor of future citations"
While it's not a watertight argument that downloads lead to citations, the theory is at least plausible.
A randomized controlled trial would be ideal. Second-best would be something that, say, suddenly caused some papers to be available or not available on Sci-Hub, or a "plausibly exogenous" (to y) cutoff around which papers would be either available on Sci-Hub or not.
This paper offers nothing. I think we can say the relationship is "not identified." Doesn't mean the conclusion is wrong, just that this paper doesn't produce meaningful evidence on the question.
For more see Angrist and Krueger (2001) https://economics.mit.edu/files/18
(In Pearl/DAG language, the underlying quality or appeal of the paper is, as others have noted, an obvious confound/backdoor path).
I feel like we need a new way to do peer review, that is more real time - so that papers can be upvoted/downvoted, flaws can be pointed out - and we have some way to assess the truthiness of what the paper is claiming. Your comment is a step in this direction (but we're not capturing the wisdom of the crowds quantitatively around papers today - arxiv is great, but 1990's era web design).
I'm working with a team that is trying to build better tools for science at https://www.researchhub.com/about
Rethinking peer review is one item on the roadmap.
This one managed to get 700 upvotes on HN and still be meaningless.
> This one managed to get 700 upvotes on HN and still be meaningless.
Agreed. This is why I don't think voting with online peramters, like down voting, will ever be viable choice; because aside from things like Cybil attacks, the reality is that Social Media has normalized and acutely optimized group think and the tribalism that often follow. For those of that don't engage in it its incredibly alarming and disconcerting how many succumb to it's practices in real Life.
What I will propose as an alternative is something that was first tried with Andreas Antonpolus' first book 'Mastering Bitcoin' wherein the chapters are each individually uploaded to Github and follows the same process that OSS does, wherein commits and corrections are submitted to alter the books ultimate maintained version, and never really being fully 'finished' and can be amended and annotated as needed to suit the updates that follow (eg: Segwit or Lightening Network) or in this case perhaps a replicated experiment that provides a larger sample size. Replication in Academic Papers is nearly non-existant, despite the notion that Peer Reviewed (especially in STEM) was to be a the critical component that made it invaluable.
This could also effectively undo the walled-garden-extrotionist model that afflicts Academia's Peer Reviewed model and foster more interactive and International work without being present in a University. Samples or specimens may need to be more tightly controlled and transported, but this could effectively be done over night if the Will exists for little to no money in anything other than training.
If the International University model is undergoing mass disruption and shifting toward a mainly online learning platform, this too could help mitigate these glaring problems and have a Global Repository for ongoing research.
Its almost stupid not to do it at this point given the MANY pitfalls that its current model forces down our throats.
I'm currently doing research, and one factor that always makes go nuts was the actual time to go and find an article and then donwloadit. I can say that I have used or rather abused sci-hub so that I can find related papers.
Having something like "github" style + the option to visually view the changes updates, what are the connection, who did what, where is coming fro, something in the way the site conectedpapers does is also very powerfull.
Lastly, the only concern in here it is that if it's ever leving document how ca you asses as whole one document if its good quality at certain given time ?
So my proposal wil follow under "feature request" type where even if the changes are made, having the ability to read the "metadata" or quality check done by lets say AI(plagarims, etc, etc,) + with the feedback from online reviewer = qood quality check.
I think this is good idea fro books, and I wish more scholars would contribute to Wikipedia and other peer editable overviews.
I'm not sure quite how it would work for new results though, for which it is still not quite clear how they fit into the current knowledge, or if they are worthwhile at all.
People also tend to feel really strongly about their own original ideas, and may try to push them forward when they should rather be forgotten.
I don't know how to prevent something like that, other than having a BDFL or individual publishing (as in the current system)
On the other hand, the top/most salient comments on it are all pretty critical of its causal claims.
How to distinguish between real flaws or just a good theory that is not yet taken seriously ?
Thanks for writing. I've taken a quick look at your paper and its OSF page, and you all clearly know what you're doing and have some great work here. I am not your target audience (I am not an academic), but if I were peer reviewing this, I'd suggest one of two avenues.
1) As Angrist and Krueger put it, "In our view, good instruments often come from detailed knowledge of the economic mechanism and institutions determining the regressor of interest...progress comes from detailed institutional knowledge and the careful investigation and quantification of the forces at work in a particular setting." With this in mind, is it possible to contact the maintainer of Sci-Hub, email her your paper, and ask if she can think of any plausibly exogenous sources of variation in papers' availability on Sci-Hub? Some possibilities that come to mind for me:
* In my own experience, papers without DOIs are often harder to find on Sci-Hub. Are there closed-access journals that don't mint DOIs on which you could use propensity score matching to create something of an apples-to-apples comparison?
* Is there an identification strategy somewhere in journals with optional APCs for which only some of the papers are open access?
* Were there ever any sci-hub outages that lasted periods of many months, and, however long down the line, did the citations of paywalled articles decline relative to their open access peers?
2) If none of this is possible, I suggest removing all causal language from your article and sticking to a predictive model. Observational research is appropriate for exploring and identifying causal hypotheses rather than confirming them; no statistical techniques that attempt to control for unobserved population heterogeneity will persuade me otherwise (I am admittedly an extremist on this position, but my own views hew closely to those of Gerber, Green and Kaplan (2003): http://www.donaldgreen.com/wp-content/uploads/2015/09/Gerber...). The Lewbel (2012) paper you cite looks interesting but, to paraphrase Gerber et al., statistical techniques can't account for nonstatistical sources of uncertainty (the inherit unnkowability of whether you've specified the 'correct' model, in the absence of randomization or exogenous variation, means we can't say anything about the biasedness of your estimation procedure).
Observational research is great! Generating hypotheses is as important as confirming them. It's just that some research designs license causal inference, and some do not.
Best of luck.
> a way for assessing the impact of x (downloads on sci-hub) on y (citations) which _identifies_ a source of variation in x, that is independent of y, except through its impact on x.
There are a bunch of different ways that people express this. One is that there are no backdoor paths between x and y. Another is that X should be uncorrelated with the error term of the model.
I couldn't tell from the paper whether they considered this, or what they think the cause graph looks like.
They think they have showed causality. If it only was so easy...
Without the proper statistical analysis it just becomes a paper on how highly cited papers get downloaded from Sci-Hub which isn't very interesting.
As a scientist, the papers that I download from Sci-Hub are mostly the ones I am interested in with regards to my current research projects, and as such are quite likely to be cited. Or in other words, the work I do predicts which papers I download and which papers I cite.
There may be some causation of availability to citations, but I believe it would have been more interesting _before_ Sci-Hub. That is, I would guess that papers freely available somewhere on the internet (mostly home-pages for scientists) would have more citations than papers that are just available behind a pay-wall. For me, Sci-Hub has changed my citations from "just those I can get my hands on", to "any paper I'm interested in".
On the matter of Sci-Hub. I already have access to all the journals I commonly use via my university. Those journals almost certainly track downloads (Elsevier even has a recommender system when you download a paper). All Sci-Hub does is reduce friction for researchers who can't be bothered to use their institutional login while off campus. Admittedly that includes a lot of people, but for the people who measurably contribute to citation metrics, does Sci-Hub actually improve availability? I doubt it, at least in universities with budgets for subscriptions. Before Sci-Hub there were always routes to get papers: ask the author or ask a colleague in a neighbouring university if they have access. Or you can ask your library who may be able to get it.
My point is that generally it was a bit more work sometimes, but in the grand scheme of publishing a paper, it wasn't so bad to network to get access to things.
It's certainly enabled the general public to access research and no doubt tons of small companies are using it to avoid paying for subscription fees, but those people aren't significantly publishing papers in places where we can measure it.
All this study shows is that paper downloads from anywhere is a proxy for popularity/hype which can be a proxy for citation count. I'm sure Elsevier or Taylor or Nature could provide a similar correlation between downloads and citations. There is also a bias here - if I'm looking for papers, I'm probably going to download the most highly cited stuff first because that's a weak signal that it's a useful paper.
I'd be more interested to see, as you said, temporal analysis of before/after Sci-Hub.
Or alternatively geographical studies - does this affect lower income countries more? I imagine that this is a boon to researchers who didn't have access because their institution couldn't afford it.
So far so good.
> , without necessarily being a cause.
The logic has fallen apart. In order for the downloads not to be a cause, the same citations would have to occur in their counterfactual absence. You're claiming that anyone who downloads a paper from Sci-Hub would, in the absence of Sci-Hub, obtain that paper by other means.
Compare the currently higher-positioned thread ( https://news.ycombinator.com/item?id=23710896 ), where this paper is getting slammed for concluding that limiting the availability of a paper may prevent the research described in that paper from "achieving its full impact". On your analysis, where Sci-Hub downloads are driven by interest from scientists and that explains the entire more-downloads-means-more-citations effect, that conclusion is 100% correct.
Note the words used in the argument: "could have a common cause", "without necessarily being a cause".
The logic that fails is the one that depends on a common cause being impossible and concludes that one thing is necessarily causing the other.
(It plausably may do so to some extent, but that paper doesn't prove that nor gives a good estimate of the effect.)
But they cannot be coincidentally related by a common cause！ This is clearly a causality chain:
1. interest in the paper -> reading the paper
2. reading the paper -> citing the paper
1. interest in the paper -> reading the paper
2. interest in the paper -> citing the paper
And that intermediary position of Sci-Hub means that, in the model described, it cannot possibly fail to be a cause of citations. That would be the same argument as "one game downloaded through piracy equals one lost sale". If you don't believe that one illegal download = one lost sale, you've immediately admitted that piracy is a cause of additional gameplay.
> And that intermediary position of Sci-Hub means that, in the model described, it cannot possibly fail to be a cause of citations.
No, downloads don't automatically lead to citations. The possible scenario where every single citation came from another channel (arXiv, ScienceDirect, SpringerLink, JSTOR, what have you) is entirely consistent with the paper's findings. I for one happen to be a scientist, I don't think I ever cited a paper I downloaded from Sci-Hub, but I've certainly grabbed a few papers from other fields on Sci-Hub; those downloads did not contribute to citation stats at all.
In addition, the paper is trying to conclude "limited access to publications may limit some scientific research from achieving its full impact" by comparing papers available on Sci-Hub to, guess what, papers available on Sci-Hub (https://news.ycombinator.com/item?id=23710992). I really need some solid convincing to accept this leap in logic.
I don't necessarily disagree with the conclusion, in fact I think it's in a way kinda obvious (there should be an effect, however small it is), but the paper's analysis adds nothing to support it. I'm just not a fan of shoddy analysis, whether I like the conclusion or not (otherwise I'd better be a politician or lawyer).
> but the paper's analysis adds nothing to support it.
I don't quite agree with this. The paper demonstrates that more realized access is related to more citations. I think you can make a solid argument that more potential access is related to more realized access. That connection is enough to count the analysis as "supporting" the conclusion ("more potential access ~ more citations") at a level which is greater than zero. Support isn't proof.
There's no reason to believe the correlation between realized access and future citations isn't exactly the same or stronger for subscription platforms or paper copies in university libraries, and nothing in the paper itself to suggest that researchers publishing in top tier journals might discover and access papers exclusively through those methods rather Scihub.
I'm sure the frequency with which papers are referenced in undergraduate essays also correlates with future citations in top tier academic journals, but any paper arguing for a causal Undergraduate Essay Effect on citations in top journals would be laughed off, especially if they made claims like 'revealed the importance of undergraduate scholarship in almost doubling the number of [top tier journal] citations of articles mentioned by undergrads'
[FWIW I don't think the direct and indirect impact of papers being accessible Scihub on professional research work is literally zero, but this paper does nothing to indicate otherwise.]
Edit: your diagrams are too simplistic anyway. Papers can be read elsewhere. Citations have more causes than readership and its incremental effect will be heterogeneous.
So if your reviewer tells you that you need more citations, you use the stuff that's easiest to get.
> FUTON bias (acronym for "full text on the Net") is a tendency of scholars to cite academic journals with open access—that is, journals that make their full text available on the Internet without charge—in preference to toll-access publications.
A similar story played out in the last decade (2000 - 2010) when affordable streaming platforms did not exist. Many folks got familiar to global movies and shows due to piracy as the price of even legal DVDs were prohibitively expensive.
Those people who are interested in me reading their work will send an arxiv link or put the PDF on researchgate or publish open access. Especially for mathematics, I need a printable PDF so that I can take notes. That makes DRMed publications impractical to use.
So if someone only links to the paid version of their article, I usually just assume that they're an arrogant prick and skip to the next paper.
There's now more good new research being published than I could ever read. Researchers need to adapt to that by reducing friction.
1. Searching for the title often gives you a PDF in the first few results already. Authors usually have the right to upload a version on their personal website or arXiv.
2. There is Google Scholar , which has links to PDFs on the right of the result (if it knows one). It is better at this than regular Google search, in my experience.
3. Manually searching on the author's website (or former website, if they moved) sometimes proves successful, although this is relatively more effort.
4. As mentioned, writing an e-mail to the authors works as well. (If you buy a paper, the authors get nothing, so there is no incentive not to share it.)
That's encouraging to hear from at least someone since it doesn't meet my experience. What I rather see in the few fields I still care about is that we're flooded with a mass of unoriginal and uninspired papers, many using a ML approach, where the purpose is clearly to get graduation or tenure rather than advancing the state of the art. It's happening to a degree that even assessing the major contributions in a field and separating me-too publication from the few original and foundational works has become impossible, similar to how general web search has become pointless. I'm all for free access, but 1. major works have always been published as author's copies with free public access 2. I really don't see any advancement in scientific quality at all as academic achievements are becoming just stepping stones and academic institutions career networks more than anything else.
Edit: also want to mention citeseer as my search engine of choice which seems to have improved a lot after their rewrite ten years ago (which made it useless for me)
I added some additional info.
It mandates that within 1 year of journal publication the authors are to make the article available for download on pubmed...
"Before you sign a publication agreement or similar copyright transfer agreement, make sure that the agreement allows the paper to be posted to PubMed Central (PMC) in accordance with the NIH Public Access Policy. Final, peer-reviewed manuscripts must be submitted to the NIH Manuscript Submission System (NIHMS) upon acceptance for publication, and be made publicly available on PMC no later than 12 months after the official date of publication."
I've not read it in a lot of detail but it looks like there's a positive correlation between releasing papers and having them accepted. Not sure how they've controlled for confounders (you only release papers you're confident in the quality of on arXiv?) https://arxiv.org/pdf/2007.00177.pdf
Nothing gets rid faster of a potentially interested reader than a paywall. I find it surprising, scientists aren't getting smarter about publicizing themselves. All that effort and you can't be bothered to blog about your findings, tweet a bit, engage with your peers online. etc.? There's this notion of spending months years on something and then expecting people to actually find it, pay for it, and then read it only to then consider referring to it. It doesn't work that way if you are just starting out.
The search functionality is "temporarily unavailable" and Google seems to have not indexed the site.
Find the abstract elsewhere and then use the DOI to find it on Sci-Hub?
That's pretty much it. Suppose you want to read a paper on memes, so you search scholar.google.com for "memes". Then let's say you find https://www.jbe-platform.com/content/journals/10.1075/etc.10... (I just chose one at random).
Going to the link gives you just the abstract. This one happens to have a full text link, but suppose it didn't. Then you'd go to your favorite sci-hub site (e.g. sci-hub.tw) and paste the URL into the "enter URL..." field, and the paper shows up.
The more downloads leads to more citations(so the paper is seen by more people).
The more interesting papers has more downloads(people download papers that are more interesting).
Looks like both way makes sense. Not sure which way is contributed more to the correlation?
But according to this data you could publish in a 4th quartile that if your paper is interesting, free to download and with some figures, it will be read and cited.
Even if you have institutional or individual access to the likes of Wiley or Elsevier it is usually far easier to just feed the DOI to Sci-Hub and read the paper instead of jumping through all the hoops to get 'official' access. This goes doubly for those who, like me, use whitelists for cookies and block third-party content (including cookies) since it generally takes a few attempts to convince the paywall that you just logged in for the umpteenth time and can I now please read that paper please? Nope, thou shalt not pass!
aw shucks, I'll just get the thing off Sci-Hub again.
The only problem is that here they are giving for free "paid newspapers". So, everyone that wanted a paid newspaper but didn't have the money to pay for it read it more times because they were able to steal them.
By this analogy the conclusion adds that the fact that people are able to steal newspapers helps to keep everyone more informed.
Now, please do the same analogy for food.
Or the problem is that there are "paid newspapers" in the first place.
> Now, please do the same analogy for food.
No, because information is not a finite resources in the same way that food is: Food can't be cloned at negligible cost after it is produced. If food was replicable like in Star Trek, then at that point we could make the same argument for food. (We do waste a lot of food when people can't pay for it or its transporation and distribution though.)
I don’t think researchers care that much about rules that hinder them to be publish or perish.