Hacker News new | past | comments | ask | show | jobs | submit login
Unpaywall: An open database of 31,903,705 free scholarly articles (unpaywall.org)
457 points by memorable 52 days ago | hide | past | favorite | 52 comments



https://oa.mg/ is another initiative that indexes over 200 million papers and lets you know if a paper is open or not. They tend to have download links to a rather large amount of papers.


It would be great to see Zotero integrate these databases into a PDF search feature, or a plug-in


Zotero has had Unpaywall integration since 2018:

https://www.zotero.org/blog/improved-pdf-retrieval-with-unpa...

(Disclosure: Zotero dev)


I… wow. Thank you.

Also: thank you for building a tool that actually works. I used it first in 2010, then switched to EndNote for better web integration, but I’m back to Zotero now that there’s a Safari add on. The PDF functionality in 5.0 has made apps like Notability and Goodnotes for the seemingly trivial task of managing and marking up highlights completely redundant.

It’s one of those OSS applications that genuinely works well and can be approachable to everyday people without feeling like a half-baked product.


All scholarly papers funded by public money should be made freely available. All of these paid for publications need to stop being the Intuit of their fields and stop making people pay for something tax payers already paid for.


to make things worse they even outsource reviewing to unpaid, uncredited scientists. many are just putting up because the older staff or the university management are eager for popularity points.



I'm gonna stick with scihub to be honest


Scihub solves an entirely different problem.


Yeah, it seems a lot more well-known and I'd rather give my attention to something that isn't a copycat of hard work.


I wouldn’t call Unpaywall a copycat of sci-hub. Unpaywall helps finding links to the same article hosted elsewhere, e.g. on a preprint server or on the author’s page. It is more similar to specific features of Google Scholar or Semantic Scholar, with the small additional convenience of appearing directly on the article web page, instead of requiring a search.


I'd probably use Scihub for a quick download for a paper I'm interested in reading. And then if I wanted to share it publicly or with coworkers I might use Unpaywall.


Just yesterday Unpaywall found a publicly available copy of a 1980s article I was looking for and which Sci-hub didn't have.


I don’t know why it should be a question of which one to use. Use both because both do different things.


Unpaywall has very little to do with scihub: This isn't about giving access to works that aren't publicly available, but instead about giving links to existing free uploads of the material by the author (i.e. on arxiv or his/her personal website).


Scihub and unpaywall are entirely different endeavours.


Hot take: Google Scholar is content collected means of donated human and robot indexers. It is not a database or service until they have an API that allows us to harvest. Yes I am salty af about it. It is taking energy away from curating ORCID records and the like.



> See also: https://core.ac.uk/

Came here to say that! Also see OpenAlex, soon to be launching:

https://docs.openalex.org/


Good news, OpenAlex has launched already in API and snapshot form! Includes most Unpaywall data.

OpenAlex featured in an HN thread a few days ago: . https://news.ycombinator.com/item?id=31271477

Web UI in the works.

Heather COI: cofounder and dev for Unpaywall and OpenAlex


If a university pay walls a single paper or article they produce from the public they should immediately be stripped of all of their public funding. I can't think of a single reason why the public is forced to pay taxes to them if the only way they are able to see a benefit from it is by proxy of someone who is paying money to attend or someone spending their private money to access the information.


In the US, federally funded research is already required to made available without paywalls within 1 year of publication. Open-access advocates online generally seem to be unaware of this, and that current goalposts are therefore (a) immediate open access, and (b) open-access publication of all scientific research regardless of funding source.

https://obamawhitehouse.archives.gov/blog/2013/02/22/expandi...

https://www.nih.gov/health-information/nih-clinical-research...


I didn't know this thanks for the information. There is probably an opportunity for some projects to enable exploration and discovery to be more intuitive - I think your second bullet is something where the distinction is not always obvious and leads to misunderstanding (at least for me).


I'm a proponent of affordable open access. I think that work produced using even a penny of federal grant money should only be published in open access venues where publication fees (including eg mandatory conference attendance) are capped at $500. (In particular, these means a permanent remote attendance option at all CS conference... no more "good work doesn't get published because the author is at a community college and can't afford a one week European beach resort vacation".)

The opinion expressed in the parent comment is (perhaps unwittingly) extreme and doesn't even solve the problem.

First, the "no paywalls" solution does not even solve the problem! What happens in practice is that publishers shift the cost from the reader to the author by charging MASSIVE open access fees to publish. The taxpayer gets screwed by the publisher on the front-end instead of the back-end. Any solution needs hard upper limits on the cost of publishing.

Second, open access requirements should be scoped to work paid for with public funds. Professors should be allowed to write books and publish them through the university press as long as they do the work on their own time (or the university's time). Graduate students should be allowed to do off-hours consulting work to make ends meet during their PhD studies.


For some reason, I expected a big search box for articles on the front page. It exists at http://unpaywall.org/articles but the link's hidden away at the very bottom of the page.


Me too. It's like they don't want people to use their website. Strange decision.

Put the search box on top, and all the secondary fluff at the bottom.


the main way most people use it the Chrome or Firefox extension


Same here.

Compare that to sci-hub, which is basically only a big search box to drop a DOI.


The Internet Archive has a similar project in beta:

https://scholar.archive.org

I have about ten academic papers that are accessible online for free, so I tried a vanity search. The Internet Archive has indexed one of those papers; Unpaywall has none.


Hi, i'm the maintainer of scholar.archive.org.

It looks like we are missing a bunch of your public papers, such as those published here: http://park.itc.u-tokyo.ac.jp/eigo/publication_en.html

Both unpaywall and scholar.archive.org work best with papers that have persistent identifiers like DOIs, PMIDs, DOAJ article ids, or dblb records. Unpaywall currently works with Crossref DOIs exclusively.

With scholar.archive.org (and fatcat.wiki, the backing catalog), it is possible to submit metadata records directly, but it can be laborious and would be better for everybody if this process was automated.

Processing OAI-PMH feeds or extracting bibliographic metadata from HTML metadata would probably improve our coverage, and we are hoping to roll out that kind of scraping eventually. But it has been a challenge to clean and de-duplicate metadata at that scale.


Many thanks for the reply. Internet Archive Scholar—like everything else the Internet Archive does—is fantastic, and I am very grateful for all the efforts you and your colleagues make.

Just for reference for anyone else reading this, here is an excerpt from an e-mail I sent you in March 2021, after IA Scholar was first mentioned on HN:

“I contacted the people at [a large Japanese academic library]. ... I showed them your HN post [1] about the data you've already collected through J-STAGE, and, contrary to my own impression, they said you have probably already captured most of the metadata for Japanese academic journals that would be easily available. They also pointed out that J-STAGE includes a fair amount of publications from the humanities side of things, also contrary to my own impression.

“The main sticking point, they said, is journals that are published by universities or academic societies and have not been listed on J-STAGE. Many of those journals have never been digitized, they said, and those that are available in digital form are likely to be available only on those universities’ or societies’ individual websites. The library people didn’t know of any aggregators or indexes for such sites. The only way to find them, they suggested, would be for someone to hunt for the sites by hand.

“Over the years, I myself have been involved with the publication of several such journals and have set up websites for a couple, too. The ones published by departments at [a particular Japanese university] are included in [the university’s online repository] but not yet, it seems, on J-STAGE. A couple published by small academic societies are available only on those societies' websites. [Addendum: The Japanese academic societies I have been involved with—mostly in the humanities—would have difficulty getting DOIs or other persistent identifies for the papers they publish; it would take some effort even to convince them of the necessity. They are volunteer-run organizations, and just maintaining their websites is often a challenge for them.]

“Yet another impression of mine (also perhaps wrong) is that a higher percentage of academic research in Japan is published through such journals than in the U.S. It would be very valuable to have that research findable through IA Scholar, but the barriers to collecting it seem high.”

[1] https://news.ycombinator.com/item?id=26408897


It is true that some might need to be done manually, but Google Scholar shows that it can be done, with some level of accuracy, via HTML and PDF scraping. PIDs and more formalized metadata make things much easier. But Google Scholar did result in pressure on platforms/publishers/repositories to put at least minimal metadata in HTML meta tags, and this can be machine-extracted. And there is a ton of content and metadata available via OAI-PMH. Neither of these technologies cost anything to publishers on the margin, once they get them implemented, and many have to reap the discovery benefits of large search indices.


- "To complete the download of the full dataset, please fill out this form, which helps us report usage back to our funders, and we'll immediately provide you with a download link."

https://unpaywall.org/products/snapshot

Is that dataset different from this un-gated one? They're both indexes of Crossref DOI's, and they're both 120 million records long.

https://www.crossref.org/blog/new-public-data-file-120-milli...


Unpaywall is based on Crossref DOIs (one-to-one records), and adds information about publicly accessible versions of each work. In theory publishers can register metadata about whether articles are OA or not with Crossref, but the quality and coverage of this metadata is poor in general.

Unpaywall will both check if articles are actually available from the publisher (by following the DOI and parsing the landing page), and by looking for other versions elsewhere on the web (eg, a pre-print). It is simple in theory, but doing this reliably for millions of DOIs from thousands of publishers is a lot of work!


Unpaywall is different to crossref. The size will be similar as a lot of scientific DOIs are from crossref, and they're a common dataset to base initial data on.

https://unpaywall.org/data-format is the format for unpaywall

https://api.crossref.org/swagger-ui/index.html for crossref info


Would not be surprised if the Crossref DB was their starting point.


It is indeed :) (COI: Unpaywall cofounder and dev)


Very useful extension, I use it frequently. Lot of tangential commentary here but this is helpful and complementary to other tools like Scihub and Google Scholar.


I remember reading that reproducibility is hard in these scholarly articles... When will an AI attempt to reproduce the results from all CS and math articles.


I am not quite sure if that's your point but math articles, and CS articles to some extend, don't rely on experiments to the same extend as other fields. A mathematical proof doesn't have to be reproduced but verified. It will be a huge step forward when an AI can automatically verify proofs beyond the already used proof systems but that should be unrelated to reproducing experiments.

What I am missing are open source repositories, docker images and virtual machines for all CS papers. Why is it acceptable to publish papers without enabling everybody else to instantly reproduce it when it is technically possible?


That's an interesting expectation, the models we have today reach average human level but few can surpass the top human in a task (like AlphaGo). But who would pay for the massive compute necessary for such a large scale task?


Presumably an AI will reflect fewer than the creators intellect., made worse by the collaboration losses necessary to accomplish the project now



This is wonderful. Thank you.


Isn’t this what google scholar was supposed to be about? Hope they figure out how to monetize this before the cash runs out because a properly curated, non fire hose kind of source would be great. Particularly if scientific publishing continues to shift to open source vs. Paywalled.


I use Google Scholar frequently because it is far better and easier to login to than propriety databases like Scopus. I think it is miles ahead.

I really do fear it is next on Google discontinue list. They cannot be making any money from it.


I've noticed recently that the trend of Google not returning links older than a certain age has started to extend to articles in open access archives, and maybe to published articles as well.


Losing Google Scholar would be devastating to research productivity in all fields. Academic databases are critical infrastructure for science.


Google scan books which borrow from library to enrich Google book, maybe they will do this again


Yes, Unpaywall is self-sustainable as a result of service-level agreements with companies who use the data in their products (specifically Web of Science, Scopus, many others). We also recently got a grant to help with additional development and integration into OpenAlex: https://blog.ourresearch.org/arcadia-2021-grant/


They've at least figured out a way to make this sustainable, as they've been going at Unpaywall for quite a few years now, and also have a bunch of other cool projects: https://ourresearch.org/


btw if there is a paper you want to read behind a paywall and it's not available elsewhere, many times if you email the author(s) they will send it to you

It's not instant gratification and you lose some anonymity but if you really want to read it, good to know that option.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: