(Disclosure: Zotero dev)
Also: thank you for building a tool that actually works. I used it first in 2010, then switched to EndNote for better web integration, but I’m back to Zotero now that there’s a Safari add on. The PDF functionality in 5.0 has made apps like Notability and Goodnotes for the seemingly trivial task of managing and marking up highlights completely redundant.
It’s one of those OSS applications that genuinely works well and can be approachable to everyday people without feeling like a half-baked product.
Came here to say that! Also see OpenAlex, soon to be launching:
OpenAlex featured in an HN thread a few days ago: . https://news.ycombinator.com/item?id=31271477
Web UI in the works.
COI: cofounder and dev for Unpaywall and OpenAlex
The opinion expressed in the parent comment is (perhaps unwittingly) extreme and doesn't even solve the problem.
First, the "no paywalls" solution does not even solve the problem! What happens in practice is that publishers shift the cost from the reader to the author by charging MASSIVE open access fees to publish. The taxpayer gets screwed by the publisher on the front-end instead of the back-end. Any solution needs hard upper limits on the cost of publishing.
Second, open access requirements should be scoped to work paid for with public funds. Professors should be allowed to write books and publish them through the university press as long as they do the work on their own time (or the university's time). Graduate students should be allowed to do off-hours consulting work to make ends meet during their PhD studies.
Put the search box on top, and all the secondary fluff at the bottom.
Compare that to sci-hub, which is basically only a big search box to drop a DOI.
I have about ten academic papers that are accessible online for free, so I tried a vanity search. The Internet Archive has indexed one of those papers; Unpaywall has none.
It looks like we are missing a bunch of your public papers, such as those published here:
Both unpaywall and scholar.archive.org work best with papers that have persistent identifiers like DOIs, PMIDs, DOAJ article ids, or dblb records. Unpaywall currently works with Crossref DOIs exclusively.
With scholar.archive.org (and fatcat.wiki, the backing catalog), it is possible to submit metadata records directly, but it can be laborious and would be better for everybody if this process was automated.
Processing OAI-PMH feeds or extracting bibliographic metadata from HTML metadata would probably improve our coverage, and we are hoping to roll out that kind of scraping eventually. But it has been a challenge to clean and de-duplicate metadata at that scale.
Just for reference for anyone else reading this, here is an excerpt from an e-mail I sent you in March 2021, after IA Scholar was first mentioned on HN:
“I contacted the people at [a large Japanese academic library]. ... I showed them your HN post  about the data you've already collected through J-STAGE, and, contrary to my own impression, they said you have probably already captured most of the metadata for Japanese academic journals that would be easily available. They also pointed out that J-STAGE includes a fair amount of publications from the humanities side of things, also contrary to my own impression.
“The main sticking point, they said, is journals that are published by universities or academic societies and have not been listed on J-STAGE. Many of those journals have never been digitized, they said, and those that are available in digital form are likely to be available only on those universities’ or societies’ individual websites. The library people didn’t know of any aggregators or indexes for such sites. The only way to find them, they suggested, would be for someone to hunt for the sites by hand.
“Over the years, I myself have been involved with the publication of several such journals and have set up websites for a couple, too. The ones published by departments at [a particular Japanese university] are included in [the university’s online repository] but not yet, it seems, on J-STAGE. A couple published by small academic societies are available only on those societies' websites. [Addendum: The Japanese academic societies I have been involved with—mostly in the humanities—would have difficulty getting DOIs or other persistent identifies for the papers they publish; it would take some effort even to convince them of the necessity. They are volunteer-run organizations, and just maintaining their websites is often a challenge for them.]
“Yet another impression of mine (also perhaps wrong) is that a higher percentage of academic research in Japan is published through such journals than in the U.S. It would be very valuable to have that research findable through IA Scholar, but the barriers to collecting it seem high.”
Is that dataset different from this un-gated one? They're both indexes of Crossref DOI's, and they're both 120 million records long.
Unpaywall will both check if articles are actually available from the publisher (by following the DOI and parsing the landing page), and by looking for other versions elsewhere on the web (eg, a pre-print). It is simple in theory, but doing this reliably for millions of DOIs from thousands of publishers is a lot of work!
https://unpaywall.org/data-format is the format for unpaywall
https://api.crossref.org/swagger-ui/index.html for crossref info
What I am missing are open source repositories, docker images and virtual machines for all CS papers. Why is it acceptable to publish papers without enabling everybody else to instantly reproduce it when it is technically possible?
I really do fear it is next on Google discontinue list. They cannot be making any money from it.
It's not instant gratification and you lose some anonymity but if you really want to read it, good to know that option.