Hacker News new | past | comments | ask | show | jobs | submit login
More than 2M research papers have disappeared from the Internet (nature.com)
182 points by Brajeshwar 7 months ago | hide | past | favorite | 66 comments



- "It is also true that much content is “preserved,” often illegally, in shadow libraries/archives such as Library Genesis and Sci-Hub (for more, see Bodó, 2018a, 2018b; Eve, 2022). These archives are at great legal threat of shutdown but have also proved surprisingly resilient. We do not, in this article, count material stored in such archives, even though it constitutes an additional storage location."

A major limitation.


I don't know, this article is explicitly about the problem of trying to link to something you find valuable but its supposedly durable url has been taken offline.

The DOI is supposed to ensure a copy can be found, if it fails to do that in 30% of the cases it's not a very useful, regardless of if a copy exists somewhere online.


I will say that DOI.org isn't the only dDOI resolver and that sci-hub in many ways behaves like one.


One of the few useful cases for IPFS tied to DOI.


ipfs can be inaccessible due ipfs bridges blocking the request.


You can programmatically cycle through bridges, or connect directly to IPFS nodes. Importantly, it’s content addressable; as long as you have a hash, you can find it somewhere (assuming sufficient object coverage).


while of course you should only do this if you are in a country where it is legal, you can look up a doi in libstc.cc, sci-hub.ru, or annas-archive.org


Are there jurisdictions where merely accessing this content is punishable by law?

EDIT: genuinely asking, I know very little about IP law.


there are jurisdictions where making a copy of it is, even for personal use, but others where it is not


I assume viewing a PDF in browser is generally safe everywhere as long as you aren't downloading to HD or cloud? Or is even this activity realistically dangerous without a VPN in some places?


"Viewing something in the browser" is technically downloading. Law system is too archaic to understand how internet works.


this can vary by jurisdiction


i don't think it's dangerous (even torrenting it isn't dangerous) but i don't want to recommend doing something illegal even though it's totally safe


You don't think this alarm should be raised because there exists illegal copies?

Otherwise what is the limitation? What is the failing that makes it not that useful of a study?


Indeed, "we can't find many books if we ignore the libraries!"


More like "we can't find many publicly-available books if we ignore home libraries". It seems prudent to wonder where so many previously-publicly-available books have gone.


both of the mentioned libraries are publicly awailable, seems prudent to count them


The libraries that can be shut down by legal fiat? Hmm.


so any library?


Think of how trivial it would be, from a storage perspective, to have a central repo of every (publicly-funded, at a minimum) research paper ever written if copyright issues were not present.

It could be run by an international non-profit or as a collaboration between governments. It wouldn't be particularly expensive to run when weighed against the huge long-term benefits.

It's really a failure of the U.S. and EU governments to not clamp down on the Elseviers of the world who bring very few benefits at a high cost to open science and society at large.


The Library of Congress has the digital infrastructure and expertise to be able to handle something like this issue.

Unfortunately, scientific journals are a sacred cash cow and infringing on any of their territory, real or imagined, prevents any meaningful top down change within the system. They've got money to pay lawyers to prevent any reform or mandates or flexing of existing regulatory powers.

Pirate everything.

Publicly funded research gets lost because it negatively affects profits to maintain unpopular, unread material in any sort of diligent and effective way.

Journals have close ties to universities and academia, and big commercial research outfits, and all of the social ties being involved with those circles can bring. They've got lobbying perfected to an art, and pay good, ruthless lawyers to protect their interests.

The average voter won't ever care enough to make a popular revolt, bottom up change possible; scientific publishing is too dry and anemic when you contrast against the million other, more outrageous, imminently threatening issues people care about. When you look at the conflicted interests that benefit from the status quo, such as companies that can pay Journals so their studies and papers will appear alongside distinguished, credentialed works, there doesn't appear to be any place where effective leverage can be applied.

Pirating it all is the only ethical solution. Outlaws with rogue copies is the only feasible way much of this data will carry on into the future. None of the people who can change things actually seem to want to, and the public has more pressing matters capturing their attention.

Aaron Swartz had it right. Apathy, greed, and politics aren't a problem with a solution that will come from this space. The only winning move is not to play their game.

The same largely applies to any media content being gatekept by entities requiring repeated, endless rent on their "property" despite bringing no value to the market, existing simply to expand, endlessly, mindlessly shoveling other people's money into their shareholder's pockets.

We live in a world that is awfully stupid sometimes. So stupid that important scientific literature is being faded into oblivion simply because it can't be monetized under perverse adtech incentive schemes. This has crucial implications, because if the habit of lazy citation takes root, it creates a kind of secular system of faith in scientism, with authors being given the benefit of the doubt when their citations can't be verified. That could be catastrophic if it affects medicines, engineering, environmental management, urban planning, forensics, or a myriad other categories where a flawed scientific paper might pose a threat.

Knowing that flawed papers get through, what could an actual malicious actor get away with?

Don't let stupid win.

Pirate everything.


> Aaron Swartz had it right.

Man, I still miss that guy. I didn't know him, but he was indispensable.


This is one of the main reasons I created Linkwarden - to combat Link-Rot.

Linkwarden is an open-source collaborative bookmark manager to collect, organize and preserve webpages:

https://linkwarden.app


I really like the demo videos on your landing page. What did you use to make them?


Thanks, I used https://www.screen.studio for the videos, totally worth it!


Thanks


Suggestion: Find your own name that isn't coat-tailing on bitwarden.

Or don't, I'm not your mom, but the name tells me not to even investigate, and I care about such things enough to run a YaCy node for example.


I personally think the name is perfectly fine and stands on its own. Maybe I’m just not familiar with it, but i would posit Bitearden isn’t enough of a household name to trigger an association in most people.


Have you investigated Linux, Vim, C++, xBSD, ...


By the time linux came to be, Unix was already genericized and there were many similar *x products. Bitwarden is a not a generic term for all things that store even just passwords let alone random other stuff.

It's a particular recognizable name that a specific single someone else paid a bunch of advertizing money to build.

c++ is actually c + stuff, and neither are individual brands or projects but specs. Everyone is free to make a something-c or c-something. msvcc and gcc are not anyvsort of gotcha. The various bsds are forks of bsd. Neither of these map to this case. (is linkwarden a fork of bitwarden? a superset or subset of bitwarden?)

vim is literally a development of vi, which is again not anyones brand anyway. Though I would actually say the same thing in that case. I don't think neovim should call itself anything-vim either because it's not, it's just a new editor that has essentially no connection to vi or vim.

It's not c++ or *bsd, it's calling your new cookie Moreos. It's not illegal just lame.


> it's calling your new cookie Moreos. It's not illegal just lame.

It's a grand tradition of FOSS and it's awesome. We need a rule: No denegrating someone else's wit without putting something better - of your own - out there.


So, it has to be called Yet Another Something?

C'mon, let's not gatekeep or trademarkify normal descriptive or allusive words. Linkwarden is fine!


By the way, what's on the Internet is NOT forever. I expect that Facebook and Twitter datasets, for example, will eventually go offline, sold to AI companies, slowly reaching irrelevancy and finally resting in some cold data storage or used for training/educational purposes.


It’s the worst of both worlds

You have to assume an unappealing photo of yourself you posted in your youth is stored somewhere forever

And, things you actually want to be stored forever and be retrievable when useful, probably won’t.


The problem is that you lack end-user control of your personal data [0]. You don't control whether it's preserved or deleted; you don't control access. It's a fundamental requirement of all data.

[0] Unless you live in the EU?


We shall henceforth refer to this as Wepple's paradox.


Good point :)


Scihub seems more and more beneficial every day.


Sci-hub is dead, hasn't been updates since 2021. Sad state of affairs, hope Anna's Archive can take up the mantle.


Probably Alexandra hope too much that Delhi court will score sci-hub a win that makr it legal in India. That would be a precedent that she aims to be imitated in other places. She is still complying with the request of the court to stop uploading new papers.


It's moving onto overlay networks. As everything of human value on the internet will eventually have to do.


A lawsuit is underway with a named defendant for Anna's Archive https://torrentfreak.com/lawsuit-accuses-annas-archive-of-ha...


Ah, major bummer, thought the lawsuit was just a useless one against all John Does. Not good if they've found her :/

Another case to add to the follow case list, I guess. https://www.courtlistener.com/docket/68157923/oclc-online-co...

The other tech case I follow is the Yuzu vs Nintendo case. Wish the best for both!


Looks like you can stop following the latter:

https://news.ycombinator.com/item?id=39593647

Yuzu loses.


Ah shit


Until they manage to get the servers shut down, there's still 88,000,000 articles for download.


Arxiv feels like the most legitimate "journal" these days. It'd be sci-hub except they seem to have managed to kill it finally

The business of IP has so hollowed out the credibility of its brokers that the business part (like the stock price) has become its only remaining value, and scientific journals are absolutely included here if their publishing method doesn't preserve the goddamn body of evidence we claim science relies on


Do not forget that you are charged when you publish your work in scientific journals. Would you not expect your published work to be preserved?


One might think this, but that you are guaranteed anything (even if you are explicitly promised it) by a for-profit corporation is a foolhardy and sometimes even dangerous thing to take for granted


> Arxiv feels like the most legitimate "journal" these days.

Fun fact: it was created initially as a central repository mailbox for physics preprints at the Los Alamos National Laboratory. Yup, the same Los Alamos that we know from the Manhattan project.


Use DOIs, they said. URLs are just transient, they said. But DOIs are permanent, they said.

Could it be that the whole DOI system was just a ruse by the commercial academic publishers to re-intermediate themselves between academic authors and readers in the age of the WWW?


I really don't see DOIs as the problem here. They are just catalogue numbers, not an archive or repository.

The DOIs are still permanent, but their resolvers might not be. Copyright problems definitely makes that harder to fix, but again, I don't think the DOI is the problem here.


This is the equivalent of complaining that file hashes are useless because they don't contain the data originally stored in the file. That is not the purpose.


That's why we need to push a desktop-centric internet, not a thin-client/endpoints one. People MUST learn to store their data, distributed storage must became the norm for public stuff, so anything someone value will be preserved.

Just start with Zotero to have your papers LOCALLY (not on Zotero cloud), it's a small simple step anyone have enough storage to make.


Prior: most published research is false [0].

If a paper is being forgotten by the internet, that's a weak signal that it was not valued by its field. If you combine that weak signal with the strong prior, you'd conclude that the paper wasn't a meaningful contribution and can be safely forgotten.

I can see why this mass forgetting is a problem for preservationists and historians of science.

[0] https://journals.plos.org/plosmedicine/article?id=10.1371/jo...

[1] e.g. Gordon Allport's "The Nature of Prejudice" was published 1954, has been cited >50,000 times, and is still in print: https://www.amazon.com/Nature-Prejudice-25th-Anniversary/dp/...


> If a paper is being forgotten by the internet, that's a weak signal that it was not valued by its field.

That would be valid if the decision was made by popular vote.

It's completely invalid if a single entity is allowed to decide whether it lives or dies.


Some research is forgotten/dismissed by some generation only to be revived by a later generation (see e.g. Geometric Algebra)


Science doesn't work like that; we can't look at data/research at our current point in time and deduce its value forever. Having this information years down the line can inspire some more research ideas or contribute some groundwork to future research


to tell whether a piece of research is false or not, you need to be able to find out what it was. then you can conclusively prove it false. lost research like tesla's weather control notes, the alchemists' transmutation of lead into gold, or water-powered cars can instead give rise to conspiracy theories about how it was suppressed by the powerful and wasted lifetimes that could have been dedicated to something productive


It also forces the next generation of scientists to repeat the same mistakes instead of learning from the past.

Of course it would be easier still to learn from mistakes if academic research allowed for more honesty about negative or undecipherable results, but that's another issue.

It's still orders of magnitude easier to do some in-depth analysis on a paper in your chosen field to see if it looks like a load of BS than to get funding and conduct a study yourself to see if a hypothesis pans out with some kind of meaningful result.


i don't think you can reject a hypothesis because someone conducted a shitty study; results from shitty studies are sort of by definition inconclusive. usually there are things you can learn from their shitty study tho


The probability assessment is fine, but meaningless even outside history. Baby and the bathwater


I wonder if this is mostly just the DOIs associated with weak or predatory journals


Arweave fixes this. It’s pay once permanent decentralized storage. One of the few web3 projects with fast-growing real world usage.


I imagine this would be of great benefit to plagiarists.


So, blockchain?

/me ducks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: