After [the US lobby started to push for copyright enforcement in the EU under the threat of sanctions](https://falkvinge.net/2011/09/05/cable-reveals-extent-of-lap...), things have changed. The copyright and IP laws that US lobbies push to the world are killing the idea of free Internet and only benefit large corporations that are untouchable.
Elsevier is Dutch-based and they are fierce in suing everyone who tries to get away with getting free papers that were paid for by the taxpayers.
The "free" Internet doesn't exist anymore, but it's good to have places like Russia where the IP law is not strictly enforced, because everything is broken so no one cares.
What do taxpayers have to do with it? I thought it was just a for-profit company?
Hyperlinks for all references would be a good start. Finding some way to make some automatic glossary of definitions of technical terms would make scientific papers substantially more accessible too.
So let's say paper A cites paper B. If you look at paper B, we show you:
- how many times it was cited
- the direct paragraphs from paper A where it was cited
- the sections from paper A where paper B was referenced
- ... and a lot more
You can also now search these citation statements directly to find evidence-based information pretty quickly.
- Short video to showcase that search: https://www.youtube.com/watch?v=JYjCn-4uMJk
- Website with a bit more details: https://citation.to/
You can also visualize citation networks similar to ConnectedPapers, set notifications for new citations on groups of papers you're interested in, and much more.
(Disclaimer -- I work here!)
Well, I definitely agree with your sentiment in a normative sense that scientific papers should be free and readily accessible to all -- in part because a lot of it is funded through tax dollars!
But given the current state of affairs, we're looking at making that information accessible to people without having to pay exorbitant fees to access individual research. We also offer steep discounts for students or anyone in academia.
With that in mind I would push back a little that we're just a scientific paper search engine -- our system does a lot of work in extracting and classifying those citation statements, which makes it more powerful than traditional scientific search engines.
And besides just using our search, a huge time-saving value of our service is the report pages which helps you quickly build a qualitative understanding of how something was cited.
Even if all scientific papers were freely accessible, our report pages allow you to see the direct, relevant snippets from citing papers without having to manually read each and every single one. I think that is quite valuable!
I know I've gone on a little tangent from the original discussion about scihub, and having free and open access to papers, but I did just want to throw that in because I think it's an important distinction. And as much as we all want that free and open world to exist, I think it's also interesting to think about how we can open up that information for people in the interim.
It builds a graph of referenced papers and makes it easier to narrow down which one are important/foundational for further research
I don't think a glossary would help. What you need to find is a "review paper". These act as a primer to the field for new researchers. They're usually well written, with less jargon and have tons of references for you to dig into. That said, I don't have a good method for finding them.. I just stumble across them haphazardly..
The references and citations are tagged and filterable, making it easy to see which are the most cited, or review papers, etc.
If you are finding many papers without hyperlinked references, it's probably just because they're published in journals whose templates don't support it. In my particular research field, most publication venues' templates starting supporting those links around 3-4 years ago, so my papers from, say, 2015 have no hyperlinks in references, while those from 2019 do. This didn't require any significant extra effort on my part, in fact in general it requires less because well-curated bibtex entries are easier to come by now than some years ago.
Can you show an example paper with links?
Part of this involves turning text data of authors into linked data that used to navigate between texts: https://author-disambiguator.toolforge.org/
Edit for direct link to Octopus: https://science-octopus.org
Pay $15 a month for access to a rolling catalogue of science info.
Someone needs to rent and service the servers, update the software, bandwidth costs money, etc..
I'm absolutely convinced that if copyrights weren't an issue, there would be enough governments, foundations, universities, corporations, and individuals willing to pay the costs of making scientific publications available to everyone. It wouldn't have to be a paid service.
Scientific papers use approximately zero bandwidth.
As a point of reference, if I hosted such a website on my home internet connection I would be paying approximately 0.00013 cents per upload. Even users downloading millions of papers would cost me less than a coffee.
Setting up and managing a proportionate payment system would cost more than just eating the bandwidth costs would.
Quick video of our citation statement search to give you a glimpse: https://www.youtube.com/watch?v=JYjCn-4uMJk
(Disclaimer -- I work at scite!)
Both support live streams, and are (being) federated and ever more integrated with other Fediverse apps.
All papers on sci-hub are available as torrents from library genesis. The full collection contains 85 million articles (before this announcement), and is about 80TB. If anything ever happens to sci-hub or library genesis, there's enough people out there with backups that a replacement can be set up fairly quickly, albeit without the proxy functionality to obtain new papers.
However, the more the merrier, so if you've got some spare hardware and bandwidth to share, I'd encourage you to contribute to the seeding effort if you're able. At current market prices of ~$30/TB, it costs ~$2400 to have a copy of the full collection sitting on your desk.
Yes they're copyrighted - albeit not by the authors who actually wrote them, but by the publishers who require copyright assignment for the privilege of having your work hosted on their website.
For example, see https://www.elsevier.com/about/policies/copyright (under "Author Rights") for what Elsevier permits you to do with your own work.
On the other hand, publishers have sometimes filed lawsuits against sites where authors share their papers, e.g. ResearchGate: https://www.nature.com/articles/d41586-018-06945-6
Elsevier have also sent takedown notices to universities where academics have made the final version available on their institutional websites:
Use https://v2.sherpa.ac.uk/romeo/ to check. Also, EU projects in most cases now require open publishing and publishers make exceptions even when OA is forbidden (“self archiving is allowed if mandated by the funding agency”).
Always seemed like a grey area to me. We didn’t really distribute the copy of the paper with the journal/conference’s name + copyright - though a perhaps a line under the title: “To be published in…”
For the final, not open access published papers they certainly do.
2. seed from your seedbox, not your personal device
Let see how will they sue millions, upon millions of people.
Make them face a fait accompli. They already lost.
Every dystopian regime does that.
These companies will pick examples and ruin their lives. It's hard to say how big the risk is here but this is reckless advice.
Vote for copyright reform.
There's really a need though for more developers to get involved with building tools for more easily searching and working with the collection, ideally with a nice UI and integration with things like crossref. This is a massively valuable data set and it would be great to see what people can come up with. Lots of awesome potential for data mining too.
If it weren't for the legal issues (publishers using copyright law to restrict access to literature they got for free since they never pay authors for their work), there's no shortage of projects that could utilize this data and be enormously beneficial for the scientific community and humanity in general. Unfortunately such work can only be done in the shadows right now, which greatly limits the number of people/institutions likely to do so.
7x14TB HDD ~ 7*300$ (Toshiba MG07ACA14TE)
You could write a custom compressor that decompiles journal PDFs to valid TeX, then compresses that.
Or at the simpler end of what's technologically possible, you could at least extract shared assets such as fonts that appear in multiple files. Keep files from the same journal together to find more overlaps.
I suspect there's quite a large gain to be had from further compression, at least theoretically. Even more if you could accept some level of non-semantic loss.
However, I guess they use .zip STORE because it's fairly robust against minor corruption.
This project is in it's early stages and the documentation has quite some way to go, but the index that's part of the release contains all the necessary information. This tool also contains the code necessary to produce the index files if you have a local copy of the zips.
Each torrent contains 100,000 files, comprised of 100 zip files with 1,000 PDFs each. They are named by DOI. There's a database dump at (http://libgen.rs/dbdumps/) (scimag.sql.gz) which has the id -> DOI mapping and other information. The specific torrent and zip file can be determined based on the id; torrent = id/100000 and zip = id/1000.
and database documentation is available here: https://gitlab.com/lucidhack/knowl/-/wikis/References/Libgen...
also see introduction to Sci-Hub for developers: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr...
Recent publications are virtually always based on direct PDF renders, and tend to be a few 100 kB per article.
Older publications are often scanned from paper-based copies, and can be about 10-20x larger, depending on the source. These may or may not have OCRed text, and OCR itself may be of variable quality. For documents with images or diagrams, those also add to both size and difficulty in vectorising copies.
It's possible to go through larger scans and regenerate them as rendered PDFs. That's intensive and error prone. There's also a range of viewpoints on archival as to whether it's preferable to retain the full expression of the original published version (and often accumulated marginalia and other marks of a specific instance), or to optimise for both storage and automated processing through reprocessed renders. The costs are high (typically you'll require a human or multiple humans to proof each work), though the storage and line-transmission savings are considerable.
I lean toward the latter myself. The attitude of other archivists (notably the Internet Archive) is to capturing as faithful a replication of originally-published formats as possible, at considerable cost in both storage and accessibility. (This applies to the Archives work in print, online / Web, and other document formats.)
Pressed, I'd strongly recommend a "capture what you can, reprocess according to need and demand as possible" approach.
Go make a backup, if you can afford it, and let's make sure that this one sticks around.
> We do not lose texts because of catastrophic events that wipe out all copies of them. We lose texts because they stop being copied.
They'll still try to maintain control but it will no longer be a crime to resist. They will lose.
However, when it comes to law I don’t think there really is a “what science says it should be”. We can use the scientific method and evidence based reasoning to assess the likely outcome of any law or policy change, but figuring out what outcomes were as society are willing to accept given all the reasons trade offs is not a scientific question.
Unfortunately I’m not sure just having open and transparent science will be enough when so many seem uninterested in having a good faith conversation about the evidence and its implications
But in the realm of governance, science is frequently used (or perhaps abused) as an unassailable authority to justify a wide variety of policy positions. I generally consider this to be a governance anti-pattern, but so long as science is being used to justify technocratic policy, it should be available for all of us to make our own judgements about.
First we'd need to solve the friction between how stable we want our laws and the progress of science over time.
We've seen in the past two years that going back and forth with recommendations makes the general public just give up.
Given how often that prize is given out as a bully pulpit to advance a cause, and given the global debt that science owes to her, I think she really does deserve one.
The risk that Malala takes in advocating for women right in Islamic countries is admirable, there is no denying in that. However, her impact are minuscule compared to Alexandra's in the big picture of progression as a human race. Malala's activism has not changed much on the course of women right in the countries where religion governs the lives from family to governance.
>In December 2020, Elsevier, Wiley and the American Chemical Society filed a copyright infringement lawsuit against Sci-Hub and Library Genesis in the Delhi High Court...
>...The high court restricted the sites from uploading, publishing or making any article available until 6 January 2021.
But it's very strange to me that that could go by unnoticed.
- Globalisation and the rule of law. Mostly that's a good thing. But SciHub would unlikely survive if Russia was unable to give US courts the middle finger regularly over the last decade. I am not convinced that the benefits of totalitarian regime outweighs the downsides but it is a thing
- copyright law is not patent law, science is not patents
So patents do seem to provide a way for inventors to protect a revenue stream. But the model of science is not one person or org does all the research and then exploits it for profit. So patents don't really seem to support science. And copyright has nothing to do with either.
Edit: one way of looking at it is that Science has socialism built in. Patents are a means of encouraging innovation by arranging that revenue flows back to the innovator, as long as the whole market obeys the patent law and licensing conditions.
But apart from "bad" actors, the amount of licensing is vast and probably impossible to track back (you would need point of sale, bill of materials, supply chain data etc)
Science has a simpler answer - publish the innovation openly and assume that the growth in wealth will feed back into general wealth growth. Which is kinda looking like "everyone shares".
So it suggests a singularity style step function - when / if something like UBI works, science will have a massive boost as the feedback loop is not mediated through university grants etc.
Two years, non-renewable, for any invention or work that had absolutely no public funding. Anything with direct public funding goes immediately into the public domain.
I can't wait for 2030 when China overtakes the US as the worlds largest economy. We can all then be banned from the internet for having seen the doctored photo of nothing happening on May 35th.
China isn't a critical part of the global Internet today and they'll be even less a part of it in another decade. They operate their own separate network that only poorly connects to the Internet, by design. That separation will increase considerably over this decade.
Xi is currently putting new restraints into place to pull Chinese tech companies back even further from the Internet and into their own isolated network.
When China becomes the largest economy by GDP, it'll be meaningless to the operation of the Internet, which they'll only kinda-sorta be a part of.
Further, China is now widely regarded as the top adversary to the US and the West. That context will get increasingly confrontational and war-like in the coming years. Nearly all members of Congress are on board the anti-China bandwagon now, they've all gotten the message from above (the military industrial complex, which dictates nearly all foreign policy). The cultural atmosphere will increasingly become like it was when the USSR was the primary adversary for decades. As that confrontation increases, China's influence over the Internet will be intentionally reduced by the powers that actually do control the Internet today. China sees that coming as well and is taking steps ahead of time to reduce its exposure, points of influence and risk. At this point China views a military confrontation with the West as close to inevitable (which recent Xi speeches have elaborated on).
This increasing separation effort by China is in part designed to make it possible for China to attempt to destroy/damage the Internet - if it comes to that - without posing much terminal risk to their network and economy in the process. If they take down the Internet, it'll butcher the economies of their adversaries, while their own network remains highly functional. This is something the West is almost entirely unprepared for, and China is aggressively preparing for it; an epic mistake by the West.
I am happy to be corrected but interested at the counter point to such doom and gloom
1. The vast majority of patents are owned by large companies, not the individual inventors.
2. Patents are very often used purely as "spoilers", i.e. preventing other companies, and even more so individuals, from working in a close enough field to the holding company's, so as not to risk patent litigation.
Makes me realise I never donated. They have the pudic courtesy to never even prompt for support.
Thank you, I learned a new word today.
As far as I can see, the most robust fallback has to be some kind of distributed data store that can mirror humanity's vital information on the widest possible array of computer/storage systems, and which would literally take an apocalypse to wipe out. Depending on one brave person to fight what should be our common battles (and we do that everywhere, the heroes are always lonely at the top while their actions benefit all of us) is disappointing.
Data has to be duplicated massively or it is always extremely vulnerable. DNA figured this out billions of years ago.
there is also some ongoing work moving to IPFS that could use help, see https://freeread.org/ipfs/ and https://github.com/sci-hub-p2p/sci-hub-p2p/ , it seems to have IPFS support https://sci-hub-p2p.readthedocs.io/_en/ipfs.html
In those 10 years, all the students you see or saw in your campus coming from second or third world countries, most of them have used scihub to research and publish papers that enabled them to pursuit higher studies.
Taking this right (not a privilege) away will mean the second burning of Alexandria. If anyone really cared about education or knowledge in general they will advocate scibub and libgen to survive.
I bet a "SciHub" devoted to political or religious articles will find backers despite being more controversial. People can be worked into passion for lots of things but not science and abstract philosophical ideas. That is largely why the FSF continues to barely cling to life.
I don’t know that I’m 100% convinced by this. Elsevier makes like 40% profit margins, and their primary contribution to researchers is prestige lock-in, basically.
Scientific publishing is in many ways stuck in the previous century. There are plenty of interesting opportunities to build technology to make that entire ecosystem more awesome, especially as funding agencies increasingly require open access publication.
Maybe there’s still not enough to build a stable company, but I took your reaction as a bit fatalistic
I feel the same way... It's always some new surveillance capitalism nightmare, endless advertising for garbage nobody really needs, addictive games with the win button hooked up to the player's credit card.
And whenever someone makes something truly world-changing like Sci-Hub all these people start coming after them because they're hurting their "interests". Who cares about their interests?
How would a startup reach unicorn valuation based on publishing scientific papers? By making people pay. So we’re back to square one
No, for some problems private enterprise is not necessarily the best solution
Open Source is important to software companies only because it lets them build new products cheaply. Most companies don't care, unless they can use it for cheap publicity, or to hurt their competition by making an open-source version of their proprietary product.
OSS definitely did shift the landscape - it almost killed desktop software! There's a reason why technology business love SaaS business model - it's not just the recurring revenue, it's also because it's immune from being killed by open source. You can create a free alternative to any software running on end user's machine, but you can't do that to proprietary code running on servers the service provider owns.
Open access doesn't bring such immediate, direct benefits to research companies - so investors will be less keen to sponsor it. There's just no business model here.
Occasionally the editors -- inevitably based in Chennai -- spot a typo that the reviewers missed. Sometimes they cock things up massively, especially equations -- I had a big argument about a tickz-based scheme once. The formatting is done automatically and for me as a latex author I really think we could do that very easily. Colleagues who use word see the text transfer to something better as a major value add. The publicity the journal adds is a strong function of how "good" it is. Science and Nature effectively exist because they are Science and Nature.
What if we could look at authors, peer reviewers and papers as a graph of weighted edges to come up with a score that was independent on journals as a concept? And where there is an ontology for the semantics of edges (not only number of citations)?
Right, the perception of being associated with authoritative and knowledgeable publishers which follow certain formalities (in other words "the proof you as a scientist belong to the Club") has been an important part of academic career progression, even if in the last few years the reputation of peer-reviewed papers in general has taken a dive.
Frankly, they profit from: academics writing articles (funded by someone else); academics editing a journal deciding who should review articles (usually, but not always, for free); academics reviewing articles (always for free); and academics citing articles in subsequent ones.
The whole system is a house of cards, and a relatively immutable one.
However, I got the (personal) impression, that this is only well established in fundamental research (which typically comes with few economic interest). As soon as the research is not paid by the state but by private companies (such as in medicine, robotics or any other "applied sciences"), scientists have a hard time to choose a OA journal (i.e. either it does not exist or you are not allowed to publish there). Changing this scheme is of course quite difficult, since too many commercial parties still benefit from it (which likely can only be change by law)...
It's acting as an impartial rating/filtering service for the non-scientist administrators.
In essence, the funding agencies want a way to evaluate scientists and institutions without asking their scientists and institutions to do so, probably because they don't trust them, and also because they have very strong incentives to avoid making any subjective judgment themselves but instead defer to some "objective" outside source - so they use paper counts published in "proper places" or the existence of papers published in "very good places" as the evaluation metric to circumvent the (genuinely very hard!) problem of evaluating the quality/quantity of the actual research done.
And so this incentive, attached to much of the money flowing within academia, trickles down to evaluation of people when hiring and promoting (the committees also often look at paper counts and publication venue rankings instead of trying to evaluate the actual papers - which is time-consuming, and if the papers are not in "your field", then very hard to make an informed judgement) and so to the individual motivations of almost all the people in the system, who have to take into account the "proper publishing rituals" or severely limit their career.
So having worked a bit with administration and evaluation of funding proposals I kind of see why something like that is valuable in general, the main problem being is that the costs are enormous and not really commensurate with the provided value - however, all the costs and barriers are suffered by "someone else" (i.e. the scientific community), while all this value is provided exactly to the funding decision makers who have the power to prevent replacing the current system of Elsevier&co, but don't really suffer from its problems. And when I say "funding decision makers" I don't mean people who have the money and personally care about if it's spent efficiently, I mean all the administrators and bureaucrats running the process of allocating someone else's money or hiring scientists in e.g. some public institution, and whatever "sticks and carrots" these administrators have in their career. This means that to outcompete the current process, any solution would have to benefit them (not the scientists) or it won't be accepted and used, and it's hard to imagine what that would be since there's a huge barrier of entry (e.g. it must work for evaluating/ranking all scientists/institutions, across all disciplines and across all the decades of previous work, or it's not useful) and a huge inertia, as the criteria are also included in very many hard to change legal documents e.g. contracts for long-term funding projects, bylaws and processes of many organizations and committees, actual laws regulating funding institutions, etc; if we had a clear winning replacement launched today, it would still take at least 5-10 years to switch to it.
IMHO the way to change is not a competing solution - it requires institutional change, with the major decision makers simply choosing to make decisions on different factors that do not include the metrics of Elsevier et al. Here's a talk by Stonebraker which gives a strong related argument - https://www.youtube.com/watch?v=DJFKl_5JTnA&t=1220s . However, this institutional change is not that likely because, as I said, for the people who can change this there's little incentive to change, and the people who would benefit from this change are not in a position to make it.
Given how important it is, and how at risk it is, I think it's very important to find a technological solution to keep it up. We have the technology to distribute the papers (torrents) as well as a search index. I really hope that either Alexandra starts using these technologies more, or the technologies mature enough to be usable.
Then Sci-Hub would be unkillable.
Here's a quick recap and reaction on the blog.
It was everything we read about as hackers in the 90's The Phrack stories for instance about missions to steal (liberate) hardware PABX's from offices.
Swartz literally went into a library to break out info, fucking Gibson.
Alexandra Elbakyan and crew beat that. And bent history. If you think history is a line, Sci-hub (Which is a little different to Libgen) changed that.