Hacker News new | past | comments | ask | show | jobs | submit login
How to circumvent Sci-Hub ISP block (fragile-credences.github.io)
484 points by tmkadamcz 14 days ago | hide | past | favorite | 189 comments

By the way, Sci-Hub has stopped adding new articles to the database for a few months now (background: https://www.reddit.com/r/scihub/comments/mk46x4/scihub_v_els...).

It would be great to develop a truly decentralised solution. Having a database of individual torrent links for each paper might be a start.

Millions of individual torrents is not a great solution. Keeping them all seeded is basically impossible unless they run a seed for each one, at which point they might as well just host the files. Plus you'll never get the economy of scale that makes BitTorrent really shine.

When you have a whole lot of tiny files that people will generally only want one or two of there isn't much better than a plain old website.

A torrent that hosts all of the papers could be useful for people who want to make sure the data can't be lost by a single police raid.

What documents (books, scientific articles) benefit from specifically is a number of highly consistent, highly accurate identifiers: DOI (scientific articles), ISBN (published books), and others (OCLC identifier, Library of Congress Catalogue Number, etc.)

With the addition of hashsums (even MD5 ad SHA1, though longer and more robust hashsums are preferred), a pretty reliable archive of content can be made. It's a curious case where increased legibility seems to be breaking rather than creating a gatekeeper monopoly.

I've been interested in the notion of more reliable content-based identifiers or fingerprints themselves, though I've found little reliable reference on this. Ngram tuples of 4-5 words are often sufficient to identify a work, particularly if a selection of several are made. Aggreeing on which tuples to use, how many, and how to account for potential noise / variations (special characters, whitespace variance, OCR inaccuracy) is also a stumbling point.

Why map anything to words for strict identification? Words and language are very error prone vs an id number or hash.

It's a bit of an itch I've been scratching for a few years.

Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically?

For works with well-formed metadata, this isn't an issue.

For identical duplicate copies of the same file, a hash is effective.

But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document.

How do I determine that:

- An ACSCII-only textfile

- Markdown, HTML, DocBook, and LaTeX sources

- PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention).

- Hardbound and paperback physical copies

- Scans made from the same or different physical books or instances, versions, and/or translations.

- Audiobooks based on a work. By the same or different readers.

- Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?)

- Re-typed or OCRed text

... all refer to the same work?

How do you define "work"?

How do you define "differences between works"?

How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.)

If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form.

In bibliographic / library science, the term is "work, expression, manifestation"


The general problem here is not solvable with technology if there is no universally agreed definition for “a work” - and there isn’t (this touches on some profound issues of ontology).

And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.

You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).

Yes, it's complicated.

All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.

The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.

Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).

The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.

The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.

And yes, the incremental / tuned approach is also likely a useful notion.

Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.

I have been dealing with the same problem for curating resources at https://learnawesome.org. Projects like Openlibrary do collect unique identifiers for _books_, but for everything else, it mostly takes manual effort. For example, I collect talks/podcasts by the author where they discuss ideas from their books. Then there are summaries written by others.

There's a lot of work toward this in library space, though it takes some adaptation to new media formats. Paul Otlet worked in a paper-only medium in the early 20th century but also has some excellent thinking. His books are now seeing translation from French. The Internet Archive and Library of Congress are also doing a lot of relevant work, see the WARC format as an example.

What's particularly relevant now are ephemeral and/or continuously updated online content --- and not just the WWW (http/https), but other protocols (ftp, gemini, ipfs, torrents, ...), as well as apps.

A working truism I developed was that "identity is search that produces a single result". So if you can come up with something that uniquely identifies a work, then that can be a working identifier. I typically focus on what can be reasonably assessed of author, title, publication date, publisher (traditional, website/domain), and failing that, descriptive text. Remember that originally titles were simply the introductory lines of works (a practice that remains used in some cases, e.g., the names of church masses or prayers, e.g., "Kyrie Eleison").

The Superintendent of Documents (SuDoc) Classification Scheme (used by the US goverment and GAO) and operates by agency, type of publication, and further divisions, as well as date/year. https://www.fdlp.gov/about-fdlp/22-services/929-sudoc-classi...

Probably because for written text the words identify the content while the hash relates more to the digital carrier format (pdf vs epub) and id number can change between publications, countries, etc.


And to drag in metadata, it may:

- Not be present.

- Be inaccurately applied to the correct work (metadata say the work is different, work is in fact related/same).

- Be inaccurately applied to the wrong work (metadata say the works are the same/related, they are not).

text to speech the doc then an acoustic fingerprint on the audio :)

You'd all but certainly be better going in the other direction.

Text is a more constrained state space than speech/audio.

There was that project some guy posted a while back that used a combination of sqlite and partial downloads to enable searches on a database before it was downloaded all the way. If you can fit PDFs somewhere into that you'd be golden.

Or just use IPFS I suppose.

IPFS would face a similar challenge as the “keep torrents seeded” problem mentioned by GP. Wouldn’t there be risk to peers who host the PDFs?

I sort of feel like there should be some way to use some kind of construct to get people to seed things so that others seed things for them, but I haven't seen that invented yet.

Been a while since I've looked at them, but IPFS with FileCoin and Ethereum Swarm had that kind of goal.

It might be beneficial to create something like what you describe without any cryptocurrency association though, and I've been mulling over possibilities for distributed systems that are inherently currency-less to avoid all of the scams that cryptocurrency attracts.

The leader in that space is Skynet, which basically is like IPFS + Filecoin but also has dynamic elements to it, and a lot better performance + reliability.

Cryptocurrency is helpful because it allows you to incentivize people to hold the data. If you don't have cryptocurrency, you're basically dependent on altruism to keep data alive (like bittorrent, or ipfs without filecoin). Popular files do okay for a while, but even popular files struggle to provide good uptime and reliability after a few years.

On an incentivized network like Sia or Filecoin, you can have high performance and reliability without ever needing any users to keep their machines online and seeding.

Does it scale well? SciHub is at least 100TB.

Skynet already hosts nearly 1,000 TB of data. At 100 TB, SciHub would be material but also comfortable.

I think seed ratios and seed time (mostly used by private trackers) attempt to solve this problem.

What kind of risk?

IPFS is not anonymous and like other p2p protocols shares your ip address. People seeding articles would get legal notices just like torrents now.

There's been a bit of effort to get it working over tor for years now but the fundamental design makes this difficult. Also despite all the money that has poured into filecoin this doesn't seem to be a priority.

This issue is nearly 6 years old:


I was thinking legal risk. In this case the publishers are going after Sci-Hub, but in the past they have gone after individuals.

"There was that project some guy posted a while back that used a combination of sqlite and partial downloads to enable searches on a database before it was downloaded all the way."


This is the original. Then came https://github.com/lmatteis/net-torrent and later one written in Javascript, inspired by net-torrent.

Isn't that essentially mapreduce? Either way, interesting and I'd love to see the link.

I believe this is the project mentioned


That one looks familiar. Though apparently the same thing has been tried in several different ways going by the replies I got.

This looks like it could be a good approach.

a plain old website or a publishing house with distribution services and syndication attached, but for a sane price.

"a whole lot of tiny files" severely underestimates the scale at work. Libgen's coverage is relatively shallow, and pdf books tend to be huge, at least for older material. Scihub piggy backs on the publishers, so that's your reference.

syndication, syndicate, quite apt don't you think? Libraries that coluded with the publishers and accepted the pricing must have been a huge part of the problem, at least historically. Now you know there's only one way out of a mafia.

In Internet scale it's not a lot of data. Most people who think they have big data don't.

Estimates I've seen put the total Scihub cache at 85 million articles totaling 77TB. That's a single 2U server with room to spare. The hardest part is indexing and search, but it's a pretty small search space by Internet standards.

The entire archive actually fits in a small desktop NAS (e.g. QNAP or Synology) with a few 14-18TB drives, you don't even need a server rack.

There is existing index in sql format distributed by libgen: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr..., it is around 30GB uncompressed.

Those 851 torrents uncompressed would probably take half a petabyte of storage, but I guess for serving pdfs you could extract individual files on demand from zip archive and (optionally) cache them. So the scihub "mirror" could run on a workstation or even laptop with 32-64GB memory connected to 100TB NAS over 1GBE, serving pdfs over VPN and using unlimited traffic plan. The whole setup including workstation, NAS and drives would cost $5-7K.

it's not a very difficult project and can be done DIY style, if you exclude the proxy part (which downloads papers using donated credentials). Of course it would still be as risky as running Scihub itself which has $15M lawsuit pending against it.

The entire Library of Congress books collection is on the order of 40 million items.

At 5 MB per book, this works out to about 200 TB of disk storage.

At about $12/TB, hosting the entire LoC collection would cost roughly $2,400 presently, with prices halving about every three years.

Note that $2,400 is disks alone. You'd obviously need chassis, powere supplies, and racks. Though that's only 17 12 TB drives.

Factor in redundancy (I'd like to see a triple-redundant storage on any given site, though since sites are redundant across each other, this might be forgoable). Access time and high-demand are likely the big factor, though caching helps tremendously.

My point is that the budget is small and rapidly getting smaller. For one of the largest collections of written human knowledge.

There are some other considerations:

- If original typography and marginalia are significant, full-page scans are necessary. There's some presumption of that built into my 5 MB/book figure. I've yet to find a scanned book of > 200MB (the largest I've seen is a scan of Charles Lyell's geology text, from Archive.org, at north of 100 MB), and there are graphics-heavy documents which can run larger.

- Access bandwidth may be a concern.

- There's a larger set of books ever published, with Google's estimate circa 2014 being about 140 million books.

- There are ~300k "conventionally published" books in English annually, and about 1-2 million "nontraditional" (largely self-published), via Bowker, theh US issuer of ISBNs.

- LoC have data on other media types, and their own complete collection is in the realm of 140 million catalogued items (coinciding with Google's alternate estimate of total books, but unrelated). That includes unpublished manuscripts, maps, audio recordings, video, and other materials. The LoC website has an overview of holdings.

Published document scarcity is entirely imposed.

It still amazes me that 77TB is considered "small". Isn't that still in the $500-$1,000 range of non-redundant storage? Or if hosted on AWS, isn't that almost $1,900 a month if no one accesses it?

I know it's not Big Data(tm) big data, but it is a lot of data for something that can generate no revenue.

> Isn't that still in the $500-$1,000 range of non-redundant storage?

Sure. Let's add redundancy and bump by an order of magnitude to give some headroom -- $5-10k is a totally reasonable amount to fundraise for this sort of application. If it were legal, I'm sure any number of universities would happily shoulder that cost. It's miniscule compared to what they're paying Elsevier each year.

Sorry. My point was it was a lot of money precisely because it cannot legally exist. If it could collect donations via a commercial payment processor, it could raise that much money from end users easily. Or grants from institutions. But in this case it seems like it has to be self-funded.

I'm prepared to accept "does generate no revenue" but "can generate no revenue" ...?

Perhaps some sort of MTurk or captcha-like tasks per access? Patr[e]ons? Donation drives? Micro-payments? Something else??

Oh, it could generate revenue if it was legal. But it is not, so it seems difficult.

For an institution, it's a rounding error.

AWS is not the cheapest bulk-storage hosting possible.

Google already does a pretty good job with search. Sci-Hub really just needs to handle content delivery, instead of kicking you to a scientific publisher's paywall.

If the sane price is an optional "Donate to keep this site going" link, then ok. But only free access, without authentication or payment, to scientific papers, is sane. IMHO.

Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

The cost of maintaining a free and open DB of scientific advances and publications would be so incredibly insignificant compared to both the value and the continued investment in those advancements.

> Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

I feel that we're halfway there already and are gaining ground. Does Pubmed Central [0] (a government-hosted open access repository for NIH-funded work) count as a "ledger" like you're referring to? The NSF's site does a good job of explaining current US open access policy [1]. There are occasional attempts to expand the open access mandate by legislation, such as FASTR [2]. A hypothetical expansion of the open access mandate to apply to all works from /institutions/ that receive indirect costs, not just individual projects that receive direct costs, would open things up even more.

[0] https://www.ncbi.nlm.nih.gov/pmc/

[1] https://www.nsf.gov/pubs/2016/nsf16009/nsf16009.jsp#q1

[2] https://sparcopen.org/our-work/fastr/

Well, some research venues (and publication venues) are not government-funded, and even if they are indirectly government funded, it's more of a sophistry than something which would make publishers hand over copies of the papers.

Also, a per-government ledger would not be super-practicable. But if, say, the US, the EU and China would agree on something like this, and implement it, and have a common ledger, then it would not be some a big leap to make it properly international. Maybe even UN-based.

That's a pretty big "if" though.

I share the sentiment insofar as free access would benefit my own sanity, except when it is about hording.

On the other hand, there is a slippery sloap to decide what isn't scientific so much as to not be required open knowledge.

By the way, specialist knowledge and open knowledge is kind of a dichotomy. You would need to define the intersection of both. Suddenly you are looking at a patent system. Pay to Quote, citation fees, news websites already are demanding this from google, here in Germany, inuding Springer Press

Libgen's coverage is definitely more shallow than scihub, but it is still pretty good.

There are already torrents of the archives. But supposing scihub was taken down it's pretty non trivial to get from the archive back to a working site with search functionality. For one thing, none of Sci-Hub's code is available.

Seems like what should be in each torrent is a virtual appliance preloaded with one shard of the data, where that virtual appliance has a pre-baked index for searching just that shard's data. Then one more torrent for a small search-coordinator appliance that fans your search query out to all N shard appliances.

BitTorrent does allow you to download a single file from a torrent though. You could have torrent for each month, and a client which allows you to search inside these torrents and download only the files you need.

Maybe Usenet? It already support massive copyright infringement yet it is still around.

Maybe we can create a freesite, on Freenet.

only if it was possible to use chia to store content.. it would be a game changer

IPFS seems like a perfect fit for this and some of the scihub torrents are already in IPFS, but it's not an anonymous network.

IPFS via the DHT tells the network of your whole network topology, including internal address you may have, and VPN endpoints too. It's all public by design as they don't want to associate IPFS with piracy per one of their developers.

this thread has some discussions on the alternatives https://www.reddit.com/r/DataHoarder/comments/nc27fv/rescue_...

Can files be taken down off ipfs? There was a fairy widely circulated link that had all the IEC and ANSI standards on there that has since been taken down.

You could use libp2p's DHT over Tor (I did a poc of this long ago, and the situation's only improved). Combined with other libp2p/IPFS components, you can essentially have a private IPFS over onion services (not to be confused with accessing the existing IPFS network via Tor exit nodes).

If 2 people add the same file in ipfs independently on their side. Will the ipfs hash be the same?


So what? bittorrent is all public too. It hasn't prevented the piracy scene for making releases.

what ipfs here provides over torrent is the ability to add more files instead of creating new torrents

Isn't scihub already on IPFS?

some torrent files are archived there, but i don't think scihub is serving the pdfs from IPFS, they likely use private storage network.

I believe libgen.fun which is a new (official) libgen mirror is running fully on ipfs, and it serves some scientific papers, but I wasn't able to search by DOI or title there, looks like it redirects to scihub, also there is no index of the papers on IPFS.

Edit: this doc talks about scihub+ipfs (it was created by the leader of Scihub rescue effort on Reddit, /u/shrine): https://freeread.org/ipfs/

You can build something like this on Skynet. Links on Skynet (called skylinks) can point to hashes of data (similar to IPFS)or can point to a pubkey. If the data points to a pubkey, it can be tweaked repeatedly and people with the old links will see the new data.

You can also point to full webapps. So for example you could have a pubkey point to a webapp that then loads the database from a list of moderators who update the data under their pubkeys to point to the latest version of scihub. Then you can even have different moderators curate articles of different subjects, and the webapp can combine everything together.

All the data is stored on Sia, which is a decentralized cloud storage platform. Skynet allows anyone to upload or download data from Sia directly from the web, no need for extensions or custom software. The Sia network handles what IPFS calls "seeding", so contributors don't need to worry about leaving their machines on.

Torrents are great and all but it's dependent on people seeding them, sci-hub/libgen is great because you don't have to worry about a download suddenly breaking because no one is seeding

But they could just always be a seeder. Doesn’t that have the upsides of the existing solution plus resiliency?

Isn't scihub/libgen already backed by torrents? I'm confused.

Nope it was mostly direct download for the endusers with some ipfs in some places.

from what I understand, the Authors are still free to send you their papers. So perhaps the simplest decentralized system is for each author to run an automatic email request system and have feeds / aggregators of papers titles/abstracts with author emails to make it easy to get the papers you want?

That would only work for a tiny tiny intersection of : 1) technical enough people to install and maintain that 2) with an email address still working (if there is one on the paper!) 3) still alive

that's probably a really small set of scientists.

Thanks for the background link. I did not know about that and it's a good incentive to donate them some money for the legal battle.

TL;DR of the link: No more uploads to support a court case in India which SciHub might win and thus establish a legal basis for operation in the biggest democracy.

when donating bitcoin make sure to get the address from the official scihub mirrors, which are currently sci-hub.do, sci-hub.st or sci-hub.se.

there are some unaffiliated "mirrors" that only redirect to scihub but list their own bitcoin address for donations, so beware.

/r/scihub on reddit keeps track of this https://www.reddit.com/r/scihub/wiki/index

Sci-hub is such a great example of a clear and compelling use-case for Bitcoin. Bitcoin is censorship-resistant money that doesn't rely on countries, laws, central bankers or politicians. The US dollar cannot be used for purposes not aligned with the US government. Sometimes ideas that the US Government doesn't agree with can be useful (e.g. Wikileaks, Sci-hub.)

When I hear complaints that Bitcoin has no use except for speculation, I think of Sci-hub, Wikileaks and other organizations that may be bad for the interests of the US government but may be good for mankind.

It's a censorship resistant technology that also indelibly records, publicly, every transaction you ever participated in. Talk about a double-edged sword...

Agree, it's an interesting trade-off: Completely private if you can use an anonymous address but completely traceable if the address is identified. I wonder if Satoshi intended it that way or if he would have preferred the greater anonymity of Monero.

Wouldn't IPFS(https://ipfs.io/) work well?..

This is where IPFS shines.

In fact, there is already a ipfs mirror of scihub.

This might be a good use case for Arweave.

I'm growning skeptical of the idea that every science text should be free. As a community that values privacy, it's hypocritical that these widespread leaking is so well accepted.

The fact that scihub needs to skip domain names every couple of months and that ISPs start blocking the website, looks to confirm my theory, rather than imply some worldwide conspiracy.

How is that a leak? Only the publishing companies are making money. Authors, reviewers and often editors all work for free.

I don't think they work for free, they have a salary, usually paid by universities I think.

What does this have to do with privacy?

Publishers want to keep the papers relatively private.

Except from literally anyone that pays them a monthly fee?

That's not privacy.

The way SciHub is being treated by governments is pretty infuriating. There's a tiny minority of people who have an interest in keeping SciHub off the internet, and they're generally neither the researchers who write the papers, nor those who want to read them. Despite this, the power of the state has been used repeatedly to keep SciHub inaccessible and limit their ability to get funding.

SciHub is just one instance of the broader problem, which is that governments - even ostensibly democratic ones - don't actually operate in the best interests of the governed.

Which, I think, shouldn't be surprising when our "representatives" ostensibly speak for hundreds of thousands (and sometimes, millions) of people each. True democracy requires a much shorter and more direct chain of responsibility.

One thing that I think is being missed in this discussion is that governments have a responsibility to enforce the laws that are currently on the books. Perhaps more importantly, they also need to be seen to be enforcing them.

SciHub was allowed to operate freely for several years when it was relatively unknown, but to turn a blind eye to it now that it's received mainstream attention would threaten the credibility of the legal system.

I wholeheartedly support SciHub, but I can't blame governments for enforcing their laws. Ultimately, the changes need to happen at the legislative level to update copyright laws, and this somewhere where lobbying local representatives could have a meaningful impact.

Well, yes, there's people who actually enforce those laws, and there are people who write them. I was specifically referring to the latter. But both are a part of government as a system, and so ultimately the system as a whole carries the blame.

>SciHub is just one instance of the broader problem, which is that governments - even ostensibly democratic ones - don't actually operate in the best interests of the governed.

Agreed. This is true for virtually any human endevour. The average person out there can be easily manipulated by those who generally seek power. So I say that the root problem is one of that of people in their ability discern appropriate people to hold power. To make matters worse those who are most suited to govern often are not interested in taking on positions of power.

Yes, due to allocation of property rights.

Cease to supply this system with the fruits of your research labors.

> Cease to supply this system with the fruits of your research labors.

Historically academics have felt forced to support this system, because for-profit journals are the high-prestige ones they must publish in in order to get tenure. This has changed for certain fields, but it isn’t as simple as just suggesting that one publish elsewhere.

It’s up to not only academics who publish articles, but also organizations that issue grants and tenure. Public policies to adjust their definitions of “prestige” or “quality” would help.

And these are mostly run by people that have not even heard of scihub, openaccess etc

Case in point: The ARC, the biggest Australian Research funding body, recently explicitly banned the mention of preprints in grants. You can't include your own arxiv/biorxiv/etc. preprints in your grants to show your work! To the ARC, that's unpublished work. I know a few mathematicians who exclusively publish on arxiv who were bitten by this change, the whole grant got rejected.

That is... profoundly idiotic and short-sighted.

Sure, but it's a divide-and-conquer problem. Academics have to mutually support each other rather than trusting everything to a competitive marketplace (which most of them seem to hate anyway).

What are some of the fields where this is changing?

For some fields the for-profit problem never happened at all. For example, in some branches of linguistics, history and archaeology the main journals have always been published by the same non-profit learned societies for decades (since the 19th century, sometimes). Prices for the hardcopy were always reasonable, and with the digital era, these journals became open access.

In other branches of those disciplines, I have seen that recently some big-name editors have founded new open-access journals with the express aim of gradually taking prestige away from for-profit journals. See here [0] (PDF).

[0] https://scholarsarchive.library.albany.edu/cgi/viewcontent.c...

One can publish online a pre-print for free, which is very close to the final version of the paper.

which fields has this changed for?

Stop supporting a system you won't inherit.

In my personal opinion the international community could do something like modify the Berne Convention/TRIPS (international copyright agreements signed by almost every country/WTO members respectively) to exclude copyright of academic papers.

The property rights in question are not natural rights, nor material rights. Sufficient political will seems like it could do it.

Finding politicians in power who will support human progress before profits might be hard! [/understatement]

Give me tenure and I'm on board.

And often the papers they host were paid for with public money.

A few years back, frustrated with increasaed DNS blocking of Sci-Hub, I wrote a quick DNSMasq hack (haq?) to return Sci-Hub IPs for any "sci-hub.<domain>" possible. The shins-n-grits factor of surfing "scihub.elsevier.com" were palpable.


As others have mentioned, Sci-Hub also maintains a Tor presence, and you can access the Onion link using the Tor browser (provided you can install that on your desktop or device).


scihub22266oqcxt.onion is a v2 onion address which are deprecated for security reasons and no longer supported in Tor. For more information see https://blog.torproject.org/v2-deprecation-timeline

Unfortunately the .onion is currently timing out.

Love the haq!

I’ve never had success with their .onion link. I suspect something may be wrong there.

Aside from Facebook and my own onion services, that has been my experience with all onion domains somehow. Since my own onion services work (from unrelated networks and no special config), I'm pretty sure it's on the hosters' end. Quite frustrating; I'd love to use this more but it doesn't seem like people can get this stuff together.

Just setup a VPN on some cheap cloud provider.

There are lots of sites UK ISPs block even though the sites themselves are not illegal or host illegal content. For e.g. torrent indexing services (the content itself may be illegal but purely providing a search across that content is basically doing what Google do).

The UK internet is heavily filtered/censored and so doing this is useful anyway.

Business ISP connections don't seem to be restricted. And neither do most cloud providers I've tried.

Might be better than using temporary proxies.

I travelled last week, and was horrified by how much is blocked by the mainstream ISPs in the UK.

Afaik, my (London) ISP does not block anything. No idea why, as all the others quote high court orders.

Many UK ISPs have "adult content" filters, which tend to be wide-reaching and block a lot more than just porn sites. But these are optional and can be turned off very easily.

There's a smaller set of non-optional blocks (pirate/torrent sites) which you need a VPN to get around.

This seems to be just another small step towards a future where only pre-approved websites are accessible via the method most people will use. It will not be called banning, this is just a measure to "ensure that the content we are serving to our users is held against our high quality standards" or the classic "it's to protect the children" or to "condemn terrorism".

Porn is not really Illegal, just unwanted, which is apparently reason enough to block it. Does this mean any content which is "unwanted" can be blocked just like that?

I have a theory that the ISPs over-block with their adult content filters, so you've got plausible deniability and don't have to ring up and say "I want porn please." Because the alternative is that they lose a customer to an ISP who doesn't embarrass them.

My ISP does not block any of the non-optional blocks either..

While I do use a VPN at times, it is never to get around blocks.

Although I don't use my ISPs dns, so if that is how they're attempting to comply I wouldn't notice.

>There's a smaller set of non-optional blocks (pirate/torrent sites) which you need a VPN to get around.

This https://unblockit.pw/ also does the job too (scihub is at the bottom).

Open Rights Group run https://www.blocked.org.uk/ to make blocking more visible and prompt action to contact ISPs about cases of overblocking.

You can enter any domain to test it, the data is collected by volunteers running an automatic tool.

My UK ISP (one of the major UK mobile networks) does not block sci-hub. They do block torrent sites such as Pirate Bay, etc.

Are there good recommendations for privacy centric cloud providers?

Also worth adding if a site isn’t banned by the UK government many sites now georestrict UK residents or give you several opt-in pop ups because of GDPR.

If you can’t get to sci-hub and you need a (free) copy of a paper, there are several other ways to get it: https://lee-phillips.org/articleAccess/

GreenTunnel is another alternative to evade ISP blocking without using a VPN:


Splitting the request packet into two TCP fragments right down the middle of the offending pattern in the Host or SNI field, that's adorable

This is a useful note on using PACs to set up proxies for just one site:

> Incidentally, you do not need to be running a web server to use the .pac file. You can access it via a file:// type URL. For example (note the 3 slashes): file:///Users/username/Library/proxy.pac


Note that if you are using Chromium, it will refuse to load a PAC file from the file:// scheme. Here's the bug tracker issue for the change: https://bugs.chromium.org/p/chromium/issues/detail?id=839566.

And you should... The proxy.pac file is in some cases reloaded for every single http request.

Another option is to simply configure your workstation to use DoH. Then your ISP can't fuck with your address resolution.

I recommend using NextDNS, and then setting up a provisioning profile at https://apple.nextdns.io to set it as your revolver on your macs and ios devices. The ad-blocking features are a nice bonus, too.

NextDNS also has a cool free software CLI local DoH proxy resolver which works a charm.

From the page's footnotes:

> Changing your DNS resolver to a public one like Google’s instead of your ISP’s is not sufficient as of 2021, for two ISPs I’ve tested, and I suspect for all UK ISPs that implement blocking.

I read the article in full.

Changing your non-DoH resolver (such as using Google Public DNS) means requests and responses can still be edited by your ISP. This is what the article is talking about.

I suggested DoH (encrypted DNS) because this is not subject to such tampering. DoH (DNS-over-HTTPS) is not the same as traditional unencrypted port 53 DNS.

Really, anyone who gives a shit about privacy should be using DoH exclusively, otherwise you are basically uploading your web history in real-time to your ISP for mining and resale.

I have been testing a large number of DoH servers. I have noticed that some names are not available across all (supposedly unfiltered) DoH servers. For example, there are some DoH servers that had no A record for webshare.io, the domain mentioned in the OP.

There's a telegram bot that sends you the papers you ask for. It's by far the most convenient way to use scihub.

Do you have a link or example for, uh, science? Lol

Google sci hub telegram bot returns me @scihubbot as the first results. You might want to try it out.

Actually, it's @scihubot. You can send it links to the paper from the journal webpage or the doi link, for example: https://doi.org/10.1063/1.432563

But it doesnt have any recent papers does it?

Agreed, that's the one I use, so fast!

Seems like this is a really good service to add some legitimacy to Tor browsing.

Having SciHub as a hidden service would bring a lot of people to Tor.

EDIT: apparently I'm a bit stupid; it exists: https://scihub22266oqcxt.onion

But it would be cool to promote the hell out of the onion address and tor browser, rather than trying to bypass ISP restrictions.

Possibly also a path to greater decentralization. Relying on jurisdictional stuff isn't going to cut it forever (see the current pause on new uploads), but it's also hard to ask people to host data that'll get them sued without offering mitigations. Private torrent trackers do that through, well, being private, but I'm sure as hell not serving springer_catalogue_2020.tar.xz to the whole internet from an address linked to me in any way. Maybe an index of independently operated hidden services, each serving a (redundant) shard of the collection?

I use SciHub a lot, and just this week have been having problems accessing it on the internet (in the UK) - I don't use Tor, but I also never even thought of it. Totally agree it would be a good idea to promote the Onion address!

Dear authors: Just post your papers (before they have been formatted by the journal, but after they have been refereed at the latest) on the arXiv (maths, cs, physics) and similar repositories. Including the old ones you still have. Once a critical mass does this, the problem will be solved, whatever the Indian courts decide.

How do the alternative domains fare in the UK? With https://sci-hub.st/ I can circumvent the ISP block in Sweden successfully.

https://sci-hub.st does not work in the UK [edit: on Sky].

Works on Zen.

Works on BT broadband

Virgin says no.

Interestingly I get this:

Secure Connection Failed

An error occurred during a connection to sci-hub.st. SSL received a record that exceeded the maximum permissible length.


I know by memory that is what I get when an HTTP server responsd to an HTTPS/SSL request :)

Yeah, that's how Virgin implement their blocking

Both .se and .st work for me on A&A (DSL) and on Three (Mobile)

Works fine here, on EE network in the UK.

I'm UK based and can't hit it on home wifi

Works for me on Plusnet in the UK, thanks!

Interestingly it works on HTTPS but with HTTP you get a block page referencing the court order (confirmed by a friend on Plusnet)

See also: https://www.blocked.org.uk/site/http://sci-hub.st

Same here, I hadn't tried HTTP. I wonder if this is Plusnet conforming to the letter of the law in quiet defiance? Of course, it could just be an error on their part...

Wikimedia should start publishing science articles (legally). It has the infrastructure, money and culture to become a non-profit world-scale publisher with decentralized curation.

That's cool.

I still cannot get my head around on to why the world is not embracing open sourcing of data. One way or other people get what the want so you might as well give open access and reinvent the entire business model altogether. Harnessing the power of community could be the key for this reinvention.

I also could never understand why software companies like Microsoft invest into software activation stuff. They constantly try to improve protection of their apps, run activation servers and probably have already spent huge lots of money on this and I have never seen anybody willing to use pirated software facing any real difficulty. Whoever wants it gets it anyway. Those who can afford a license easily and those (mostly businesses) who care about the legal aspect buy anyway. A simple (even easily crackable from the technical point of view) low-tech "enter a serial" dialog is enough to stop the rest (lazy/stupid people).

I understand how does this work in games and movies (the publisher gets most of the profit in the first hours after the release, before it gets cracked) but can't understand what's the point with business/utility apps.

Interesting point, I guess it is just be a marketing gimmick so as to show the customer that "security" is our priority.

No need to use a .pac file for configuring site-based HTTP(S) proxy selection in modern browsers. I've just installed the FoxyProxy extension (works in Chrome and Firefox), and I was able to specify regexps for the URL, and select the HTTP(S) proxy based on which regexp matches.

You're probably better off using a paid VPN provider than a paid proxy provider. A VPN can be used in more places, and some VPNs provide http proxy access (the kind used in the tutorial) in additional to openvpn/wireguard. If they have a browser extension, chances are they support http proxy.

A very easy one: Firefox > Preferences > General > Network settings > Enable DNS overs HTTPS

This is a good step, but afaik you can still identify and block a website using onyl the TLS traffic, even without the DNS.

Sure. I'm just saying it's been working for the last years. Let's hope it lasts.

The good thing is that, so far (in France), the blockings only affect the most mainstream providers.

This is the reason Tor exists.

Obviously another solution on linux is to install a local recursive DNS resolver and be done with it... I'm quite happy with knot-resolver (kresd).

This only works if your ISP is using/abusing/hijacking DNS to censor your connections.

If they're doing that you'd be better off using D-o-T or D-o-H, to protect your DNS from interference.

ISP rarely do anything else than DNS censoring (censoring by ip blackholing is for really grave stuff). Also i don't understand why you'd be "better off" using encrypted connection to a 3rd party DNS which can still lie to you. Just run a local resolver, it's so lightweight there's no real reason not to. (and honestly, the hypothetical delay isn't noticeable)

Sorry, am I missing something because I'm pretty sure the whole point of the article is that ISPs do block more than just DNS

Is it? I didn't understand that. It's just a random tutorial on using a proxy for a specific domain.

A 3rd party is better because it can be hosted in some other country not subject to local fascism du jour you have to deal with from your ISP.

What is forcing these UK ISPs to block these IP ranges?

Many years ago, Hollywood and the Music Industry argued in court in the UK that since most UK ISPs have a mechanism to block stuff like child pornography, they ought to be compelled by the court to use this same mechanism to protect the interests of these commercial entities.

The courts agreed. If your ISP has such a capability the court will cheerfully give rights holders authority to demand the ISP blocks stuff that they claim rights over.

The ISPs could choose to just not block stuff. I am looking at Sci-Hub right now, because my UK ISP (Andrews & Arnold) doesn't block anything. From time to time, parliamentarians get vexed about this, but there is a cultural memory in the place that passing laws intended to stop people from saying things doesn't work. Lady Chatterley's Lover was banned, because, the government argued, it was obscene but it turns out that a court didn't buy this argument, and instead Penguin made piles of money because everybody wanted to read the banned book.

But most UK ISPs have decided that it is in their better interests to block things. Their options for attempting to do this have narrowed over time, once upon a time DNS blocking was pretty effective, I assume that by now they're mostly relying on IP blocking, which of course means they run the risk of exciting collateral attacks...

The UK Government passed a few legislations to this effect and have been doing so since 2015~

Also: all data is required to be logged, and those logs are searchable by civil servants without a warrant.

I don't know. Virgin Media quote High Court orders saying they have to block several I tried, but my home ISP does not block any of them. These seems weird to me, that the court orders would cover a few specific ISPs, but I haven't looked in to it further.

Are those High Court orders secret? Is there any journalism happening?

Someone else linked this:


Which includes info on the orders:


Occasionally there is a small amount of reporting on it, but it tends to be fairly misdirected and naive.

The government have passed quite a few bills this year aimed at locking down on this kind of thing whilst people were pre-occupied with COVID.

I once attempted to make something like ProxySwitchy for DNS[1], but I didn't work on it long enough to get off the ground. This article made me think about it again. Is there actually a use case for that kind of thing?

[1]: https://github.com/emsal0/resolvplox

Can anyone in Europe with IOS 15 see if Private Relay is able to bypass this? We have a similar situation over here in the States with Verizon and some piracy sites and It’s able to bypass.

Too bad the Handshake domain donation to SciHub didn't pan out.


IPFS would be a perfect fit for this...

It sadly isn't censorship resistant, but it should do for a few years, and as soon as censorship on IPFS starts to become an issue, hopefully the IPFS developers will evolve the design.

Tor works well for me so far

once upon a time I wrote a script that took a MARC file from the library catalog as input, and output a PAC file for the group that ran the campus proxy server.

The best way imo is the telegram bot

TIL that the UK has gone down a slippery slope

Do they have an official .onion, by chance?

I'm glad this document exists but I tend to favour using TorBrowser, both on Android and (in my case) Manjaro.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact