It would be great to develop a truly decentralised solution. Having a database of individual torrent links for each paper might be a start.
When you have a whole lot of tiny files that people will generally only want one or two of there isn't much better than a plain old website.
A torrent that hosts all of the papers could be useful for people who want to make sure the data can't be lost by a single police raid.
With the addition of hashsums (even MD5 ad SHA1, though longer and more robust hashsums are preferred), a pretty reliable archive of content can be made. It's a curious case where increased legibility seems to be breaking rather than creating a gatekeeper monopoly.
I've been interested in the notion of more reliable content-based identifiers or fingerprints themselves, though I've found little reliable reference on this. Ngram tuples of 4-5 words are often sufficient to identify a work, particularly if a selection of several are made. Aggreeing on which tuples to use, how many, and how to account for potential noise / variations (special characters, whitespace variance, OCR inaccuracy) is also a stumbling point.
Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically?
For works with well-formed metadata, this isn't an issue.
For identical duplicate copies of the same file, a hash is effective.
But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document.
How do I determine that:
- An ACSCII-only textfile
- Markdown, HTML, DocBook, and LaTeX sources
- PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention).
- Hardbound and paperback physical copies
- Scans made from the same or different physical books or instances, versions, and/or translations.
- Audiobooks based on a work. By the same or different readers.
- Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?)
- Re-typed or OCRed text
... all refer to the same work?
How do you define "work"?
How do you define "differences between works"?
How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.)
If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form.
In bibliographic / library science, the term is "work, expression, manifestation"
And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.
You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).
All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.
The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.
Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).
The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.
The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.
And yes, the incremental / tuned approach is also likely a useful notion.
Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.
What's particularly relevant now are ephemeral and/or continuously updated online content --- and not just the WWW (http/https), but other protocols (ftp, gemini, ipfs, torrents, ...), as well as apps.
A working truism I developed was that "identity is search that produces a single result". So if you can come up with something that uniquely identifies a work, then that can be a working identifier. I typically focus on what can be reasonably assessed of author, title, publication date, publisher (traditional, website/domain), and failing that, descriptive text. Remember that originally titles were simply the introductory lines of works (a practice that remains used in some cases, e.g., the names of church masses or prayers, e.g., "Kyrie Eleison").
The Superintendent of Documents (SuDoc) Classification Scheme (used by the US goverment and GAO) and operates by agency, type of publication, and further divisions, as well as date/year. https://www.fdlp.gov/about-fdlp/22-services/929-sudoc-classi...
And to drag in metadata, it may:
- Not be present.
- Be inaccurately applied to the correct work (metadata say the work is different, work is in fact related/same).
- Be inaccurately applied to the wrong work (metadata say the works are the same/related, they are not).
Text is a more constrained state space than speech/audio.
Or just use IPFS I suppose.
It might be beneficial to create something like what you describe without any cryptocurrency association though, and I've been mulling over possibilities for distributed systems that are inherently currency-less to avoid all of the scams that cryptocurrency attracts.
Cryptocurrency is helpful because it allows you to incentivize people to hold the data. If you don't have cryptocurrency, you're basically dependent on altruism to keep data alive (like bittorrent, or ipfs without filecoin). Popular files do okay for a while, but even popular files struggle to provide good uptime and reliability after a few years.
On an incentivized network like Sia or Filecoin, you can have high performance and reliability without ever needing any users to keep their machines online and seeding.
There's been a bit of effort to get it working over tor for years now but the fundamental design makes this difficult. Also despite all the money that has poured into filecoin this doesn't seem to be a priority.
This issue is nearly 6 years old:
HN submission: https://news.ycombinator.com/item?id=27016630
"a whole lot of tiny files" severely underestimates the scale at work. Libgen's coverage is relatively shallow, and pdf books tend to be huge, at least for older material. Scihub piggy backs on the publishers, so that's your reference.
syndication, syndicate, quite apt don't you think? Libraries that coluded with the publishers and accepted the pricing must have been a huge part of the problem, at least historically. Now you know there's only one way out of a mafia.
Estimates I've seen put the total Scihub cache at 85 million articles totaling 77TB. That's a single 2U server with room to spare. The hardest part is indexing and search, but it's a pretty small search space by Internet standards.
There is existing index in sql format distributed by libgen: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr..., it is around 30GB uncompressed.
Those 851 torrents uncompressed would probably take half a petabyte of storage, but I guess for serving pdfs you could extract individual files on demand from zip archive and (optionally) cache them. So the scihub "mirror" could run on a workstation or even laptop with 32-64GB memory connected to 100TB NAS over 1GBE, serving pdfs over VPN and using unlimited traffic plan. The whole setup including workstation, NAS and drives would cost $5-7K.
it's not a very difficult project and can be done DIY style, if you exclude the proxy part (which downloads papers using donated credentials). Of course it would still be as risky as running Scihub itself which has $15M lawsuit pending against it.
At 5 MB per book, this works out to about 200 TB of disk storage.
At about $12/TB, hosting the entire LoC collection would cost roughly $2,400 presently, with prices halving about every three years.
Factor in redundancy (I'd like to see a triple-redundant storage on any given site, though since sites are redundant across each other, this might be forgoable). Access time and high-demand are likely the big factor, though caching helps tremendously.
My point is that the budget is small and rapidly getting smaller. For one of the largest collections of written human knowledge.
There are some other considerations:
- If original typography and marginalia are significant, full-page scans are necessary. There's some presumption of that built into my 5 MB/book figure. I've yet to find a scanned book of > 200MB (the largest I've seen is a scan of Charles Lyell's geology text, from Archive.org, at north of 100 MB), and there are graphics-heavy documents which can run larger.
- Access bandwidth may be a concern.
- There's a larger set of books ever published, with Google's estimate circa 2014 being about 140 million books.
- There are ~300k "conventionally published" books in English annually, and about 1-2 million "nontraditional" (largely self-published), via Bowker, theh US issuer of ISBNs.
- LoC have data on other media types, and their own complete collection is in the realm of 140 million catalogued items (coinciding with Google's alternate estimate of total books, but unrelated). That includes unpublished manuscripts, maps, audio recordings, video, and other materials. The LoC website has an overview of holdings.
Published document scarcity is entirely imposed.
I know it's not Big Data(tm) big data, but it is a lot of data for something that can generate no revenue.
Sure. Let's add redundancy and bump by an order of magnitude to give some headroom -- $5-10k is a totally reasonable amount to fundraise for this sort of application. If it were legal, I'm sure any number of universities would happily shoulder that cost. It's miniscule compared to what they're paying Elsevier each year.
Perhaps some sort of MTurk or captcha-like tasks per access? Patr[e]ons? Donation drives? Micro-payments? Something else??
AWS is not the cheapest bulk-storage hosting possible.
The cost of maintaining a free and open DB of scientific advances and publications would be so incredibly insignificant compared to both the value and the continued investment in those advancements.
I feel that we're halfway there already and are gaining ground. Does Pubmed Central  (a government-hosted open access repository for NIH-funded work) count as a "ledger" like you're referring to? The NSF's site does a good job of explaining current US open access policy . There are occasional attempts to expand the open access mandate by legislation, such as FASTR . A hypothetical expansion of the open access mandate to apply to all works from /institutions/ that receive indirect costs, not just individual projects that receive direct costs, would open things up even more.
Also, a per-government ledger would not be super-practicable. But if, say, the US, the EU and China would agree on something like this, and implement it, and have a common ledger, then it would not be some a big leap to make it properly international. Maybe even UN-based.
That's a pretty big "if" though.
On the other hand, there is a slippery sloap to decide what isn't scientific so much as to not be required open knowledge.
By the way, specialist knowledge and open knowledge is kind of a dichotomy. You would need to define the intersection of both. Suddenly you are looking at a patent system. Pay to Quote, citation fees, news websites already are demanding this from google, here in Germany, inuding Springer Press
IPFS via the DHT tells the network of your whole network topology, including internal address you may have, and VPN endpoints too. It's all public by design as they don't want to associate IPFS with piracy per one of their developers.
this thread has some discussions on the alternatives https://www.reddit.com/r/DataHoarder/comments/nc27fv/rescue_...
what ipfs here provides over torrent is the ability to add more files instead of creating new torrents
I believe libgen.fun which is a new (official) libgen mirror is running fully on ipfs, and it serves some scientific papers, but I wasn't able to search by DOI or title there, looks like it redirects to scihub, also there is no index of the papers on IPFS.
Edit: this doc talks about scihub+ipfs (it was created by the leader of Scihub rescue effort on Reddit, /u/shrine): https://freeread.org/ipfs/
You can also point to full webapps. So for example you could have a pubkey point to a webapp that then loads the database from a list of moderators who update the data under their pubkeys to point to the latest version of scihub. Then you can even have different moderators curate articles of different subjects, and the webapp can combine everything together.
All the data is stored on Sia, which is a decentralized cloud storage platform. Skynet allows anyone to upload or download data from Sia directly from the web, no need for extensions or custom software. The Sia network handles what IPFS calls "seeding", so contributors don't need to worry about leaving their machines on.
that's probably a really small set of scientists.
TL;DR of the link: No more uploads to support a court case in India which SciHub might win and thus establish a legal basis for operation in the biggest democracy.
there are some unaffiliated "mirrors" that only redirect to scihub but list their own bitcoin address for donations, so beware.
/r/scihub on reddit keeps track of this https://www.reddit.com/r/scihub/wiki/index
When I hear complaints that Bitcoin has no use except for speculation, I think of Sci-hub, Wikileaks and other organizations that may be bad for the interests of the US government but may be good for mankind.
In fact, there is already a ipfs mirror of scihub.
The fact that scihub needs to skip domain names every couple of months and that ISPs start blocking the website, looks to confirm my theory, rather than imply some worldwide conspiracy.
That's not privacy.
Which, I think, shouldn't be surprising when our "representatives" ostensibly speak for hundreds of thousands (and sometimes, millions) of people each. True democracy requires a much shorter and more direct chain of responsibility.
SciHub was allowed to operate freely for several years when it was relatively unknown, but to turn a blind eye to it now that it's received mainstream attention would threaten the credibility of the legal system.
I wholeheartedly support SciHub, but I can't blame governments for enforcing their laws. Ultimately, the changes need to happen at the legislative level to update copyright laws, and this somewhere where lobbying local representatives could have a meaningful impact.
Agreed. This is true for virtually any human endevour. The average person out there can be easily manipulated by those who generally seek power. So I say that the root problem is one of that of people in their ability discern appropriate people to hold power. To make matters worse those who are most suited to govern often are not interested in taking on positions of power.
Cease to supply this system with the fruits of your research labors.
Historically academics have felt forced to support this system, because for-profit journals are the high-prestige ones they must publish in in order to get tenure. This has changed for certain fields, but it isn’t as simple as just suggesting that one publish elsewhere.
In other branches of those disciplines, I have seen that recently some big-name editors have founded new open-access journals with the express aim of gradually taking prestige away from for-profit journals. See here  (PDF).
The property rights in question are not natural rights, nor material rights. Sufficient political will seems like it could do it.
Finding politicians in power who will support human progress before profits might be hard! [/understatement]
As others have mentioned, Sci-Hub also maintains a Tor presence, and you can access the Onion link using the Tor browser (provided you can install that on your desktop or device).
Love the haq!
There are lots of sites UK ISPs block even though the sites themselves are not illegal or host illegal content. For e.g. torrent indexing services (the content itself may be illegal but purely providing a search across that content is basically doing what Google do).
The UK internet is heavily filtered/censored and so doing this is useful anyway.
Business ISP connections don't seem to be restricted. And neither do most cloud providers I've tried.
Might be better than using temporary proxies.
Afaik, my (London) ISP does not block anything. No idea why, as all the others quote high court orders.
There's a smaller set of non-optional blocks (pirate/torrent sites) which you need a VPN to get around.
Porn is not really Illegal, just unwanted, which is apparently reason enough to block it. Does this mean any content which is "unwanted" can be blocked just like that?
While I do use a VPN at times, it is never to get around blocks.
Although I don't use my ISPs dns, so if that is how they're attempting to comply I wouldn't notice.
This https://unblockit.pw/ also does the job too (scihub is at the bottom).
You can enter any domain to test it, the data is collected by volunteers running an automatic tool.
> Incidentally, you do not need to be running a web server to use the .pac file. You can access it via a file:// type URL. For example (note the 3 slashes): file:///Users/username/Library/proxy.pac
I recommend using NextDNS, and then setting up a provisioning profile at https://apple.nextdns.io to set it as your revolver on your macs and ios devices. The ad-blocking features are a nice bonus, too.
NextDNS also has a cool free software CLI local DoH proxy resolver which works a charm.
> Changing your DNS resolver to a public one like Google’s instead of your ISP’s is not sufficient as of 2021, for two ISPs I’ve tested, and I suspect for all UK ISPs that implement blocking.
Changing your non-DoH resolver (such as using Google Public DNS) means requests and responses can still be edited by your ISP. This is what the article is talking about.
I suggested DoH (encrypted DNS) because this is not subject to such tampering. DoH (DNS-over-HTTPS) is not the same as traditional unencrypted port 53 DNS.
Really, anyone who gives a shit about privacy should be using DoH exclusively, otherwise you are basically uploading your web history in real-time to your ISP for mining and resale.
Having SciHub as a hidden service would bring a lot of people to Tor.
EDIT: apparently I'm a bit stupid; it exists: https://scihub22266oqcxt.onion
But it would be cool to promote the hell out of the onion address and tor browser, rather than trying to bypass ISP restrictions.
Interestingly I get this:
Secure Connection Failed
An error occurred during a connection to sci-hub.st. SSL received a record that exceeded the maximum permissible length.
Error code: SSL_ERROR_RX_RECORD_TOO_LONG
See also: https://www.blocked.org.uk/site/http://sci-hub.st
I understand how does this work in games and movies (the publisher gets most of the profit in the first hours after the release, before it gets cracked) but can't understand what's the point with business/utility apps.
The good thing is that, so far (in France), the blockings only affect the most mainstream providers.
If they're doing that you'd be better off using D-o-T or D-o-H, to protect your DNS from interference.
The courts agreed. If your ISP has such a capability the court will cheerfully give rights holders authority to demand the ISP blocks stuff that they claim rights over.
The ISPs could choose to just not block stuff. I am looking at Sci-Hub right now, because my UK ISP (Andrews & Arnold) doesn't block anything. From time to time, parliamentarians get vexed about this, but there is a cultural memory in the place that passing laws intended to stop people from saying things doesn't work. Lady Chatterley's Lover was banned, because, the government argued, it was obscene but it turns out that a court didn't buy this argument, and instead Penguin made piles of money because everybody wanted to read the banned book.
But most UK ISPs have decided that it is in their better interests to block things. Their options for attempting to do this have narrowed over time, once upon a time DNS blocking was pretty effective, I assume that by now they're mostly relying on IP blocking, which of course means they run the risk of exciting collateral attacks...
Also: all data is required to be logged, and those logs are searchable by civil servants without a warrant.
Which includes info on the orders:
Occasionally there is a small amount of reporting on it, but it tends to be fairly misdirected and naive.
It sadly isn't censorship resistant, but it should do for a few years, and as soon as censorship on IPFS starts to become an issue, hopefully the IPFS developers will evolve the design.