Hacker News new | past | comments | ask | show | jobs | submit login

By the way, Sci-Hub has stopped adding new articles to the database for a few months now (background: https://www.reddit.com/r/scihub/comments/mk46x4/scihub_v_els...).

It would be great to develop a truly decentralised solution. Having a database of individual torrent links for each paper might be a start.




Millions of individual torrents is not a great solution. Keeping them all seeded is basically impossible unless they run a seed for each one, at which point they might as well just host the files. Plus you'll never get the economy of scale that makes BitTorrent really shine.

When you have a whole lot of tiny files that people will generally only want one or two of there isn't much better than a plain old website.

A torrent that hosts all of the papers could be useful for people who want to make sure the data can't be lost by a single police raid.


What documents (books, scientific articles) benefit from specifically is a number of highly consistent, highly accurate identifiers: DOI (scientific articles), ISBN (published books), and others (OCLC identifier, Library of Congress Catalogue Number, etc.)

With the addition of hashsums (even MD5 ad SHA1, though longer and more robust hashsums are preferred), a pretty reliable archive of content can be made. It's a curious case where increased legibility seems to be breaking rather than creating a gatekeeper monopoly.

I've been interested in the notion of more reliable content-based identifiers or fingerprints themselves, though I've found little reliable reference on this. Ngram tuples of 4-5 words are often sufficient to identify a work, particularly if a selection of several are made. Aggreeing on which tuples to use, how many, and how to account for potential noise / variations (special characters, whitespace variance, OCR inaccuracy) is also a stumbling point.


Why map anything to words for strict identification? Words and language are very error prone vs an id number or hash.


It's a bit of an itch I've been scratching for a few years.

Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically?

For works with well-formed metadata, this isn't an issue.

For identical duplicate copies of the same file, a hash is effective.

But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document.

How do I determine that:

- An ACSCII-only textfile

- Markdown, HTML, DocBook, and LaTeX sources

- PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention).

- Hardbound and paperback physical copies

- Scans made from the same or different physical books or instances, versions, and/or translations.

- Audiobooks based on a work. By the same or different readers.

- Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?)

- Re-typed or OCRed text

... all refer to the same work?

How do you define "work"?

How do you define "differences between works"?

How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.)

If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form.

In bibliographic / library science, the term is "work, expression, manifestation"

https://www.loc.gov/marc/marbi/2011/2011-dp03.html


The general problem here is not solvable with technology if there is no universally agreed definition for “a work” - and there isn’t (this touches on some profound issues of ontology).

And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.

You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).


Yes, it's complicated.

All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.

The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.

Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).

The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.

The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.

And yes, the incremental / tuned approach is also likely a useful notion.

Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.


I have been dealing with the same problem for curating resources at https://learnawesome.org. Projects like Openlibrary do collect unique identifiers for _books_, but for everything else, it mostly takes manual effort. For example, I collect talks/podcasts by the author where they discuss ideas from their books. Then there are summaries written by others.


There's a lot of work toward this in library space, though it takes some adaptation to new media formats. Paul Otlet worked in a paper-only medium in the early 20th century but also has some excellent thinking. His books are now seeing translation from French. The Internet Archive and Library of Congress are also doing a lot of relevant work, see the WARC format as an example.

What's particularly relevant now are ephemeral and/or continuously updated online content --- and not just the WWW (http/https), but other protocols (ftp, gemini, ipfs, torrents, ...), as well as apps.

A working truism I developed was that "identity is search that produces a single result". So if you can come up with something that uniquely identifies a work, then that can be a working identifier. I typically focus on what can be reasonably assessed of author, title, publication date, publisher (traditional, website/domain), and failing that, descriptive text. Remember that originally titles were simply the introductory lines of works (a practice that remains used in some cases, e.g., the names of church masses or prayers, e.g., "Kyrie Eleison").

The Superintendent of Documents (SuDoc) Classification Scheme (used by the US goverment and GAO) and operates by agency, type of publication, and further divisions, as well as date/year. https://www.fdlp.gov/about-fdlp/22-services/929-sudoc-classi...


Probably because for written text the words identify the content while the hash relates more to the digital carrier format (pdf vs epub) and id number can change between publications, countries, etc.


Bingo.

And to drag in metadata, it may:

- Not be present.

- Be inaccurately applied to the correct work (metadata say the work is different, work is in fact related/same).

- Be inaccurately applied to the wrong work (metadata say the works are the same/related, they are not).


text to speech the doc then an acoustic fingerprint on the audio :)


You'd all but certainly be better going in the other direction.

Text is a more constrained state space than speech/audio.


There was that project some guy posted a while back that used a combination of sqlite and partial downloads to enable searches on a database before it was downloaded all the way. If you can fit PDFs somewhere into that you'd be golden.

Or just use IPFS I suppose.


IPFS would face a similar challenge as the “keep torrents seeded” problem mentioned by GP. Wouldn’t there be risk to peers who host the PDFs?


I sort of feel like there should be some way to use some kind of construct to get people to seed things so that others seed things for them, but I haven't seen that invented yet.


Been a while since I've looked at them, but IPFS with FileCoin and Ethereum Swarm had that kind of goal.

It might be beneficial to create something like what you describe without any cryptocurrency association though, and I've been mulling over possibilities for distributed systems that are inherently currency-less to avoid all of the scams that cryptocurrency attracts.


The leader in that space is Skynet, which basically is like IPFS + Filecoin but also has dynamic elements to it, and a lot better performance + reliability.

Cryptocurrency is helpful because it allows you to incentivize people to hold the data. If you don't have cryptocurrency, you're basically dependent on altruism to keep data alive (like bittorrent, or ipfs without filecoin). Popular files do okay for a while, but even popular files struggle to provide good uptime and reliability after a few years.

On an incentivized network like Sia or Filecoin, you can have high performance and reliability without ever needing any users to keep their machines online and seeding.


Does it scale well? SciHub is at least 100TB.


Skynet already hosts nearly 1,000 TB of data. At 100 TB, SciHub would be material but also comfortable.


I think seed ratios and seed time (mostly used by private trackers) attempt to solve this problem.


What kind of risk?


IPFS is not anonymous and like other p2p protocols shares your ip address. People seeding articles would get legal notices just like torrents now.

There's been a bit of effort to get it working over tor for years now but the fundamental design makes this difficult. Also despite all the money that has poured into filecoin this doesn't seem to be a priority.

This issue is nearly 6 years old:

https://github.com/ipfs/notes/issues/37


I was thinking legal risk. In this case the publishers are going after Sci-Hub, but in the past they have gone after individuals.


"There was that project some guy posted a while back that used a combination of sqlite and partial downloads to enable searches on a database before it was downloaded all the way."

https://github.com/bittorrent/sqltorrent



This is the original. Then came https://github.com/lmatteis/net-torrent and later one written in Javascript, inspired by net-torrent.


Isn't that essentially mapreduce? Either way, interesting and I'd love to see the link.



I believe this is the project mentioned

https://github.com/lmatteis/torrent-net


That one looks familiar. Though apparently the same thing has been tried in several different ways going by the replies I got.


This looks like it could be a good approach.


a plain old website or a publishing house with distribution services and syndication attached, but for a sane price.

"a whole lot of tiny files" severely underestimates the scale at work. Libgen's coverage is relatively shallow, and pdf books tend to be huge, at least for older material. Scihub piggy backs on the publishers, so that's your reference.

syndication, syndicate, quite apt don't you think? Libraries that coluded with the publishers and accepted the pricing must have been a huge part of the problem, at least historically. Now you know there's only one way out of a mafia.


In Internet scale it's not a lot of data. Most people who think they have big data don't.

Estimates I've seen put the total Scihub cache at 85 million articles totaling 77TB. That's a single 2U server with room to spare. The hardest part is indexing and search, but it's a pretty small search space by Internet standards.


The entire archive actually fits in a small desktop NAS (e.g. QNAP or Synology) with a few 14-18TB drives, you don't even need a server rack.

There is existing index in sql format distributed by libgen: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr..., it is around 30GB uncompressed.

Those 851 torrents uncompressed would probably take half a petabyte of storage, but I guess for serving pdfs you could extract individual files on demand from zip archive and (optionally) cache them. So the scihub "mirror" could run on a workstation or even laptop with 32-64GB memory connected to 100TB NAS over 1GBE, serving pdfs over VPN and using unlimited traffic plan. The whole setup including workstation, NAS and drives would cost $5-7K.

it's not a very difficult project and can be done DIY style, if you exclude the proxy part (which downloads papers using donated credentials). Of course it would still be as risky as running Scihub itself which has $15M lawsuit pending against it.


The entire Library of Congress books collection is on the order of 40 million items.

At 5 MB per book, this works out to about 200 TB of disk storage.

At about $12/TB, hosting the entire LoC collection would cost roughly $2,400 presently, with prices halving about every three years.


Note that $2,400 is disks alone. You'd obviously need chassis, powere supplies, and racks. Though that's only 17 12 TB drives.

Factor in redundancy (I'd like to see a triple-redundant storage on any given site, though since sites are redundant across each other, this might be forgoable). Access time and high-demand are likely the big factor, though caching helps tremendously.

My point is that the budget is small and rapidly getting smaller. For one of the largest collections of written human knowledge.

There are some other considerations:

- If original typography and marginalia are significant, full-page scans are necessary. There's some presumption of that built into my 5 MB/book figure. I've yet to find a scanned book of > 200MB (the largest I've seen is a scan of Charles Lyell's geology text, from Archive.org, at north of 100 MB), and there are graphics-heavy documents which can run larger.

- Access bandwidth may be a concern.

- There's a larger set of books ever published, with Google's estimate circa 2014 being about 140 million books.

- There are ~300k "conventionally published" books in English annually, and about 1-2 million "nontraditional" (largely self-published), via Bowker, theh US issuer of ISBNs.

- LoC have data on other media types, and their own complete collection is in the realm of 140 million catalogued items (coinciding with Google's alternate estimate of total books, but unrelated). That includes unpublished manuscripts, maps, audio recordings, video, and other materials. The LoC website has an overview of holdings.

Published document scarcity is entirely imposed.


It still amazes me that 77TB is considered "small". Isn't that still in the $500-$1,000 range of non-redundant storage? Or if hosted on AWS, isn't that almost $1,900 a month if no one accesses it?

I know it's not Big Data(tm) big data, but it is a lot of data for something that can generate no revenue.


> Isn't that still in the $500-$1,000 range of non-redundant storage?

Sure. Let's add redundancy and bump by an order of magnitude to give some headroom -- $5-10k is a totally reasonable amount to fundraise for this sort of application. If it were legal, I'm sure any number of universities would happily shoulder that cost. It's miniscule compared to what they're paying Elsevier each year.


Sorry. My point was it was a lot of money precisely because it cannot legally exist. If it could collect donations via a commercial payment processor, it could raise that much money from end users easily. Or grants from institutions. But in this case it seems like it has to be self-funded.


I'm prepared to accept "does generate no revenue" but "can generate no revenue" ...?

Perhaps some sort of MTurk or captcha-like tasks per access? Patr[e]ons? Donation drives? Micro-payments? Something else??


Oh, it could generate revenue if it was legal. But it is not, so it seems difficult.


For an institution, it's a rounding error.

AWS is not the cheapest bulk-storage hosting possible.


Google already does a pretty good job with search. Sci-Hub really just needs to handle content delivery, instead of kicking you to a scientific publisher's paywall.


If the sane price is an optional "Donate to keep this site going" link, then ok. But only free access, without authentication or payment, to scientific papers, is sane. IMHO.


Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

The cost of maintaining a free and open DB of scientific advances and publications would be so incredibly insignificant compared to both the value and the continued investment in those advancements.


> Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

I feel that we're halfway there already and are gaining ground. Does Pubmed Central [0] (a government-hosted open access repository for NIH-funded work) count as a "ledger" like you're referring to? The NSF's site does a good job of explaining current US open access policy [1]. There are occasional attempts to expand the open access mandate by legislation, such as FASTR [2]. A hypothetical expansion of the open access mandate to apply to all works from /institutions/ that receive indirect costs, not just individual projects that receive direct costs, would open things up even more.

[0] https://www.ncbi.nlm.nih.gov/pmc/

[1] https://www.nsf.gov/pubs/2016/nsf16009/nsf16009.jsp#q1

[2] https://sparcopen.org/our-work/fastr/


Well, some research venues (and publication venues) are not government-funded, and even if they are indirectly government funded, it's more of a sophistry than something which would make publishers hand over copies of the papers.

Also, a per-government ledger would not be super-practicable. But if, say, the US, the EU and China would agree on something like this, and implement it, and have a common ledger, then it would not be some a big leap to make it properly international. Maybe even UN-based.

That's a pretty big "if" though.


I share the sentiment insofar as free access would benefit my own sanity, except when it is about hording.

On the other hand, there is a slippery sloap to decide what isn't scientific so much as to not be required open knowledge.

By the way, specialist knowledge and open knowledge is kind of a dichotomy. You would need to define the intersection of both. Suddenly you are looking at a patent system. Pay to Quote, citation fees, news websites already are demanding this from google, here in Germany, inuding Springer Press


Libgen's coverage is definitely more shallow than scihub, but it is still pretty good.


There are already torrents of the archives. But supposing scihub was taken down it's pretty non trivial to get from the archive back to a working site with search functionality. For one thing, none of Sci-Hub's code is available.


Seems like what should be in each torrent is a virtual appliance preloaded with one shard of the data, where that virtual appliance has a pre-baked index for searching just that shard's data. Then one more torrent for a small search-coordinator appliance that fans your search query out to all N shard appliances.


BitTorrent does allow you to download a single file from a torrent though. You could have torrent for each month, and a client which allows you to search inside these torrents and download only the files you need.


Maybe Usenet? It already support massive copyright infringement yet it is still around.


Maybe we can create a freesite, on Freenet.


only if it was possible to use chia to store content.. it would be a game changer


IPFS seems like a perfect fit for this and some of the scihub torrents are already in IPFS, but it's not an anonymous network.

IPFS via the DHT tells the network of your whole network topology, including internal address you may have, and VPN endpoints too. It's all public by design as they don't want to associate IPFS with piracy per one of their developers.

this thread has some discussions on the alternatives https://www.reddit.com/r/DataHoarder/comments/nc27fv/rescue_...


Can files be taken down off ipfs? There was a fairy widely circulated link that had all the IEC and ANSI standards on there that has since been taken down.


You could use libp2p's DHT over Tor (I did a poc of this long ago, and the situation's only improved). Combined with other libp2p/IPFS components, you can essentially have a private IPFS over onion services (not to be confused with accessing the existing IPFS network via Tor exit nodes).


If 2 people add the same file in ipfs independently on their side. Will the ipfs hash be the same?


Yes


So what? bittorrent is all public too. It hasn't prevented the piracy scene for making releases.

what ipfs here provides over torrent is the ability to add more files instead of creating new torrents


Isn't scihub already on IPFS?


some torrent files are archived there, but i don't think scihub is serving the pdfs from IPFS, they likely use private storage network.

I believe libgen.fun which is a new (official) libgen mirror is running fully on ipfs, and it serves some scientific papers, but I wasn't able to search by DOI or title there, looks like it redirects to scihub, also there is no index of the papers on IPFS.

Edit: this doc talks about scihub+ipfs (it was created by the leader of Scihub rescue effort on Reddit, /u/shrine): https://freeread.org/ipfs/


You can build something like this on Skynet. Links on Skynet (called skylinks) can point to hashes of data (similar to IPFS)or can point to a pubkey. If the data points to a pubkey, it can be tweaked repeatedly and people with the old links will see the new data.

You can also point to full webapps. So for example you could have a pubkey point to a webapp that then loads the database from a list of moderators who update the data under their pubkeys to point to the latest version of scihub. Then you can even have different moderators curate articles of different subjects, and the webapp can combine everything together.

All the data is stored on Sia, which is a decentralized cloud storage platform. Skynet allows anyone to upload or download data from Sia directly from the web, no need for extensions or custom software. The Sia network handles what IPFS calls "seeding", so contributors don't need to worry about leaving their machines on.


Torrents are great and all but it's dependent on people seeding them, sci-hub/libgen is great because you don't have to worry about a download suddenly breaking because no one is seeding


But they could just always be a seeder. Doesn’t that have the upsides of the existing solution plus resiliency?


Isn't scihub/libgen already backed by torrents? I'm confused.


Nope it was mostly direct download for the endusers with some ipfs in some places.


from what I understand, the Authors are still free to send you their papers. So perhaps the simplest decentralized system is for each author to run an automatic email request system and have feeds / aggregators of papers titles/abstracts with author emails to make it easy to get the papers you want?


That would only work for a tiny tiny intersection of : 1) technical enough people to install and maintain that 2) with an email address still working (if there is one on the paper!) 3) still alive

that's probably a really small set of scientists.


Thanks for the background link. I did not know about that and it's a good incentive to donate them some money for the legal battle.

TL;DR of the link: No more uploads to support a court case in India which SciHub might win and thus establish a legal basis for operation in the biggest democracy.


when donating bitcoin make sure to get the address from the official scihub mirrors, which are currently sci-hub.do, sci-hub.st or sci-hub.se.

there are some unaffiliated "mirrors" that only redirect to scihub but list their own bitcoin address for donations, so beware.

/r/scihub on reddit keeps track of this https://www.reddit.com/r/scihub/wiki/index


Sci-hub is such a great example of a clear and compelling use-case for Bitcoin. Bitcoin is censorship-resistant money that doesn't rely on countries, laws, central bankers or politicians. The US dollar cannot be used for purposes not aligned with the US government. Sometimes ideas that the US Government doesn't agree with can be useful (e.g. Wikileaks, Sci-hub.)

When I hear complaints that Bitcoin has no use except for speculation, I think of Sci-hub, Wikileaks and other organizations that may be bad for the interests of the US government but may be good for mankind.


It's a censorship resistant technology that also indelibly records, publicly, every transaction you ever participated in. Talk about a double-edged sword...


Agree, it's an interesting trade-off: Completely private if you can use an anonymous address but completely traceable if the address is identified. I wonder if Satoshi intended it that way or if he would have preferred the greater anonymity of Monero.


Wouldn't IPFS(https://ipfs.io/) work well?..


This is where IPFS shines.

In fact, there is already a ipfs mirror of scihub.


This might be a good use case for Arweave.


I'm growning skeptical of the idea that every science text should be free. As a community that values privacy, it's hypocritical that these widespread leaking is so well accepted.

The fact that scihub needs to skip domain names every couple of months and that ISPs start blocking the website, looks to confirm my theory, rather than imply some worldwide conspiracy.


How is that a leak? Only the publishing companies are making money. Authors, reviewers and often editors all work for free.


I don't think they work for free, they have a salary, usually paid by universities I think.


What does this have to do with privacy?


Publishers want to keep the papers relatively private.


Except from literally anyone that pays them a monthly fee?

That's not privacy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: