Hacker News new | past | comments | ask | show | jobs | submit login
Archivists Are Trying to Make Sure LibGen Never Goes Down (vice.com)
908 points by legatus on Dec 3, 2019 | hide | past | favorite | 257 comments

This is an extremely important effort. The LibGen archive contains around 32 TBs of books (by far the most common being scientific books and textbooks, with a healthy dose of non-STEM). The SciMag archive, backing up Sci-Hub, clocks in at around 67 TBs [0]. This is invaluable data that should not be lost. If you want to contribute, here's a few ways to do so.

If you wish to donate bandwidth or storage, I personally know of at least a few mirroring efforts. Please get in touch with me over at legatusR(at)protonmail(dot)com and I can help direct you towards those behind this effort.

If you don't have storage or bandwidth available, you can still help. Bookwarrior has requested help [1] in developing an HTTP-based decentralizing mechanism for LibGen's various forks. Those with experience in software may help make sure those invaluable archives are never lost.

Another way of contributing is by donating bitcoin, as both LibGen [2] and The-Eye [3] accept donations.

Lastly, you can always contribute books. If you buy a textbook or book, consider uploading it (and scanning it, should it be a physical book) in case it isn't already present in the database.

In any case, this effort has a noble goal, and I believe people of this community can contribute.

P.S. The "Pirate Bay of Science" is actually LibGen, and I favor a title change (I posted it this way as to comply with HN guidelines).


[1] https://imgur.com/a/gmLB5pm

[2] bitcoin:12hQANsSHXxyPPgkhoBMSyHpXmzgVbdDGd?label=libgen, as found at, listed in https://it.wikipedia.org/wiki/Library_Genesis

[3] Bitcoin address 3Mem5B2o3Qd2zAWEthJxUH28f7itbRttxM, as found in https://the-eye.eu/donate/. You can also buy merchandising from them at https://56k.pizza/.

Sounds like anyone with a seed box could donate some bandwidth and storage by leeching then seeding part of it? It would be nice if there’s a list of seeder/leecher counts (like TPB) or better yet of priority list of parts that need more seeders.

Edit: Found the other comment where you link to the seeding stats: https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2x...

Or better yet, a RSS feed that plays nice with auto-retention and quota settings. It just delivers you a bunch of parts that are in need of seeders and you use your existing mechanism to help with it.

For important archives like this maybe we need some sort of turn-key solution for the masses? Like a Raspberry Pi image that maintains a partial mirror. Imagine if one could by a RPi and external HD, burn the image, and connect it to some random wifi network (at home, at work, at the library, etc).

I'm not hosting a copy of this at work (where we easily have 32TB on old hardware) since distributing it is copyright infringement. The same goes for my home connection.

Most people don't care. The chance anything at all bad will happen is so incredibly low.

This isn't even movies wherein some large studio's can send notices. I don't think publishing houses have that many funds to send so many legal notices.

Books are a safe bet to pirate

That's what people said about music and films too. You don't want to be the next Jammie Thomas.

This is an existential threat to the deep-pocketed likes of Elsevier et al. They will use the law to make an example of anyone too close to their sphere of influence; so if you are in the US or the EU; support the efforts of LibGen vocally and loudly, and contribute anonymously, but don't risk your neck to the extent where they can get a hold of you.

There are plenty of ways to support the effort safely though. Make sure people who wish to access scientific papers and books know where to go, and make sure your elected officials know about the need for publicly funded science to be published free of charge, open access, (retroactively too).

I'm guessing a pretty significant minority of HN's users maintain offshore seed boxes to get other copyrighted content and for them it might be pretty trivial to add partial peering of libgen content.

I think a turn-key solution for people living in not US/EU will still help the general health of the archive.

At least the large academic publishers are sitting on enormous stacks of cash, so that argument doesn't fly.

I just read the article and your comments here and I'm a bit unsure what's the difference to the Internet Archive. Is it that the IA can archive them but not make them public for legal reasons and The-Eye is more focused on keeping them online and accessible no matter what?

Yes. It is extremely likely IA has the LibGen corpus archived, but darked (inaccessible), to prevent litigation.

There are quite a few such copies, on the 'just in case' principle.

> Lastly, you can always contribute books. If you buy a textbook or book, consider uploading it (and scanning it, should it be a physical book) in case it isn't already present in the database.

There's no easy solution for scanning physical books, is there?

There are providers [1] that will destructively scan the book for you and return a PDF. If you want to preserve the book, you're stuck using a scanning rig [2]. The Internet Archive will also non-destructively scan as part of Open Library [3], but they only permit one checkout at a time of scanned works, and the latency can be high between sending them a book and it becoming available. FYI, 600 DPI is preferred for archival purposes.

[1] http://1dollarscan.com/ (no affiliation, just a satisfied customer, can't scan certain textbooks due to publisher threats of litigation)

[2] https://www.diybookscanner.org/

[3] https://openlibrary.org/help/faq

A big +1 for 1dollarscan.com. They've scanned many hundreds of books for me. The quality of the resulting PDFs is uniformly excellent, their turnaround time is fast, and their prices are cheap ($1 per 100 pages).

I've visited their office -- located in an inexpensive industrial district of San Jose -- on multiple occasions. They have a convenient process for receiving books in person.

I believe the owners are Japanese and the operation reminds me of the businesses I visited in Tokyo: quiet, neat, and über-efficient.

> quiet, neat, and über-efficient

I wish the same could be said for the Tokyo office I work in!

I will add a vote for bookscan.us, which I have been using since 2013 or so. Very reasonable prices and great service.

There are DIY book scanners (http://diybookscanner.org) and products such as the Fujitsu ScanSnap SV600. The SV600 has decent features like page-detection and finger-removal (I recommend using a pencil's eraser tip). I have personally used it to scan dozens of books, with satisfactory results.

Just saw a father who had to do it fully manually for her blind daughter. I shall show your comment to him.

Scanning with your phone is getting easier. At a minimum you can take a pic of each of the pages. Software can clean up the images, sorta. It's not ideal but it's better than nothing.

I remember when "cammed" books were bottom-tier and basically limited to things like 0day releases, even when done with an expensive DSLR. It's amazing how much camera technology has progressed since then; in less than a second you can get a high-resolution, extremely readable image of each page.

I used to participate in the "bookz scene", well over a decade ago. Raiding the local public libraries --- borrowing as many books as we could --- and having "scanparties" to digitise and upload them was incredibly fun, and we did it for the thrill, never thinking that one day almost all of our releases would end up in LibGen.

I found vFlat to be magical in cleaning up book scan images you took with your phone.


>This app is incompatible with your device.

my disappointment is immeasurable and my day is ruined

I use bookscan.us for this purpose: I mail the physical book to them and they send me a file a few days later for a very reasonable price.

Unfortunately it’s a destructive process.

Your local physical library may make a book scanner available. Mine does, with a posted 60-pages-at-a-time limit (though I don't know how this is enforced).

Mind explaining the origin of your 32 TB figure? I must be missing something enormous, but as far as I can tell the SciMag database dump is 9.3 GB, the LibGen non-fiction dump is 3.2 GB, and the LibGen fiction dump is 757 MB. That's a pretty huge divergence.

Source: http://gen.lib.rus.ec/dbdumps/

Oh, wait. I'm dumb. I see that your first link is a citation.

Continuing to be dense, why is there a difference between their "database dump" and the total of all the files they have?

The databases contain the metadata (authors, edition, ISBN, etc.) for the books.

Thus, 32 TB of books (over 2 million titles), 3.2 GB database.

Ah, that makes sense.

To make sure I'm understanding this correctly:

The Libgen Desktop application (which requires only a copy of the database) would then use the DB metadata to make LibGen locally searchable, and would only retrieve the individual books/papers on request?

I guess it's stunningly obvious to everyone else, but how are you certain the replacement isn't worse than the original system. I already see comments about the curation problem, for example. What's the point in making bad information (duplicate information etc.) highly available? Why put so much faith in this donation strategy i.e. donating bandwidth or donating money?

The new architecture of pirate sites, what I call the Hydra architecture, seems pretty interesting to me. There isn't a single site hosting the content, but a group of mirrors freely exchanging data between one another. In case some of them go down, the other ones still remain and new ones can appear, copying data from the remaining mirrors. This is like a hydra that grows two heads every time you chop one off. It's absolutely unkillable, as there's no single group or server to sue.

A more advanced version of this architecture is used by pirate addons for the Kodi media center software. Basically, you have a bunch of completely legal and above board services like Imdb that contain video metadata. They provide the search results, the artworks, the plot descriptions, episode lists for TV shows etc. Impossible to sue and shut down, as they're legal. Then, you have a large number of illegal services that, essentially, map IDs from websites like IMDB to links. Those links lead to websites like Openload, which let you host videos. They're in the gray area, if they comply with DMCA requests and are in a reasonably safe jurisdiction, they're unlikely to be shut down. On the Kodi side, you have a bunch of addons. There are the legitimate ones that access IMDB and give you the IDs, the not that legitimate ones that map IDs to URLs, and the half-legitimate ones that can actually play stuff ron those URLS (not an easy taks, as websites usually try to prevent you from playing something without seeing their ads). Those addons are distributed as libraries, and are used as dependencies by user-friendly frontends. Those frontends usually depend on several addons in each category, so, in case one goes down, all the other ones still remain. It's all so decentralized and ownerless that there's no single point of failure. The best you can do is killing the frontend addon, but it's easy to make a new one, and users are used to switching them every few months.

> It's absolutely unkillable

Just like any other distributed system, this is vulnerable to organized take downs and scare tactics. There was a whole bunch of mirrors of Pirate Bay, yet once most of Europe's legal systems adopted the "sharing is theft" mindset, it became pretty much impossible to find one.

But now the main site seems to be bullet proof. There was a time where weekly there would be a new official link. I'm not sure what changed structurally with hosting tbp

They just stopped going after it, and focused resources on stopping streaming websites

I was just in Europe and used piratebay while I was there. The main site didn’t work, but searching “piratebay mirror” found one that did right away.

Here in Norway, ISPs are actually legally obligated to block access to The Pirate Bay. Mirrors work.

> there's no single point of failure. The best you can do is killing the frontend addon

Single decentralized service, providing access to all content, national and international, free of DRM, for all platforms, for a proper, fair, and non-monopolist price.

That will pull all the users who are willing to pay for content over to the paid service, and those who remained were not willing to pay regardless of what you did anyhow.


I worry that if this system becames permanent, one in which it is practically impossible to stop piracy, followed by the loss of traditional incentives we might find ourselves in a place where no motivated investor will break even when producing quality and innocuous content.

Most of the stuff on scihub was funded by tax dollars

In that case, we will just pass around the same old stuff until we get bored enough that we'll actually pay for new stuff.

Some people might decide to pay but the technology will be there to distribute it for free. At this point it would be sort of like a public good with the free rider problem.

"Motivated investors" don't produce quality content anyway; they produce mass-market swill like Twilight or Game of Thrones.

Most people use sites like these to pirate the "swill" you despite so much.

Yongle Encyclopedia was a similar project of the 15th century China. It was the largest encyclopedia in the world for 600 years until surpassed by Wikipedia.

Alas, Yongle Encyclopedia is almost completely lost now. Archiving is harder than you think.


WP says that it was never printed for the general public. Hmmm. Had it been (parts duplicated, say, at hundreds of sites), most of it would probably have survived.

I read the Wikipedia article about it and the sad thing is that the majority of the Yongle Encyclopedia seem to have been destroyed only in quite recent times.

> but 90 percent of the 1567 manuscript survived until the Second Opium War in the Qing dynasty. In 1860, the Anglo-French invasion of Beijing resulted in extensive burning and looting of the city,[16] with the British and French soldiers taking large portions of the manuscript as souvenirs.

Preservation is easy if you don't get invaded.

It's easy if you anticipate these things. Who put the dead sea scrolls in that cave in the middle of nowhere? Not someone who went in and forgot their scroll one day. Someone who had the foresight that this would be a safe space in the face of who knows what future threat. And it payed off.

I doubt it was that intentional. I would wager that "someone forgot about it" is the more likely explanation.

Maybe we should print this out on acid-free paper-thin flexible wood-pulp sheets stitched to together to form linear organized aggregations. Each aggregation would contain one or more works and be searchable using a SQL-like database. To make this plan really work there would need to be a collection of geographically distributed long term physical repositories that would receive periodic updates as new material became available.

All joking aside, I do wonder wither digital or analogue formats are better able to survive into the distant future.

* What impact will DRM have on the accessibility of our knowledge to future historians?

* Is anything recoverable from a harddrive or flash media after 500 years in a landfill?

* Will compressed files be more of less recoverable? What about git archives?

* Will the future know the shape of our plastic GI Joes toys but not the content of the GI Joes cartoon?

> I do wonder wither digital or analogue formats are better able to survive into the distant future.

There are 5000 year old clay tablets we can still read.

There are centuries old documents on paper, vellum etc. that we can still read.

I personally have decades-old paper documents I can easily read, and a box of floppies I can't.

It's not just a problem of unreadable physical media, I have a database file on a perfectly readable HD that was generated by an application that is no longer available. I might be able to interrogate it somehow, but it won't be easy.

Digital formats and connectivity make LOCKSS easier, so that's a plus. There's less chance of a fire or flood or space-limited librarian destroying the last known copy. However, without archivists actively transforming content to new formats as required, it might only take a few decades before a lot of content starts to require a massive effort to read.

Clay is the plastic of the ancient world.

Let's say the probability that: a single copy of a physical book survives 1,000 years, is found and is understood by an archaeologist, is pB and the probability that a single copy of a book on an SSD survives 1,000 years is found and understood by an archaeologist is pD. Even if pB is far larger than pD it could be the case that there might be so many more copies of single book held on SSDs thus making it more likely the book will survive via an SSD than a physical book. On the other hand the technology to recover data from SSDs might not exist in 1,000 years.

It could also be the case that each generation would copy these books onto new digital media providing an unbroken chain of copies. The oldest copy of the Iliad is Venetus.A which is from 1000AD (1000 years ago) despite the Iliad probably first being written down in 800BC (2800 years ago). It was copied from earlier copies of copies of copies.

I really don't know how this will play out and I've been unable to find research on how long SSD and flash memory based media survives especially if buried in a landfill.

* - If archaeologists exist in the future. The current push from the STEM boosters to defund and de-emphasize the humanities may result in a near-future without archaeologists or funded archaeological projects. Over 1,000 years the entire field could die.

> thus making it more likely the book will survive via an SSD than a physical book

Yes. That's what I mean by LOCKSS being easier.

> is found and is understood by an archaeologist,

There is a problem with merging these two probabilities.

The probability of finding a book is of course massively smaller than the probability of finding a digital copy.

The probability of understanding a book is so much greater than the probability of understanding a file on a disk.

This makes it more likely that the physical book will survive in a meaningful way.

> It could also be the case that each generation would copy these books onto new digital media

This is what I mean by archivists actively transforming the content. Regarding written content like the Iliad, copies and translations can be made centuries apart. Content in digital formats may need to be transformed whenever the application that reads it is discontinued.

Would an SSD even function after 1000 years? Unless sealed, I imagine ambient moisture would do a number inside the drive. The same is true for books of course, but we still have 1000 year old books that have lasted by sitting on a shelf in churches and temples, etc., without any specific care until recent history.

The nice part of a book in an apocalyptic scenario is that you can copy it even if you don't know the language. You don't need a special tool for this, only one capable of marking a surface. It wouldn't be fun or fast, but it's possible and it's what monks did for centuries. Would archeologists 1000 years from now be lucky enough to find a SATA cable too?

It doesn't really matter if the SSD as a whole still works, because after 1000 years you'll never recover the data via the normal interface. Modern MLC flash is often specified for less than 1 year data retention, and even SLC is unlikely to make it to 1000 years. Attempting to read it will only make things worse ("read disturb"). The best hope of saving the data is with some future nanotech that directly probes each floating gate transistor and counts the electrons, and reverse engineering all the error correction and wear leveling.

I would assume they would read the SSD not by powering it on and plugging it into to a computer but by disassembling it and physically imaging the physical structure. This would also bypass the all the write leveling infrastructure allowing them to recover deleted data. It reminds me of the current techniques of using x-rays to read writing on the odd scraps of paper used to bind a book [0].

[0]: "X-rays reveal 1,300-year-old writings inside later bookbindings" https://www.theguardian.com/books/2016/jun/04/x-rays-reveal-...

No one is proposing we use floppy disks.

Redundant, shared servers ARE a forever solution. Making sure your data is one one of the ones that makes it seems like a vastly easier proposition to me than writing data to clay tablets and trying to keep those from ending up in a dump somewhere.

What is the likelihood that historians a century or two hence will have an application capable of turning an ISO 32000-1 file into a human-readable text?

If we are talking about archaeologists, rather than historians, even ASCII and Unicode could be a challenge to work out.

Because those hundreds of years don't transpire in a glimpse. At some point in the middle there will be deprecated formats and new ones, and transcoders you can batch run. Sure it relies on intervention, but the upside is any/everyone else can copy the one persons work.

Yes we should learn from history, but we should also not assume that everything that happened before will happen the same way again, given how much of our world has changed.

> However, without archivists actively transforming content to new formats as required, it might only take a few decades before a lot of content starts to require a massive effort to read.

More effort than batch reading physical books and tablets in old languages?

You can reuse interfaces easier on data, and current ML could probably pull some of the weight of interpreting old data right now, not to mention what we have 50 years from now.

0.99999 at least.

Compare the capabilities of digital historians today to those 10- and 20-years ago respectively. It’s night and day.

This is not a solvable problem without technological continuity, or some unimaginably smart technology we can't imagine today.

If you found a mysterious archive object and had no idea what it was - CD-R, hard drive, SSD, whatever - not only would you have to reinvent an entire hardware reader around it, you would also have to work out the file structure, extract the data (some of which could be damaged), and reverse engineer the container file formats and the data structures inside them.

If you got all of that right, you'd eventually be able to start trying to translate the content of the text, audio, images, videos (how many compression formats are there?) into something you could understand.

A much more advanced civilisation would struggle with making a cold start on all of that. In our current state, we'd get nowhere if we didn't already have some records explaining where to begin.

Take a CD-R of some MP3 with English language file names stored on a FAT32 filesystem for example. Assume the reflective layer didn't rust since it was abandoned in a dry climate and our future archaeologist has access to roughly modern levels of technology.

1. Even if the CD-R has been crushed and shattered you could use a modern and cheap microscope to read continuous pits and lands off the disk [0,1]. It would be clear to anyone familiar with information theory how to translate the pits and lands to a series of set of arbitrary symbols which encode data.

2. This data would at first be meaningless. However the mathematical relationships of a simple error correcting code would stand out. This would allow them recover corrupted data. Once the error correcting code was stripped out they have a transcript of the raw data.

3. They would notice a pattern in the data. There would be long high entropy regions and then very short low entropy regions. They would probably notice that some of the low entropy regions had every 8-th bit set to zero (ASCII) and if taken in 8-bit chunks these regions had the roughly the same number of symbols as in the latin alphabet. If they were familiar with English they might quickly decode these regions using letter frequency correspondence with another English text.

4. The high entropy regions would be far harder to decode. However these future archaeologists would be faced with the obvious data patterns of frames of an MP3. Decoding the first MP3 would be a serious project involving many institutions over many years but once it was done it would allow the decoding of all artifacts that use the MP3 and related encoding formats. Possibly someone would find a "rosetta file" [2], a disk that contained both a .wav file and an encoded MP3 of the same song. More likely someone would find an MP3 player and then reverse engineer the decoding algorithm.

[0]: "Being able to see the tracks and bits in a CD-ROM" https://superuser.com/questions/870776/being-able-to-see-the...

[1]: "CD-ROM Under the Microscope" https://www.youtube.com/watch?v=RZUxemOE07Q

[2]: https://en.wikipedia.org/wiki/Rosetta_Stone

I mean, archaeology and linguistics have been figuring out ancient languages as an entire field, while determined individual hobbyists are able to reverse engineer unknown file formats.

By which I mean, many file formats are syntactically much simpler and more obviously structured than natural languages. It might take an entire field to reverse engineer weird formats like .DOC once all knowledge gets lost, but I doubt this will be the case for bitmaps or UTF-8 ...

Bitmaps are easy enough, but I wouldn't bet on UTF-8.

And any modern compression is probably right out without technological continuity.

I think if you gave a philologist living in 1880 AD a clay tablet with a binary inscription of a fragment of an English poem encoded UTF-8 they would decode it very quickly.

This is what the philologist would see:


How it would probably go:

1. Hmmmm there are only two symbols A and B, these symbols can't be words since no language has only two words. Thus the words must be made of a string of these symbols.

2. Every 8-th symbol* is a A. Lets try putting the symbols in groups of size 8.

3. These groups of 8 can't be words because they repeat far too often and they would only allow 128 possible words. Thus these groups of 8 might be letters in an alphabet.

4. Does the frequency of this possible letters fit any known languages? Yes, English.

5. Which group of 8 is "e"?

A few minutes later and the clay tablet is decoded.

* - This is not always true in utf-8 but true in most encoding of Latin alphabets including this example. Even with some variable length characters thrown in this fact would stand out.

This is a very restricted subset of utf-8. I agree that the ASCII subset would not be tremendously difficult to decipher; the most interesting parts are laid out systematically and in order and case is even just a bit flip.

It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems.

But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode.

There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover.

I agree recovering CJK Unified Ideographs encodings would be far harder than a phonetic alphabet, however a few things could make not as hard as it seems. The decoder has access to a text in both the future format and UTF-8. A text might mix phonetic words and ideographs as Japanese sometimes does today. The phonetic words would provide clues as to the ideographic characters.

Code breakers have decoded ciphertexts which used a code such that each word was replaced with a number. To make it even harder common words would be replaced by more than one numbers to defeat common frequency analysis techniques. This was done often with pen and paper.

Yuri Knorozov managed to decipher the Mayan script. That was a significantly harder task than recovering UTF-8 mappings because he has very little to work with on the source language (he did have somethings).

Exactly. You shouldn't underestimate the tremendous amount of work has been put into deciphering actual ancient languages using advanced techniques and minor contextual clues. Compared to that, deciphering most common UTF8 data would be relatively simple, meaning it could be done by a single person with some reverse engineering skills.

An engraved metal or stone tablet could be left along with the CDs to bootstrap the process. It could range from explaining the MP3 spec, to as simple as pictograms showing human speech being converted to microscopic pits. Explaining ASCII would be even easier.

the storage part at least could be a solved problem: https://en.wikipedia.org/wiki/5D_optical_data_storage

Pretty much everyone in a tech job could afford to buy 40TB of storage at home, or remotely and mirror the entire repo. I think that given this low barrier of entry if you can afford to help preserve the information then you can and probably should. Even if a small amount do it it's more points of recovery.

Storage isn't hard, but downloading 40 Tb can be a problem. Are there any arrangements for physical distribution (of the "truck loaded with USB drives" variety)?

Id say.... in the day .... anyone could afford to buy a single floppy disk and store files on it. But how many actually did and how many are actually recoverable. Lots probably got thrown out in intervening years.

In the GLAM sector the LOCKSS[1] is project is quite well-known. It tries to deal with some of the resiliency problems that is inherent in digital preservation. However, I'd guess this system does not offer the needed anonymity.

[1] https://www.lockss.org/ ; https://en.wikipedia.org/wiki/LOCKSS

Forget DRM, even future Engilsh may be incomprehensible. There is an entire field of study dedicated to finding a way to make our future voice heard, without a good plausible solution, called Nuclear Semiotics (https://en.wikipedia.org/wiki/Nuclear_semiotics).

If we can't effectively warn a future (>10,000 years) generation to stay away from something that may harm or kill them, what chance do we have of making a universally understandable archive of data?

Libgen is one of the greatest contributors to scientific productivity worldwide, possibly beaten only by Sci-Hub. Just about everybody in academia knows about it. If it ever vanished, some of us could probably still get by trading files from person to person, but nothing could be as perfect as what we got now.

> Just about everybody in academia knows about it

Just about everybody in academia uses it, too, especially in the case of Scihub. I can't imagine taking the time to actually check whether I have access to some journal when I want to read a paper, let alone jump through all the hoops before you can get a PDF. The first thing we did when my partner's paper was recently published was check to see if it was on Scihub yet. (It was!)

I remember, in the early 2000s, going through all the trouble of logging into my university library's proxy portal to get access to certain scientific papers. I probably wouldn't have done that if SciHub was available, and it probably would have opened up my eyes sooner to the fact that most people don't even have access to such a portal. Although, frankly, it was a different web back then and if you were persistent you could actually find anything.

> possibly beaten only by Sci-Hub

Today I learned that Library Genesis is actually "powered by Sci-Hub" as its primary source.

So I guess they're sister projects by similarly minded people (who seem to be mostly/originally based in Slavic countries, which I find interesting culturally - perhaps it's due to a looser legal environment + activist academics?).

> Just about everybody in academia knows about it.

That really says something about the state of society, this tension between copyright laws (and the motivations behind them) and the intellectual ideal of free and open access to knowledge.

It's actually more the other way round, since Libgen stores all of sci-hub's papers for them.

I am not an expert on the topic, but I believe that in the former Soviet Union it was common between mathematicians to pass around preprints (a la arXiv). These then perculated through to the West. I think it had to do with the USSR and their restrictive (if we are being euphemistic) policies towards academics.

"the USSR and their restrictive (if we are being euphemistic) policies towards academics."

What do you mean?

Their policies were for more than "restrictive" is how I'm reading it

See [1]

[1] https://en.wikipedia.org/wiki/Suppressed_research_in_the_Sov...

Yes. I'm saying restrictive to describe the effect on academic papers. The effect on (oppression of) the academics themselves was much worse.

No, that's BS.

There are well known cases of genetics and cybernetics being banned for ideological reasons during Stalin's time. Scientific books and articles of convicted 'enemies of the state' were dangerous to possess in that time too. Some scientists used ideological 'arguments' in scientific debates which were dangerous to argue against.

But all that, AFAIK, ended after Stalin's death in 1953.

Moreover, I've never heard anything about mathematics in this regard.

Not sure what you are saying. Mathematicians were not even allowed to travel abroad [1] and any "concessions" were essentially as it pleased the USSR state. Only from 1990 was movement free in the true sense of the word.

[1] An example was when Margulis won the Fields medal: https://en.wikipedia.org/wiki/Grigory_Margulis. There are many other examples too.

What does that have to do with sharing knowledge in the USSR and the countries in the Soviet block?

It was never in Soviet ideology to hide knowledge behind paywalls. See, for example, this [0] post about Mir publishing house and warm comments of Indians who grew up with their books. Sci-hub's ideology is just continuation of this approach.

[0] https://news.ycombinator.com/item?id=21352277

> Sci-hub's ideology is just continuation of this approach.

Actually, that was the point of what I was saying—the mathematicians had to be inventive and thus passed around preprints that they knew would also be read in the West.

Why did they have to be inventive? Please provide a source.

who seem to be mostly/originally based in Slavic countries, which I find interesting culturally - perhaps it's due to a looser legal environment + activist academics?

You see the same situation with Asia --- it's a collectivist culture, they have a very different perspective on IP in general.

That makes sense, thanks for pointing that out. I can see how it's related to the value system of collectivist (as opposed to individualist) cultures, and how they see intellectual property - as a common good, beyond personal/private ownership.

It's saved me probably 3 grand over the course of college. I would have had to take on debt otherwise to pass my courses.

I don't see anyone having mentioned the possibility of posting this data to Usenet at all - at minimum for archival purposes which should be good for ~8-9 years. That way at least the data isn't lost. With so many of those torrents have 0 or 1 seed, this is a serious risk I think, despite the comments elsewhere about people rotating what they seed.

I realize that doesn't solve the access problem for most people as most of the users who need this research might not know how to use usenet or even be familiar with it at all, but I think the first major concern would be to secure the entire repository on a stable network. Usenet seems like a good place for that even if it doesn't serves as a means of distribution. Encrypting the uploads would make them immune to DMCA takedowns provided that the decryption keys weren't made public and were only shared with individuals related to the maintenance of the LibGen project.

Two thoughts on that. Encoding it to a text format with CRC data for posting to usenet is highly inefficient in terms of data storage. And 33TB of stuff is not going to be retained for 8-9 years, the last I checked due to the huge volume of binaries traffic, the major commercial usenet feed providers have at most 6-9 months of retention for the major binary groups. Beyond that it becomes cost prohibitive for them in terms of disk storage requirements. This is not an issue for the majority of their customers, 6-9 months is more than long enough retention to go find a 40GB 2160p copy of some recently-released-on-bluray movie.

Entirely agree about the lack of efficiency. No question about that.

However, in my personal experience, I have seen no issues downloading old data from any binary group. At least not with the provider I have. In fact, just this past week I obtained something sizable (several GBs) with no damaged parts so didn't even need the parchive recovery files at all. This has always been my experience. I've never seen anything like the pruning you are talking about. That sounds more like an issue with your specific provider to me.

yEnc overhead is about 2% and there are plenty of providers with ~10 year retention.

Wow. I can't even imagine how much disk space ten years of retention of alt.binaries.* takes up. It's been literally ten years since I last did anything serious Usenet related.

Atleast in my experience, 10Y providers ask for more money and provide less high speed bandwidth (after which your up/down is usually limited to around 10 or 1mbps)

To me, an aspiring scholar, LibGen is the most amazing tool ever. Things like inter-library loan and access to databases on university networks already make life so much easier to what it used to be—but nothing beats LibGen in terms of convenience. I’m in a the nowadays obscure field of patristic theology and I can’t believe how much stuff I can find on LibGen, often things that even highly specialized research libraries like Harvard don’t have.

The hours that LibGen saved me in gathering all the sources for my research must be in the hundreds. Thank you!

There is a huge amount of duplication there (i.e. books that have many scans), I wonder if it would be better to tackle that versus doing a straight backup.

There are groups behind data curation as well, though it is much harder. LibGen sees an addition rate of about 230 GBs per month, while SciMag's is around 1.10 TBs per month. We should expect those numbers to increase in the future. The man-hours required to curate those database may very well cost much more than the storage and bandwidth required to store duplicates and incorrectly tagged files. In any case, as I said, there are people seriously interested in curating the LibGen database, though most efforts I know of are still in the earliest stages.

Do you know if they process PDF to reduce file size ?

A lot of the data is in the djvu format which is very efficient for scanned books.

This is a downside of Libgen: duplicate uploads, missing or erroneous metadata. You start wishing that there was at least some curation of the collection, so it could approach the quality of an academic library catalogue as many users are usedto. But I guess the people behind Libgen want to keep the number of people with database edit rights small. (When you upload a book, you yourself can edit the metadata for that book for 24 hours, but you cannot go through the rest of LibGen's database and make corrections.)

Maybe they should consider a system where users can suggest tags/metadata or flag erroneous data that can be reviewed and allowed by a select few?

Integration with BookBrainz would be nice. The Brainz projects already consist of massive amounts of metadata curation and it would be possible to transfer that knowledge a bit.

I think the duplication issue is probably overstated. I doubt tackling that would shave off more than 20% of the total backup size.

Speaking from personal experience, I usually see several results for any search. Granted, there's a big selection bias there, but 20% seems way too small.

Because you or anyone is most likely to search for relatively popular books. So those books will have a multiple copies. But for every popular book, there are many unpopular, but still useful books, that only have a single copy.

To be fair for textbooks at least I often see several results but often of different editions (1x edition 1, 2x edition 2, 1x edition 3 etc.). In some cases I think it's worthwhile keeping the different additions around, unless it becomes a huge burden.

Usually the different results have meaningful differences - often times different edition or translator etc

In my experience it's different editions or mirrors.

It's probably more of a nuisance for people wanting to use the content. E.g., copies with different metadata or tags.

20% is not insignificant.

Forking the LibGen to save 20% of file sizes will be counterproductive. Yes you save some storage but the network effects is more important, for people willing to contribute to "the one true thing" actually provides more seeders than the 20%.

What's interesting is that 32TB is becoming more and more affordable and the research material is roughly staying about the same size.

That might change though as people start including video + data within papers and have new notebook formats that are live and contain docker containers/ipython, etc.

It's a shame we can't just mail these around.

You can buy 48TB (4x12TB) for €1000. Store some index on an SSD, and you have another full node.

If you don't care about warranty, 8 and 12TB drives routinely go for $15/TB on sale inside WD Elements.

I picked up 32TB for just under $500 with discount over the holiday that way.

Even if you shuck the drive, as long as you keep the enclosure Western Digital will still honor the warranty.

I’ve heard that you can send the drives back without the enclosure and they have still honored the warranty. https://www.reddit.com/r/DataHoarder/comments/am9vdv/easysto...

Can you elaborate? What's the catch?

You need to take 10 minutes out of your day to remove the plastic enclosure. Depending on your setup, you may also need to make some minor modifications to the drive: google.com/search?q=3.3V+wd+easystore

The theory is that this is a form of market segmentation, where enthusiasts/companies are willing to pay more for a bare drive regular consumers.

The only catch is that it's a minor lottery which model drive you're getting.

For instance, I got all white label WD80EMAZs (256MB cache, non-SMR, same firmware as the Reds) in this batch, so I had to insulate the 3.3V pins.

There are also true Reds, 128MB and 512MB cache drives, helium filleds, 7.2K HGSTs slowed to 5.4K, and other variants.

Or use a traditional power supply to sata cable.

When people publish data it's typically uploaded to a public repository anyway. Supplementary videos are a thing, but in my field at least they generally stay in the supplementary and aren't the raw data so file sizes are reasonable, while still images are used in the text. Journals are still printed works first, believe it or not.

The bandwidth to upload to people can get expensive depending on where you live. Most home connections don't have bi-directional fiber so you are stuck with crippling amounts of upload bandwidth.

I feel like this is the crux of the matter. You could easily get 32 people on this site to volunteer 1 TB each, if it were just cold storage. However, making those resources accessible and searchable (with all the pitfalls of compliance, uptime, legality, etc) is a totally different ballgame.

Encrypted shards partially solves this, but then you hit the quandry of "But what if I have a shard of something illegal or undesired enough to upset the wrong people?" which has not been thoroughly tested in our legal system.

Related: Looking at harddisk cost per terabyte, quite often extern drives are cheaper than internal ones.

For example right now in Germany I can get a WD 8TB USB 3.0 drive for 135€ but the cheapest internal 8TB drive costs 169€.

Any idea why? It's puzzling.

It is very common these days to buy the WD 8TB, 10TB and 12TB external USB3 hard drives and remove their cases, and put them in some sort of home built file server or NAS. There's a technique to put a thin section of kapton tape on one of the SATA pins so that they will power up from ordinary PC/ATX type power supplies with regular SATA power connectors.


In large ZFS arrays, many people are using them with great success, at no greater or lesser annual failure rate than the expensive enterprise hard drives.

> at no greater or lesser annual failure rate than the expensive enterprise hard drives.

I've read these reports as well, but I can say that it's not my experience (we've gone through a few rounds of shucking at the Internet Archive, for economy and in one case necessity after the 2011 Thailand floods pinched the supply chain). Our raw failure rates on shucked drives are significantly higher, and the drives themselves are typically non-performant for high-throughput workloads (often being SMR disks/etc, though hopefully the move away from drive-managed SMR will finally kill that product category off).

I'm apparently out of the loop w.r.t. non-solid state storage. For people in the same boat:

Shingled magnetic recording (SMR) is a magnetic storage data recording technology used in hard disk drives (HDDs) to increase storage density and overall per-drive storage capacity ... The overlapping-tracks architecture may slow down the writing process since writing to one track overwrites adjacent tracks, and requires them to be rewritten as well.


From reddit.com/r/datahoarder I don't believe I've seen a single instance of 8, 10 or 12TB consumer USB3 drives coming out of the plastic case as a model that is shingled recording. The average consumer trying to copy many dozens of GB onto an external drive would not tolerate SMR write performance.

Some of the disks in the market over the last few years were not well-labeled in terms of revealing their SMR internals, and use larger media caches to disguise write performance issues (at least, until they don't). For example, the Seagate STGY8000400 is an external 8TB SMR drive. But the industry as a whole is moving to host-managed SMR, so hopefully that specific issue with external disks will soon go away.

Could you publish some statistics?

It has been like this for years. You'll often see people refer to "shucking" them, taking the drives out to use in a NAS.

My best guess would be that more people buy external drives than internal, and those that "manufacture" external drives (ie - buy internal drives and repackage them) purchase in larger volumes than those that sell bare internal drives.

Of course, that wouldn't explain the difference between a WD external drive and that same drive as an internal drive - assuming that WD actually manufactures both (and doesn't just license the name provided the 3rd party uses their drives)...

I noticed this yesterday while shopping for cyber Monday deals. If you want to load up a server with drives, perhaps the external drives can be removed from their cases and used internally?

I did exactly this when setting up a NAS. Saved $50 / drive, and it took me a few minutes to remove the drive from each.

/r/DataHoarder might be a good place to ask (or search). Also, this is known as "shucking".

Check it before you buy it. Years ago I bought a 1TB WD external drive. The usb interface was connected directly to the drive.

Thanks! I will have to read up on shucking and and /r/DataHoarder. I would think someone already has a list of which external drives can be used this way.

For me on amazon.de the WD 8TB USB 3.0 drive is currently at EUR 159.99. Where do you get it for EUR 135?

Let me say this: I fucking love libgen. It actually makes my life better and I'm so thankful to the people running it.

Posting that here only creates problems for them. The more it's known in the west the more likely it will go down.

+1, bookwarrior has warned about this.

who is bookwarrior?

Is there a way to just download the whole 32TB to your own machine? I see a ton of mirrors but the content seems to be highly fragmented between them

There are ways to do so. The archive is made up of many, many torrents (I believe it's a monthly if not biweekly update of the database). If you have the storage/bandwidth availability for the whole 32TBs, please get in touch and I may be able to help you get the whole deal without too much hassle. Otherwise, just pick some torrents (it would be best to pick them based on torrent health, but they are so many to check manually) and try to keep seeding as much as possible.

EDIT: To find libgen's torrents health, check out this google sheet: https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2x...

Thanks frgtpsswrdlame for the heads up.

If LibGen can announce all of the torrents in a JSON payload with health metadata, that can be consumed for automated seedbox consumption and prioritization. Check out ArchiveTeam's Warrior JSON project payload [1] for inspiration. It need not even be generated on-demand; render it on a schedule and distribute at known endpoints.

[1] https://warriorhq.archiveteam.org/projects.json

Actually there is now a google sheet which shows the health of the torrents so it should be easy to pick the most helpful torrents. It's linked in this post: reddit.com/e3yl23

I'm pretty surprised by the lack of seeders. Out of the 2438 torrents listed, a third have 0 seeders, another third have 1 seeder, and all but 5 have less than 10. Hopefully the publicity boosts those numbers.

From what I've heard a good chunk of people rotate their seeds for LibGen because their seedboxes can't handle all the connections for every torrent at once.

Is there some tool or documentation describing this practice?

I'm sure someone could get you the info to get setup as a seeder. For modern clients it's rather rather trivial to manage that many torrents. Get any decent modern CPU, 4gb+ ram, and $560 in storage and you're off.

I think the problem is that because of the size of each torrent, and there's 1000 of them, it's difficult to effectively seed all at once, so instead people would rather seed sections at once, and rotate through them.

I'm not sure how people setup the rotation though, that can't be an incredibly common feature but I could be wrong.

There are features that prioritize those with low seed/leech ratio in a sort of periodic fashion. Also it partially auto-balances because a swarm only needs a little more than unity ratio injected into it to get itself fully replicated. So each one that get's chosen because of a low seed/leech ratio will inherently drop out of that criteria as soon as the swarm is self-sufficient.

Why doesn't someone maintain a single torrent containing a snapshot of the full archive at a given point in time, updated (say) monthly?

I want a full mirror, and ain't nobody got time to deal with 2000 torrents, many of which have no seeders. That's a really dumb way to run this particular railroad.

Because torrent clients can't handle that many pieces in a single torrent. There are algorithms that are super-linear, maybe even quadratic or worse. They start causing trouble int eh TB range.

Also the UI for adding many torrents is much nicer than for selecting a non-trivial subset of files inside a single torrent. Also many parts of the ecosystem handle partial-seeds that do and will only for the near future seed a subset and not leech any other parts. They often get treated as leechers, despite not really being leechers.

TL;DR: 2k files are just a watchfolder and a cp * watchfolder/ away from working. Scaling does not work with one fat 32TB, however.

Thanks! I don't have 32TB free locally at the moment but I might soon. If and when that happens, I'll get in touch :)

Why not publish the site over IPFS, that would make P2P hosting much simpler?

In my experience ipfs doesn't actually work. I'd love to be proven wrong, but the reason why nobody uses ipfs even when it seems like a great fit is bect it's not really usable.

This is my experience as well. In theory, IPFS is exactly the right thing for LibGen, but in practice I consider it unusable.

FWIW: StavrosK has actually been putting some serious effort into making IPFS accessible.

See here for example: https://news.ycombinator.com/item?id=16521385

It's not really an accessibility issue so much as a performance one. If it was a reasonable alternative to something like the rsync daemon I'd use it all the time.

Unfortunately the performance issues and overhead is just too much.

IPFS isn't really an alternative to rsync, but the rest of your point stands.

Not in the general case, no. But one of the major uses for rsync in synchronizing mirrors, and it could very much be a replacement for rsync in that particular niche.

For example mirroring project gutenburg.

Better rsync support is actually in the works (though probably still a ~quarter out to land end to end in go-ipfs). The two main components are improving ipfs "mount" support, and adding metadata support into unixfs so we can only update modified files. See details here: https://github.com/ipfs/team-mgmt/blob/master/OKR/PACKAGE-MA... -- mount currently has IPNS read support but write support needs an owner to get it over the finish line and unixfsv1.5 (with metadata) should be landing in js-ipfs later this week!

Ah, yes, you mean pinning a specific set, you're right. Unfortunately I've found that the way the daemon does pinning currently doesn't lend itself to that use case (a single unavailable file will stall the pin for hours).

Thank you, I really hope IPFS improves.

Sorry to hear about your bad experience, StavrosK.

I think this perspective really depends how you're trying to use IPFS. For example, the ease of use of running a local IPFS node has improved a ton with IPFS-desktop & companion, and tools like ipfs-cohost (https://github.com/ipfs-shipyard/ipfs-cohost) also improve usability and tooling for shared community hosting. I think this has actually seen a ton of progress and end consumer usability has improved significantly in the past year (and is now even coming out-of-the-box in browsers like Brave and Opera!)

I definitely hear that running a beefy IPFS node for local hosting/pinning still needs work, but pinning services like Infura, Temporal, and Pinata have helped abstract some of those challenges from individual applications like this. From a developer perspective, there are a lot of performance improvements for adding, data transfer, and content resolution coming down the line very soon (https://github.com/ipfs/go-ipfs/issues/6776), and there's also been a lot of work improving the ipfs gateways and docs to support the dev community better. I definitely think there is still lots of room for improvement - but also lots of progress to recognize in making IPFS usable to exactly these sorts of applications. Check out qri,io - they're doing collaborative dataset hosting like this and it's pretty slick!

You are correct, the end user experience has improved tremendously, I tried the desktop bundle the other day and it was indeed very easy to get started with.

> pinning services like Infura, Temporal, and Pinata have helped abstract some of those challenges

I wonder if you omitted Eternum on purpose :P

(For context, I created and run Eternum, and that experience is mostly where my opinion of IPFS comes from.)

Gotcha! Thank you! Running a pinning service definitely still has rough edges =/ but I know the Infura team recently open sourced some of the tooling they built to make it a bit easier: https://blog.infura.io/new-tools-for-running-ipfs-nodes-196d.... Might help others who are self hosting a large chunk of data on a persistent node too...

If you ever want to chat about how we can make pinning services on IPFS easier to run, would love to chat! I know cluster has been researching how to improve "IPFS for enterprise" usage and would really appreciate the user feedback!

Ah, thanks for that link, that would have come in handy a few weeks ago when I migrated the node to a new server.

I would love to chat. My #1 request is to make pinning asynchronous, and generally improve pinning performance. I think that's most of my frustration, followed by slow DHT resolves, followed by large resource usage by the node.

Pinning services are nice, but the idea of pinning services is a bit antithetical to the basic philosophy of p2p. If the only way to make something available is e.g. putting something on pinata, I might as well put it on S3.

The basic problem is that the DHT is currently not working, and IPFS is using the DHT in a very demanding way compared to, say, bittorrent or DAT.

I know that there are some fixes in the works, but the next releases really need to solve the DHT problem, otherwise no amount of usability improvements is going to matter...

How would that work with adding new books and metadata? IPFS archives are immutable, right? I think something like Dat might be better because the people with the secret keys could update the archive and everyone else would automatically seed the updated version

You can just have it pin an IPNS CID, or you can publish a new hash for people to pin. There are ways.

That said, maybe Dat would be better, especially if it works well.

Currently (at least for the-eye) it's about IPFS's barrier of entry. I expect LibGen's case to be similar. Most people don't know about it, and if even those that knew about it had to learn how IPFS works etc, they would probably just try to find the book they're looking for elsewhere.

No need to conflate the frontend (the end-user interface that 'most people' use when trying to 'find the book they're looking for') with the mirroring/archiving backend (the distributed/p2p technology used to 'make sure LibGen never goes down').

The frontend would still be a user-friendly HTTP web-application (or collection of several) that pulls (portions of) the archive from the distributed/resilient backend to serve individual files to clients.

The backend can be a relatively obscure, geeky, post-BitTorrent p2p software like IPFS or Dat, as long as those willing to donate bandwidth/storage can run it on their systems. This is a vastly different audience from 'most people'.

The real question is which software's features best fits the backend use-case (efficiently hosting a very large and growing/evolving, IP-infringing dataset). Dat [1] has features to (1) update data and efficiently synchronize changes, and to (2) efficiently provide random-access data from larger datasets. Two quite compelling advancements over BitTorrent for this use-case.

[1] https://docs.datproject.org/docs/faq#how-is-dat-different-th...

I am not fully aware how IPFS operates, but wouldn't it at least solve the back-end mirroring? Front-end servers would then "only" need to access IPFS for continuous syncing of metadata (for search) and fetching user-requested files (upon request).

True, I too find it not ideal, but having such a massive library available over it surely would increase the interest in lowering the barrier of entry?

How about Tahoe-LAFS? I haven't used it, but it should be stable by now.

There's also ZeroNet, though IDK if it can handle the traffic.

Are there any i2p torrents? I guess anonymity might be helpful if I want to mirror/seed this data...

I assume anyone could simply seed the "official" torrents via i2p? Not sure how that system actually works, it's interesting for sure but a lot less well-known than the alternatives.

one of the next interplanetary or Interstellar Probe should carry a copy of the sci-hub torrent in some kind of permanent storage

Do we have anything rated for a few millennia of interstellar radiation besides etched gold plates?

Microsoft's project Silica [0] may hopefully provide really long term, large capacity archive grade storage on earth. I wonder what effects interstellar radiation has on them.

[0] https://www.theverge.com/2019/11/4/20942040/microsoft-projec...

Glass is pretty inert, full stop. It would depend on the voxel size but I imagine as long as you have more than a few hundered atoms per voxel/bit you will have survivability on the order of millenia, even in high radiation environments. Someone would have to do the nuclear cross section calculations to get a real bit error rate but glass is very tough stuff.

I think millipede memory[0] could have managed that, too bad the technology got shelved.

[0]: https://en.wikipedia.org/wiki/Millipede_memory

whats wrong with etched gold disks?

I wonder if that's actually feasible for this application.

Microdots managed about 32MiB / square centimeter (from the "you could fit the Bible 50 times in a square inch") measure. That's a completely arbitrary density to achieve, since it was photographs which were enlarged and shrunk, and you could hypothetically use any encoding for your gold etchings, and it also leaves out the question of "how fine can you etch gold plates while still having the engraving be 'robust'".

But in any case, that gives you a target area of ~325 square meters for a 100TiB archive.

That's a lot, but not like a crazy obviously impossible number like a million square km or something.

Assuming you can cut the etching down to 1m squares and stack them on top of one another that should be no problem at all. Assuming each layer is 1.2mm thick (same as a CD) that's a volume of 1m x 1m x .390m, it would fit easily in a payload fairing of any rocket that can get it into orbit.

The volume isn't a real problem, but nearly half a cubic meter of gold should also have a mass of ~7500 kilograms. You're also looking at a cost of ~$350 million, US.

(Which isn't necessarily impossible, but still a lot, especially for an interstellar probe.)

7.5 metric tons is well within the weight limit for a Falcon 9, so add about $57m for the launch costs for a total project cost of probably around $500m.

That puts it in the realm of the Voyager probes for total cost.

7.5 metric tons is also over ten times the mass of each Voyager probe.

Thank SpaceX for slashing launch costs and me for not factoring in much of a second stage to boost that mass out of orbit.

But I was also extremely generous with the thickness in the first post. In real life they would almost certainly be closer to .05mm than 1.2mm, and probably not made out of solid gold.

The point was to show that even with some rather pessimistic assumptions the project was within human scale and even had some precedent.

Yeah, it's all within the realm of possibility, with too many unknowns (eg, what is a realistic information density) to really easily say how easy or hard it is.

Re: Falcon 9's, Voyager was lifted on a Titan IIIE, which had a LEO payload of 15,300 kg, compared to 22,800 kg for the Falcon 9 to LEO. Assuming Voyager was near the capacity of what could be ejected from the Solar System by a Titan IIIE, you'd need to send up ~7 Falcon 9's to Voltron together in orbit.

I would argue that at this point, actually building Voltron might be more wortwhile.

Then they could be double-sided. And aluminum is cheap ... just hope it doesnt hit a rock

there is no need to put a data storage archive on something shot out into interstellar space. geostationary telecommunications satellites are at a sufficiently high enough orbit that they will likely outlast human civilization. We could destroy ourselves with nuclear war, regress to a stone age level of technology, rediscover spaceflight and go find them long before the orbits of any of them decay.

too close to the warzone to survive

okay, high Mars orbit. Quite a lot less delta-v requirement for some theoretical several thousand kilogram chunk of long-lifespan data storage, and easier to discover and retrieve, than achieving solar escape velocity (look at the delta-v budget for the new horizons probe vs. its total size and weight, for instance).

I'd probably put a signaling system on it. Some large RTG using elements with a long half life and extremely redundant systems that send a weak signal towards earth, just over the background noise. Some regular signal ("DATA" as a morse code) that repeats from an unknown object orbiting mars?

I'd bet that any civilization would be flying there ASAP to see what it is about.

The crux is really making a simple radio transmitter that can survive that long but the parameters are very weak (ie, the frequency is allowed to shift so long as it remains somewhat in a region that can be received on earth's surface, it needs a very long last power source but doesn't need a lot of power, unidirectional transmission is fine as long as it arrives).

Storing that amount of information in a way that an unknown alien species would be able to read (even assuming technical expertise greater than our own) is a huge problem.

Keep in mind that they don't know our written or computational language and there's nothing about our technology that is inherently self-explaining/obvious.

Even the assumption that they'd use binary computers (rather than trinary, or other technology not based around electrical voltages) is open to debate.

An idea I've seen is including messages at several levels. At the outermost level you describe in very basic format how to build a magnifying glass. From there you have diagrams that are legible that describes how to build a microscope. From there you have more than enough space to describe the basics of what else is in there and to start describing your language. I'm thinking optical storage in a clear rock of some sort, as has already been prototyped.

If you assume motivated readers and human-level intelligence, you could end up with good results. It might take a decade or three, and a lot of mental firepower, but they could get there.

(The outer layer is the hardest, since our information density is lowest. Our "description of how to build a magnifying glass" might cover just the basic optics of curved glass and a very basic description of how to get to glass and how to curve it correctly, leaving a lot of the details up to the finder. After all, we did it without help. We're not so much trying to solve this problem for the finder as help them on their way.)

So, before jumping in to argue, remember I'm stipulating decades of dedicated effort by presumably an interested consortium of... whatever they are. I think we can safely stipulate an amount of effort at least as large as our society has dedicated to, say, Linear A and B, or the Voynich manuscript. I'm not trying to spec "Ugh wanders out of the jungle, sees our pretty rock, and personally has a 20th century civilization up and running in 10 years" or anything crazy.

>I'm not trying to spec "Ugh wanders out of the jungle, sees our pretty rock, and personally has a 20th century civilization up and running in 10 years" or anything crazy.

Quite coincidental, there is currently an Anime running named "Dr. Stone" which is quite exactly about that; jump starting human civilizatio from the stone age to modern day as fast as possible. Atleast in-story it's been a few months and they're currently building radios and have a waterwheel generator.

Yeah, it's on my list. My strategy is generally to wait until I can just mainline the entire season, so I tend not to watch the latest seasonals, but it's definitely on my list when the season is done.

In Vernor Vinge's "A Fire Upon the Deep", a very advanced civilization in the outer galaxy that can't reach where we are for $REASONS has as a persistent hobby speculation on the fastest way to bootstrap advanced civilizations, assuming essentially-perfect knowledge of physics instead of blundering around.

> an unknown alien species

Not necessarily for aliens ... but why not keep a backup in a safe place outside the dangers of earthlings ?

OTOH, I think a sufficiently advanced alien intelligence will be able to decipher the information structures we use regardless of differences in technology. It's possible though there will be missing links in that archive, which will need to be supplemented with a primary secondary and high school curriculim.

What about a satellite backup in orbit around earth? Maybe an elliptical orbit, coming around a few times a year or something

What format? I'm eager to know the solution to this non-problem.

pdf of course!

/Godel, Escher, Bach/ has an interesting thought experiment about exactly this.

If one were to receive an object, how would it be indicated that there is a message embedded in there? Given that a intelligence could recognize that there was a message embedded, could it eventually be deciphered?

This is an interesting idea because there's a lot of radical political and philosophical publications on there. Brill's Historical Materialsim book series is on there almost in its entirety.

I did not know about LibGen until this post. Too bad for me living in a cave. Anyways this is amazing project. Best luck to them and similar efforts.

Imagine this:

- A tiny well behaved client that starts with the OS.

- It downloads rare bits of the archive at 1 kb/s obtaining 1 GB every 278 hours. It should stop around 100 MB to 5 GB.

- It periodically announces what chunks/documents it has.

- It seeds those chunks at 1 kb/s

- Chunks/documents that have thousands of seeds already are not announced. Eventually those are pruned.

This escalates the situation to the point where everyone can help without it costing anything.

If someone is trying to obtain a 20 mb pfd it would take 5 and a half hours using a single 1 kb seed. With just 50 seeds it's just 8 min.

I'd like to dedicate 1TB of my FreeNAS to something like this. Would be nice to run a small container with some P2P service that contained that chunk.

Can't Tahoe-LAFS help with this kind of a challenge? I don't have experience with it, but it looks stable.

I've thought that we could potentially build an end to end encrypted datastore within Polar and possibly add IPFS support to potentially help with this issue.

Here's a blog post about our datastores for some background.


... essentially Polar is a PDF manager and knowledge repository for academics, scientists, intellectuals, etc.

One secondary challenge we have is allowing for sharing of research but I'd like to do it in a secure and distributed manner.

Some of our users are concerned about their eBooks being stored unencrypted and while for the majority of our users this will never be a problem I can see this being an issue in countries with political regimes that are hostile to open research.

In the US we have an issue of researchers being harassed over climate change btw. Having a way to encrypt your knowledge repository (ebooks) would help academic freedom as your employer or government couldn't force you to give them your repository.

But what if we went beyond this and provided a way to ADD documents to the repository from a site like LibGen?

Then we'd have the ability to easily, with one click, encrypt the document (end to end) and added it to our repository.

If we can add support for Polar to allow colleagues to share directly, this would be a virtual mirror of LibGen.

Alice could add books b1, b2, b3 to their repo, they could then share with Bob, only he would be able to see b1, b2, b3, then they would generate a shared symmetric key to share the books.

No 3rd party (including me) would have any knowledge what's going on.

I'm going to assume our users are not going to do anything nefarious or pirate any books. I'm also certain that they're confirming to the necessary laws ...

The challenge though is that while we'd be able to have a mirror of LibGen and more material, it would be a probabilistic mirror - I'm sure we'd have like 60% of it but the obscure material wouldn't be mirrored.

Right now our datastores support just local disk, and Firebase (which is Google Cloud basically). While we would encrypt the data end to end in Google Cloud I can totally understand why users might not like to use that platform.

One major issue is China where it's blocked.

Something like IPFS could go a long way to solving this but it's still very new and I haven't hacked on it much.

I'd say IPFS, but That's a pretty big commitment from an entire community to keep alive.

its best to split on small torrents on few 1-2 GB so normal users can seed

If only some of the money made would reach the scientists lel. Most of em will give you their paper per mail if you aak them. The majority does not want them to sit behind paywalls...

One could use FAANG data centers to host them for free, it would be really great

Look at the google books project. That got shutdown real hard due to copyright issues and litigation after they invested a ton of money in digitizing some of the most valuable library collections in the world.

> Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.

Indeed, what an intellectual tragedy..

> In August 2010, Google put out a blog post announcing that there were 129,864,880 books in the world. The company said they were going to scan them all.

That seems like a surprisingly "small" number.

Well, in trying to picture a physical library with 130 million books, maybe that's a realistic estimate. But compared to, say, the recently discovered data hoard of more than 2 billion online identities, it's miniscule.

SciHub and LibGen are truly the modern-day Library of Alexandria. The fact that they're being called "Pirate Bays of Science" - and that providing free and open access to all books in the world is illegal - just goes to show that our civilization's priorities are misdirected.

Until fairly recently (historically), books were overwhelmingly scarce. A few datapoints:

- The total number of books -- not titles, but actual bound volumes -- in Europe as of 1500 CE, was about 50,000. By 1800, the total was just under one billion.

- The library of the University of Paris circa 1000 CE comprised about 2,000 volumes. It was among the largest in Europe.

- The Library of Constantinople in the 5th century had 120,000 volumes, the largest in Europe at the time.

- A fair-sized city public library today has on the order of 300,000 volumes. A large university library generally a millon or so. The Harvard Library contains 20 million volumes. The University of California collection, across all ten campuses, totals more than 34 million volumes.

- The total surviving corpus of Greek literature is a few hundred titles. I believe many of those were only preserved through Arabic scholars, some possibly in Arabic translation, not the original greek.

- There's an online collection of cuneiform tablets. These generally correspond to a written page (or less) of text, with the largest collections numbering in the tens of thousands of items.

- As of about 1800, the library of the British Museum (now the British Library) had 50,000 volumes. Again, among the largest of its time.

- From roughly 1950 - 2000, roughly 300,000 titles were published annually in the United States and/or English-language editions. R.R. Bowker issues ISBNs and tracks this. From ~2005 onward, "nontraditional" books (self- / vanity-published) have been about or above 1 million annually.

- The US Library of Congress, the largest contemporary library in the world, holds 24 million books in its main collection (another 16 million in large type), and has 126 million catalogued items in total (2015).

- At about 5 MB per book, in PDF form, total storage for the 38 million volumes of the Library of Congress would be slightly under 200 TB. At about $50/TB, that's $10,000 of raw disk storage. (Actual provisioning costs would be higher.) Costs are falling at 15%/year.

- Total data in the world comprises far more than books, and has been doubling about every 2 years. Or stated inversely: half of all the recorded information of humankind was created in the past two years.


Some of this is off the top of my head, but partial support for the facts from:









Thank you for that, very interesting and educational. I love how you led up to the punchline. It made me see that books as a technology and artifact are part of the "history of information", and how books are becoming subsumed in a shared trajectory with media/data in general.

> half of all the recorded information of humankind was created in the past two years

That is shocking to imagine, and it's exponentially growing.

It reminds me of Vannevar Bush's "As We May Think", pointing out the emerging information overload in society. It certainly puts things in perspective, how we (humanity) have been making a conscious, collaborative effort to develop globally networked computers, one of whose important functions is to help us organize all the information, including books.

The conundrum it seems is that technology is also a massive multiplier/amplifier of the amount of data, that its capacity to help us organize would never catch up to what it's helping to produce.

> total storage for the 38 million volumes of the Library of Congress would be slightly under 200 TB

I guess it's redundant to say, but I'm sure in the near future that would fit on a thumb drive!

Bush's essay is of course a classic. There are some precursors -- there's a BBC interview of H.G. Wells describing something similar from the 1940s.[1] E.F. Forster's The Machine Stops has some similar ideas. And various encyclopaedists very much embodied similar ideals.

I've been listening to Peter Adamson's "History of Philsophy Without Any Gaps" podcast, which is excellent, and spends a fair bit of time looking at the historiography of the topic -- what works were preserved, how, various interpretations, practices, preservation, and losses. Interesting to note that most of the preserved Greek and Roman works were found in obscure Arabian monastaries and libraries. The mainstream collections themselves were often lost in raids, fires, or other mishaps. Which makes the LibGen situation all the more relevant and urgent.

(I'm a huge user of the site and others like it, for what it's worth.)

On the amount of total data being captured: there's a huge difference between quantity and quality measures of information. They're almost certainly inversely related.

Of what books were written in antiquity, up to the time of the printing press, say, odds were fairly strong that a work would be read.

At 1 million new titles being published per year, there are only 330 people in the US per book, or roughly 400 native English speakers worldwide. (With ~2 billion speakers worldwide, the total audience might reach 2,000 per book). Clearly, most of what's being written will have a very small, or no, audience.

For machine-captured data, the likelihood that any of it is seen directly by a human is vanishingly small. More of it will undergo some level of machine processing or interpretation, though even that only applies to a fairly small fraction of data. Insert old joke about the WORN drive: write once, read never.

As for storage costs (and/or size), at a 15% cost reduction per year, storage halves every 4.67 years (4 years and 8 months), which means that in 10 years, the $10k price tag becomes $2k, and in 20 years, it should be under $400. For the entire Library of Congress collection.

Flash drives seem to be increasing in capacity by a factor of 10 every 2.5 years. There are now 2 TB flash drives, so 200 TB might be as little as 5 years out. That ... still sounds optimistic to me.



The more practical problems are simply organising, cataloguing, and accessing the archives. This is an area that still needs help.



1. I think that's from "Science and the Citizen*, 1943, though the BBC and I have a disagreement concerning access. https://www.bbc.co.uk/archive/hg-wells--science-and-the-citi...

While brushing up on the encyclopaedists, I found this little gem:

"Among some excellent men, there were some weak, average, and absolutely bad ones. From this mixture in the publication, we find the draft of a schoolboy next to a masterpiece." — Denis Diderot

Taking the quote out of context (and aside from its historical male-centered language) - it sure rings true of the current state of the web, as well as books.

About the inverse relationship of quantity vs quality, we seem to be drowning in quantity! As you've pointed out, there's great need for thoughtful organization and curation.

I like how you break down the quantifiable aspects to draw a historical trend and future projection. The rise of "data science" and "big data" in the past few decades really makes sense in this light.

I'm sure machine learning and "AI" will play an increasing role in the task of organizing and processing all this information, but at the bottom I feel that the most value probably comes from human curation.

LibGen has been an amazing resource for me as a lover of knowledge, a life-long book worm. I've got bookshelves and boxes full of physical books as well, but it's a drop in the ocean..

I love the Diderot quote. I'd also encountered earlier:

"As long as the centuries continue to unfold, the number of books will grow continually, and one can predict that a time will come when it will be almost as difficult to learn anything from books as from the direct study of the whole universe. It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes. When that time comes, a project, until then neglected because the need for it was not felt, will have to be undertaken...."

... and on for another several paragraphs. It's an extraordinarily keen observation on the state and future of knowledge. At the always excellent History of Information website:


(Diderot is on my list of authors to explore in more depth.)

The fact that the quality of any given information or exchange is often (though not always) entirely divorced from its source (or author) is another interesting note. There are a few points here worth expanding on.

At least probabalistically, there are spaces (real or virtual) in which it's more likely to encounter good ideas. HN for its various failings, does well in today's Net. Google+, for all its faults, was similarly useful.

Size matters far less than selection. The tendency for centres of learning, research, and/or inquiry (and not necessarily in that order) to emerge is one that's been long observed, and their durability remarkable. The first universities (Bologna, Padua, Oxford, Paris, Cambridge, Heidelberg, and others, see: https://en.wikipedia.org/wiki/Medieval_university) are often still, 600 - 700 years later among the best in the world. Certainly in the US, Harvard, Yale, Princeton, M.I.T., among the earliest founded, remain the most prestigious. Though as noted in the conversation with Tyler Cowen and Patrick Collison, the list from 1920 is "completely the same, except we’ve added on California".


What happens as the overal quantity and flux of information increases is that more effective rejection systems are required. That is: you've got too much information flowing in, you want a way to cheaply, with minimal effort or consequential residiual load, reject information that may be irrelevant, with minimal bias.

There are numerous systems that have been arrived at, and many of our cognitive biases or informal tests for truth arise out of these (optimism, pessimism, availability, sunk-cost, tradition, popularity, socio-ethnic prejudice, etc.). Randomised methods are probably far fairer and less prone to category error. Michael Schulson's sortition essay in Aeon remains among the best articles I've read in the past decade, if not several:

"If You Can't Choose Wisely, Choose Randomly"


Another fundamental problem is self-dealing and self-selection within institutions. Much of the failure within academia (also touched on by Cowen and Collison, who, I'll note, I don't generally agree with, though they are touching on and making many points I've been pursuing for some years) comes from the fact that it's internal selection of students, faculty, articles, topics, and ideologies, rather than strict tests of real-world validity, which promote these structures.

The same problems infect government and business -- it's not as if any one social domain is immune to this.

Oh, and another lecture by H.G. Wells on that topic:

"...When I go to see my government in Westminster I find presiding over it the Speaker in a wig and a costume of the time of Dean Swift, the procedure is in its essence very much the same. The Members debate bring motions and when they divide the art of counting still in governing bodies being in its infancy they crowd into lobbies and are counted just as a drover would have counted his sheep two thousand years ago...."


(Audio quality is exceptionally poor, 1931 recording.)

Partial transcript: http://www.aparchive.com/metadata/INTERVIEW-WITH-H-G-WELLS-S...

AI ... may be useful, but seems to be result-without-explanation, a possible new form of knowledge, to go with revelation (pervasive if not particularly acurate), technical (means), and scientific (causes / structural).

Wholehearted agreement on LibGen.

Very enjoyable conversation BTW, thank you.

Nature shows us how to process information at ever increasing noise and scale - https://www.edge.org/response-detail/10464

Yes and no.

Briefly: the article distinguishes "endocrinal" vs. "distributed" decisionmaking.

This applies at some levels, but not at others.

For individual humans, we don't have the option of rewiring our concsiousnesses, which are rather pathetically single-threaded, and can at best multitask poorly by task-switching, at a very great loss of task proficiency.

Even withing collective organisations (companies, governments, organisations, communities), the multiple independent actors works where those actors' actions are autonomous and independent of others. Or, in the alternative, where they work without mutual conflict toward a common goal.

But you get problems where either individual actors' motivations and actions are in conflict, or in which a single global decision must be made (as with various global catastrophic risks), and multiple independent decisions cannot be arrived at. Even for noncritical arbitrary decisions, such as which side of the road to drive on, in which there is no compelling argument to be made for one side or the other, but in which both sides cannot be simultaneously selected, you need some global decisionmaking capacity.

When you reach the point of either an existing decisionmaking system (as in: a single human, with the finite and largely immutable information acquisition and processing capabilities corresponding), or a multi-agent system which must reach a common decision, you've got the challenge of limiting data intake to that amount which allows effective function within the environment, and avoids overloading capabilities or ineffective action.

The article "Evolving the Global Brain" was thought-provoking, especially in the context of our discussion about the history of information and the exponentially increasing amount of information for humanity to gather/produce, process, curate, archive.

It's an attractive concept, that human society is structurally similar to a brain, and that an individual is a neuron. (If humanity is the brain, I suppose the rest of the Earth is the body. We're not doing too well as the self-appointed brain of the operation.)

My first reaction to the analogy of "endocrinal" (one-to-many) and "neural" (many-to-many) decision making, is that it's missing a primal psychological/biological motivation of humans to seek to dominate others of its own kind as well as all of nature. I'm not familiar enough with biology to say definitively, but I'm pretty sure the endocrinal system does not actively seek to subjugate the neural system (or vice versa) and dominate the whole body.

Social organization, it seems to me, is more a function of power, very small groups gaining advantage and dominance over vastly larger groups of people, than that of collaboration for mutual benefit. (I might be a bit too cynical of political motivations and authentic democracy these days.)

From the final paragraph:

> ..the current global brain is only tenuously linked to the organs of international power. Political, economic and military power remains insulated from the global brain, and powerful individuals can be expected to cling tightly to the endocrine model of control and information exchange.

I'd disagree with this, and say that the global brain (if we mean the Internet and its empowerment of globally networked intelligence) was born from the wombs of "political, economic and military power". It never achieved escape velocity to become a truly free, autonomous and collaborative, neural model of decision making.

To backtrack a bit:

> Well-connected collective entities like Google and Wikipedia will play the role of brainstem nuclei to which all other information nexuses must adapt.

The most powerfully well-connected collective entities are international political/financial/corporate entities, and indeed do they more or less dictate how all information nexuses (nexii?) must adapt.

One biological analogy that comes to mind, is how propaganda and "disinformation" act like neurotoxins in the social brain, introducing noise/entropy, skewing its coherence, and preventing well-informed and orchestrated cooperation.

Another is how established political powers have a well-developed "immune system", composed of mass media, legal structures, military/police force, surveillance of the public. This immune system could be seen at work, for example, at the environmental protests at the Standing Rock Indian Reservation.

The final sentence of the article:

> This formidable design task is left up to us.

By this I assume the author means, evolving the global brain. Quite a challenge! From my perspective, it's going to be a historic struggle: design or be designed.

There are also multiple petabytes of microfiche scans of old newspapers. And of course nobody cares about it. The problem was shut down 2011ish and the data became "owned" by a team that didn't care for it. There was talk of just deleting the data because the team didn't want to pay for it. Ugh.

Huh interesting. Have you got any details? Names of the project? There are probably institutions willing to host that content.

I'm incredibly thankful that the public library system was invented before copyright maximalists got control of Congress.

In this case the issue seems to have come from "copyright minimalists" instead : wanting the books to be freely available, rather than making money for Google...

I wonder why the Copyright Office didn't just buy Google Books, would only have cost a few hundred million $ ?

> Upon hearing that Google was taking millions of books out of libraries, scanning them, and returning them as if nothing had happened, authors and publishers filed suit against the company, alleging, as the authors put it simply in their initial complaint, “massive copyright infringement.”

This is where the project derailed and never quite recovered.

Did we read the same article ?


> As Tim Wu pointed out in a 2003 law review article, what usually becomes of these battles—what happened with piano rolls, with records, with radio, and with cable—isn’t that copyright holders squash the new technology. Instead, they cut a deal and start making money from it.


> now, in 2011, there was a plan—a plan that seemed to work equally well for everyone at the table


> DOJ’s intervention likely spelled the end of the settlement agreement. No one is quite sure why the DOJ decided to take a stand instead of remaining neutral. Dan Clancy, the Google engineering lead on the project who helped design the settlement, thinks that it was a particular brand of objector—not Google’s competitors but “sympathetic entities” you’d think would be in favor of it, like library enthusiasts, academic authors, and so on—that ultimately flipped the DOJ.

I’m fairly confident that google books is a huge money loser for google. The only reason it’s still online is because there are people within google willing to stick their necks out to spend the money on it.

Indeed, but for how long, considering the Google Books team itself seemingly wants to delete the ~100M books database ?

P.S.: Might be soon, considering that it was basically what Google was initially about, and one of the founders just resigned from Alphabet...

Source on the deletion initiative?

Thanks, though I was hoping for better than an HN comment...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact