
Archivists Are Trying to Make Sure LibGen Never Goes Down - legatus
https://www.vice.com/en_us/article/pa7jxb/archivists-are-trying-to-make-sure-a-pirate-bay-of-science-never-goes-down
======
legatus
This is an extremely important effort. The LibGen archive contains around 32
TBs of books (by far the most common being scientific books and textbooks,
with a healthy dose of non-STEM). The SciMag archive, backing up Sci-Hub,
clocks in at around 67 TBs [0]. This is invaluable data that should not be
lost. If you want to contribute, here's a few ways to do so.

If you wish to donate bandwidth or storage, I personally know of at least a
few mirroring efforts. Please get in touch with me over at
legatusR(at)protonmail(dot)com and I can help direct you towards those behind
this effort.

If you don't have storage or bandwidth available, you can still help.
Bookwarrior has requested help [1] in developing an HTTP-based decentralizing
mechanism for LibGen's various forks. Those with experience in software may
help make sure those invaluable archives are never lost.

Another way of contributing is by donating bitcoin, as both LibGen [2] and
The-Eye [3] accept donations.

Lastly, you can always contribute books. If you buy a textbook or book,
consider uploading it (and scanning it, should it be a physical book) in case
it isn't already present in the database.

In any case, this effort has a noble goal, and I believe people of this
community can contribute.

P.S. The "Pirate Bay of Science" is actually LibGen, and I favor a title
change (I posted it this way as to comply with HN guidelines).

[0] [http://185.39.10.101/stat.php](http://185.39.10.101/stat.php)

[1] [https://imgur.com/a/gmLB5pm](https://imgur.com/a/gmLB5pm)

[2] bitcoin:12hQANsSHXxyPPgkhoBMSyHpXmzgVbdDGd?label=libgen, as found at
[http://185.39.10.101/](http://185.39.10.101/), listed in
[https://it.wikipedia.org/wiki/Library_Genesis](https://it.wikipedia.org/wiki/Library_Genesis)

[3] Bitcoin address 3Mem5B2o3Qd2zAWEthJxUH28f7itbRttxM, as found in
[https://the-eye.eu/donate/](https://the-eye.eu/donate/). You can also buy
merchandising from them at [https://56k.pizza/](https://56k.pizza/).

~~~
canuckintime
> Lastly, you can always contribute books. If you buy a textbook or book,
> consider uploading it (and scanning it, should it be a physical book) in
> case it isn't already present in the database.

There's no easy solution for scanning physical books, is there?

~~~
toomuchtodo
There are providers [1] that will destructively scan the book for you and
return a PDF. If you want to preserve the book, you're stuck using a scanning
rig [2]. The Internet Archive will also non-destructively scan as part of Open
Library [3], but they only permit one checkout at a time of scanned works, and
the latency can be high between sending them a book and it becoming available.
FYI, 600 DPI is preferred for archival purposes.

[1] [http://1dollarscan.com/](http://1dollarscan.com/) (no affiliation, just a
satisfied customer, can't scan certain textbooks due to publisher threats of
litigation)

[2] [https://www.diybookscanner.org/](https://www.diybookscanner.org/)

[3] [https://openlibrary.org/help/faq](https://openlibrary.org/help/faq)

~~~
bumbledraven
A big +1 for 1dollarscan.com. They've scanned many hundreds of books for me.
The quality of the resulting PDFs is uniformly excellent, their turnaround
time is fast, and their prices are cheap ($1 per 100 pages).

I've visited their office -- located in an inexpensive industrial district of
San Jose -- on multiple occasions. They have a convenient process for
receiving books in person.

I believe the owners are Japanese and the operation reminds me of the
businesses I visited in Tokyo: quiet, neat, and über-efficient.

~~~
dunstad
> quiet, neat, and über-efficient

I wish the same could be said for the Tokyo office I work in!

------
miki123211
The new architecture of pirate sites, what I call the Hydra architecture,
seems pretty interesting to me. There isn't a single site hosting the content,
but a group of mirrors freely exchanging data between one another. In case
some of them go down, the other ones still remain and new ones can appear,
copying data from the remaining mirrors. This is like a hydra that grows two
heads every time you chop one off. It's absolutely unkillable, as there's no
single group or server to sue.

A more advanced version of this architecture is used by pirate addons for the
Kodi media center software. Basically, you have a bunch of completely legal
and above board services like Imdb that contain video metadata. They provide
the search results, the artworks, the plot descriptions, episode lists for TV
shows etc. Impossible to sue and shut down, as they're legal. Then, you have a
large number of illegal services that, essentially, map IDs from websites like
IMDB to links. Those links lead to websites like Openload, which let you host
videos. They're in the gray area, if they comply with DMCA requests and are in
a reasonably safe jurisdiction, they're unlikely to be shut down. On the Kodi
side, you have a bunch of addons. There are the legitimate ones that access
IMDB and give you the IDs, the not that legitimate ones that map IDs to URLs,
and the half-legitimate ones that can actually play stuff ron those URLS (not
an easy taks, as websites usually try to prevent you from playing something
without seeing their ads). Those addons are distributed as libraries, and are
used as dependencies by user-friendly frontends. Those frontends usually
depend on several addons in each category, so, in case one goes down, all the
other ones still remain. It's all so decentralized and ownerless that there's
no single point of failure. The best you can do is killing the frontend addon,
but it's easy to make a new one, and users are used to switching them every
few months.

~~~
MadWombat
> It's absolutely unkillable

Just like any other distributed system, this is vulnerable to organized take
downs and scare tactics. There was a whole bunch of mirrors of Pirate Bay, yet
once most of Europe's legal systems adopted the "sharing is theft" mindset, it
became pretty much impossible to find one.

~~~
asdff
But now the main site seems to be bullet proof. There was a time where weekly
there would be a new official link. I'm not sure what changed structurally
with hosting tbp

~~~
wyxuan
They just stopped going after it, and focused resources on stopping streaming
websites

------
sanxiyn
Yongle Encyclopedia was a similar project of the 15th century China. It was
the largest encyclopedia in the world for 600 years until surpassed by
Wikipedia.

Alas, Yongle Encyclopedia is almost completely lost now. Archiving is harder
than you think.

[https://en.wikipedia.org/wiki/Yongle_Encyclopedia](https://en.wikipedia.org/wiki/Yongle_Encyclopedia)

~~~
weinzierl
I read the Wikipedia article about it and the sad thing is that the majority
of the Yongle Encyclopedia seem to have been destroyed only in quite recent
times.

~~~
knolax
> but 90 percent of the 1567 manuscript survived until the Second Opium War in
> the Qing dynasty. In 1860, the Anglo-French invasion of Beijing resulted in
> extensive burning and looting of the city,[16] with the British and French
> soldiers taking large portions of the manuscript as souvenirs.

Preservation is easy if you don't get invaded.

~~~
asdff
It's easy if you anticipate these things. Who put the dead sea scrolls in that
cave in the middle of nowhere? Not someone who went in and forgot their scroll
one day. Someone who had the foresight that this would be a safe space in the
face of who knows what future threat. And it payed off.

~~~
zaarn
I doubt it was that intentional. I would wager that "someone forgot about it"
is the more likely explanation.

------
EthanHeilman
Maybe we should print this out on acid-free paper-thin flexible wood-pulp
sheets stitched to together to form linear organized aggregations. Each
aggregation would contain one or more works and be searchable using a SQL-like
database. To make this plan really work there would need to be a collection of
geographically distributed long term physical repositories that would receive
periodic updates as new material became available.

All joking aside, I do wonder wither digital or analogue formats are better
able to survive into the distant future.

* What impact will DRM have on the accessibility of our knowledge to future historians?

* Is anything recoverable from a harddrive or flash media after 500 years in a landfill?

* Will compressed files be more of less recoverable? What about git archives?

* Will the future know the shape of our plastic GI Joes toys but not the content of the GI Joes cartoon?

~~~
frobozz
> I do wonder wither digital or analogue formats are better able to survive
> into the distant future.

There are 5000 year old clay tablets we can still read.

There are centuries old documents on paper, vellum etc. that we can still
read.

I personally have decades-old paper documents I can easily read, and a box of
floppies I can't.

It's not just a problem of unreadable physical media, I have a database file
on a perfectly readable HD that was generated by an application that is no
longer available. I might be able to interrogate it somehow, but it won't be
easy.

Digital formats and connectivity make LOCKSS easier, so that's a plus. There's
less chance of a fire or flood or space-limited librarian destroying the last
known copy. However, without archivists actively transforming content to new
formats as required, it might only take a few decades before a lot of content
starts to require a massive effort to read.

~~~
EthanHeilman
Clay is the plastic of the ancient world.

Let's say the probability that: a single copy of a physical book survives
1,000 years, is found and is understood by an archaeologist _, is pB and the
probability that a single copy of a book on an SSD survives 1,000 years is
found and understood by an archaeologist_ is pD. Even if pB is far larger than
pD it could be the case that there might be so many more copies of single book
held on SSDs thus making it more likely the book will survive via an SSD than
a physical book. On the other hand the technology to recover data from SSDs
might not exist in 1,000 years.

It could also be the case that each generation would copy these books onto new
digital media providing an unbroken chain of copies. The oldest copy of the
Iliad is Venetus.A which is from 1000AD (1000 years ago) despite the Iliad
probably first being written down in 800BC (2800 years ago). It was copied
from earlier copies of copies of copies.

I really don't know how this will play out and I've been unable to find
research on how long SSD and flash memory based media survives especially if
buried in a landfill.

* - If archaeologists exist in the future. The current push from the STEM boosters to defund and de-emphasize the humanities may result in a near-future without archaeologists or funded archaeological projects. Over 1,000 years the entire field could die.

~~~
asdff
Would an SSD even function after 1000 years? Unless sealed, I imagine ambient
moisture would do a number inside the drive. The same is true for books of
course, but we still have 1000 year old books that have lasted by sitting on a
shelf in churches and temples, etc., without any specific care until recent
history.

The nice part of a book in an apocalyptic scenario is that you can copy it
even if you don't know the language. You don't need a special tool for this,
only one capable of marking a surface. It wouldn't be fun or fast, but it's
possible and it's what monks did for centuries. Would archeologists 1000 years
from now be lucky enough to find a SATA cable too?

~~~
mrob
It doesn't really matter if the SSD as a whole still works, because after 1000
years you'll never recover the data via the normal interface. Modern MLC flash
is often specified for less than 1 year data retention, and even SLC is
unlikely to make it to 1000 years. Attempting to read it will only make things
worse ("read disturb"). The best hope of saving the data is with some future
nanotech that directly probes each floating gate transistor and counts the
electrons, and reverse engineering all the error correction and wear leveling.

~~~
EthanHeilman
I would assume they would read the SSD not by powering it on and plugging it
into to a computer but by disassembling it and physically imaging the physical
structure. This would also bypass the all the write leveling infrastructure
allowing them to recover deleted data. It reminds me of the current techniques
of using x-rays to read writing on the odd scraps of paper used to bind a book
[0].

[0]: "X-rays reveal 1,300-year-old writings inside later bookbindings"
[https://www.theguardian.com/books/2016/jun/04/x-rays-
reveal-...](https://www.theguardian.com/books/2016/jun/04/x-rays-reveal-
medieval-manuscripts)

------
knzhou
Libgen is one of the greatest contributors to scientific productivity
worldwide, possibly beaten only by Sci-Hub. Just about everybody in academia
knows about it. If it ever vanished, some of us could probably still get by
trading files from person to person, but nothing could be as perfect as what
we got now.

~~~
lioeters
> possibly beaten only by Sci-Hub

Today I learned that Library Genesis is actually "powered by Sci-Hub" as its
primary source.

So I guess they're sister projects by similarly minded people (who seem to be
mostly/originally based in Slavic countries, which I find interesting
culturally - perhaps it's due to a looser legal environment + activist
academics?).

> Just about everybody in academia knows about it.

That really says something about the state of society, this tension between
copyright laws (and the motivations behind them) and the intellectual ideal of
free and open access to knowledge.

~~~
mikorym
I am not an expert on the topic, but I believe that in the former Soviet Union
it was common between mathematicians to pass around preprints (a la arXiv).
These then perculated through to the West. I think it had to do with the USSR
and their restrictive (if we are being euphemistic) policies towards
academics.

~~~
gdy
"the USSR and their restrictive (if we are being euphemistic) policies towards
academics."

What do you mean?

~~~
alteria
Their policies were for more than "restrictive" is how I'm reading it

See [1]

[1]
[https://en.wikipedia.org/wiki/Suppressed_research_in_the_Sov...](https://en.wikipedia.org/wiki/Suppressed_research_in_the_Soviet_Union)

~~~
mikorym
Yes. I'm saying restrictive to describe the effect on academic papers. The
effect on (oppression of) the academics themselves was much worse.

~~~
gdy
No, that's BS.

There are well known cases of genetics and cybernetics being banned for
ideological reasons during Stalin's time. Scientific books and articles of
convicted 'enemies of the state' were dangerous to possess in that time too.
Some scientists used ideological 'arguments' in scientific debates which were
dangerous to argue against.

But all that, AFAIK, ended after Stalin's death in 1953.

Moreover, I've never heard anything about mathematics in this regard.

~~~
mikorym
Not sure what you are saying. Mathematicians were not even allowed to travel
abroad [1] and any "concessions" were essentially as it pleased the USSR
state. Only from 1990 was movement free in the true sense of the word.

[1] An example was when Margulis won the Fields medal:
[https://en.wikipedia.org/wiki/Grigory_Margulis](https://en.wikipedia.org/wiki/Grigory_Margulis).
There are many other examples too.

~~~
gdy
What does that have to do with sharing knowledge in the USSR and the countries
in the Soviet block?

It was never in Soviet ideology to hide knowledge behind paywalls. See, for
example, this [0] post about Mir publishing house and warm comments of Indians
who grew up with their books. Sci-hub's ideology is just continuation of this
approach.

[0]
[https://news.ycombinator.com/item?id=21352277](https://news.ycombinator.com/item?id=21352277)

~~~
mikorym
> Sci-hub's ideology is just continuation of this approach.

Actually, that was the point of what I was saying—the mathematicians had to be
inventive and thus passed around preprints that they knew would also be read
in the West.

~~~
gdy
Why did they have to be inventive? Please provide a source.

------
turc1656
I don't see anyone having mentioned the possibility of posting this data to
Usenet at all - at minimum for archival purposes which should be good for ~8-9
years. That way at least the data isn't lost. With so many of those torrents
have 0 or 1 seed, this is a serious risk I think, despite the comments
elsewhere about people rotating what they seed.

I realize that doesn't solve the access problem for most people as most of the
users who need this research might not know how to use usenet or even be
familiar with it at all, but I think the first major concern would be to
secure the entire repository on a stable network. Usenet seems like a good
place for that even if it doesn't serves as a means of distribution.
Encrypting the uploads would make them immune to DMCA takedowns provided that
the decryption keys weren't made public and were only shared with individuals
related to the maintenance of the LibGen project.

~~~
walrus01
Two thoughts on that. Encoding it to a text format with CRC data for posting
to usenet is highly inefficient in terms of data storage. And 33TB of stuff is
not going to be retained for 8-9 years, the last I checked due to the huge
volume of binaries traffic, the major commercial usenet feed providers have at
most 6-9 months of retention for the major binary groups. Beyond that it
becomes cost prohibitive for them in terms of disk storage requirements. This
is not an issue for the majority of their customers, 6-9 months is more than
long enough retention to go find a 40GB 2160p copy of some recently-released-
on-bluray movie.

~~~
trevyn
yEnc overhead is about 2% and there are plenty of providers with ~10 year
retention.

~~~
walrus01
Wow. I can't even imagine how much disk space ten years of retention of
alt.binaries.* takes up. It's been literally ten years since I last did
anything serious Usenet related.

~~~
zaarn
Atleast in my experience, 10Y providers ask for more money and provide less
high speed bandwidth (after which your up/down is usually limited to around 10
or 1mbps)

------
lukebuehler
To me, an aspiring scholar, LibGen is the most amazing tool ever. Things like
inter-library loan and access to databases on university networks already make
life so much easier to what it used to be—but nothing beats LibGen in terms of
convenience. I’m in a the nowadays obscure field of patristic theology and I
can’t believe how much stuff I can find on LibGen, often things that even
highly specialized research libraries like Harvard don’t have.

The hours that LibGen saved me in gathering all the sources for my research
must be in the hundreds. Thank you!

------
dooglius
There is a huge amount of duplication there (i.e. books that have many scans),
I wonder if it would be better to tackle that versus doing a straight backup.

~~~
Invictus0
I think the duplication issue is probably overstated. I doubt tackling that
would shave off more than 20% of the total backup size.

~~~
dooglius
Speaking from personal experience, I usually see several results for any
search. Granted, there's a big selection bias there, but 20% seems way too
small.

~~~
abdullahkhalids
Because you or anyone is most likely to search for relatively popular books.
So those books will have a multiple copies. But for every popular book, there
are many unpopular, but still useful books, that only have a single copy.

------
burtonator
What's interesting is that 32TB is becoming more and more affordable and the
research material is roughly staying about the same size.

That might change though as people start including video + data within papers
and have new notebook formats that are live and contain docker
containers/ipython, etc.

It's a shame we can't just mail these around.

~~~
jbverschoor
You can buy 48TB (4x12TB) for €1000. Store some index on an SSD, and you have
another full node.

~~~
washadjeffmad
If you don't care about warranty, 8 and 12TB drives routinely go for $15/TB on
sale inside WD Elements.

I picked up 32TB for just under $500 with discount over the holiday that way.

~~~
StavrosK
Can you elaborate? What's the catch?

~~~
washadjeffmad
The only catch is that it's a minor lottery which model drive you're getting.

For instance, I got all white label WD80EMAZs (256MB cache, non-SMR, same
firmware as the Reds) in this batch, so I had to insulate the 3.3V pins.

There are also true Reds, 128MB and 512MB cache drives, helium filleds, 7.2K
HGSTs slowed to 5.4K, and other variants.

~~~
jbverschoor
Or use a traditional power supply to sata cable.

------
Tepix
Related: Looking at harddisk cost per terabyte, quite often extern drives are
cheaper than internal ones.

For example right now in Germany I can get a WD 8TB USB 3.0 drive for 135€ but
the cheapest internal 8TB drive costs 169€.

Any idea why? It's puzzling.

~~~
LameRubberDucky
I noticed this yesterday while shopping for cyber Monday deals. If you want to
load up a server with drives, perhaps the external drives can be removed from
their cases and used internally?

~~~
javitury
Check it before you buy it. Years ago I bought a 1TB WD external drive. The
usb interface was connected directly to the drive.

~~~
LameRubberDucky
Thanks! I will have to read up on shucking and and /r/DataHoarder. I would
think someone already has a list of which external drives can be used this
way.

------
sandov
Let me say this: I fucking love libgen. It actually makes my life better and
I'm so thankful to the people running it.

------
nullifidian
Posting that here only creates problems for them. The more it's known in the
west the more likely it will go down.

~~~
coffee12345
+1, bookwarrior has warned about this.

~~~
news_hacker
who is bookwarrior?

------
voldacar
Is there a way to just download the whole 32TB to your own machine? I see a
ton of mirrors but the content seems to be highly fragmented between them

~~~
legatus
There are ways to do so. The archive is made up of many, many torrents (I
believe it's a monthly if not biweekly update of the database). If you have
the storage/bandwidth availability for the whole 32TBs, please get in touch
and I may be able to help you get the whole deal without too much hassle.
Otherwise, just pick some torrents (it would be best to pick them based on
torrent health, but they are so many to check manually) and try to keep
seeding as much as possible.

EDIT: To find libgen's torrents health, check out this google sheet:
[https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2x...](https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2xvth-
fUfNDxjE9SGT-KjLCj0/edit?usp=sharing)

Thanks frgtpsswrdlame for the heads up.

~~~
ihuman
I'm pretty surprised by the lack of seeders. Out of the 2438 torrents listed,
a third have 0 seeders, another third have 1 seeder, and all but 5 have less
than 10. Hopefully the publicity boosts those numbers.

~~~
penagwin
From what I've heard a good chunk of people rotate their seeds for LibGen
because their seedboxes can't handle all the connections for every torrent at
once.

~~~
js8
Is there some tool or documentation describing this practice?

~~~
Red_Leaves_Flyy
I'm sure someone could get you the info to get setup as a seeder. For modern
clients it's rather rather trivial to manage that many torrents. Get any
decent modern CPU, 4gb+ ram, and $560 in storage and you're off.

~~~
penagwin
I think the problem is that because of the size of each torrent, and there's
1000 of them, it's difficult to effectively seed all at once, so instead
people would rather seed sections at once, and rotate through them.

I'm not sure how people setup the rotation though, that can't be an incredibly
common feature but I could be wrong.

~~~
namibj
There are features that prioritize those with low seed/leech ratio in a sort
of periodic fashion. Also it partially auto-balances because a swarm only
needs a little more than unity ratio injected into it to get itself fully
replicated. So each one that get's chosen because of a low seed/leech ratio
will inherently drop out of that criteria as soon as the swarm is self-
sufficient.

------
Avamander
Why not publish the site over IPFS, that would make P2P hosting much simpler?

~~~
legatus
Currently (at least for the-eye) it's about IPFS's barrier of entry. I expect
LibGen's case to be similar. Most people don't know about it, and if even
those that knew about it had to learn how IPFS works etc, they would probably
just try to find the book they're looking for elsewhere.

~~~
archi42
I am not fully aware how IPFS operates, but wouldn't it at least solve the
back-end mirroring? Front-end servers would then "only" need to access IPFS
for continuous syncing of metadata (for search) and fetching user-requested
files (upon request).

------
fghtr
Are there any i2p torrents? I guess anonymity might be helpful if I want to
mirror/seed this data...

~~~
zozbot234
I assume anyone could simply seed the "official" torrents via i2p? Not sure
how that system actually works, it's interesting for sure but a lot less well-
known than the alternatives.

------
buboard
one of the next interplanetary or Interstellar Probe should carry a copy of
the sci-hub torrent in some kind of permanent storage

~~~
saalweachter
Do we have anything rated for a few millennia of interstellar radiation
besides etched gold plates?

~~~
legatus
Microsoft's project Silica [0] may hopefully provide really long term, large
capacity archive grade storage on earth. I wonder what effects interstellar
radiation has on them.

[0] [https://www.theverge.com/2019/11/4/20942040/microsoft-
projec...](https://www.theverge.com/2019/11/4/20942040/microsoft-project-
silica-glass-storage-warner-bros-features-details)

~~~
kortex
Glass is pretty inert, full stop. It would depend on the voxel size but I
imagine as long as you have more than a few hundered atoms per voxel/bit you
will have survivability on the order of millenia, even in high radiation
environments. Someone would have to do the nuclear cross section calculations
to get a real bit error rate but glass is very tough stuff.

------
FpUser
I did not know about LibGen until this post. Too bad for me living in a cave.
Anyways this is amazing project. Best luck to them and similar efforts.

------
6510
Imagine this:

\- A tiny well behaved client that starts with the OS.

\- It downloads rare bits of the archive at 1 kb/s obtaining 1 GB every 278
hours. It should stop around 100 MB to 5 GB.

\- It periodically announces what chunks/documents it has.

\- It seeds those chunks at 1 kb/s

\- Chunks/documents that have thousands of seeds already are not announced.
Eventually those are pruned.

This escalates the situation to the point where everyone can help without it
costing anything.

If someone is trying to obtain a 20 mb pfd it would take 5 and a half hours
using a single 1 kb seed. With just 50 seeds it's just 8 min.

------
milofeynman
I'd like to dedicate 1TB of my FreeNAS to something like this. Would be nice
to run a small container with some P2P service that contained that chunk.

------
skjoldr
Can't Tahoe-LAFS help with this kind of a challenge? I don't have experience
with it, but it looks stable.

------
burtonator
I've thought that we could potentially build an end to end encrypted datastore
within Polar and possibly add IPFS support to potentially help with this
issue.

Here's a blog post about our datastores for some background.

[https://getpolarized.io/2019/03/22/portable-datastores-
and-p...](https://getpolarized.io/2019/03/22/portable-datastores-and-platform-
independence.html)

... essentially Polar is a PDF manager and knowledge repository for academics,
scientists, intellectuals, etc.

One secondary challenge we have is allowing for sharing of research but I'd
like to do it in a secure and distributed manner.

Some of our users are concerned about their eBooks being stored unencrypted
and while for the majority of our users this will never be a problem I can see
this being an issue in countries with political regimes that are hostile to
open research.

In the US we have an issue of researchers being harassed over climate change
btw. Having a way to encrypt your knowledge repository (ebooks) would help
academic freedom as your employer or government couldn't force you to give
them your repository.

But what if we went beyond this and provided a way to ADD documents to the
repository from a site like LibGen?

Then we'd have the ability to easily, with one click, encrypt the document
(end to end) and added it to our repository.

If we can add support for Polar to allow colleagues to share directly, this
would be a virtual mirror of LibGen.

Alice could add books b1, b2, b3 to their repo, they could then share with
Bob, only he would be able to see b1, b2, b3, then they would generate a
shared symmetric key to share the books.

No 3rd party (including me) would have any knowledge what's going on.

I'm going to assume our users are not going to do anything nefarious or pirate
any books. I'm also certain that they're confirming to the necessary laws ...

The challenge though is that while we'd be able to have a mirror of LibGen and
more material, it would be a probabilistic mirror - I'm sure we'd have like
60% of it but the obscure material wouldn't be mirrored.

Right now our datastores support just local disk, and Firebase (which is
Google Cloud basically). While we would encrypt the data end to end in Google
Cloud I can totally understand why users might not like to use that platform.

One major issue is China where it's blocked.

Something like IPFS could go a long way to solving this but it's still very
new and I haven't hacked on it much.

------
mutant
I'd say IPFS, but That's a pretty big commitment from an entire community to
keep alive.

------
boksiora
its best to split on small torrents on few 1-2 GB so normal users can seed

------
asdernr
If only some of the money made would reach the scientists lel. Most of em will
give you their paper per mail if you aak them. The majority does not want them
to sit behind paywalls...

------
mister_hn
One could use FAANG data centers to host them for free, it would be really
great

~~~
woofcat
Look at the google books project. That got shutdown real hard due to copyright
issues and litigation after they invested a ton of money in digitizing some of
the most valuable library collections in the world.

~~~
lovecg
It’s incredibly sad:
[https://www.theatlantic.com/technology/archive/2017/04/the-t...](https://www.theatlantic.com/technology/archive/2017/04/the-
tragedy-of-google-books/523320/)

~~~
lioeters
> Somewhere at Google there is a database containing 25 million books and
> nobody is allowed to read them.

Indeed, what an intellectual tragedy..

> In August 2010, Google put out a blog post announcing that there were
> 129,864,880 books in the world. The company said they were going to scan
> them all.

That seems like a surprisingly "small" number.

Well, in trying to picture a physical library with 130 million books, maybe
that's a realistic estimate. But compared to, say, the recently discovered
data hoard of more than 2 billion online identities, it's miniscule.

SciHub and LibGen are truly the modern-day Library of Alexandria. The fact
that they're being called "Pirate Bays of Science" \- and that providing free
and open access to all books in the world is illegal - just goes to show that
our civilization's priorities are misdirected.

~~~
dredmorbius
Until fairly recently (historically), books were overwhelmingly scarce. A few
datapoints:

\- The _total_ number of books -- not titles, but actual bound volumes -- in
Europe as of 1500 CE, was about 50,000. By 1800, the total was just under one
billion.

\- The library of the University of Paris circa 1000 CE comprised about 2,000
volumes. It was among the largest in Europe.

\- The Library of Constantinople in the 5th century had 120,000 volumes, the
largest in Europe at the time.

\- A fair-sized city public library today has on the order of 300,000 volumes.
A large university library generally a millon or so. The Harvard Library
contains 20 million volumes. The University of California collection, across
all ten campuses, totals more than 34 million volumes.

\- The total surviving corpus of Greek literature is a few hundred titles. I
believe many of those were only preserved through Arabic scholars, some
possibly in Arabic translation, not the original greek.

\- There's an online collection of cuneiform tablets. These generally
correspond to a written page (or less) of text, with the largest collections
numbering in the tens of thousands of items.

\- As of about 1800, the library of the British Museum (now the British
Library) had 50,000 volumes. Again, among the largest of its time.

\- From roughly 1950 - 2000, roughly 300,000 _titles_ were published annually
in the United States and/or English-language editions. R.R. Bowker issues
ISBNs and tracks this. From ~2005 onward, "nontraditional" books (self- /
vanity-published) have been about or above 1 million annually.

\- The US Library of Congress, the largest contemporary library in the world,
holds 24 million books in its main collection (another 16 million in large
type), and has 126 million catalogued items in total (2015).

\- At about 5 MB per book, in PDF form, total storage for the 38 million
volumes of the Library of Congress would be slightly under 200 TB. At about
$50/TB, that's $10,000 of raw disk storage. (Actual provisioning costs would
be higher.) Costs are falling at 15%/year.

\- Total _data_ in the world comprises far more than books, and has been
doubling about every 2 years. Or stated inversely: half of all the recorded
information of humankind was created in the past two years.

Sources:

Some of this is off the top of my head, but partial support for the facts
from:

[https://en.wikipedia.org/wiki/History_of_printing#/media/Fil...](https://en.wikipedia.org/wiki/History_of_printing#/media/File%3AEuropean_Output_of_Printed_Books_ca._1450–1800.png)

[https://en.wikipedia.org/wiki/History_of_libraries](https://en.wikipedia.org/wiki/History_of_libraries)

[http://www.bowker.com/tools-resources/Bowker-
Data.html](http://www.bowker.com/tools-resources/Bowker-Data.html)

[https://www.loc.gov/item/prn-16-023/the-library-of-
congress-...](https://www.loc.gov/item/prn-16-023/the-library-of-congress-by-
the-numbers-in-2015/2016-02-01/)

[https://en.wikipedia.org/wiki/Harvard_Library](https://en.wikipedia.org/wiki/Harvard_Library)

[https://en.wikipedia.org/wiki/University_of_California_Libra...](https://en.wikipedia.org/wiki/University_of_California_Libraries)

[https://www.techpowerup.com/249972/ssds-are-cheaper-than-
eve...](https://www.techpowerup.com/249972/ssds-are-cheaper-than-ever-hit-the-
magic-10-cents-per-gigabyte-threshold)

[https://qz.com/472292/data-is-expected-to-double-every-
two-y...](https://qz.com/472292/data-is-expected-to-double-every-two-years-
for-the-next-decade/)

~~~
lioeters
Thank you for that, very interesting and educational. I love how you led up to
the punchline. It made me see that books as a technology and artifact are part
of the "history of information", and how books are becoming subsumed in a
shared trajectory with media/data in general.

> half of all the recorded information of humankind was created in the past
> two years

That is shocking to imagine, and it's exponentially growing.

It reminds me of Vannevar Bush's "As We May Think", pointing out the emerging
information overload in society. It certainly puts things in perspective, how
we (humanity) have been making a conscious, collaborative effort to develop
globally networked computers, one of whose important functions is to help us
organize all the information, including books.

The conundrum it seems is that technology is also a massive
multiplier/amplifier of the amount of data, that its capacity to help us
organize would never catch up to what it's helping to produce.

> total storage for the 38 million volumes of the Library of Congress would be
> slightly under 200 TB

I guess it's redundant to say, but I'm sure in the near future that would fit
on a thumb drive!

~~~
dredmorbius
Bush's essay is of course a classic. There are some precursors -- there's a
BBC interview of H.G. Wells describing something similar from the 1940s.[1]
E.F. Forster's _The Machine Stops_ has some similar ideas. And various
encyclopaedists very much embodied similar ideals.

I've been listening to Peter Adamson's "History of Philsophy Without Any Gaps"
podcast, which is _excellent_ , and spends a fair bit of time looking at the
historiography of the topic -- what works were preserved, how, various
interpretations, practices, preservation, and losses. Interesting to note that
most of the preserved Greek and Roman works were found in _obscure_ Arabian
monastaries and libraries. The mainstream collections themselves were often
lost in raids, fires, or other mishaps. Which makes the LibGen situation all
the more relevant and urgent.

(I'm a huge user of the site and others like it, for what it's worth.)

On the amount of total data being captured: there's a huge difference between
_quantity_ and _quality_ measures of information. They're almost certainly
inversely related.

Of what books were written in antiquity, up to the time of the printing press,
say, odds were fairly strong that a work would be read.

At 1 million new titles being published per year, there are only 330 people in
the US per book, or roughly 400 native English speakers worldwide. (With ~2
billion speakers worldwide, the total audience _might_ reach 2,000 per book).
Clearly, most of what's being written will have a very small, or no, audience.

For machine-captured data, the likelihood that _any_ of it is seen directly by
a human is vanishingly small. More of it will undergo some level of machine
processing or interpretation, though even that only applies to a fairly small
fraction of data. Insert old joke about the WORN drive: write once, read
never.

As for storage costs (and/or size), at a 15% cost reduction per year, storage
halves every 4.67 years (4 years and 8 months), which means that in 10 years,
the $10k price tag becomes $2k, and in 20 years, it should be under $400. For
the entire Library of Congress collection.

Flash drives seem to be increasing in capacity by a factor of 10 every 2.5
years. There are now 2 TB flash drives, so 200 TB might be as little as 5
years out. That ... still sounds optimistic to me.

[https://m.eet.com/media/1171702/digital_storage_in_consumer_...](https://m.eet.com/media/1171702/digital_storage_in_consumer_electronics_fig4.12.jpg)

[https://www.digitaltrends.com/computing/largest-flash-
drives...](https://www.digitaltrends.com/computing/largest-flash-drives/)

The more practical problems are simply organising, cataloguing, and accessing
the archives. This is an area that still needs help.

________________________________

Notes:

1\. I _think_ that's from "Science and the Citizen*, 1943, though the BBC and
I have a disagreement concerning access. [https://www.bbc.co.uk/archive/hg-
wells--science-and-the-citi...](https://www.bbc.co.uk/archive/hg-wells--
science-and-the-citizen/zmwcpg8)

~~~
lioeters
While brushing up on the encyclopaedists, I found this little gem:

"Among some excellent men, there were some weak, average, and absolutely bad
ones. From this mixture in the publication, we find the draft of a schoolboy
next to a masterpiece." — Denis Diderot

Taking the quote out of context (and aside from its historical male-centered
language) - it sure rings true of the current state of the web, as well as
books.

About the inverse relationship of quantity vs quality, we seem to be drowning
in quantity! As you've pointed out, there's great need for thoughtful
organization and curation.

I like how you break down the quantifiable aspects to draw a historical trend
and future projection. The rise of "data science" and "big data" in the past
few decades really makes sense in this light.

I'm sure machine learning and "AI" will play an increasing role in the task of
organizing and processing all this information, but at the bottom I feel that
the most value probably comes from human curation.

LibGen has been an amazing resource for me as a lover of knowledge, a life-
long book worm. I've got bookshelves and boxes full of physical books as well,
but it's a drop in the ocean..

~~~
dredmorbius
I love the Diderot quote. I'd also encountered earlier:

"As long as the centuries continue to unfold, the number of books will grow
continually, and one can predict that a time will come when it will be almost
as difficult to learn anything from books as from the direct study of the
whole universe. It will be almost as convenient to search for some bit of
truth concealed in nature as it will be to find it hidden away in an immense
multitude of bound volumes. When that time comes, a project, until then
neglected because the need for it was not felt, will have to be
undertaken...."

... and on for another several paragraphs. It's an extraordinarily keen
observation on the state and future of knowledge. At the always excellent
History of Information website:

[http://www.historyofinformation.com/detail.php?entryid=2877](http://www.historyofinformation.com/detail.php?entryid=2877)

(Diderot is on my list of authors to explore in more depth.)

The fact that the _quality_ of any given information or exchange is often
(though not always) entirely divorced from its _source_ (or author) is another
interesting note. There are a few points here worth expanding on.

At least probabalistically, there are spaces (real or virtual) in which it's
more likely to encounter good ideas. HN for its various failings, does well in
today's Net. Google+, for all its faults, was similarly useful.

Size matters far less than selection. The tendency for centres of learning,
research, and/or inquiry (and not necessarily in that order) to emerge is one
that's been long observed, and their durability remarkable. The first
universities (Bologna, Padua, Oxford, Paris, Cambridge, Heidelberg, and
others, see:
[https://en.wikipedia.org/wiki/Medieval_university](https://en.wikipedia.org/wiki/Medieval_university))
are often _still_ , 600 - 700 years later _among the best in the world._
Certainly in the US, Harvard, Yale, Princeton, M.I.T., among the earliest
founded, remain the most prestigious. Though as noted in the conversation with
Tyler Cowen and Patrick Collison, the list from 1920 is "completely the same,
except we’ve added on California".

[https://conversationswithtyler.com/episodes/mark-
zuckerberg-...](https://conversationswithtyler.com/episodes/mark-zuckerberg-
interviews-patrick-collison-and-tyler-cowen/)

What happens as the overal quantity _and flux_ of information increases is
that _more effective rejection systems are required_. That is: you've got _too
much_ information flowing in, you want a way to _cheaply, with minimal effort
or consequential residiual load_ , reject information that _may_ be
irrelevant, with minimal bias.

There are numerous systems that have been arrived at, and many of our
cognitive biases or informal tests for truth arise out of these (optimism,
pessimism, availability, sunk-cost, tradition, popularity, socio-ethnic
prejudice, etc.). Randomised methods are probably far fairer and less prone to
category error. Michael Schulson's sortition essay in _Aeon_ remains among the
best articles I've read in the past decade, if not several:

"If You Can't Choose Wisely, Choose Randomly"

[https://aeon.co/essays/if-you-can-t-choose-wisely-choose-
ran...](https://aeon.co/essays/if-you-can-t-choose-wisely-choose-randomly)

Another fundamental problem is self-dealing and self-selection within
institutions. Much of the failure within academia (also touched on by Cowen
and Collison, who, I'll note, I don't _generally_ agree with, though they are
touching on and making many points I've been pursuing for some years) comes
from the fact that it's _internal_ selection of students, faculty, articles,
topics, and ideologies, rather than strict tests of real-world validity, which
promote these structures.

The same problems infect government and business -- it's not as if any one
social domain is immune to this.

Oh, and another lecture by H.G. Wells on that topic:

"...When I go to see my government in Westminster I find presiding over it the
Speaker in a wig and a costume of the time of Dean Swift, the procedure is in
its essence very much the same. The Members debate bring motions and when they
divide the art of counting still in governing bodies being in its infancy they
crowd into lobbies and are counted just as a drover would have counted his
sheep two thousand years ago...."

[https://invidio.us/watch?v=qRgP-46AC_o](https://invidio.us/watch?v=qRgP-46AC_o)

(Audio quality is exceptionally poor, 1931 recording.)

Partial transcript: [http://www.aparchive.com/metadata/INTERVIEW-WITH-H-G-
WELLS-S...](http://www.aparchive.com/metadata/INTERVIEW-WITH-H-G-WELLS-
SOUND/02408f42a2314a78a46c0e7b847f4107)

AI ... _may_ be useful, but seems to be result-without-explanation, a possible
new form of knowledge, to go with revelation (pervasive if not particularly
acurate), technical (means), and scientific (causes / structural).

Wholehearted agreement on LibGen.

Very enjoyable conversation BTW, thank you.

~~~
hos234
Nature shows us how to process information at ever increasing noise and scale
- [https://www.edge.org/response-detail/10464](https://www.edge.org/response-
detail/10464)

~~~
dredmorbius
Yes and no.

Briefly: the article distinguishes "endocrinal" vs. "distributed"
decisionmaking.

This applies at some levels, but not at others.

For individual humans, we don't have the option of rewiring our
concsiousnesses, which are rather pathetically single-threaded, and can at
best multitask poorly by task-switching, at a very great loss of task
proficiency.

Even withing collective organisations (companies, governments, organisations,
communities), the multiple independent actors works _where those actors '
actions are autonomous and independent of others_. Or, in the alternative,
where they work _without mutual conflict_ toward a common goal.

But you get problems where either individual actors' motivations and actions
are in conflict, or in which a _single_ global decision must be made (as with
various global catastrophic risks), and multiple independent decisions cannot
be arrived at. Even for noncritical arbitrary decisions, such as which side of
the road to drive on, in which _there is no compelling argument to be made for
one side or the other_ , but in which _both_ sides cannot be simultaneously
selected, you need some global decisionmaking capacity.

When you reach the point of _either_ an existing decisionmaking system (as in:
a single human, with the finite and largely immutable information acquisition
and processing capabilities corresponding), _or_ a multi-agent system _which
must reach a common decision_ , you've got the challenge of limiting data
intake to that amount which allows effective function within the environment,
and avoids overloading capabilities or ineffective action.

~~~
lioeters
The article "Evolving the Global Brain" was thought-provoking, especially in
the context of our discussion about the history of information and the
exponentially increasing amount of information for humanity to gather/produce,
process, curate, archive.

It's an attractive concept, that human society is structurally similar to a
brain, and that an individual is a neuron. (If humanity is the brain, I
suppose the rest of the Earth is the body. We're not doing too well as the
self-appointed brain of the operation.)

My first reaction to the analogy of "endocrinal" (one-to-many) and "neural"
(many-to-many) decision making, is that it's missing a primal
psychological/biological motivation of humans to seek to dominate others of
its own kind as well as all of nature. I'm not familiar enough with biology to
say definitively, but I'm pretty sure the endocrinal system does not actively
seek to subjugate the neural system (or vice versa) and dominate the whole
body.

Social organization, it seems to me, is more a function of power, very small
groups gaining advantage and dominance over vastly larger groups of people,
than that of collaboration for mutual benefit. (I might be a bit too cynical
of political motivations and authentic democracy these days.)

From the final paragraph:

> ..the current global brain is only tenuously linked to the organs of
> international power. Political, economic and military power remains
> insulated from the global brain, and powerful individuals can be expected to
> cling tightly to the endocrine model of control and information exchange.

I'd disagree with this, and say that the global brain (if we mean the Internet
and its empowerment of globally networked intelligence) was born from the
wombs of "political, economic and military power". It never achieved escape
velocity to become a truly free, autonomous and collaborative, neural model of
decision making.

To backtrack a bit:

> Well-connected collective entities like Google and Wikipedia will play the
> role of brainstem nuclei to which all other information nexuses must adapt.

The most powerfully well-connected collective entities are international
political/financial/corporate entities, and indeed do they more or less
dictate how all information nexuses (nexii?) must adapt.

One biological analogy that comes to mind, is how propaganda and
"disinformation" act like neurotoxins in the social brain, introducing
noise/entropy, skewing its coherence, and preventing well-informed and
orchestrated cooperation.

Another is how established political powers have a well-developed "immune
system", composed of mass media, legal structures, military/police force,
surveillance of the public. This immune system could be seen at work, for
example, at the environmental protests at the Standing Rock Indian
Reservation.

The final sentence of the article:

> This formidable design task is left up to us.

By this I assume the author means, evolving the global brain. Quite a
challenge! From my perspective, it's going to be a historic struggle: design
or be designed.

------
whydoyoucare
Isn't scanning a physical book and uploading a soft-copy, a landmine of
hazards (both legal and moral)? Essentially you are encouraging (some)
unlawful activity... I am not so sure I am onboard with this idea!

~~~
guidoism
It’s easy to take this stance in a rich country. But what about the people in
countries where one of these books cost the equivalent of a year’s wages. Not
so black and white eh?

~~~
whydoyoucare
As far as I know, prices of books differ between rich and developing nations.
For e.g., The C Programming Language that costs $50 in the US [1], is sold for
Rs. 259 (~$4 US) in India. I believe that is the case with most "economy
editions" specifically targeted at developing nations. It certainly isn't an
"year's wages".

While I do understand your point, it still does not justify encouraging
modern-day Robinhoods' and breaking the law.

[1] [https://www.amazon.com/Programming-Language-2nd-Brian-
Kernig...](https://www.amazon.com/Programming-Language-2nd-Brian-
Kernighan/dp/0131103628) [2] [https://www.amazon.in/Programming-Language-
Kernighan-Dennis-...](https://www.amazon.in/Programming-Language-Kernighan-
Dennis-Ritchie/dp/9332549443/)

~~~
esotericn
Hi.

You're speaking to someone from the UK. When I grew up we had to walk uphill
both ways, to quote the Four Yorkshiremen sketch.

I couldn't afford to buy things like books, operating systems, games, etc. We
spent the money on food, rent, and on the Internet.

People like that past version of me are going to pirate regardless of what you
say. Law breaking? You can't see me grinning ear to ear.

I like to think my tax bill makes up for that.

