In Internet scale it's not a lot of data. Most people who think they have big da...

andyxor · on June 9, 2021

The entire archive actually fits in a small desktop NAS (e.g. QNAP or Synology) with a few 14-18TB drives, you don't even need a server rack.

There is existing index in sql format distributed by libgen: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr..., it is around 30GB uncompressed.

Those 851 torrents uncompressed would probably take half a petabyte of storage, but I guess for serving pdfs you could extract individual files on demand from zip archive and (optionally) cache them. So the scihub "mirror" could run on a workstation or even laptop with 32-64GB memory connected to 100TB NAS over 1GBE, serving pdfs over VPN and using unlimited traffic plan. The whole setup including workstation, NAS and drives would cost $5-7K.

it's not a very difficult project and can be done DIY style, if you exclude the proxy part (which downloads papers using donated credentials). Of course it would still be as risky as running Scihub itself which has $15M lawsuit pending against it.

dredmorbius · on June 9, 2021

The entire Library of Congress books collection is on the order of 40 million items.

At 5 MB per book, this works out to about 200 TB of disk storage.

At about $12/TB, hosting the entire LoC collection would cost roughly $2,400 presently, with prices halving about every three years.

dredmorbius · on June 9, 2021

Note that $2,400 is disks alone. You'd obviously need chassis, powere supplies, and racks. Though that's only 17 12 TB drives.

Factor in redundancy (I'd like to see a triple-redundant storage on any given site, though since sites are redundant across each other, this might be forgoable). Access time and high-demand are likely the big factor, though caching helps tremendously.

My point is that the budget is small and rapidly getting smaller. For one of the largest collections of written human knowledge.

There are some other considerations:

- If original typography and marginalia are significant, full-page scans are necessary. There's some presumption of that built into my 5 MB/book figure. I've yet to find a scanned book of > 200MB (the largest I've seen is a scan of Charles Lyell's geology text, from Archive.org, at north of 100 MB), and there are graphics-heavy documents which can run larger.

- Access bandwidth may be a concern.

- There's a larger set of books ever published, with Google's estimate circa 2014 being about 140 million books.

- There are ~300k "conventionally published" books in English annually, and about 1-2 million "nontraditional" (largely self-published), via Bowker, theh US issuer of ISBNs.

- LoC have data on other media types, and their own complete collection is in the realm of 140 million catalogued items (coinciding with Google's alternate estimate of total books, but unrelated). That includes unpublished manuscripts, maps, audio recordings, video, and other materials. The LoC website has an overview of holdings.

Published document scarcity is entirely imposed.

HWR_14 · on June 9, 2021

It still amazes me that 77TB is considered "small". Isn't that still in the $500-$1,000 range of non-redundant storage? Or if hosted on AWS, isn't that almost $1,900 a month if no one accesses it?

I know it's not Big Data(tm) big data, but it is a lot of data for something that can generate no revenue.

smichel17 · on June 9, 2021

> Isn't that still in the $500-$1,000 range of non-redundant storage?

Sure. Let's add redundancy and bump by an order of magnitude to give some headroom -- $5-10k is a totally reasonable amount to fundraise for this sort of application. If it were legal, I'm sure any number of universities would happily shoulder that cost. It's miniscule compared to what they're paying Elsevier each year.

HWR_14 · on June 9, 2021

Sorry. My point was it was a lot of money precisely because it cannot legally exist. If it could collect donations via a commercial payment processor, it could raise that much money from end users easily. Or grants from institutions. But in this case it seems like it has to be self-funded.

pbhjpbhj · on June 9, 2021

I'm prepared to accept "does generate no revenue" but "can generate no revenue" ...?

Perhaps some sort of MTurk or captcha-like tasks per access? Patr[e]ons? Donation drives? Micro-payments? Something else??

HWR_14 · on June 9, 2021

Oh, it could generate revenue if it was legal. But it is not, so it seems difficult.

dredmorbius · on June 9, 2021

For an institution, it's a rounding error.

AWS is not the cheapest bulk-storage hosting possible.

matthewdgreen · on June 10, 2021

Google already does a pretty good job with search. Sci-Hub really just needs to handle content delivery, instead of kicking you to a scientific publisher's paywall.