
Wayback Machine director outlines the scale of everyone's favorite archive - jor-el
https://arstechnica.com/gaming/2018/10/the-internets-keepers-some-call-us-hoarders-i-like-to-say-were-archivists/
======
AdmiralAsshat
Wonder what the operating budget of the organization is.

I donate $50 a year, because it's pretty much the only site I can check every
month or so and reliably expect that it will be better since the last time I
visited.

~~~
gojomo
As a charitable non-profit, a lot of financial info is available via the
public-record "Form 990" filed with the IRS. See for example:

[https://projects.propublica.org/nonprofits/organizations/943...](https://projects.propublica.org/nonprofits/organizations/943242767)

Looking at the form for ty2016, the most recent available, the Internet
Archive's expenditures that year were about $16.4 million.

------
ScottBurson
Odd this story doesn't mention Brewster Kahle, who AFAIK came up with the idea
of archiving the Web. I heard people joke about it in its early years, too.
I'm glad he took it seriously.

~~~
gkoberger
He may not be in the article, but he did keynote the Internet Archive’s Annual
Bash last week! So he's not forgotten.

------
Sargos
The Internet Archive is definitely up there with Wikipedia as one of mankind's
greatest treasures. It really should be globally funded and endowed for the
foreseeable future.

------
homero
How do they afford all those disks?

~~~
mekarpeles
\- foundations + government grants [https://archive.org/post/2776/who-funds-
this](https://archive.org/post/2776/who-funds-this)

\- donations (in-cash / in-kind)
[https://archive.org/about/credits.php](https://archive.org/about/credits.php)

\- digitization services
[https://archive.org/scanning](https://archive.org/scanning)

\- web archiving services [https://archive-it.org](https://archive-it.org)

src: I work @ Internet Archive on
[https://openlibrary.org](https://openlibrary.org)

~~~
PM_ME_YOUR_CAT
This is pretty great actually. I see that ads are none of your points: does
Wayback Machine either avoid them totally or is it a negligible amount?

~~~
aaroninsf
No ads full stop!

------
codeulike
TLDR: 22 petabytes, stored redundantly as 44 petabytes, and a warehouse of
physical media (books, vinyl records) that grows by a 'shipping container'
every two weeks.

edit: Also this article mentions a bunch of things I didn't know about (e.g.
playable 80s video game archive) so its worth reading.

~~~
exikyut
Actually approx 100 PB of raw capacity thanks to redundancy, I was told once.

Unsure if this is on RAID or ZFS.

~~~
ddorian43
I was expecting some kind of reed-solomon encoding.

~~~
patrickg_zill
Zfs does have Reed-Solomon in the double and triple parity modes of operation.

~~~
romed
I guess I would expect something more like massively distributed erasure
encodings rather than ZFS directly. For example something like Dropbox’s magic
pocket but taken to extremes with, say, 100 data stripes and 20 parity
stripes, all on different machines.

~~~
klodolph
One of the known problems with these large encodings is that the cost of
reconstructing lost data increases as the size of the data increases. If you
use Reed-Solomon (100,20), then you only have 20% overhead and have a
vanishingly small probability of losing data, but if you lose a single 10TB
disk, you need to do 1PB of I/O to rebuild it! Even forgetting I/O for a
moment, you might be churning through a bunch of CPU time just to rebuild a
single block of data.

Of course, you don't need to rebuild immediately. You're effectively working
with (100,19), and you can put off the reconstruction as long as you like,
maybe you don't reconstruct until someone wants to read the data or until
enough other disks fail, and you can prioritize the I/O as low as you like.
But in practice, super large encodings become more and more expensive as the
size increases.

~~~
Dylan16807
Let's see. At 1-3% annual failure rate, we expect to need to rebuild a couple
drives in each array per year. To make the math simpler, let's have each
server send and receive 8.5TB and do 1/120 of the parity math.

Since we have plenty of redundancy, let's keep things low-priority up to 3
drive failures, and try to rebuild each drive in 90 days. For bonus points, if
two drives are rebuilding at once the increase in bandwidth is negligible.

8.5TB in 90 days is less than 10mbps. That means we could build servers with
50 drives and a single gigabit connection and if they were rebuilding every
array at the same time it wouldn't even use half that bandwidth.

Real servers are going to have vastly faster connections and probably a lot
fewer drives, so honestly that petabyte of I/O is not a big deal in context.
In practice you could replace failed drives in a day, and the limiting factor
is the speed of a single drive, not the network.

~~~
klodolph
> ...8.5TB in 90 days is less than 10mbps.

The arithmetic is correct but that's not the correct value. The I/O necessary
for reconstructing one 8.5TB drive in a (100,20) Reed-Solomon group is 850 TB,
which comes out to 875 Mbit/s, averaged over 90 days. It's not uncommon to see
data centers with 10 Gbit/s connections, and sure, maybe it's a 10 Gbit/s per
link on a Clos fabric but it's hard to claim that this bandwidth usage is
trivial.

The point is not that the rebuild is prohibitively expensive or impossible,
the point is that as the group size increases, the cost of data reconstruction
increases and the cost of the encoding overhead decreases. At some point the
cost savings from reduced encoding overhead are smaller than the additional
I/O costs incurred by reconstruction. So the ideal encoding size is not as
large as possible, but some medium size which balances the cost of the
encoding overhead with the cost of reconstruction.

And consider that if any of this data is being served, you incur the 100x I/O
penalty immediately.

> Real servers are going to have vastly faster connections and probably a lot
> fewer drives, so honestly that petabyte of I/O is not a big deal in context.
> In practice you could replace failed drives in a day, and the limiting
> factor is the speed of a single drive, not the network.

The Internet Archive has 24 disks per machine, I believe.

[https://en.wikipedia.org/wiki/PetaBox](https://en.wikipedia.org/wiki/PetaBox)

> In practice you could replace failed drives in a day, and the limiting
> factor is the speed of a single drive, not the network.

To rebuild an 8.5TB drive in 1 day requires 78 Gbit/s of bandwidth. Even
inside a data center, oof.

~~~
Dylan16807
> The I/O necessary for reconstructing one 8.5TB drive in a (100,20) Reed-
> Solomon group is 850 TB, which comes out to 875 Mbit/s, averaged over 90
> days.

You missed the part where I'm dividing the work over all the servers.

We're not having 119 disks read 8.5TB each and sending it all to the server
with the new disk.

Instead, 119 servers are each recovering 85GB and then sending _that_ to the
server with the new disk.

Each server sends 8.5TB over the network, and receives 8.5TB over the network.
You only need 10Mbps per server for slow mode.

> And consider that if any of this data is being served, you incur the 100x
> I/O penalty immediately.

1% of blocks take 100x the I/O. That's an average of only 2x. But because of
data locality, you needed to read most of those blocks anyway. You might only
have a few percent penalty.

> To rebuild an 8.5TB drive in 1 day requires 78 Gbit/s of bandwidth. Even
> inside a data center, oof.

It requires 78Gbps of _switching bandwidth_. It's easy to get a switch with
_terabits per second_ of total capacity. (less than $100 per port on eBay)
[https://i.dell.com/sites/csdocuments/Shared-Content_data-
She...](https://i.dell.com/sites/csdocuments/Shared-Content_data-
Sheets_Documents/en/Dell_Networking_S4048T_ON_Spec_Sheet.pdf)

So I assert that even 100ish drives per group is actually quite easy to
handle.

~~~
klodolph
> We're not having 119 disks read 8.5TB each and sending it all to the server
> with the new disk.

> Instead, 119 servers are each recovering 85GB and then sending that to the
> server with the new disk.

Could you explain how this is possible? In other words, how do you arrange the
blocks so that each individual device has enough data to reconstruct 85GB from
another drive without doing network I/O? I'm not aware of a scheme where this
is possible. Can you point to a paper or describe how this scheme works?

Simple arithmetic suggests that it is not possible to rebuild 100 different
8.5GB chunks from a single 8.5TB selected from an larger amount of data
encoded with Reed-Solomon (100,20). The simple arithmetic suggests that you
would only be able to reconstruct a total of 1.7TB from any individual 8.5TB
disk, which is only 20% of what your scheme suggests. It is also unclear to me
what would happen in this scheme if two disks were lost simultaneously, which
is nearly guaranteed to happen at some point if you are playing around with
120 disks.

Simple erasure coding means that e.g. with (100,20) Reed-Solomon, you spread
each stripe across 120 devices, and therefore need to read from 100
_different_ devices to recover any one lost chunk from a single stripe. There
are some schemes which reduce this but the gains are modest and there are
tradeoffs involving e.g. encoding time or overhead, for example, the
"Hitchhiker" schemes published by researchers at Facebook:

[https://www.cs.cmu.edu/~nihars/publications/Hitchhiker_SIGCO...](https://www.cs.cmu.edu/~nihars/publications/Hitchhiker_SIGCOMM14.pdf)

~~~
Dylan16807
Let's say three drives have failed.

Each server with a working drive reads a contiguous run of 117MB off its hard
drive. It then sends the first MB to server 1, the second MB to server 2, the
third MB to server 3... the last MB to server 117.

Each of those servers receives 117 chunks of a single 1MB stripe.

It then does a parity calculation to figure out the missing 3 1MB chunks, and
sends them to the rebuilding servers.

We have now recovered 117MB of the disk. Our network data sent was 120MB per
server. Every lost chunk was recovered via sibling chunks from 100+ different
devices.

You can then optimize this by having the servers only send 100 chunks of each
stripe across the network.

Once you repeat this operation 85470 times, sending 100+3MB each time, you
have recovered 3 10TB disks with a network use of 8.8TB per server. (The
efficiency is slightly worse than the 8.5TB for recovering one disk)

So you're still sending over a petabyte through your switch, but that's what
good switches are built for, handling cross-traffic from every single port at
the same time. A gigabit per server is acceptable, and 10gig approaches
overkill.

~~~
klodolph
This seems pretty obvious to me once you say it, I guess I didn't understand
what you were trying to say. It seems like that goes both ways, too... what
I'm trying to say is that the amount of reconstruction work increases as the
encoding size increases. You're offering suggestions for how to accomplish
that work but it doesn't change the underlying fact that wider encodings
require more work for reconstruction.

And that really was _all_ I was trying to say... you don't make encodings
arbitrarily wide because the space savings must be weighed against the
reconstruction costs.

~~~
Dylan16807
Not _arbitrarily_ wide, agreed. But it's also true that you can go a lot wider
than a typical raid array without much difficulty.

I guess something to keep in mind is that you need bigger chunks to read
efficiently, and smaller stripes to write efficiently. But in archival storage
you don't care about write speed so the balance changes a lot.

