
Designing the Wayback Machine Loading Animation - foob
https://intoli.com/blog/designing-the-wayback-machine-loading-animation/
======
johansch
Seems like the real problem that should be addressed is the fact that each
request typically takes like 3-5 seconds to complete.

And it feels like it's been like this pretty much since the wayback machine
was launched. Back then I expected that annoyance to be fixed over time - now,
almost 16 years later I'm less hopeful. :/

~~~
blauditore
Well, they're storing vast amounts of data. As far as I know, most
applications dealing with those kinds of issues use large caches, such that
only a fraction of requests actually touch underlying databases. This is
fairly effective for highly overlapping requests, such as facebook posts that
are read by many users.

However, I suppose the Wayback Machine receives many unique requests that
would hit the database anyway, so caching may not be super effective.

I'm not an expert on big-data systems though.

~~~
bnewbold
The Wayback backend is much simpler than one might imagine. The index lookup
("where is the data for this URL at this timestamp?") is called the CDX API
and is pretty fast, given that it's basically looking up a line in a sorted
many-terabyte text file. Slower are the processes that extract the original
HTTP response from what is effectively a giant multi-GB tarball sitting on an
archival-grade (not IOPS-optimized) spinning disk, and the code that parses
HTML and/or javascript and re-writes "embed" URLs for playback. Another speed
limit is that we serve everything from our datacenters in the bay area with no
CDN for most content, which you'll really notice when connecting from outside
North America.

The priority is to have as much data as accessible as possible, and for the
same cost we can crawl and store far more web content on slow spinning disks
than with SSDs or large RAM caches. Most RAM on storage nodes, which would
otherwise be used as disk cache, gets used for derive tasks (like OCR or video
conversion) or crawling/crunching tasks. That being said, if the service is so
slow as to be unusable then there's no point operating it in the first place;
hopefully we can get the latency a it lower.

~~~
johansch
So the primary reason for the multiple second response times is that many
sequential megabytes have to be read from an HD in order to return an often
quite tiny HTTP response data for a particular {url, timestamp}? That would
explain the nearly constantly (slow) response time.

(I was puzzled why it didn't seem to vary much depending on the time of day,
etc.)

~~~
bnewbold
Well, I oversimplified a bit. The "tarball" (WARC or ARC file) is a single
large file with individually compressed HTTP response objects concatenated
together. The index stores the byte offset and length (compressed) into this
file, so an HTTP/1.1 range request can pull only the data needed. So in theory
it's efficient. I don't work on that team directly and haven't looked in to
exactly what the performance latency bottleneck is, sorry for the misleading
response.

~~~
johansch
Ok. Well, I do hope you work on improving the response times. I think it would
do wonders for the popularity of the wayback machine (and in the end,
financial contributions).

------
bflesch
I'm always amazed how people come up with artsy ways of using SVG. IMO this is
akin to pixel art where tinkerers need to spend a lot of time to become in
expert in creating these things.

Really appreciate the writeup.

------
pimlottc
So, is the first gif the final animation? It isn't clear.

~~~
foob
I don't think that they've decided on the final animation yet. Once they have,
I'll update the post to default to the final animation settings and mention
that more explicitly.

------
greenpizza13
Cool demo... completely hijacks the browser history.

~~~
foob
Thanks for letting me know! That was accidental and it should be fixed now.

