Hacker News new | comments | show | ask | jobs | submit login
Designing the Wayback Machine Loading Animation (intoli.com)
42 points by foob 11 months ago | hide | past | web | favorite | 13 comments

Seems like the real problem that should be addressed is the fact that each request typically takes like 3-5 seconds to complete.

And it feels like it's been like this pretty much since the wayback machine was launched. Back then I expected that annoyance to be fixed over time - now, almost 16 years later I'm less hopeful. :/

Well, they're storing vast amounts of data. As far as I know, most applications dealing with those kinds of issues use large caches, such that only a fraction of requests actually touch underlying databases. This is fairly effective for highly overlapping requests, such as facebook posts that are read by many users.

However, I suppose the Wayback Machine receives many unique requests that would hit the database anyway, so caching may not be super effective.

I'm not an expert on big-data systems though.

The Wayback backend is much simpler than one might imagine. The index lookup ("where is the data for this URL at this timestamp?") is called the CDX API and is pretty fast, given that it's basically looking up a line in a sorted many-terabyte text file. Slower are the processes that extract the original HTTP response from what is effectively a giant multi-GB tarball sitting on an archival-grade (not IOPS-optimized) spinning disk, and the code that parses HTML and/or javascript and re-writes "embed" URLs for playback. Another speed limit is that we serve everything from our datacenters in the bay area with no CDN for most content, which you'll really notice when connecting from outside North America.

The priority is to have as much data as accessible as possible, and for the same cost we can crawl and store far more web content on slow spinning disks than with SSDs or large RAM caches. Most RAM on storage nodes, which would otherwise be used as disk cache, gets used for derive tasks (like OCR or video conversion) or crawling/crunching tasks. That being said, if the service is so slow as to be unusable then there's no point operating it in the first place; hopefully we can get the latency a it lower.

So the primary reason for the multiple second response times is that many sequential megabytes have to be read from an HD in order to return an often quite tiny HTTP response data for a particular {url, timestamp}? That would explain the nearly constantly (slow) response time.

(I was puzzled why it didn't seem to vary much depending on the time of day, etc.)

Well, I oversimplified a bit. The "tarball" (WARC or ARC file) is a single large file with individually compressed HTTP response objects concatenated together. The index stores the byte offset and length (compressed) into this file, so an HTTP/1.1 range request can pull only the data needed. So in theory it's efficient. I don't work on that team directly and haven't looked in to exactly what the performance latency bottleneck is, sorry for the misleading response.

Ok. Well, I do hope you work on improving the response times. I think it would do wonders for the popularity of the wayback machine (and in the end, financial contributions).

Seemingly consistent response-times of multiple seconds when the underlying storage medium (mechanical hard drives, even for the most popular content) responds in tens of milliseconds does indicate something.

I'm always amazed how people come up with artsy ways of using SVG. IMO this is akin to pixel art where tinkerers need to spend a lot of time to become in expert in creating these things.

Really appreciate the writeup.

So, is the first gif the final animation? It isn't clear.

I don't think that they've decided on the final animation yet. Once they have, I'll update the post to default to the final animation settings and mention that more explicitly.

Cool demo... completely hijacks the browser history.

Thanks for letting me know! That was accidental and it should be fixed now.

Ha! I just manually reloaded HN to write the exact same comment almost word for word...

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact