The Wayback Machine data is stored in WARC or ARC files which are written at web crawl time by the Heritrix crawler (or other crawlers) and stored as regular files in the archive.org storage cluster.
Playback is accomplished by binary searching a 2-level index of pointers into the WARC data. The second level of this index is a 20TB compressed sorted list of (url, date, pointer) tuples called CDX records. The first level fits in core, and is a 13GB sorted list of every 3000th entry in the CDX index, with a pointer to larger CDX block.
Index lookup works by binary searching the first level list stored in core, then HTTP range-request loading the appropriate second-level blocks from the CDX index. Finally, web page data is loaded by range-requesting WARC data pointed to by the CDX records. Before final output, link re-writing and other transforms are applied to make playback work correctly in the browser.
The server stack:
- frontend: Tengine + HAProxy to a pool of Wayback tomcat app servers
- backend: The redis-backed archive.org metadata API for object location and nginx on linux (via ext4) for data service
Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix.
Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. And once you've got that artifact, the individual capture lookups can be satisfied fairly efficiently, too. (A distributed-hashtable would then be something extra to maintain.)
Also, the queries aren't random: there are hot ranges, and even a single user's session begins with a range query (all dates for an URL), then visits one URL from that same range. Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often.
There's no doubt many places that could be improved, but this basic sorted-index model has fit the application well for a long while, avoided too much domain-specific complexity, and been amenable to many generations of index/sharding/replication/internal-API tweaks.
BTW, the Archive is hiring for multiple technical roles, including a senior role developing a next-generation of the Wayback Machine: https://archive.org/about/jobs.php
Just my guess.