Former Archive employee (& still occasional contract contributor) here. This was one of my 1st questions when joining in 2003!
Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix.
Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. And once you've got that artifact, the individual capture lookups can be satisfied fairly efficiently, too. (A distributed-hashtable would then be something extra to maintain.)
Also, the queries aren't random: there are hot ranges, and even a single user's session begins with a range query (all dates for an URL), then visits one URL from that same range. Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often.
There's no doubt many places that could be improved, but this basic sorted-index model has fit the application well for a long while, avoided too much domain-specific complexity, and been amenable to many generations of index/sharding/replication/internal-API tweaks.
BTW, the Archive is hiring for multiple technical roles, including a senior role developing a next-generation of the Wayback Machine: https://archive.org/about/jobs.php
Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix.
Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. And once you've got that artifact, the individual capture lookups can be satisfied fairly efficiently, too. (A distributed-hashtable would then be something extra to maintain.)
Also, the queries aren't random: there are hot ranges, and even a single user's session begins with a range query (all dates for an URL), then visits one URL from that same range. Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often.
There's no doubt many places that could be improved, but this basic sorted-index model has fit the application well for a long while, avoided too much domain-specific complexity, and been amenable to many generations of index/sharding/replication/internal-API tweaks.
BTW, the Archive is hiring for multiple technical roles, including a senior role developing a next-generation of the Wayback Machine: https://archive.org/about/jobs.php