Thanks! We were writing up a response at the same time: The Wayback Machine data...

_wiv7 · on May 9, 2014

And despite the size, it's so much faster than it used to be! Good work.

sytelus · on May 10, 2014

Why not use hashtable instead of binary search? I'm assuming your index is immutable and query for the data structure is essentially random. Another advantage of looking up item by URL hash may be that you can use hash-prefix to direct the query to appropriate machine (so basically, your infrastructure simply looks like giant distributed hashtable top to bottom with no binary searches required).

gojomo · on May 10, 2014

Former Archive employee (& still occasional contract contributor) here. This was one of my 1st questions when joining in 2003!

Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix.

Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. And once you've got that artifact, the individual capture lookups can be satisfied fairly efficiently, too. (A distributed-hashtable would then be something extra to maintain.)

Also, the queries aren't random: there are hot ranges, and even a single user's session begins with a range query (all dates for an URL), then visits one URL from that same range. Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often.

There's no doubt many places that could be improved, but this basic sorted-index model has fit the application well for a long while, avoided too much domain-specific complexity, and been amenable to many generations of index/sharding/replication/internal-API tweaks.

BTW, the Archive is hiring for multiple technical roles, including a senior role developing a next-generation of the Wayback Machine: https://archive.org/about/jobs.php

bglazer · on May 10, 2014

Perhaps caching reasons? If people follow links internally within a domain then randomly distributed hashes would mean a visit to randomly distributed data servers to retrieve every new page. With their architecture the 'CDX' block should contain similar URLs and accessing the linked URLs would could be a seek within the already retrieved block.

Just my guess.