Hacker News new | past | comments | ask | show | jobs | submit login

Thanks! We were writing up a response at the same time:

The Wayback Machine data is stored in WARC or ARC files[0] which are written at web crawl time by the Heritrix crawler[1] (or other crawlers) and stored as regular files in the archive.org storage cluster.

Playback is accomplished by binary searching a 2-level index of pointers into the WARC data. The second level of this index is a 20TB compressed sorted list of (url, date, pointer) tuples called CDX records[2]. The first level fits in core, and is a 13GB sorted list of every 3000th entry in the CDX index, with a pointer to larger CDX block.

Index lookup works by binary searching the first level list stored in core, then HTTP range-request loading the appropriate second-level blocks from the CDX index. Finally, web page data is loaded by range-requesting WARC data pointed to by the CDX records. Before final output, link re-writing and other transforms are applied to make playback work correctly in the browser.

The server stack:

- frontend: Tengine + HAProxy to a pool of Wayback tomcat app servers[3]

- backend: The redis-backed archive.org metadata API[4] for object location and nginx on linux (via ext4) for data service

  [0] http://en.wikipedia.org/wiki/Web_ARChive
  [1] https://github.com/internetarchive/heritrix3
  [2] https://github.com/internetarchive/CDX-Writer
  [3] https://github.com/internetarchive/wayback
  [4] http://blog.archive.org/2013/07/04/metadata-api/
-sam and raj, Internet Archive

And despite the size, it's so much faster than it used to be! Good work.

Why not use hashtable instead of binary search? I'm assuming your index is immutable and query for the data structure is essentially random. Another advantage of looking up item by URL hash may be that you can use hash-prefix to direct the query to appropriate machine (so basically, your infrastructure simply looks like giant distributed hashtable top to bottom with no binary searches required).

Former Archive employee (& still occasional contract contributor) here. This was one of my 1st questions when joining in 2003!

Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix.

Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. And once you've got that artifact, the individual capture lookups can be satisfied fairly efficiently, too. (A distributed-hashtable would then be something extra to maintain.)

Also, the queries aren't random: there are hot ranges, and even a single user's session begins with a range query (all dates for an URL), then visits one URL from that same range. Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often.

There's no doubt many places that could be improved, but this basic sorted-index model has fit the application well for a long while, avoided too much domain-specific complexity, and been amenable to many generations of index/sharding/replication/internal-API tweaks.

BTW, the Archive is hiring for multiple technical roles, including a senior role developing a next-generation of the Wayback Machine: https://archive.org/about/jobs.php

Perhaps caching reasons? If people follow links internally within a domain then randomly distributed hashes would mean a visit to randomly distributed data servers to retrieve every new page. With their architecture the 'CDX' block should contain similar URLs and accessing the linked URLs would could be a seek within the already retrieved block.

Just my guess.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact