I guess I'm just not in the demographic archive.org is targeting now despite being a heavy user for more than a decade. It's not the backend guy's fault but it does mean lots of us won't benefit from any backend improvements.
(For example, Google's Blogspot has this amazingly horrible thing of providing the same blog across dozens of national TLDs like 'blog.com', 'blog.co.uk', 'blog.fr' etc - so there may be a full copy of that blog in the WM except it's split across dozens of domains and the Wayback Machine forces you to check every possible permutation of domain name since it can't do a fulltext search.)
How are the content crawled?
@devhead: To import the data into ES we used a custom application to extract the text from the OCR'd documents.
This is required to support our bookreader software. A complete ingestion takes a few days; we rate-limit indexing in order not to overload the cluster, and maintain reasonable search performance.
It'd be awesome if you could share how you imported the data into ES and how long it took to fully go through that large dataset?