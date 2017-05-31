The switch to a javascript-rendered site frontend has made using the wayback machine pretty much inaccessible to all but the most modern and bloated of browsers. Emails to support about keeping the old site going at an alternate URL or fixing fall-back rendering/operation for no-JS have been met with silence.
I guess I'm just not in the demographic archive.org is targeting now despite being a heavy user for more than a decade. It's not the backend guy's fault but it does mean lots of us won't benefit from any backend improvements.
(For example, Google's Blogspot has this amazingly horrible thing of providing the same blog across dozens of national TLDs like 'blog.com', 'blog.co.uk', 'blog.fr' etc - so there may be a full copy of that blog in the WM except it's split across dozens of domains and the Wayback Machine forces you to check every possible permutation of domain name since it can't do a fulltext search.)
How are the content crawled?
@devhead: To import the data into ES we used a custom application to extract the text from the OCR'd documents.
This is required to support our bookreader software. A complete ingestion takes a few days; we rate-limit indexing in order not to overload the cluster, and maintain reasonable search performance.
It'd be awesome if you could share how you imported the data into ES and how long it took to fully go through that large dataset?
The switch to a javascript-rendered site frontend has made using the wayback machine pretty much inaccessible to all but the most modern and bloated of browsers. Emails to support about keeping the old site going at an alternate URL or fixing fall-back rendering/operation for no-JS have been met with silence.
I guess I'm just not in the demographic archive.org is targeting now despite being a heavy user for more than a decade. It's not the backend guy's fault but it does mean lots of us won't benefit from any backend improvements.