Honestly, I think possibly the biggest problem with indexing Wayback Machine is ...

gregsadetsky · on March 17, 2022

Right, the download speed is definitely an issue (and like you say, it's quite understandable considering the volume/traffic they deal with), and the continual growth is one of many factors I didn't consider.

I wonder if the IA would allow someone to interconnect directly with their storage datacenter, if one were to submit a well articulated plan to create this search index/capability.

Also, what do you mean by tools to index WARCs? Specifically, the gzip + WARC parsing + html parsing steps? Would the (CLI?) result be text extracted from the original html pages, i.e. something along the lines of running `strings` or beautifulsoup?

jchw · on March 17, 2022

Yeah, pretty much. Though being able to directly load data into a search cluster like Elastic would be nice.