Hacker News new | past | comments | ask | show | jobs | submit login

Honestly, I think possibly the biggest problem with indexing Wayback Machine is simply the size. I’m pretty sure it’s growing far faster than anyone can pull WARCs out for indexing, especially because well, it’s not exactly high throughput on the download side. I don’t blame anyone for that, but it does make the prospect of externally indexing feel a bit bleak.

At this point, I’d like it if there were just tools to index huge WARCs on their own. Maybe it’s time to write that.




Right, the download speed is definitely an issue (and like you say, it's quite understandable considering the volume/traffic they deal with), and the continual growth is one of many factors I didn't consider.

I wonder if the IA would allow someone to interconnect directly with their storage datacenter, if one were to submit a well articulated plan to create this search index/capability.

Also, what do you mean by tools to index WARCs? Specifically, the gzip + WARC parsing + html parsing steps? Would the (CLI?) result be text extracted from the original html pages, i.e. something along the lines of running `strings` or beautifulsoup?


Yeah, pretty much. Though being able to directly load data into a search cluster like Elastic would be nice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: