Ask HN: I want to crawl every plain HTML website. Where do I begin?

ratio11 · on Dec 2, 2022

You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.

heresjohnny · on Dec 2, 2022

More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.

dwrodri · on Dec 2, 2022

I'm aware of the severe limitations for this first pass. For now, the main focus is building a corpus without jumping through too many hoops. Rendering JS is both a security risk and technical lift I didn't want to start with, but it'd be a top priority once I get something that I can show to other people.

deepsy · on Dec 2, 2022

I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.