Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: I want to crawl every plain HTML website. Where do I begin?
4 points by dwrodri on Dec 1, 2022 | hide | past | favorite | 4 comments
Web crawling in this day and age is loaded with caveats, and must be done carefully as to not put an unnecessary load on other people's infrastructure. I'd really like to try to make my own web search tool, so I'm trying to scope out something simple enough to get me started. Here's what I have so far.

- I don't want to parse anything in the SimilarWeb Top 50.

- I don't want to render JS

- I'd like to keep a web index that is still measured in TBs

I've done search engines for research papers in the past. The key difference was that I could collect that data very easily through a documented API. Now I need to either build or use a crawler and I'm not sure where to begin. Here are some thoughts that I have so far.

- I'm probably going to write the crawler in Go. It seems like a good fit for this sort of software.

- How do I start collecting lists of domains? Do I just start hitting public IPv4 addresses on Ports 80 and 443?

- If I run something like this with proper rate limiting on a server, would Cloudflare inevitably just start blocking me?

- If I were to run this from a machine connected to the Internet via a residential ISP, then would I get a nasty letter from my ISP?

Any advice or feedback is appreciated. The goal of this project is to learn more about web crawling moreso than to build a product that would be sold.




You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.

https://commoncrawl.org


More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.


I'm aware of the severe limitations for this first pass. For now, the main focus is building a corpus without jumping through too many hoops. Rendering JS is both a security risk and technical lift I didn't want to start with, but it'd be a top priority once I get something that I can show to other people.


I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: