Web crawling in this day and age is loaded with caveats, and must be done carefully as to not put an unnecessary load on other people's infrastructure. I'd really like to try to make my own web search tool, so I'm trying to scope out something simple enough to get me started. Here's what I have so far.
- I don't want to parse anything in the SimilarWeb Top 50.
- I don't want to render JS
- I'd like to keep a web index that is still measured in TBs
I've done search engines for research papers in the past. The key difference was that I could collect that data very easily through a documented API. Now I need to either build or use a crawler and I'm not sure where to begin. Here are some thoughts that I have so far.
- I'm probably going to write the crawler in Go. It seems like a good fit for this sort of software.
- How do I start collecting lists of domains? Do I just start hitting public IPv4 addresses on Ports 80 and 443?
- If I run something like this with proper rate limiting on a server, would Cloudflare inevitably just start blocking me?
- If I were to run this from a machine connected to the Internet via a residential ISP, then would I get a nasty letter from my ISP?
Any advice or feedback is appreciated. The goal of this project is to learn more about web crawling moreso than to build a product that would be sold.
https://commoncrawl.org