Hacker News new | past | comments | ask | show | jobs | submit login

In my case I was just fetching the home page of all known domain names. First issue I'd noticed was ensuring DNS requests were asynchronous. I wanted to rate limit fetching per /28 IPv4 to respect the hosts getting crawled, but couldn't really do that without knowing the IP beforehand (and keeping the crawler busy) so ended up creating a queue based on IP. I used libunbound. Found that some subnets have hundreds of thousands of sites and although the crawl starts quickly, you end up rate limited on those.

Also interested at the higher end of the scale about how 'hard'/polite you should be with authoritative nameservers as some of them can rate limit also.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: