While I don’t disagree with the idea that all crawlers should have equal access,...

kmeisthax · on March 26, 2021

I think this is a problem which should be solved by automatic rate-limiting and throttling at the application/caching layer (or just individual web server for smaller sites). Requests with a non-browser UA get put into a separate bots-only queue that drains at a rate of ~1/sec or so. If the queue fills up you start sending 429s with random early failures for bots (UA/IP/subnet pairs) that are overrepresented in the traffic flow.

I don't know if such software exists, but it should. It would be a hell of a lot healthier for the web than "everyone but Google f*ck off", and it creates an incentive for bots to throttle themselves (as they're more likely to get a faster response than trying to request as fast as possible).

NathanKP · on March 26, 2021

I suspect that at least some of the bots use web server response times and response codes as part of the signal for ranking. If your website does not appear capable of handling load then it won't rank as highly, because it is not in their best interests to have search results that don't load.

rstupek · on March 26, 2021

We've had the Bing crawler make a obscene number of requests quite often but fortunately it doesn't bring us down.