AI companies have started using a technique to evade rate limits where they will...

Nathanba · 2025-04-13T04:09:00 1744517340

Tens of thousands of scraper bots for a single site? Is that really the case? I would have assumed that maybe 3-5 bots send lets say 20 requests per second in parallel to scrape. Sure, they might eventually start trying different ips and bots if their others are timing out but ultimately it's still the same end result: All they will realize is that they have to increase the timeout and use headless browsers to cache results and the entire protection is gone. But yes, I think for big bot farms it will be a somewhat annoying cost increase to do this. This should really be combined with the cloudflare captcha to make it even more effective.

marginalia_nu · 2025-04-13T08:36:26 1744533386

A lot of the worst offenders seem to be routing the traffic through a residential botnet, which means that the traffic really does come from a huge number of different origins. It's really janky and often the same resources are fetched multiple times.

Saving and re-using the JWT cookie isn't that helpful, as you can effectively rate limit using the cookie as identity, so to reach the same request rates you see now they'd still need to solve hundreds or thousands of challenges per domain.

Hasnep · 2025-04-13T04:59:40 1744520380

If you're sending 20 requests per second from one IP address you'll hit rate limits quickly, that's why they're using botnets to DDoS these websites.

vhcr · 2025-04-13T04:25:04 1744518304

Until someone writes the proof of work code for GPUs and it runs 100x faster and cheaper.

marginalia_nu · 2025-04-13T09:50:22 1744537822

A big part of the problem with these scraping operations is how poorly implemented they are. They can get a lot cheaper gains by simply cleaning up how they operate, to not redundantly fetch the same documents hundreds of times, and so on.

Regardless of how they solve the challenges, creating an incentive to be efficient is a victory in itself. GPUs aren't cheap either, especially not if you're renting them via a browser farm.

runxiyu · 2025-04-13T07:25:32 1744529132

Anubis et al. are also looking into alternative algorithms. There seems to be consensus that SHA-256 PoW is not appropriate

genewitch · 2025-04-13T09:17:31 1744535851

There's lots of other ones but you want hashes that use lots of RAM, stuff like scrypt used to be the go-to but I am sure there are better, now.