Cloudflare released a feature to block these short of crawlers, even on free tiers. I don't have a need to block crawlers but I'm curious how effective it would be for this case.
I don't expect most companies to have caches of static content to largely prevent this sort of thing, but certainly by the time you get to iFixit size you should have your shit together right? Their content seems to be one of the best for caching as the content shouldn't need to be dynamic and shouldn't change much over time.
Further, on the content scraping side, aren't there better ways to determine when to re-scrape so you're not completely overloading servers of folks who don't have easily cache-able static content? Are "HTTP/1.1 304 Not Modified" header responses too difficult to implement or honor? Seems like failures on both sides are what leads to these situations.
When I worked at iFixit ~12-13 years ago most of the site was already aggressively cached whenever possible, and I’m sure it’s only gotten better since.
They are no strangers to bursty traffic. We would get insane bursts of traffic many, many times greater than typical wherever a teardown of a new iDevice was published shortly after release.
If Kyle and the devops folks there are noticing it, it’s definitely disruptive behavior.
I believe you. But at the same time it's insane to me we have these standards that aren't upheld on one end or the other. I'm fulling willing to acknowledge that Anthropic is the problem here. But at some level, and at some point, don't you lose the ability to say the site is down because it's too popular? I'm totally open to the arguments that the traffic overloaded the Cloudflare limits or whatever. But the frequency of outages caused by not just traffic at Anthropic levels, but at random HN post levels are enough to take down websites. I guess this is just my cognitive dissonance between being told on HN "you don't need to scale! You're not Facebook!" and "OMG our went down because we have a few thousand concurrent visitors!"
I don't think the site itself went down, so much as the scraping itself was excessive and beyond this, the presumed use of said scraping is against the TOS and copyrights of the site itself.
On a smaller site, it can absolutely be brutal. I worked on a site where literally 2/3 of requests were bots and a lot of those were dynamically constructed pages that are/were harder to cache. In the end, switching to a supplemental search database helped significantly, but it doesn't mean that an aggressive bot is an okay thing.
From the tweet:
> Hey @AnthropicAI: I get you're hungry for data. Claude is really smart! But do you really need to hit our servers a million times in 24 hours?
You're not only taking our content without paying, you're tying up our devops resources. Not cool.
Assuming they're literally getting hit "a million times in 24 hours", wouldn't it still cost bandwidth? Also, perhaps iFixit doesn't have such traffic in their past so they didn't bother optimzing prematurely.
I feel the blame mostly falls on Anthropic not being considerate when they scrape other people's content.
> I feel the blame mostly falls on Anthropic not being considerate when they scrape other people's content.
That's certainly a fair assumption. I haven't looked into it in any detail. Is IFixit sending appropriate cache headers or is Anthropic just ignoring them? I don't know. But at some point, your caching strategy and your ability to respond to spikes in demand is your own problem. We live in reality, and in reality you can't really control who is hitting your website when. Sure, it would be nice if they respected your cache control policies, but what are you doing when you know they won't? I'll refer back to my previous statement. It's totally understandable when someone who didn't expect ever HN levels of attentions to be overwhelmed. But if you're a site as large and respected as IFixit, you've got to do better than the "unwashed masses".