What is the solution to automation then? What do I do when someone hits my content-rich Wordpress blog with a scraper that hits 100 pages a second to download my content, and my database falls over leading to real, legitimate users being unable to use my site? What if it’s not a legitimate scraper but someone with hundreds of proxies uses them to DDOS my site for days? Should I sacrifice my uptime to protect the freedom of those unwilling to attest that they’re running on real hardware?
The method to stop a (D)DoS is the same as it always was: caching and rate limiting.
Re: content scraping -- I was an indie web dev of a sort for a while and people always ask this question, and the answer is it's impossible to stop. Not even Facebook or big content sites like CNet or The Verge can stop it. At the bottom of it, you can just access the site in a browser and save the source. Content scraping is a rephrasing of "viewing content even just once". Stopping it is antithetical to the web and technologically infeasible.
it's probably actually cheaper to pay people piece rates to do it for you in a browser than to pay a developer to write and maintain a scraping script anyway, so if the later became genuinely impossible moving to the former isn't a big deal.
Put your WordPress blog behind a caching proxy with a 5s TTL - that way any amount of traffic to a URL will produce at most one hit every 5 seconds to your backend.
I've used this trick to survive surprise spikes of traffic in multiple projects for years.
Doesn't help for applications where your backend needs to be involved in serving every request, but WordPress blogs serving static content are a great example of something where that technique DOES work.
Proof-of-work schemes such as Hashcash[1] and simple ratelimiting algorithms can act as deterrents to spamming and scraping attacks.
There are other kinds of non-invasive bot management you can do as well, however, due to various reasons I'm not in a position to talk about it. A few other methods are mentioned at the end of the post being discussed[2].
Can privacy be preserved with zero knowledge proofs? I don't like the idea of universal fingerprinted devices in an already heavily authoritarian world.
Semantic quibble: it's less "proof of work" and more "proof of hardware+work". Or, as they call it, hardware-bound proof of work. The reason you can't offload the challenge to a more powerful device is that they rely on identifying stable differences for each device class that ultimately trace down to the hardware they're running on.
Wasn't mining in the browser basically shutdown by every major browser?
It was done super fast.. one can't help but think that Google pull all the levers they had at Apple/Mozilla to made sure the first viable alternative to advertisement was killed before it was born. But I think as a side effect it make PoW might be sort of impossible?
I don't really know how to mining "fingerprinting" works exactly - so would be curious to know if I'm wrong
What killed "mining in the browser", more than anything else, was:
1) It was almost exclusively used for malicious purposes. Very few legitimate web sites used cryptominers, and it was never considered a viable substitute for display advertising; it was primarily deployed on hacked web sites. Browser vendors were relatively slow to react; many of the first movers were actually antivirus/antimalware vendors adding blocks on cryptominer scripts and domains.
2) The most popular cryptominer scripts, like Coinhive, all mined the Monero coin. (Most other cryptocurrencies were impractical to mine without hardware acceleration.) Monero prices were at an all-time high at the time; when Monero prices crashed in late 2018, the revenue from running cryptominer scripts dropped dramatically, making these scripts much less profitable to run. (This is ultimately what led Coinhive to shut down.)
I guess slow/fast is subjective. It didn't seem like enough time passed for a legitimate ecosystem to develop. Just the basic idea of say hosting a static-site/blog on a VPS with a cryptominer that could pay for itself would have been a game changer - but was probably just the tip of the iceberg of possibilities. Instead we're still stuck either having to sell our traffic/info to Google/Microsoft, put up ads, pay for it out of pocket. The entrenched players won
The hacked site boogieman felt overblown (and from what you're saying it sounds like if would have died out anyway). I'm sure it happened, but at least personally I never once came across it. Or if I did, then my CPU spun a bit more and I didn't notice. No real harm done.
More fundamentally we're now in territory where the browser vendors get to decide what javascript is okay to run and which isn't.
Anyway, it's just complaining into the ether :) it is what it is. thanks for the context of the market forces and antivirus companies
> I guess slow/fast is subjective. It didn't seem like enough time passed for a legitimate ecosystem to develop.
Coinhive was live from 2017 - 2019, and it basically ran the whole course from exciting new tech to widely abused to dead over those two years. I don't think it needed more time.
> The hacked site boogieman felt overblown...
Troy Hunt acquired several of the Coinhive domains in 2021 -- two years after the service shut down -- and it was still getting hundreds of thousands of requests a day, mostly from compromised web sites and/or infected routers. It was a serious problem, albeit one which mostly affected smaller and poorly maintained web sites.
Make it someone else's problem; put a caching CDN in front of it, like Cloudflare, who have experience with these problems (like intentional or accidental DDOS).
I understand and agree with the suggestion of putting a CDN, but it's somewhat ironic to suggest the use of Cloudflare when that very same company is advocating for the DRM-for-webpages scheme.
Is it not a fair to assume that Cloudflare, as a company who have made a name for themselves selling various DDoS protection services, realize they're in an arms race with the old school way of handling these problems are are pursuing more advanced solutions before the current techniques are entirely useless?
It would be easy to point to the irony of saying "instead of supporting Cloudflare's proposals for PATs, use their CDN product for brute force protection" but on the other hand, they employ a lot of experts in this space and might see the writing on the wall in an increasingly adversarial public internet.
This is a good question, but if you look at it closely, Cloudflare seems to be the only company advocating for attestation schemes for the web.
It’s almost as if the conspiracy theory of Cloudflare acting as an arm of the US government and helping in the centralization of the internet is actually true.
is there such thing as a caching CDN that effectively protects against scrapers? generally if somebody is going to try and scrape a whole bunch of old infrequently-accessed but dynamically generated pages, most of those won't be in the cache and so the caching proxy isn't going to help at all.
i'm honestly asking, not just trying to disprove you. this is a real problem i have right now. ideally i'd get all my thousands of old, never-updated but dynamically generated pages moved over to some static host, but that's work and if i could just put some proxy in front to solve this for me i'd be pretty happy. but afaik, nothing actually solves this.
Akamai has a scraper filter (I think it just rate limits scrapers out of the box but can be configured to block if you want).
I'm not sure how good it is at detecting what is a scraper and what isn't though.
Yeah, AWS has one of these, a set of firewall rules called "bot control". it seems to work well enough for blocking the well-behaved bots who request pages at a reasonable rate and self-identify with user-agent strings (which i'm not really concerned about blocking, but it does give me some nice graphs about their traffic). it seem doesn't do a whole lot to block an unknown scraper hitting pages as fast as it can.
> What do I do when someone hits my content-rich Wordpress blog with a scraper that hits 100 pages a second to download my content, and my database falls over
It's a blog. Blogs are not complex. Why is your blog's database so awfully designed that 100 pages a second causes it to fall over?
> leading to real, legitimate users being unable to use my site?
You assume that a scraper is not a legitimate user. I argue otherwise. If you don't want a scraper to use your site then put your site behind a paywall.
> What if it’s not a legitimate scraper but someone with hundreds of proxies uses them to DDOS my site for days?
If it's a network bandwidth problem, then a reverse proxy (eg, CDN) solves that.
> Should I sacrifice my uptime to protect the freedom of those unwilling to attest that they’re running on real hardware?
All software runs on real hardware. What is your exact question?
I am accessing this site in a virtual machine. I could be doing it with a headless browser. Why does that matter at all?