I run a system at my employer that occasionally gets scraped by malicious users. It can be used to infer the purchasability of a specific domain, which is a moderately-interesting API endpoint, since that requires talking to domain registries. For a while, nobody cared enough about it to abuse the endpoint. But then we started getting about 40 QPS of traffic. We normally get less than 1.
I was keeping an eye on it, because we are hard-capped at 100 QPS to our provider, beyond that and they start dropping our traffic (and it is an outside provider, bundling domain registries like verisign and stuff), which makes regular users break if their traffic gets unlucky.
Anyway, after a week of 40qps, they start spiking to 200+, and we pull the plug on the whole thing: now each request to our endpoint requires a recaptcha token. This is not great (more friction for legit users = more churn) but it is successful. IF they had only kept their QPS low, nobody would have cared. I wanted to send some kind of response code like, "nearing quota".
FTR before people ask: it was quite difficult to stop this particular attack, since it worked like a DDOS, smeared across a _large_ number of ipv4 and ipv6 requesters. 50 QPS just isn't enough quota to do stuff like reactively banning IP numbers if the attacker has millions of IPs available.
Maybe mCaptcha [0] is worth a look. It applies a Proof-of-Work like algorithm (not blockchain-related) which makes it very expensive for scrapers to get data in bulk, but poses least amount of friction to individual users. The project is implemented in Rust and received NGI.eu/NLnet funding. I don't know its state of production-readiness, but Codeberg.org is considering using it (this choice is informed by higher respect for privacy and improved a11y compared to hCaptcha).
In my opinion, PoW is the only reliable way, to avoid DDoS attacks. Scraping too much, too quickly is a light form of DDoS, although not intentional. PoW was invented in 2006 exactly for that purpose.
The genius of bitcoin (not BTC) is that it provides an organized and practical way, for PoW to be used by everyone on the planet. Some people find it strange, because there is an imaginary token created out of pure nothing which represents the PoW. It would work fine, without that imaginary token i.e. bitcoin, just by dollars of euros, but it wouldn't be internet native. The point is always to just send a minimal PoW with every http or tcp/ip request.
I hate to be a pedant but I see it used wrong a lot. "DDOS" stands for Distributed denial of service, specifically indicating the traffic is coming from many sources (wide range of ip), which is what makes it so hard to defend against. Someone scraping too fast would be performing an unintentional DOS, because they probably arent scraping using a botnet (and if they were they probably do in fact intend to attack)
And if they’re scraping using “serverless” compute? It’s distributed, but could be used without “meaning” to attack. That seems to be what happened to Archive (they said AWS, could be ECS or Lambda, idk).
FWIW this thread inspired me to implement Cloudflare Turnstile on one of my pages - highly recommend. As compared to reCAPTCHA your users are never wasting their life away clicking on traffic signs, and as compared to mCaptcha you don’t have to host the server yourself.
In practice they're often using botnets. It's what makes this such a difficult problem to deal with. If they weren't it would be trivial to rate-limit or block problematic IPs or subnets.
In most cases, they aren't legitimate scrapers. They have a complete disregard for the resources they use from the websites they scrape and use a botnet to hide their point of origin.
Pretty sure my second-grade teacher invented it. Writing "I will not talk while the teacher is talking" 100 times on the chalkboard is proof of work, and it definitely cut down on unwanted chatter.
Yes, DDoS attacks are spam over the http protocol. Spam is spam over the imap protocol. Overwhelming a server with too many download requests, is not spam but it has the same effect. Calling the police every ten minutes because the door sounds like someone tries to break in, is spam.
Would it be possible to use bitcoin pow as a replacement for captcha? It both blocks ddos and create an income through mining (client side) for the host.
No. In fact, using such a well-known proof of work function would defeat the purpose of using PoW. Instead of having something that's cheap for occasional users but expensive for spammers, you'd have something that's cheap for people who own mining ASICs and expensive for occasional users.
Hardware specialization breaks the economics behind a CAPTCHA. To fight that you need to use a PoW that hasn't been ASIC'd yet, and be willing to change PoW functions at the drop of a hat. PoW functions that stress memory or cache are also helpful here, though you run the risk of browsers flagging you as a cryptominer (which is technically correct, even if economically wrong).
> cheap for people who own mining ASICs and expensive for occasional users.
Seems like it should be the same cost (barring the friction of having a wallet, etc.) for both sets of people since Bitcoin is just a commodity and the value is the same to everyone, miner or not.
For example, a miner should value some fraction of a BTC the same way anyone else does, since they can sell or buy it at the same price a normal user can. The fact that they can profitably mine BTC just means they have a profitable business on the side, it doesn’t mean they should prefer to pay for services (or emails) in BTC.
Miners have ASICs - specialized silicon that ONLY mines Bitcoin, but does so many times faster than a CPU or GPU can. The average low-demand user does not have such hardware. Furthermore, because Bitcoin increases difficulty to maintain a fixed transaction rate, ASICs have to be replaced with newer models every few years. So older ASICs are going to be cheaper, but they might still be worth buying if you're a spammer.
So in order to moderately inconvenience said spammer, you have to make each and every ordinary user wait hours mining a few satoshis' worth of hashes in order to be let in. This is the exact opposite of what you want.
Yes, sorry I wasn't expressing myself very clearly. I agree if the goal is to impose a uniform (per-unit) cost to users, you should avoid widely-optimized hash functions.
But even if you pick a novel function that doesn't have special-purpose hardware, spammers can still optimize their setups to lower the effective unit cost below what a legitimate user faces, by doing normal miner activities (picking hardware, scaling up, moving to where electricity is cheap, etc.).
Since your goal is to maximally discriminate between legitimate and spam use cases, you'd want spammers to face at least the same per-unit cost as legitimate users.
What's one way you can do that? Well, how about charging actual currency, whether Bitcoin or fiat? Money has the useful property of having the same nominal value for everyone, and not being amenable to further optimization.
In short, forcing users to actually run PoW themselves doesn't really make economic sense. Even if you're avoiding existing hash functions, it's mostly worse than just charging money because spammers have better ability to optimize against it.
And if charging money doesn't work, switching to local PoW is unlikely to be better.
It wouldn't create any income due to the electricity costs associated with mining, unless the users start buying ASICs. Also the most straightforward way of implementing this would be by having the user spending the token they generated in a transaction to prove their work, which would nullify their income. Maybe there is some workaround by monitoring the blockchain and seeing if a certain user generated some tokens at a certain time, but would it be worth the extra energy required due to the use of bitcoin' PoW? I think no.
The problem to me seems that it needs to scale with the number of requests. So PoW effectively burns energy.
As the problem with captchas, is rather who profits from them and who could track users, it seems that with an OCR system to be protected, it is most reasonable to actually give OCR tasks to humans. IMHO this is far more sustainable and people would understand the value: there is nothing bad in improving ML in general. Maybe someone could even define a sensible PoW task for OCR but I doubt it...
If it doesn’t involve my free labor training machine learning models like reCAPTCHA and hCAPTCHA and I can still visit sites from a non-Western IP, then it’s already an improvement.
Really interesting! How efficiently does a web browser compute the PoW? I'm concerned that a bot would use an efficient GPU implementation while real users would run an inefficient JS/webcrypto version.
I tried to implement PoW in browser for the same concept. I think it's probably at least marginally useful, but practically you're limited by WebCrypto I think, which outperformed anything I could find in pure JS or WASM. The disadvantage of WebCrypto is that there's a limited set of algorithms you can use and also, understandably, calculating hashes is async, so if you want a lot of rounds, you'll spend a lot of time jumping in and out of the event loop. It's still probably a useful speed bump or price increase for expensive operations: what might cost a few seconds on a typical phone or desktop is probably at least enough to act as a speed bump for attackers to perform it quickly (especially combined with other measures like throttling and progressively increasing PoW difficulty for an IP.) Maybe WebGPU could change this, but I'm weary of relying on all users having a fast GPU enabled in their browser just for this use case.
Though thinking about it, I wonder if there is a hybrid: start with a difficulty that's just a few seconds for CPU/WebCrypto and ramp up quickly, but also support WebGPU where possible so that web users on abusive connections may still succeed? I am not sure though, I guess this depends on the feasibility of using WebGPU and etc.
Requiring every user to compute it's own PoW is a terrible idea. It defeats the whole purpose. One's person expensive computation is another person's almost free computation.
I commented in the past about it:
"At first glance, yes, we can create intentionally expensive computations without relying on a blockchain, that would serve the same purpose. In reality we cannot. Special computer hardware (ASICs) could generate much cheaper PoW annotations than general purpose computers, and sell it to spammers. Blockchain economic incentives ensure that ASICs will be used by the miners first and foremost."
Looking cross-comment I think you're advocating for micro-payments or micro-proof-of-burn. This is better than having every user compute their own PoW as they'd purchase a small amount of cryptocurrency from an ASIC farm, which should be cheaper/faster/more efficient than using their local hardware. I mostly agree with the concept, but I don't think any such API exists yet with enough adoption to consider this a real option any time soon.
Right now, if I was to ask my users to do a proof of burn for $0.0001 of BTC most of them would just close my site as they don't have any BTC. The process of setting up an account on an exchange, waiting several hours/days for KYC checks to clear, adding a credit card, buying BTC, sending it to a browser extension, and then trying the signup again is a *significant* initial hurdle. If we were in a world where I could assume all my users already owned BTC, that's a different story, but we haven't seen adoption of cryptocurrency anywhere near that level. I don't expect PoW schemes to drive that adoption either, so this seems like a poor solution today.
Do you happen to know if a hCaptcha/mCaptcha-like micro-proof-of-burn tool exists already? I'd be happy to be proven wrong.
That brings us back to options that exist today, which includes every user computing their own PoW. Looking at mCaptcha some more, it uses a SHA256 derivative so it's compute-hard and vulnerable to GPUs/ASICs. The author mentions some of those concerns in an earlier thread [1]. I wonder if a different proof-of-work algorithm would be better, like a memory-hard PoW, proof-of-space, and/or proof-of-wait. I'm skeptical of those too, unfortunately.
You are right in most of your points. Today it doesn't exist an API for micro_proof_of_burn. Economic incentives however require a little patience because they need to work their way through the system. 0.0001 BTC is hugely expensive. A blockchain with millionth of a cent transaction, is certainly possible, and it will exist in 2-3 years approximately.
In my calculations, with millionth of a cent per transaction, even paying for torrent blocks (64 KB), not torrent pieces (16KB) will soon become profitable.
Micropayment are 'profitable' save for anti-money-laundering laws (not to speak of anti-terrorism laws) which make any such payment have significant compliance costs. The only way to bypass that would be with centralized verification agencies to take liability (and demonetize anyone deemed controversial). Since they're centralized, they'll have zero use for blockchain (good riddance). Does anyone want that future?
Security is not all or nothing.
There are many applications where adding a small bit of friction in the form of compute will stop 99.9% of abusive traffic.
Visual captchas are a plague on the internet, but so is Blockchain mania.
I added FriendlyCaptcha to some of my sites, and stopped 100% of abusive traffic. Open source, user friendly, accessible to people with disabilities.
Thanks for the clarification. Indeed it is not open source in the truest sense, but I don't consider source-available to be closed-source. Especially when the main limitation seems to be a no reselling clause. This would not prevent most people from deploying to their SaaS for example. Even having the code available to audit is a plus. Again, perfection is the enemy of good.
I've found that black and white thinking in security is very dangerous, as you often end up with very "secure" controls that have terrible UX, which users bypass completely via byob etc... And pwnage ensues. UX is a primary pillar of security.
Nice solution, if that works for your site, perfect.
If i may add, in case someone desires a little bit of revenue from a website, one very popular solution is to put advertisements in some places. Well some people consider that a security hole, including me. So i guess the definition of security varies, but the security mania goes on for many decades. I am one of those security maniacs, and any tool to enhance security is important, blockchain is one of them.
Ethereum was pretty successfully ASIC-resistant. Other PoW algorithms are GPU-resistant. Even so, I doubt most spammers are sophisticated enough to create ASICs just to make cheaper requests to this guy's service.
Specialized ASICs are not easy to come up with, what's more if it's in a tug of war you can expect those to suddenly become obsolete, rendering this approach too costly for a scrapper. Also you may have missed the part in the original link where they were using AWS to do the scrapping, so no ASICs there, and you pay by the minute, two things that would make PoW a valid countermeasure, no "free computation" in that setup.
As far as i know, btc and ethereum actively fight the ASICs, because they will diminish the profits of amateur miners, and will bring great centralization to the network. Which is true, and it is no problem.
The difference between this use case and bitcoin is algorithm mobility. All bitcoin miners must use the same algorithm and get the same result for it to work. This makes it very hard (read slow) to change the algorithm, so there's time and incentive to design, fabricate and deploy ASICs. For client-server PoW the server can choose a new algorithm whenever the old algorithm becomes ineffective.
> Blockchain economic incentives ensure that ASICs will be used by the miners first and foremost.
So instead, what, you want people to buy BTC and send small amounts of it to websites?
Problem is, given varying income levels across society and across the world, one person's expensive micropayment is another person's almost free micropayment.
In a comment down below, i made it clear i was not referring to BTC. In the internet there is no centralized registry of naming projects. There are currently three projects in which the communities call themselves bitcoin. Any one of the other two bitcoins support much smaller microtransactions than BTC.
Very true, that micropayments could vary on their relative cheapness across the global population. Let's put a number, is 0.01 cent affordable by most people on the planet, for every http request?
I am talking about Bitcoin BSV. If the purpose of blockchain was solely to provide
PoW, then there could exist a lot of blockchains, a thousand maybe. Economic incentives ensure that the most inefficient of blockchains will be put out of the market. So some of them, like btc will soon be put out of the market, because speed and microtransactions is the two factors every blockchain competes on.
I answered on the other comment, but at that point, in case there is a blockchain, which supports a millionth of a cent transaction, then exchanges (Binance, Coinbase etc) are not so useful. If every person just needs one cent for a million http requests, then one guy in your neighbourhood or your town, or your city might have some of it, you message him and he will send you a cent for free. You buy him a coffee, and he will give you in return, billions of http requests.
But again, economic incentives need to work their way into the system.
Yikes, that sounds terrible. The current best alternative is a visual captcha. I'm not driving around town to buy coffee for some random person just to be able to use the internet. That's not the sort of internet I want.
I am not suggesting that you should do that. As soon as we reach the sweet spot of transaction fees, between a thousandth of a cent, and a millionth of a cent, then every person will have a wallet with one dollar inside it, and send it to a site he visits and desires to not be considered a spammer.
In case you run out of proof_of_burn a friend of yours might send you just a cent, and you are good to go, for one hundred thousand http requests more. There is no need for a normal person, to hold more than a handful of dollars for every year's internet use.
That renders exchanges almost useless. Not totally useless, but much less relevant than they are today.
I personally wouldn't care less, if there is one bitcoin/blockchain which reaches that sweet spot, or if they are a hundred, including litecoin, ripple etc. PoW is meant to be used for practical reasons. Blockchain however constitutes an economic system, of suppliers-miners and consumers-users, it is more than just a software program. It will evolve in the future, and it requires some crucial time.
For the moment there no API which provides the kind of PoW service to be very useful, and some hacks might be required to mitigate side effects of relentless scraping. These are just hacks, useful today, but the real solution is coming soon. Web3 some people call it, or Cryptocosm is another name of it.
Unlike blockchain, where the PoW algorithm is part of basic compatibility, it's not at all in mcaptcha - the users see a checkbox. The rest is implementation details.
If someone creates an ASIC, you can break it very easily by changing the PoW algorithm a bit, and no users are affected.
Even if someone has the money to keep up with you, which is remarkably expensive, they will be too slow right now.
>One's person expensive computation is another person's almost free computation
We have pretty good guides to what's an expensive computation to everyone. That's how password-hashing algorithms work, much better than BitCoin does. Besides, the existence of specialized methods of compute is hardly determinative. We're not trying to stop TLAs. It just needs to be expensive enough to stop 99%, and at most we'll later update again.
I got interested in mCaptcha, followed your link, but couldn’t find anywhere an example of what the end user would deal with. What kind of PoW are we talking about?
> It applies a Proof-of-Work like algorithm (not blockchain-related)
If you're going to waste my energy to do proof-of-work anyway, I'd rather you use it for something useful (even mining crypto-currency to pay for server costs) rather than let it go to waste.
how does it work? perhaps it's proprietary that the authors wouldn't want to disclose, but I did not see anything referenced in the site. I wouldn't plug it into mine without knowing g
I disagree. To "solve" their CAPTCHAs I had to register by providing a working email address. I don't encounter HCAPTCHA problems that often, so when I need to solve a new one usually my cookie has expired and I have to reopen the link to get a new one before being able to continue. I just store such links in my password manager, but imagine you have to find an email with the link they sent you over a year ago before being able to just continue on with what you were doing. And even then, depending on ad blocking/privacy settings their cookie may not even work.
I think this whole thing is a big hurdle just because I'm unable to solve visual puzzles. Besides, having a company collecting email addresses of people who are disabled in one way or another and giving them an identifying cookie is a privacy/data disaster waiting to happen.
That being said, I think the audio alternatives for visual CAPTCHAs are also unacceptable. Even if you can hear them, they may be hard to solve especially if they are not provided in your mother tongue. I think we can and should be able to do better by now
I had similar issue for one of our clients.
My strategy to not affect legit users was to enable mitigations if global traffic crossed some threshold.
eg. in your case this could mean if traffic is above eg. 75QPS then captcha is enabled, and if it's below that it's disabled.
I don't know what tech stack you are using, but nice trick that i figured out was to abuse rate limiting to detect global traffic (doing if branch with rate limit with const as client id)
That is more or less what I landed on myself. (Not quite, but similar reactive configs based on traffic thresholds).
For a while, we had to just set off pagers when global traffic exceeded a threshold and manually toggle the extra hardening, but eventually it became a lot more reactive.
Temporarily add 500ms of latency to all ipv6 users, backoff timers for ipv4 addresses. Since there's only 4 billion v4 addresses, it's easy enough to just track them all in a sqlite db.
Mostly works, but there do exist a couple of botnets that contain 1 million compromised machines. If each makes one request before hitting backoff, spread evenly throughout the day, that's about 10 QPS alone before they use an IP number twice. But they tend to not actually level out their usage (which is a bummer - if they did they could have kept using it). Instead they hit with a lot of parallel queries all at once.
There's only so much you can really do when your underlying resource is so limited. Luckily the value of the query is lower than the cost of a recaptcha solve, so the attackers moved on to some other target.
Ironically I could now turn off the endpoint protection (or have it responsive to traffic load), until the attackers return. I shall not go into too many details, no need to give people a map.
How are they getting millions of ipv4 addresses? IIUC that’s at least the equivalent of a /12 block. Do those shady residential proxies really operate at that scale?
If they’re ipv6 address wouldn’t they be safe to block across large ranges?
It is trivially easy to get millions of IPv6’s, even spread out across a thousand ranges.
It is also trivially “easy” to get past reCAPTCHA, but it costs more. My guess is that a domain name checker tool isn’t worth the cost per request to bypass reCAPTCHA (approximate 0.02 cents per session)
So.. make the requests serial. Just dump all the requests coming in over IPv6 into a queue, service that queue at a rate that's higher than the non-bot traffic requires but still low enough to be a problem for bots that can't self-limit to a reasonable rate. And of course, manage that queue intelligently so that you start dropping requests before you run out of RAM.
That just sounds like DDoSing yourself? It costs bots almost nothing to wait in the queue, and exactly zero real IPv6 clients will wait 10+ seconds to complete the request.
If somebody wants to access endpoint you might send him a challenge first. Random text. The client must append to the text some other text chosen by him, so that when you calculate sha256 on concatenated text, first byte or two of it will be zeros.
To access your actual endpoint client needs to send that generated text and you can check it if it results in the required number of zeros. You might demand more zeros in times of heavier load. Each additional bit that you require to be zero increases number of random attempts to find the text by a factor of two.
To make stuff easier for yourself the challenge text instead of being random might be a hash of clients request parameters and some secret salt. Then you don't have to remember the text or any context at all between client requests. You can regenerate it when the client sends second requests with answer.
Honestly I don't know why this isn't a standard option in frameworks for building public facing apis that are expected to be used reasonably.
Because it doesn’t accomplish anything. Things take longer for honest users while botnet abusers don’t even notice that the rented hardware is burning more CPU. Nor does it matter because each request still goes through.
For honest users it's not noticeable in normal conditions. But the DDOSer will DDOS themselves. They might make your service slow down for the users when they attack
but not because your server gets swamped just because you choose to increase the difficulty of the challenge while you are attacked. DDOSers will notice because they won't be able to make requests as fast. At least not the ones that are costly for you. It's an enforced client side throtling that affects even bad actors.
Generating challenges is really cheap. Calculating a single SHA256 or sth out of a querystring+salt (or better yet 128bit SipHash). Generation and validation can be done on separate layer/server so requests without valid PoW won't even register on your main system.
I think the part you're not addressing is the fact that DDOSers aren't using their own hardware and more importantly: their request still goes through at the end so what do they care if it took 2000ms longer?
I actually do like your puzzle challenge idea from an obfuscation standpoint of making an API less attractive for clients that aren't your own, though. A challenge system + an annoying format like Protobufs instead of JSON is too much work for a lot of abusers.
Biggest problem with DDOS though is that if the volume is even reaching your application server, you're probably hosed.
> their request still goes through at the end so what do they care if it took 2000ms longer?
It means that on this hardware they can make one request per two seconds not thousands per second.
And it affects all of the hardware under the control of an attacker, so his attack becomes thousands times less dangerous.
> Biggest problem with DDOS though is that if the volume is even reaching your application server, you're probably hosed.
That's why I'm suggesting that challenge generation and validation can be done on separate machines. So your application servers can be safe.
Obfuscation is a valid point too.
I thought about it for a bit, made some experiments. Now I think the challenge should be completely random but the server needs to keep track of recently issued challenges and solved challenges to prevent reusing of already calculated solutions. I think bloom filters would be perfect for that because some small percentage of false negatives doesn't matter.
There's a demand, why not supply it, and make money while you are at it?
This reminds me of the RMT driven botting problem in WoW (World of Warcraft). Instead of fighting the neverending game of cat and mouse against botters, Blizzard just decided to supply the long reprimanded demand for in-game currency by creating the WoW token, and they make money while they're at.
Hmmm, I'd have to think about this more to be sure. I'm not immediately positive of the answer.
If the registrant creates any DNS records at all, then the SOA will need to exist for the zone to be valid; but I don't recall whether the registrar is or isn't required to publish a zone for a registered domain that otherwise contains no records. (Also whether the registrar is required to inform the registry of authoritative nameservers for the domain at all times from the moment that the domain is registered, and then whether that information would lead to the synthesis of a SOA record.)
I guess I can either try this (if I could find a no-frills-enough registrar that doesn't create any records at all for add-on hosting services or domain parking) or try to take some more ICANN coursework to find out the answer.
Or maybe someone else reading this thread knows whether we can have a domain in practice that is registered but has no published SOA.
Endpoint was invoked in our signup funnel, so there was a bootstrapping problem for quota enforcement, the attackers weren't making a whole signup, just getting to the point where the domain search ran.
Understandable. I've seen services that offer residential IP address proxies for as low as $1/GB. FWIW the particular service in mind actually pays the IP owners who opt into it.
I guess your tool is asking the registries if domains are registered.
It's always interesting to see these prisoner's dilemma / tragedy of the commons show up in networking. If they hadn't abused the commons it would have worked out better for everyone.
See sibling, but the endpoint was part of a signup funnel, so short of rearchitecting it completely to put that check after customer creation, there's no real persistent key to rate limit on. Any one IP ended up getting rate limited to 5 requests per hour on that API, but the attack was incoming from what looked like a botnet, so it was tricky.
Be careful with that. A friend once did something similar to an active scraper who then went off and set his site up for a full blown DDoS.
He knew that was the reason for the intentional DDoS due to messages (along the lines of “think it is funny to poison my information do you?”) in the query string of the bulk requests. Like unpleasant fools making a big noise in shops because they aren't served immediately after cutting in line or some such, the entitled can be quite petty when actively taken to task for the inconvenience they cause.
Passive defences are safer in that regard, assuming you don't get into an arms race with the scrapers, though are unfortunately more likely to mildly inconvenience your good users.
That's fair. Perhaps a middle-ground of poisoned answers that look like overload related errors.
Though I did have success once stopping hot-linked images by serving up images that didn't fit well with the sensibilities of the internet forum that was hotlinking them. It had the advantage of looking like the people in the forum posting had deliberately chosen to post the image. Serving different images depending on client ip made for fun too, as they argued with each other about why they posted "that".
Most of the time, the new CAPTCHAs do that and then never show an actual interactive element. The interactive gimmicks are a fallback in cases of high uncertainty.
The webmaster doesn’t need to worry about it, the anti-bot services handle who gets what difficulty of challenge. But the webmaster can specify whether they’d like to be more or less strict/difficult than usual.
Yeah, "recaptcha" in this case is shorthand for a lot of stuff we did to harden the endpoint that ultimately represents a minor shift in balance from no friction to some friction for otherwise legitimate users, but a pretty significant drop in illegitimate traffic.
The main idea here is that at some point just leaving the scrape running in a way that didn't overwhelm my backend would have resulted in me not caring enough to do that. But now they get nothing. Even if you're borrowing bandwidth and not paying, you should be a good neighbor is all.
I just wonder, is archive.org getting any government grant money? If they aren't they should. And I'm not even talking about just the US. All sorts of countries (US, UK, Germany - to name the few) and international organisations like EU pour hundreds of millions into "cultural projects" of very questionable value.
How about they actually fund something really worthy of preservation? Of course it is archive.org role to reach out first. For example EU could fund an archive.org mirror in the EU (with certain throughout etc).
Of course opponents of public/government funding have a very good point in that many organisations when they get public money, they find a way to burn through all of it in a lot less efficient way. This can be mitigated by attaching concrete conditions to the grants. One example is a mirror in a specific location.
The Internet Archive has many smaller donors, too. I use the Archive frequently for a variety of purposes and derive immense value from it, so I donate to it every year.
Public money would also mean regulation and I doubt the IA wants to be regulated. They already see themselves apart from things like web standards given their approach to robots.txt [1]
There are various underfunded digital archives in the EU and elsewhere that have legal duties to preserve online content relating to their specific countries.
They may be "of very questionable value" to you but your solution to remove funding from them to channel it to a much wealthier organisation in a wealthier country is neither ethical, legal or practical.
I think he or she rather meant some other cultural projects than digital archives.
There are indeed lots of cultural projects that get funding, that I consider not that important in comparison, but of course those people involved would think different. (Opera for example is heavily subsidized)
For real? I am not against subsidicing art and culture, but I am against selective subsidicing. For example in germany there is a strong divide into "serious art" like opera and classical music that gets lots of money direct or indirectly - and trivial art, everyone else. Getting allmost nothing. So it boils down to taste and the favourite culture of the establishment. But there is so much other good music and performers besides the mainstream out there, who gets categorized into "entertainment" and have to struggle on their own.
So back on topic, I would be fine with taking money from opera to give it to internet archives. But of course, I rather would have more money for everyone involved in arts and culture.
Not at all. The suggestion was to shift countries government funding from their own projects to a foreign, better funded initiative on the implied assumption they don't know what they are doing.
The suggestion was well-motivated but presumably guided by a lack of understanding of the existing digital landscape of international publicly funded projects and their obligations and constraints.
Government money would risk making them dependent on it, which would reduce their ability to be self sustainable and independent, which risks becoming political. Best avoided if possible.
Unpopular opinion: Severely rate limit retrieving the files from the website / HTTP endpoint, and loudly point towards downloading the files via torrents.
The torrent protocol was meant to relieve this level of server load in mind.
Torrents have the habit of disappearing when no users keep them alive. It happened to me enough times to be wary of such solution. If there's a way to keep them alive regardless of interest I'm all for it.
How is that worse than serving files via HTTP? With HTTP archive.org is the only peer that can serve files to you, with BitTorrent it will be one of the (hopefully) many peers, and will degrade to the same level of redundancy as HTTP if all other peers leave.
BitTorrent also supports web seeds and they don't even really have to keep a full client running, just embed an HTTP link into the .torrent file.
Have you tried an Archive.org torrent yet? It's backed by the servers, but has the advantages of selecting which parts of the archive you want and being verifiable and being able to have more bandwidth on popular archives. The Archive.org servers show up on the "HTTP sources" tab next to the "Peers" tab for me.
The torrents the archive provides are backed by web seeds served by the internet archive web servers.
This approach allows the internet archive to effectively rate limit hosts without making content unavailable entirely, and allows others to help carry the load.
Thanks everyone for your reply. It made me reconsider my opinion. Though the downvotes are unnecessary. Use them when they're actually necessary. Not as gate keeping or as a "don't agree" button or whatever frivolous thing is passing through your mind.
This is what "proof of storage" or "proof of data availability" blockchain networks are for. They use economic incentives to continuously pay nodes a small amount to store some data and keep it available, and the cryptographic sampling mechanism ensures that less popular data must remain in the available dataset for nodes to be paid, even if it is rarely requested in full.
If you are willing to spend money, there is no problem keeping content up with existing technology. Like always, blockchain solutions miss the problem entirely.
It would be nice not to shoot the messenger. The problem described by maksumur is a real problem people encounter often - torrents for less popular large files that people only occasionally want, but do want or need, have a habit of disappearing, or being non-functional when you do find a seed. Popularity plays too strong a role.
As far as I know the proof-of-data-availability folks (whether using a blockchain or not) are the only folks trying to solve this problem with a serious technical solution at a cost scale lower than "rent your own server and have deep pockets for bandwidth" at the moment.
It's not about "willing to spend money", as if there is only one threshold to pass.
Torrents are used largely because they diffuse the cost of hosting, so that people (and organisations) don't need deep pockets to distribute large files to many others. There is an inevitable infrastructure cost but it's spread out more fairly among users.
The proof-of-data-availability stuff is similar, but takes it further so that a wider range of data stays publically available to whoever wants to download it, whenever they want, than it otherwise would be. It is really just a more fancy version of torrenting that is less prone to the excessive per-file popularity fluctuations that torrents suffer from. The objective is to lower costs compared with the current state of the internet, not add more. And it does not specifically require a blockchain.
If you know of something else tackling the problem I'd be love to hear about it. Torrent seed sites don't qualify, as they don't solve the problem: Most files aren't available on them either, and you have to pay for the more obscure content they do have.
No, that still misses the mark entirely. We are in a thread about archive.org's server being hit too hard, someone proposed they should offer distributed downloads via torrent, and someone commented asking for "a way to keep them alive regardless of interest".
You are proposing a system based on financial rewards for hosting. Who pays those rewards, for those files in which there is no interest? If archive.org is to pay for it, we are back to square one. They are already very good at hosting content, within the limits of the resources they have. If people with an interest in downloading the files pay for it, no go, files with no interest go away. If you propose that people currently abusing the free service of archive.org, to the point of bringing it down, would pay a fee per download, you must be joking.
> You are proposing a system based on financial rewards for hosting.
No, I'm not. You are incorrectly assuming that blockchains are necessarily financial or that cryptoeconomic incentive structure involves net pay to someone.
> If you propose that people currently abusing the free service of archive.org, to the point of bringing it down, would pay a fee per download, you must be joking.
I'm not proposing that.
People "pay" for hosting by participating in some amount of upload to offset their download, in order to be granted higher download rates. That is the same principle as BitTorrent has used since its inception: Upload is measured and download is traded for upload to ensure users choose to upload for a while.
The difference is that information is split and diffused in a different way, which ensure that some amount of upload bandwidth and temporary storage is available for less popular data as long as there are people participating in the network, mostly when they are downloading something more popular and providing some upload in exchange. The network power law helps by ensuring the long tail of less popular content needs relatively little "extra" bandwidth so it is not onerous on the users who, most of the time, are downloading and storing popular data.
No money needs to be involved.
Probably some financial things will emerge much like paid Torrent sites do at the moment for enhanced access which some people prefer. You can't prevent them, but they are not required for the network's operation nor required to access it.
Torrents aren't illegal inherently; the content downloaded are often copyright violations, yes, but using torrenting protocol itself isn't illegal. Most Linux distributions provide a torrent option to download their OS image file.
Torrents are the backbone of a lot of academic research downloads.
They're the absolute best way to distribute neural network weights, or large models used in physical science studies. That or IPFS.
If you use huggingface or dockerhub for anything like this, seriously consider switching, as it will be extremely useful when those centralized systems inevitably fail.
I have a side project that scrapes thousands and thousands of pages of a single website.
So as not to piss them off (and so they don't try to block me), my script will take about 6 hours. Between each page fetch it sleeps for a small, random amount of time. It's been working like that for years.
Yes, this is the way. When I was purchasing a car last year I scraped a popular used car website for cars I liked to keep track of deals. I added a random sleep in between each page so the entire script took a few hours to run.
I build a system that scraped GitHub. Even though GitHub can clearly handle lots of traffic I still rate limited the hell out of it. The only time I've ever seen scrapers get banned is when go super fast. Unless it's LinkedIn and Instagram who guard their product aka data as much as possible.
Merely for your consideration, they actually do a great job of indicating in the response how many more requests per "window" the current authentication is allowed, and a header containing the epoch at which time the window will reset: https://docs.github.com/en/rest/overview/resources-in-the-re...
I would suspect, all things being equal, that politely spaced requests are better than "as fast as computers can go" but I was just trying to point out that one need not bend oppressively over the other direction when the site is kind enough to tell the caller about the boundaries in a machine-readable format
Yes. I did the same. My side project takes 4-5 days for the whole routine because there are random wait in between requests and only 2 active requests at any given moment.
I can't say how the website I'm scraping would respond if I just went full throttle. But it's also just a matter of courtesy anyway not to make ten thousands requests per second.
Funny story - at work we once had a huge spike in requests from a single IP. We all crowded around, thinking it was some malicious hacker from France. How exciting - we're now interesting enough to warrant a DoS! Turns out another team in the company was just pulling all our data into Algolia to improve search. They were clearly not very courteous!
So on the other end (building APIs) I certainly do pay attention to traffic and have Grafana alerts set up around it.
I would block it, if I was the maintainer. As the linked post mentions, automated requests can "ramp up" at any moment, risking server stability. By preemptively blocking automated parsing (on a resource which primary usage is individual requests, not mass ones) I would avoid future problems for myself. Let them contact us via support if they really need an exception.
In general I would rate limit by IP anything connected to the internet.
Kind of a snarky response. Obviously this has been seen in the wild. If you created an intrusion detection system to look for suspicious requests, I think one occurring over and over and at a regular interval would clearly be seen as malicious and not a genuine user.
Can you provide an actual example? I see it come up a lot in these conversations, but I'm really skeptical that anyone actually does this analysis. It seems like rate analysis (either requests or bandwidth) would achieve the same result in a far simpler manner, so I suspect that is what actually happens.
I recall back when I was less experienced I wrote a scraper for downloading recipes from Taste of Home and let Go firebomb them with as many goroutines as it could handle.
It's not just about "wishing", (D)DoSing is a punishable offence in many jurisdictions, and considering the damage taking down such a popular and essential website causes, I hope tIA will report them to law enforcement if it happens a third time.
Is there an open-source rate limiter that works well for sites large and small?
It just strikes me as surprising that sites are still dealing with problems like this in 2023.
The idea that a site is still manually having to identify a set of IP addresses and block them with human intervention seems absolutely archaic by this point.
And I don't think different sites have particularly different needs here... basic pattern-matching heuristics can identify IP addresses/blocks (plus things like HTTP headers) that are suddenly ramping up requests, and use CAPTCHAs to allow legitimate users through when one IP address is doing NAT for many users behind it. Really the main choice is just whether to block spidering as much as possible, always allow it as long as it's at a reasonable speed, or limit it to a standard whitelist of known orgs (Google, Bing, Facebook, Internet Archive, etc.).
It just strikes me as odd that when you follow a basic tutorial for installing Apache on Linux on Digital Ocean, rate limiting isn't part of it. It seems like it should be almost as basic as an HTTPS certificate by this point.
It isn't that hard to set up naive rate limiting per ip address. It's a few lines in haproxy, and there is documentation on how to do it. There are a couple of problems that make it more complicated though. The first is that with NATs, you can have a lot of users behind a single IP address, which can result in legitimate requests getting blocked. The second is that, while it can help against a DoS, it doesn't help that much against a DDoS, because the requests are coming from a large number of distinct IP addreesses.
But it's precisely those complicated parts though that seem like they should be solved by now. The solution is both of the problems is for when traffic gets unexpectedly high either behind a NAT or globally, everybody gets a CATPCHA when gives you a token and then you continue to rate-limit each token. (If users log in then each login already does this, no CAPTCHA needed.) This strategy is basically what e.g. CloudFlare does with the CAPTCHAs.
Obviously if someone is attempting a large-scale DDoS your servers can't handle it and you'll be using CloudFlare for its scale. But otherwise, for basic protection against greedy spiders who are even trying to evade detection across a ton of VPN/cloud IP's, this strategy works fine. It's exactly the kind of thing that I would expect any large website to implement.
If there isn't an open-source tool that does this, I wonder why not. Or if there is, I wonder why IA isn't using something like it. But heck, IA wasn't even using a simple version -- it was just 64 IP addresses where basic rate-limiting would have worked fine.
Indeed. Tens of thousands of requests per second from just 64 hosts? So they allow individual hosts to make hundreds of requests per second, sustained? That sounds crazy. Even for a burst limit hundreds per second would be extremely high.
Archive.org is a core utility for the web to the point where Wikipedia and many other sites would collapse without it in the sense that many if not most of their outbound links would be dead forever. I’m pretty sure it would even impact the US justice system [1].
Obviously judges aren’t going to have to worry about reasonable rate limits but if these DDoSes are rare, I’d much rather they dealt with them on a case by case basis. Without some complex dynamic rate limit that scales based on available compute and demand, rate limiting would be a blunt solution that will necessarily generate false positives.
Rate limiting is (usually) done on a per-host basis. The only users who would be adversely affected are those who perform hundreds of requests per second from a single system.
Virtually every major website has per-host inbound request limits. This is completely standard practice and Archive.org is the odd one here.
And DDoS isn't the only concern. Legitimate users that run poorly written Python scripts that make insane numbers of requests can hog server resources, and rate limits with appropriate error messages linking to resources documenting efficient access patterns can improve the experience for everyone, and drastically cut costs for the service operators.
It‘s rather per public IP, such that e.g. behind a corporate proxy you may experience rate limiting even for regular use, just because you’re sharing a few IP addresses with a large number of users. Better hope that you get an increased quota for your external IPs in that case.
Just yesterday there was a comment here on HN [1] about https://jsonip.com, which is essentially supported by a single person (all operational costs included) and gets abused in a somewhat similar manner. I am not even sure what to think: do the folks not understand what they do, or are they just bluntly ignorant of it?
Yeah I'm getting this shit too with Marginalia Search. I'm getting about 2-2.5M queries per day that are definitely from bots that would 100% sink my server if they somehow go through the bot mitigation. It peaks at hundreds of search queries per second. To be clear these are search queries, and search queries typically trigger disk reads of about ~10-20 Mb.
I get about 20,000 queries per day that may be human.
Archive.org is a bit of a special case, you need to call them repeatedly to archive a website. They do have a rate limit there, it's pretty aggressive* to the point you could trip it by manually using the site. They must have forgotten to limit the OCR files download.
* If they had a better API (a simple non-synchronous API would be enough, one where we could send a list of URLs would be even better), one could have made a lot less calls.
Last time I wanted to bulk-archive a bunch of urls, I asked about it, and sent a txt file full of URLs to someone and they put it in the archival queue.
I believe you can upload WRAC files to IA and ask them to index the content. Saves them the need to do the archiving and you won't be rate limited on their end.
I would [have] reach[ed] to AWS techies. The services of archive.org are known and important enough for the AWS folks to chase down the sources and/or initiate some action against the perpetrators, malicious or not as they might be.
What an incredibly well versed notice from archive.org! Clear, no assumptions, no irony, no accusations, but only reaching out as they request from us.
Very few people have been more dedicated to the original spirit of service of the World Wide Web than Brewster Kahle. It's as though he's been personally monitoring his server logs for the past 30 years.
We've had various web-crawling businesses promoting their crawl-etiquette-busting technologies to approval and acclaim on HN; undermining of services for the public good like this is exactly the end result we are supporting when we do that.
OCRs of PD material? Is someone about to train a new AI model, perhaps? I'd think the Archive would be happy to arrange for receiving hard drives, filling them up, and sending them back. They seem like a very helpful bunch.
I'm waiting for Cloudflare to open source their web server as promised.
But failing that I'm likely to implement a web proxy that utilizes the wirefilter library that is already open sourced by Cloudflare (but no longer updated).
The tools to stop these attacks are reasonably trivial.
Most of the art of stopping them is observability.
When you can see every dimension of a TCP/UDP connection, the TLS handshake, the HTTP communication... Then the dimension by which an attack is being conducted is glaringly obvious.
Once you have the very obvious correlation, then you only need a blunt instrument of a tool that can block/deny/nullroute the attack.
What's hard isn't the block rules, What's hard is instrumenting enough observability to see the obvious correlation of an attack.
Even on my personal nginx a single request logs about 20 things. But there are literally hundreds of properties one can log if you have access to the entirety of the stack, and if you have access to log on it, then you have the ability to carry that context to a point in the code where you can block it.
Also, 10k qps really isn't a lot. So this should be treated as a warning sign, this attack was low volume even for amateur booter services.
Why are there always super obnoxious scrapers out there? It doesn't even make sense from the scrapers perspective because it raises the chance of being blocked and having annoying anti bot measures introduced. It especially makes no sense in this case because archive.org would be happy to give them the data for probably less than whatever they were going to spend on AWS. There's no manners anymore.
The fact that IA is so liberal with their scraping policies is laudable, because on the other end of the spectrum lies Wikimedia, where you're lucky if you manage to download their dumps at 500kbps, which makes their dumps pretty much impossible to obtain...
$ curl https://dumps.wikimedia.org/enwiki/20230520/enwiki-20230520-pages-articles-multistream.xml.bz2 -o/dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
2 20.4G 2 476M 0 0 4387k 0 1:21:38 0:01:51 1:19:47 4393k^C
$ curl https://wikidata.aerotechnet.com/enwiki/20230520/enwiki-20230520-pages-articles-multistream.xml.bz2 -o/dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 20.4G 0 45.9M 0 0 1953k 0 3:03:23 0:00:24 3:02:59 2257k^C
$ curl https://mirror.clarkson.edu/wikimedia/enwiki/20230520/enwiki-20230520-pages-articles-multistream.xml.bz2 -o/dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
1 20.4G 1 344M 0 0 35.2M 0 0:09:56 0:00:09 0:09:47 36.8M^C
maybe you should check your own internet connection first?
It depends. A CDN-like approach wouldn't really work because all requests are different and legit, so plain old caching isn't going to cut it.
On the other hand, Cloudflare has quite some experience with DDoS protection, so they definitely have the right infrastructure in place to stop this kind of abuse. What is the bill going to be, though?
It will definitely help yes, I don't know as well why they are not using it. And probably work on auto-scaling as well, in case the API Requests are legit.
10,000 requests/seconds is nowadays an average number for load testing in the companies I work with, for websites / mobile apps deployed at this scale
Huh, meanwhile, I recently wanted to grab ~200 pages from the archive where I put a 1-2 minute delay between the requests and had a bunch of repeated failures.
I wish there was a way to just get a whole archived site N links deep as a .zip file - I tried a few tools for this and none of them worked as good as going over it manually and recording a .har file with all the contents.
I recall tripping the rate limiter at least once by accident - I was clicking around the archived site for reference while working on creating a restored version.
There's no reason for Archive.org to allow any traffic from AWS, Azure or Google Cloud. Archive.org should just block the CIDRs for all those networks. Most traffic that comes from them is scraping traffic and usually malicious.
I recently fought and won 429... On a website ran by the very company I work for. They use a vendor for certain services, and such vendor sells a (terrible) API access for bulk operations, so I whipped out Selenium instead; which meant I, the site "admin", was being throttled with 429. A few pauses here and there, and all was well.
I use the Internet Archive a lot, and I get rate limited after opening (manually!) a few dozen pages now. The message is clear: archive.org is meant for one-off lookups, if you're doing real historical research, you're not welcome. I thought creating an account and even donating money might help me, but nope.
It's a shame that AWS can be used to do this and they'll either do nothing or take their sweet time doing anything, and half the time they reply saying that the abuse is no longer happening from the reported IPs (because they've already moved to new IPs).
For actual spammers this won't do much. You can create a new account with a new email and presumably use one of those gift cards for payments (or use Apple wallet to regenerate card numbers).
What is the legal ground for archive.org to copy websites? Shouldn't copyright forbid that?
They don't even respect robots.txt. So content creators can't even opt out of that. Not that copyright would have copyright holders having to opt out of copying in the first place.
Same way a library does. The information is made available to anyone who wants it, the copyrights are maintained intact, and they don’t attempt to profit from the material.
Copyright law specifically allows for libraries and archives to make copies of copyrighted material.
Without such laws, without libraries, knowledge could not be guaranteed to be shared freely among the public, resulting in ever growing knowledge and education gaps between those with means and those without.
Edit: Since you asked for the legal ground, here it is specifically:
I just wanted to add, while I don’t agree with the assessment that copyright holders need to be protected from something like archive.org, i don’t think the parent comment deserved to be flagged, so I vouched for it. I think the question was raised in sincerity, and I think it offers a point of discussion for those not familiar with the issues.
I don’t think it’s helpful to flag things people disagree with, as long as they don’t attempt to spread misinformation, or trolling etc. The parent phrased the topic as a question, meaning I believe they were open to understanding.
I also think it’s relative to the topic posted, as we’re talking about either an attack on archive.org, or a massive recopying of archive data by an unknown.
Talking to people with opposing views is important. Let’s not just shut people down if we don’t agree with something, especially when someone is asking a question.
> i don’t think the parent comment deserved to be flagged, so I vouched for it.
Since archive.org have now updated that the problem scraper is now evading countermeasures and bringing them down repeatedly, and has been identified as an "AI" company, it could end up being an existential risk: if companies start using archive.org as a large-scale commercial IP theft proxy, they will likely face even more legal challenges than they already do.
A question indeed that has to be answered before everything else. We can't criticize eg search engines for "previewing" content on search result pages and for sending you to copycat sites with the most ads and ignore gross content copying at the same time. We also have these "helpful" insertions of links to archive.ph copies here when the original story is paywalled and sometimes even just ad-ridden - to copied content on archive.ph with third-party ads which I find at least unethical.
That said, personally I do use Wayback machine about once a month for genuine search of historic content.
I was keeping an eye on it, because we are hard-capped at 100 QPS to our provider, beyond that and they start dropping our traffic (and it is an outside provider, bundling domain registries like verisign and stuff), which makes regular users break if their traffic gets unlucky.
Anyway, after a week of 40qps, they start spiking to 200+, and we pull the plug on the whole thing: now each request to our endpoint requires a recaptcha token. This is not great (more friction for legit users = more churn) but it is successful. IF they had only kept their QPS low, nobody would have cared. I wanted to send some kind of response code like, "nearing quota".
FTR before people ask: it was quite difficult to stop this particular attack, since it worked like a DDOS, smeared across a _large_ number of ipv4 and ipv6 requesters. 50 QPS just isn't enough quota to do stuff like reactively banning IP numbers if the attacker has millions of IPs available.