For example with HAProxy (my favorite load-balancer and HTTP "router") this can be easily achieved by using `nbsrv`, creating an ACL and only routing requests to the backend based on that ACL. Based on the documentation bellow:
One can write this:
acl site_alive nbsrv(dynamic) gt 2
use_backend dynamic if site_alive
I actually found the original solution interesting, and I also proposed an alternative achievable with HAProxy as an load-balancer.
Wouldn't it make more sense simply to limit the number of connections per back-end server?
However I assume that the customer (the original author was referring to) didn't want to apply any limitations. (Why is beyond my reasoning...)
(Many of the recent (last ~5 years) AWS failures have been due to cascading failures and it's neat to read their postmortems).
There is also some feat that this will exacerbate problems with an already stressed financial system. Banks are massively increasing their cash reserves, and many organisations have a grim outlook on the upcoming months.
If there’s some sort of natural limit to how many simultaneous connections they can handle, why can’t they just return some 4xx error code for connections beyond that? (And have clients implement an exponential back off?)
Or if that’s too difficult, the load balancer could keep track of some maximum number of connections (or even requests per second) each backend is capable of, and throttle (again with some 4xx error code) when the limit has been reached by all backends? This is pretty basic functionality for load balancers.
You’re going to need actual congestion control anyway, when the number of client connections is unbounded like this. Even when your servers aren’t crashing, what if the client apps whose clicks you’re tracking becomes suddenly more popular and you can’t handle the load even with all of your servers up?
Yup, it's pretty much a DDoS. It's simple to test. Spin up Wordpress + Woocommerce (a common ecommerce stack) in a Digitalocean 1 GB droplet.
Now ab -n 1000 -c 30 the home page, that's 30 concurrent clients
Watch MySQL die, Apache get killed because out of memory, and...
-bash: fork: Cannot allocate memory
Why is Apache continuing to fork new workers ad infinitum? That’s a denial of service attack waiting to happen, and the answer surely isn’t “oh let’s just automate the rebooting”...
Edit: you’ve edited your comment to say it’s a DoS as well, so looks like we’re on the same page here, the software is garbage.
I try to be slow to jump to that conclusion ... so often there are reasons for things to be the way they are. The software tools and the hardware capabilities were very, very different in the aughties, and many people currently writing software don't really seem to appreciate by just how much. It may be the case that things could have been done differently, perhaps better, but then again, once you know the full story, maybe not.
Some time ago I wrote up a war story from the mid-nineties and had some current software people crap all over it. When I started to explain about the machine limitations of the time the bluster increased, and among the replies I got was "The software was crap".
Ever since then I've been interested in the contexts for these stories. So often I hear "Well you shouldn't have done it like that!" rather than an enquiring:
"OK, let's assume some clever people wrote this, I wonder what the pressures were that made them come out with that solution."
I've learned a lot by approaching things that way, instead of simply declaring: The software was garbage.
Pressure from everyone to get something shipped and out-the-door - never-mind the technical debt - and after it ships management is only interested in further growth fuelled by more features - deepening the technical debt.
As for the problem of Apache keeling over too easily - that’s because Apache (at the time - and I think still now) is based on “one thread per HTTP request” - or one thread per connection. Asynchronous (or rather: coroutine-based) request handling was very esoteric in the late-1990s and early-2000s - and required fundamentally different program architecture - no-one was going to rewrite their CGI programs to do that - and Perl script based applications couldn’t do anything at all. Asynchronous HTTP request handling only really became mainstream when NodeJS started getting popular about 7-8 years ago - but, to my knowledge, besides ASP.NET Core - no other web application platform treats asynchronous request handling as a first-class feature.
Nginx first came out in 2002. It was being used widely by something like 2008.
> no other web application platform treats asynchronous request handling as a first-class feature
Huh? Web application platforms that do this have been around since the late 1990s. (Heck, there was one written in Python in the late 1990s, it was called "Medusa".) They just weren't "popular" back then.
Nginx, OTOH, was popular before NodeJS even existed, let alone before NodeJS became mainstream.
Nginx does the async stuff properly I believe but there's still the nonsense of forking child processes.
Besides that, Python, Java and PHP have had mature async web stacks available for years as well. Is there any commonly used language that doesn’t?
Ed: or maybe a blog post to be repeatedly deployed in threads like this one.
Yeah, I agree in many cases that's the issue, as the author pointed out in the last paragraph.
Let me also add that on this particular stack this happens with caching enabled (W3 Total Cache). Without cache it struggles with 10 concurrent clients (TTFB > 5 seconds). 100 MB per client to serve what's pretty much a grid of products and a button, it's scary
WordPress itself is not garbage (at least not for those reasons), but WordPress + a bunch of plugins hammered together is a different beast.
The problem is you have to manually tune the number of max workers a bit based on estimated resource usage.
For example, maybe in the average second you see 1 login request (CPU-bound password hashing), 10 page views (database connection bound), 100 image views (IO/client speed bound), and 0.01 login sessions being bounced between servers (???? bound). And your server is limited to 111 connections per second.
But when the site goes down? Users can't load images until they've seen pages, and can't see pages until they've logged in, so now you're processing 111 login attempts per second when your tuning assumed a fraction of that.
This was pretty much not easily possible (with suphp, mod_php) until FPM which has a per site worker pool.
Solved a lot of problems for us as a shared hosting provider.
Restarting was a bear. We had to "warm up" the application server by hitting some of the web pages so caches would get populated etc., before opening the system to public traffic (which, as I recall, we controlled with IP address aliases) or else it would die again.
Finally we put the system behind an F5 load balancer which resulted in a much more even distribution of traffic -- that, coupled with the "warm up" crawl, let to a greatly stabilized system and highest ever (and growing) page views.
Start with a super low limit then slowly increase while watching reqs/sec. When reqs/sec stops increasing or you run out of ram then there’s your sweet spot.
Perhaps I'm missing something terribly obvious here, but why would that happen?
I can understand requests being dropped and processing-times worsening, but a full system-wide crash?
edit My bad, I'd missed this in the article:
> they could have rewritten their web site code so it didn't send the machines into a many-GB-deep swap fest. They could have done that without getting any hosting people involved. They didn't, and so now I have a story
As mentioned in other comments, a proper defense against this would be to have a front-line system that anticipates such a situation and actively manages it: puts outstanding requests in a queue (throttling the load) and then errors out on new requests if there's still too many of them coming in.
I would assume that if you left it alone for an hour or so it might eventually unfuck itself, but for production purposes, that counts as dead, especially when you can’t even ssh into it because the memory allocations for your ssh session are also in that gigantic queue for disk bandwidth via swap.
edit I hadn't noticed this at the bottom of the article:
> they could have rewritten their web site code so it didn't send the machines into a many-GB-deep swap fest. They could have done that without getting any hosting people involved. They didn't
1. A throttling scheme of some sort
2. Ignore the possibility, and just crash disastrously if/when it occurs
Is that right? If so, surely that's what we should be talking about here, no? (I see ninkendo mentioned it in their comment too.)
Am I right in thinking the whole problem that the blog-post discusses - avoiding a small pool of servers getting overloaded and blowing up - follows from poorly engineered servers that go with Option 2?
Edit This is what I get for reading the article too quickly. She addresses this quite explicitly:
Don't assume it's an easy thing to avoid.
I'm really not sold on this attitude. Ordinary exception handling may be a lot of effort, but if you don't do it, that means your work is sloppy. Same here.
More generally, a program shouldn't ever outright explode, regardless of input. That's not quite the same thing as exit-with-error, but I suppose the distinction is ultimately fuzzy.
> The JVM for example can only kill its process once its memory usage exceeds `Xmx`, but it can't prevent that from happening in the first place
I don't see that the particulars of JVM memory-management have any bearing here. A web server should be capable of detecting when it has reached capacity, and to reject additional requests as appropriate. There's no reason this couldn't be done in Java. From a quick search, this is indeed how modern Java servers behave.
If there are many communicating processes, I can imagine that could complicate things greatly. That's a downside of using many communicating processes.
Why shouldn't the system be able to just drop incoming requests until it's able to handle them? That's what routers do.
An even better loadbalancer will poll a load endpoint, representing CPU load, queue length, percentage of time GC'ing, or some similar metric, and scale back requests as that metric gets too high.
Another reason is backend constraints. Adding another server might simply cause more pressure on backend systems like a database. So they can't simply add more servers, as then the additional load on the database might cause that to crash, and result in even longer downtime.
I ended up leveraging Consul to do leadership election (only for the alerting bit) and monitor the health of all the nodes in the cluster. If one of them failed, the load was redistributed equally among the remaining nodes.
I think what Rachel calls a "suicide pact" is now commonly called a circuit breaker. After a certain number of requests fail, the load balancer simply removes all the backends for a certain period of time, and causes all requests to immediately fail. This attempts to mitigate the cascading failure by simply isolating the backend from the frontend for a period of time. If you have something like a "stateless" web-app that shares a database with the other replicas, and the database stops working, this is exactly what you want. No replica will be able to handle the request, so don't send it to any replica.
Another option to look into is the balancer's "panic threshold". Normally your load balancer will see which backends are healthy, and only route requests to those. That is what the load balancer in the article did, and the effect was that it overloaded the other backends to the point of failure (and this is a somewhat common failure mode). With a panic threshold set, when that many backends become unhealthy, the balancer stops using health as a routing criterion. It will knowingly send some requests to an unhealthy backend. This means that the healthy backends will receive traffic load that they can handle, so at least (healthy/total)% of requests will be handled successfully (instead of causing a cascading failure).
Finally, other posts mention a common case like running ab against apache/mysql/php on a small machine. The OOM eventually kicks in and starts killing things. Luckily, people are also more careful on that front now. Envoy, for example, has the overload manager, so you can configure exactly how much memory you are going to use, and what happens when you get close to the limits. For my personal site, I use 64M of RAM for Envoy, and when it gets to 99% of that, it just stops accepting new connections. This sucks, of course, but it's better than getting OOM killed entirely. (A real website would probably want to give it more than 64M of RAM, but with my benchmarking I can't get anywhere close with 8000 requests/second going through it... and I'm never going to see that kind of load.)
I guess the TL;DR is that in 2011 it sounded scary to have a "suicide pact" but now it's normal. Sometimes you've got to kill yourself to save others. If you're a web app, that is.
Even though it seems it is no longer maintained, the circuit breakers, fail over modes and all that are well documented.
And I don't know why Hystrix hasn't been adopted by a wide audience yet. It seems like a necessity in the micro service landscape.
* limit the number of concurrent requests, and drop the others;
* limit the number of concurrent requests, but queue the others (with a timeout);
* distribute all requests uniformly (randomly or in round-robin fashion) to all backends;
* (any combination of the above);
However if the "customer" asks you to not drop or queue requests, then there is nothing the load-balancer can actually do...