I found it interesting because such a simple task (requiring at least a number of on-line servers before the load-balancer starts serving requests), required a custom binary controlled the webserver and had to cross-monitor each server.
For example with HAProxy (my favorite load-balancer and HTTP "router") this can be easily achieved by using `nbsrv`, creating an ACL and only routing requests to the backend based on that ACL. Based on the documentation bellow:
I just wanted to mention that I didn't fault the original author for the proposed solution. It was 2011, they seemed to use some commercial load-balancer, and the customer didn't seem to want to actually solve the solution.
I actually found the original solution interesting, and I also proposed an alternative achievable with HAProxy as an load-balancer.
Indeed if I were to configure the load-balancer I would limit both the number of concurrent requests and the number of queued requests. (In fact this is what I do in my HAProxy deployments.)
However I assume that the customer (the original author was referring to) didn't want to apply any limitations. (Why is beyond my reasoning...)
Travel, immigration, entertainment, restaurants etc are severely affected. A lot of people have lost their income, and spend less.
There is also some feat that this will exacerbate problems with an already stressed financial system. Banks are massively increasing their cash reserves, and many organisations have a grim outlook on the upcoming months.
There may also end up being a potential flip-side to this. Many people (dubbed essential) have found that last couple of months that the lack of open businesses + increased pay have meant that in a few months they are going to be in the market for larger purchases (i.e. houses, cars, etc.). Their savings have gone up and at least person I've talked too has mentioned opening up a new business after all of this is over.
That show's first season gives an excellent picture of how interconnected technology is and how ideas spread and combine. The surprising connections are analogous to the cascading failures we see in other systems. Well worth a watch.
Wait, why are the servers “crashing” when under too much load in the first place?
If there’s some sort of natural limit to how many simultaneous connections they can handle, why can’t they just return some 4xx error code for connections beyond that? (And have clients implement an exponential back off?)
Or if that’s too difficult, the load balancer could keep track of some maximum number of connections (or even requests per second) each backend is capable of, and throttle (again with some 4xx error code) when the limit has been reached by all backends? This is pretty basic functionality for load balancers.
You’re going to need actual congestion control anyway, when the number of client connections is unbounded like this. Even when your servers aren’t crashing, what if the client apps whose clicks you’re tracking becomes suddenly more popular and you can’t handle the load even with all of your servers up?
Why is Apache continuing to fork new workers ad infinitum? That’s a denial of service attack waiting to happen, and the answer surely isn’t “oh let’s just automate the rebooting”...
Edit: you’ve edited your comment to say it’s a DoS as well, so looks like we’re on the same page here, the software is garbage.
I try to be slow to jump to that conclusion ... so often there are reasons for things to be the way they are. The software tools and the hardware capabilities were very, very different in the aughties, and many people currently writing software don't really seem to appreciate by just how much. It may be the case that things could have been done differently, perhaps better, but then again, once you know the full story, maybe not.
Some time ago I wrote up a war story from the mid-nineties and had some current software people crap all over it. When I started to explain about the machine limitations of the time the bluster increased, and among the replies I got was "The software was crap".
Ever since then I've been interested in the contexts for these stories. So often I hear "Well you shouldn't have done it like that!" rather than an enquiring:
"OK, let's assume some clever people wrote this, I wonder what the pressures were that made them come out with that solution."
I've learned a lot by approaching things that way, instead of simply declaring: The software was garbage.
> “OK, let's assume some clever people wrote this, I wonder what the pressures were that made them come out with that solution."
Pressure from everyone to get something shipped and out-the-door - never-mind the technical debt - and after it ships management is only interested in further growth fuelled by more features - deepening the technical debt.
As for the problem of Apache keeling over too easily - that’s because Apache (at the time - and I think still now) is based on “one thread per HTTP request” - or one thread per connection. Asynchronous (or rather: coroutine-based) request handling was very esoteric in the late-1990s and early-2000s - and required fundamentally different program architecture - no-one was going to rewrite their CGI programs to do that - and Perl script based applications couldn’t do anything at all. Asynchronous HTTP request handling only really became mainstream when NodeJS started getting popular about 7-8 years ago - but, to my knowledge, besides ASP.NET Core - no other web application platform treats asynchronous request handling as a first-class feature.
> Asynchronous HTTP request handling only really became mainstream when NodeJS started getting popular about 7-8 years ago
Nginx first came out in 2002. It was being used widely by something like 2008.
> no other web application platform treats asynchronous request handling as a first-class feature
Huh? Web application platforms that do this have been around since the late 1990s. (Heck, there was one written in Python in the late 1990s, it was called "Medusa".) They just weren't "popular" back then.
Perhaps the ones from the late 1990s were (depends on what you think qualifies as "esoteric", since Medusa, for example, was serving high volume websites in the late 1990s), but I wasn't claiming they weren't; I was only pointing out that they did in fact support asynchronous request handling as a first class feature, even before Nginx did.
Nginx, OTOH, was popular before NodeJS even existed, let alone before NodeJS became mainstream.
It's not the async logic that's hard, it's the parsing and handling of the HTTP protocol. Apache had years to get it right and figure out the workarounds for non-compliant black boxes. It's this that keeps it around unfortunately. Its architecture is obsolete and a waste of resources.
Nginx does the async stuff properly I believe but there's still the nonsense of forking child processes.
Not sure where you draw the line between a server and a platform, but Go and Elixir are async by default. (Not in the async await sense, but in the sense that they handle requests using lightweight userspace threads).
Besides that, Python, Java and PHP have had mature async web stacks available for years as well. Is there any commonly used language that doesn’t?
How many years? More than 15? This conversation is getting into decades, which many developers (just by nature of growth of the industry) haven't experienced.
Now done that, and submitted it here[0]. I have no doubt it will sink without trace, but it's there for me to point at in the future. Thanks for the encouragement.
Thank you! It didn't get much attention, so in accordance with the guidelines I've given it one re-submission. I expect it will sink without trace, although I may be surprised. Regardless, it's there as a resource so I can point people at it.
Yeah, I agree in many cases that's the issue, as the author pointed out in the last paragraph.
Let me also add that on this particular stack this happens with caching enabled (W3 Total Cache). Without cache it struggles with 10 concurrent clients (TTFB > 5 seconds). 100 MB per client to serve what's pretty much a grid of products and a button, it's scary
By default, Apache is rather poorly optimised. MySQL is even worse. WordPress' poor performance is the result of mashing components together without any concern for performance.
WordPress itself is not garbage (at least not for those reasons), but WordPress + a bunch of plugins hammered together is a different beast.
There's an Apache setting to limit that, or at least there was when I last had to deal with this problem 20 years ago on webservers that had less processing power than today's midrange smartphones.
The problem is you have to manually tune the number of max workers a bit based on estimated resource usage.
And even if you've got a really good idea of how many connections you can handle at a time, when problems start happening the patterns might change.
For example, maybe in the average second you see 1 login request (CPU-bound password hashing), 10 page views (database connection bound), 100 image views (IO/client speed bound), and 0.01 login sessions being bounced between servers (???? bound). And your server is limited to 111 connections per second.
But when the site goes down? Users can't load images until they've seen pages, and can't see pages until they've logged in, so now you're processing 111 login attempts per second when your tuning assumed a fraction of that.
Part of the problem is that you usually need a lot more Apache workers to process static file requests. But limit the number of expensive PHP invocations
This was pretty much not easily possible (with suphp, mod_php) until FPM which has a per site worker pool.
Solved a lot of problems for us as a shared hosting provider.
I dealt with a similar issue in 2001. About 6 mid-range Sun servers (like small refrigerators in size), running Netscape Application Server, with the load distributed using round-robin DNS (A record round-robin). Due to some DNS servers caching the A record longer than intended, it would create uneven distribution of load -- too much on one server and it'd go bad, requests would start taking too long and it'd start throwing 500 errors or just wouldn't respond. Once it stopped accepting connections, other servers would take on its load (the client would retry to the next IP in the round-robin) but then similar thing would happen to the fifth server, and the rest would follow shortly thereafter in a death spiral.
Restarting was a bear. We had to "warm up" the application server by hitting some of the web pages so caches would get populated etc., before opening the system to public traffic (which, as I recall, we controlled with IP address aliases) or else it would die again.
Finally we put the system behind an F5 load balancer which resulted in a much more even distribution of traffic -- that, coupled with the "warm up" crawl, let to a greatly stabilized system and highest ever (and growing) page views.
This. Least-conns load balancing with max-conns per host allows this type of failure to be handled as gracefully as possible without human intervention or apps needing to share state. It amazes me that these aren’t defaults on very popular load balancing platforms in 2020.
Exactly this, limit your apache processes to match the ram you have available so the machine returns 4xx or 5xx to excess requests but otherwise remains responsive and normal. Problem solved. The options are right there in the conf.
Start with a super low limit then slowly increase while watching reqs/sec. When reqs/sec stops increasing or you run out of ram then there’s your sweet spot.
> The load now rebalanced to four remaining machines is just far too big, and they all die as a result.
Perhaps I'm missing something terribly obvious here, but why would that happen?
I can understand requests being dropped and processing-times worsening, but a full system-wide crash?
edit My bad, I'd missed this in the article:
> they could have rewritten their web site code so it didn't send the machines into a many-GB-deep swap fest. They could have done that without getting any hosting people involved. They didn't, and so now I have a story
In the article, swap is mentioned. Apparently either processing each request requires some amount of memory, or new processes are fired up for concurrent requests (the Apache model). So, what I've seen happen in similar circumstances is that once swapping starts, some of the processes run 10x slower, while new load keeps coming in and piling up, running the server even deeper into the swap. Soon you have a thousand processes, load-average >100, and you can't run anything in the terminal.
As mentioned in other comments, a proper defense against this would be to have a front-line system that anticipates such a situation and actively manages it: puts outstanding requests in a queue (throttling the load) and then errors out on new requests if there's still too many of them coming in.
System latency is not linear with load. At a certain threshold it starts swapping way beyond the capacity of the disk i/o and basically nothing gets done.
I would assume that if you left it alone for an hour or so it might eventually unfuck itself, but for production purposes, that counts as dead, especially when you can’t even ssh into it because the memory allocations for your ssh session are also in that gigantic queue for disk bandwidth via swap.
Why doesn't the server just drop incoming requests until it's able to handle them again? As I mentioned in another comment, this is what routers do.
edit I hadn't noticed this at the bottom of the article:
> they could have rewritten their web site code so it didn't send the machines into a many-GB-deep swap fest. They could have done that without getting any hosting people involved. They didn't
This will easily happen with systems that do not have any kind of upper limits on what they _try_ to do.
For example: not limiting your thread pool size so that your server might try to create more threads than the system has memory for, dying due to lack of memory.
Similar things can happen even if you're using async IO as you still need memory for all those stacks being run asynchronously.
So there are two ways a server can be engineered to handle overload:
1. A throttling scheme of some sort
2. Ignore the possibility, and just crash disastrously if/when it occurs
Is that right? If so, surely that's what we should be talking about here, no? (I see ninkendo mentioned it in their comment too.)
Am I right in thinking the whole problem that the blog-post discusses - avoiding a small pool of servers getting overloaded and blowing up - follows from poorly engineered servers that go with Option 2?
Edit This is what I get for reading the article too quickly. She addresses this quite explicitly:
> they could have rewritten their web site code so it didn't send the machines into a many-GB-deep swap fest. They could have done that without getting any hosting people involved. They didn't
I think options 1 and 2 are exactly right. Depending on your system, you may not even have choice 1... it's quite hard actually to control a system so well that you can confidently say it will never exhaust the available resources. The JVM for example can only kill its process once its memory usage exceeds `Xmx`, but it can't prevent that from happening in the first place. If you had a few JVMs (or Python, Ruby or any other process) taking up all of your memory and the system started swapping disk like crazy, you could easily have the system crash anyway as different processes don't "throttle" or coordinate with each other.
I'm really not sold on this attitude. Ordinary exception handling may be a lot of effort, but if you don't do it, that means your work is sloppy. Same here.
More generally, a program shouldn't ever outright explode, regardless of input. That's not quite the same thing as exit-with-error, but I suppose the distinction is ultimately fuzzy.
> The JVM for example can only kill its process once its memory usage exceeds `Xmx`, but it can't prevent that from happening in the first place
I don't see that the particulars of JVM memory-management have any bearing here. A web server should be capable of detecting when it has reached capacity, and to reject additional requests as appropriate. There's no reason this couldn't be done in Java. From a quick search, this is indeed how modern Java servers behave.
If there are many communicating processes, I can imagine that could complicate things greatly. That's a downside of using many communicating processes.
I've never had to think about this. Always had the option of just opening up more capacity. My choice for application server (uWSGI) doesn't have easily found configurations for dropping connections on high load. Lots of documentation on fallbacks but no pre-emptive dropping (that I can find).
A better solution would be to simply configure the loadbalancer to have a maximum number of requests per second per endpoint and then to drop any requests over that.
An even better loadbalancer will poll a load endpoint, representing CPU load, queue length, percentage of time GC'ing, or some similar metric, and scale back requests as that metric gets too high.
was it just a cost thing that would prevent people from just adding another server in to the mix? given that 4 was the magic number, add another server or two to add buffer to time between servers dying and 'it all breaks'? I'm realizing the cost factor may have been it, depending on size/location/etc. - would there be any other reason?
Might have been tried and not worked. Might have been a limitation in scaling (f.ex. only being able to do master-slave replication and not being able to add more master nodes). Remember: it's bad software to begin with.
Not necessarily the reason here, but to answer your question: "would there be any other reason?"
Another reason is backend constraints. Adding another server might simply cause more pressure on backend systems like a database. So they can't simply add more servers, as then the additional load on the database might cause that to crash, and result in even longer downtime.
Oh boy! I had a similar cascading failure situation once with a Nagios "cluster" I inherited. The previous engineer distributed the work between a master and 3 slave nodes with a backup mechanism such that if any of the slaves died, the load would go to the master. This was fine when he first created it but as more slaves were added, the master was running at capacity just dealing with the incoming data. So each each additional slave node, the probability of one of them failing and sending its load to overwhelm the master increased. Sometimes a poorly designed distributed system is worse than a single big server.
I ended up leveraging Consul to do leadership election (only for the alerting bit) and monitor the health of all the nodes in the cluster. If one of them failed, the load was redistributed equally among the remaining nodes.
HA is definitely super tricky. Not many products do it well. One of the last NoSQL databases I used for instance was quicker to restart than for failover to be detected so DBAs would just restart the cluster instead of waiting for failover to happen during an upgrade.
There is actually quite a bit of complexity with load balancing, but the good news is that a lot of the complexity is understood and is configurable on the load balancer.
I think what Rachel calls a "suicide pact" is now commonly called a circuit breaker. After a certain number of requests fail, the load balancer simply removes all the backends for a certain period of time, and causes all requests to immediately fail. This attempts to mitigate the cascading failure by simply isolating the backend from the frontend for a period of time. If you have something like a "stateless" web-app that shares a database with the other replicas, and the database stops working, this is exactly what you want. No replica will be able to handle the request, so don't send it to any replica.
Another option to look into is the balancer's "panic threshold". Normally your load balancer will see which backends are healthy, and only route requests to those. That is what the load balancer in the article did, and the effect was that it overloaded the other backends to the point of failure (and this is a somewhat common failure mode). With a panic threshold set, when that many backends become unhealthy, the balancer stops using health as a routing criterion. It will knowingly send some requests to an unhealthy backend. This means that the healthy backends will receive traffic load that they can handle, so at least (healthy/total)% of requests will be handled successfully (instead of causing a cascading failure).
Finally, other posts mention a common case like running ab against apache/mysql/php on a small machine. The OOM eventually kicks in and starts killing things. Luckily, people are also more careful on that front now. Envoy, for example, has the overload manager, so you can configure exactly how much memory you are going to use, and what happens when you get close to the limits. For my personal site, I use 64M of RAM for Envoy, and when it gets to 99% of that, it just stops accepting new connections. This sucks, of course, but it's better than getting OOM killed entirely. (A real website would probably want to give it more than 64M of RAM, but with my benchmarking I can't get anywhere close with 8000 requests/second going through it... and I'm never going to see that kind of load.)
I guess the TL;DR is that in 2011 it sounded scary to have a "suicide pact" but now it's normal. Sometimes you've got to kill yourself to save others. If you're a web app, that is.
I have always preferred the library approach myself, but it seems like people are converging on "sidecar" proxies to connect up their microservices. Istio and Linkerd are the big ones. Istio uses Envoy which you can use without a whole "service mesh" to add things like circuit breaking, load balancing, rate limiting, etc.
Unfortunately the load-balancer is not a magic-bullet curing every issue a system has. A load-balancer can be configured to do lots of things, like for example:
* limit the number of concurrent requests, and drop the others;
* limit the number of concurrent requests, but queue the others (with a timeout);
* distribute all requests uniformly (randomly or in round-robin fashion) to all backends;
* (any combination of the above);
However if the "customer" asks you to not drop or queue requests, then there is nothing the load-balancer can actually do...
I took it from the article that dropping requests was permitted (since that happens when all servers go down). So my assumption is still that a better solution is that the load balancer allows only a specific number of requests per server and rejects or caches the rest. I would even argue that requests being rejected is more understandable for the user than the website simply not being there for a certain time.
I think you are right, except that the article reads as if they were using a loadbalancer that wasn't in their control (a third-party service). If you can't control your loadbalancer to not pass requests when you're overloaded, the next best thing is to keep track of it on each node, and basically do what the article described.
For example with HAProxy (my favorite load-balancer and HTTP "router") this can be easily achieved by using `nbsrv`, creating an ACL and only routing requests to the backend based on that ACL. Based on the documentation bellow:
* http://cbonte.github.io/haproxy-dconv/2.1/configuration.html...
* http://cbonte.github.io/haproxy-dconv/2.1/configuration.html...
One can write this:
[This article was linked from the original article described in (https://news.ycombinator.com/item?id=23099347).]