
A self-killing web site requested by a customer (2011) - ciprian_craciun
https://rachelbythebay.com/w/2011/06/28/sp/
======
ciprian_craciun
I found it interesting because such a simple task (requiring at least a number
of on-line servers before the load-balancer starts serving requests), required
a custom binary controlled the webserver and had to cross-monitor each server.

For example with HAProxy (my favorite load-balancer and HTTP "router") this
can be easily achieved by using `nbsrv`, creating an ACL and only routing
requests to the backend based on that ACL. Based on the documentation bellow:

* [http://cbonte.github.io/haproxy-dconv/2.1/configuration.html...](http://cbonte.github.io/haproxy-dconv/2.1/configuration.html#7.3.1-nbsrv)

* [http://cbonte.github.io/haproxy-dconv/2.1/configuration.html...](http://cbonte.github.io/haproxy-dconv/2.1/configuration.html#4.2-monitor%20fail)

One can write this:

    
    
        ~~~~
        frontend www
        mode http
        acl site_alive nbsrv(dynamic) gt 2
        use_backend dynamic if site_alive
        ~~~~
    

[This article was linked from the original article described in
([https://news.ycombinator.com/item?id=23099347](https://news.ycombinator.com/item?id=23099347)).]

~~~
jefftk
Things are a lot easier these days, mostly because there's been more time for
people to have problems and code up good generalizable solutions

~~~
ciprian_craciun
I just wanted to mention that I didn't fault the original author for the
proposed solution. It was 2011, they seemed to use some commercial load-
balancer, and the customer didn't seem to want to actually solve the solution.

I actually found the original solution interesting, and I also proposed an
alternative achievable with HAProxy as an load-balancer.

~~~
dumbfounder
This was written in 2011, I think it's something that happened many years
before. Sounds more like a 2001 thing to do.

------
nicbou
This is called a cascading failure. It's also a problem with the electric
grid, and more terrifyingly with global finance.

[https://en.wikipedia.org/wiki/Cascading_failure](https://en.wikipedia.org/wiki/Cascading_failure)

~~~
Beltiras
I'd imagine this is the next couple of years in finance due to COVID. Shocks
and reverberations for a long long while.

~~~
nicbou
Travel, immigration, entertainment, restaurants etc are severely affected. A
lot of people have lost their income, and spend less.

There is also some feat that this will exacerbate problems with an already
stressed financial system. Banks are massively increasing their cash reserves,
and many organisations have a grim outlook on the upcoming months.

~~~
nvahalik
There may also end up being a potential flip-side to this. Many people (dubbed
essential) have found that last couple of months that the lack of open
businesses + increased pay have meant that in a few months they are going to
be in the market for larger purchases (i.e. houses, cars, etc.). Their savings
have gone up and at least person I've talked too has mentioned opening up a
new business after all of this is over.

------
ninkendo
Wait, why are the servers “crashing” when under too much load in the first
place?

If there’s some sort of natural limit to how many simultaneous connections
they can handle, why can’t they just return some 4xx error code for
connections beyond that? (And have clients implement an exponential back off?)

Or if that’s too difficult, the load balancer could keep track of some maximum
number of connections (or even requests per second) each backend is capable
of, and throttle (again with some 4xx error code) when the limit has been
reached by all backends? This is pretty basic functionality for load
balancers.

You’re going to need actual congestion control anyway, when the number of
client connections is unbounded like this. Even when your servers aren’t
crashing, what if the client apps whose clicks you’re tracking becomes
suddenly more popular and you can’t handle the load even with all of your
servers up?

~~~
laurentdc
> If there’s some sort of natural limit to how many simultaneous connections
> they can handle

Yup, it's pretty much a DDoS. It's simple to test. Spin up Wordpress +
Woocommerce (a common ecommerce stack) in a Digitalocean 1 GB droplet.

Now ab -n 1000 -c 30 the home page, that's 30 concurrent clients

Watch MySQL die, Apache get killed because out of memory, and...

root@wootest01:~# reboot

-bash: fork: Cannot allocate memory

~~~
ninkendo
So, maybe fix that?

Why is Apache continuing to fork new workers ad infinitum? That’s a denial of
service attack waiting to happen, and the answer surely isn’t “oh let’s just
automate the rebooting”...

Edit: you’ve edited your comment to say it’s a DoS as well, so looks like
we’re on the same page here, the software is garbage.

~~~
ColinWright
> _... the software is garbage._

I try to be slow to jump to that conclusion ... so often there are reasons for
things to be the way they are. The software tools and the hardware
capabilities were very, very different in the aughties, and many people
currently writing software don't really seem to appreciate by just how much.
It may be the case that things could have been done differently, perhaps
better, but then again, once you know the full story, maybe not.

Some time ago I wrote up a war story from the mid-nineties and had some
current software people crap all over it. When I started to explain about the
machine limitations of the time the bluster increased, and among the replies I
got was "The software was crap".

Ever since then I've been interested in the contexts for these stories. So
often I hear "Well you shouldn't have done it like that!" rather than an
enquiring:

 _" OK, let's assume some clever people wrote this, I wonder what the
pressures were that made them come out with that solution."_

I've learned a lot by approaching things that way, instead of simply
declaring: The software was garbage.

~~~
DaiPlusPlus
> “OK, let's assume some clever people wrote this, I wonder what the pressures
> were that made them come out with that solution."

Pressure from everyone to get something shipped and out-the-door - never-mind
the technical debt - and after it ships management is only interested in
further growth fuelled by more features - deepening the technical debt.

As for the problem of Apache keeling over too easily - that’s because Apache
(at the time - and I think still now) is based on “one thread per HTTP
request” - or one thread per connection. Asynchronous (or rather: coroutine-
based) request handling was very esoteric in the late-1990s and early-2000s -
and required fundamentally different program architecture - no-one was going
to rewrite their CGI programs to do that - and Perl script based applications
couldn’t do anything at all. Asynchronous HTTP request handling only really
became mainstream when NodeJS started getting popular about 7-8 years ago -
but, to my knowledge, besides ASP.NET Core - no other web application platform
treats asynchronous request handling as a first-class feature.

~~~
pdonis
_> Asynchronous HTTP request handling only really became mainstream when
NodeJS started getting popular about 7-8 years ago_

Nginx first came out in 2002. It was being used widely by something like 2008.

 _> no other web application platform treats asynchronous request handling as
a first-class feature_

Huh? Web application platforms that do this have been around since the late
1990s. (Heck, there was one written in _Python_ in the late 1990s, it was
called "Medusa".) They just weren't "popular" back then.

~~~
monknomo
I think not being popular would qualify them as esoteric

~~~
pdonis
Perhaps the ones from the late 1990s were (depends on what you think qualifies
as "esoteric", since Medusa, for example, was serving high volume websites in
the late 1990s), but I wasn't claiming they weren't; I was only pointing out
that they _did_ in fact support asynchronous request handling as a first class
feature, even before Nginx did.

Nginx, OTOH, _was_ popular before NodeJS even existed, let alone before NodeJS
became mainstream.

------
MaxBarraclough
> The load now rebalanced to four remaining machines is just far too big, and
> they all die as a result.

Perhaps I'm missing something terribly obvious here, but why would that
happen?

I can understand requests being dropped and processing-times worsening, but a
full system-wide crash?

 _edit_ My bad, I'd missed this in the article:

> they could have rewritten their web site code so it didn't send the machines
> into a many-GB-deep swap fest. They could have done that without getting any
> hosting people involved. They didn't, and so now I have a story

~~~
sneak
System latency is not linear with load. At a certain threshold it starts
swapping way beyond the capacity of the disk i/o and basically nothing gets
done.

I would assume that if you left it alone for an hour or so it might eventually
unfuck itself, but for production purposes, that counts as dead, especially
when you can’t even ssh into it because the memory allocations for your ssh
session are also in that gigantic queue for disk bandwidth via swap.

~~~
MaxBarraclough
Why doesn't the server just drop incoming requests until it's able to handle
them again? As I mentioned in another comment, this is what routers do.

 _edit_ I hadn't noticed this at the bottom of the article:

> they could have rewritten their web site code so it didn't send the machines
> into a many-GB-deep swap fest. They could have done that without getting any
> hosting people involved. They didn't

~~~
rovr138
Apache has a way to limit the amount of processes. Tuning that would have
helped.

------
klausjensen
Lovely hack and an example of how thinking outside the box can create
solutions, that are order of magnitude cheaper than the "obvious" solution.

------
londons_explore
A better solution would be to simply configure the loadbalancer to have a
maximum number of requests per second per endpoint and then to drop any
requests over that.

An even better loadbalancer will poll a load endpoint, representing CPU load,
queue length, percentage of time GC'ing, or some similar metric, and scale
back requests as that metric gets too high.

~~~
MichaelApproved
Maybe their “load balancer” was just a simple round robin service which
wouldn’t have those abilities.

------
mgkimsal
was it just a cost thing that would prevent people from just adding another
server in to the mix? given that 4 was the magic number, add another server or
two to add buffer to time between servers dying and 'it all breaks'? I'm
realizing the cost factor may have been it, depending on size/location/etc. -
would there be any other reason?

~~~
Beltiras
Might have been tried and not worked. Might have been a limitation in scaling
(f.ex. only being able to do master-slave replication and not being able to
add more master nodes). Remember: it's bad software to begin with.

------
hangonhn
Oh boy! I had a similar cascading failure situation once with a Nagios
"cluster" I inherited. The previous engineer distributed the work between a
master and 3 slave nodes with a backup mechanism such that if any of the
slaves died, the load would go to the master. This was fine when he first
created it but as more slaves were added, the master was running at capacity
just dealing with the incoming data. So each each additional slave node, the
probability of one of them failing and sending its load to overwhelm the
master increased. Sometimes a poorly designed distributed system is worse than
a single big server.

I ended up leveraging Consul to do leadership election (only for the alerting
bit) and monitor the health of all the nodes in the cluster. If one of them
failed, the load was redistributed equally among the remaining nodes.

------
rjkennedy98
HA is definitely super tricky. Not many products do it well. One of the last
NoSQL databases I used for instance was quicker to restart than for failover
to be detected so DBAs would just restart the cluster instead of waiting for
failover to happen during an upgrade.

------
jrockway
There is actually quite a bit of complexity with load balancing, but the good
news is that a lot of the complexity is understood and is configurable on the
load balancer.

I think what Rachel calls a "suicide pact" is now commonly called a circuit
breaker. After a certain number of requests fail, the load balancer simply
removes all the backends for a certain period of time, and causes all requests
to immediately fail. This attempts to mitigate the cascading failure by simply
isolating the backend from the frontend for a period of time. If you have
something like a "stateless" web-app that shares a database with the other
replicas, and the database stops working, this is exactly what you want. No
replica will be able to handle the request, so don't send it to any replica.

Another option to look into is the balancer's "panic threshold". Normally your
load balancer will see which backends are healthy, and only route requests to
those. That is what the load balancer in the article did, and the effect was
that it overloaded the other backends to the point of failure (and this is a
somewhat common failure mode). With a panic threshold set, when that many
backends become unhealthy, the balancer stops using health as a routing
criterion. It will knowingly send some requests to an unhealthy backend. This
means that the healthy backends will receive traffic load that they can
handle, so at least (healthy/total)% of requests will be handled successfully
(instead of causing a cascading failure).

Finally, other posts mention a common case like running ab against
apache/mysql/php on a small machine. The OOM eventually kicks in and starts
killing things. Luckily, people are also more careful on that front now.
Envoy, for example, has the overload manager, so you can configure exactly how
much memory you are going to use, and what happens when you get close to the
limits. For my personal site, I use 64M of RAM for Envoy, and when it gets to
99% of that, it just stops accepting new connections. This sucks, of course,
but it's better than getting OOM killed entirely. (A real website would
probably want to give it more than 64M of RAM, but with my benchmarking I
can't get anywhere close with 8000 requests/second going through it... and I'm
never going to see that kind of load.)

I guess the TL;DR is that in 2011 it sounded scary to have a "suicide pact"
but now it's normal. Sometimes you've got to kill yourself to save others. If
you're a web app, that is.

~~~
rrmoelker
While not actively developing infrastructure myself, I've always like the
concept presented in the Hystrix package:
[https://github.com/Netflix/Hystrix](https://github.com/Netflix/Hystrix)

Even though it seems it is no longer maintained, the circuit breakers, fail
over modes and all that are well documented.

And I don't know why Hystrix hasn't been adopted by a wide audience yet. It
seems like a necessity in the micro service landscape.

~~~
jrockway
I have always preferred the library approach myself, but it seems like people
are converging on "sidecar" proxies to connect up their microservices. Istio
and Linkerd are the big ones. Istio uses Envoy which you can use without a
whole "service mesh" to add things like circuit breaking, load balancing, rate
limiting, etc.

------
Random_ernest
I am not a webdev, but isn't that a task for the loadbalancer in the first
place?

~~~
ciprian_craciun
Unfortunately the load-balancer is not a magic-bullet curing every issue a
system has. A load-balancer can be configured to do lots of things, like for
example:

* limit the number of concurrent requests, and drop the others;

* limit the number of concurrent requests, but queue the others (with a timeout);

* distribute all requests uniformly (randomly or in round-robin fashion) to all backends;

* (any combination of the above);

However if the "customer" asks you to not drop or queue requests, then there
is nothing the load-balancer can actually do...

~~~
Random_ernest
I took it from the article that dropping requests was permitted (since that
happens when all servers go down). So my assumption is still that a better
solution is that the load balancer allows only a specific number of requests
per server and rejects or caches the rest. I would even argue that requests
being rejected is more understandable for the user than the website simply not
being there for a certain time.

~~~
legacynl
I think you are right, except that the article reads as if they were using a
loadbalancer that wasn't in their control (a third-party service). If you
can't control your loadbalancer to not pass requests when you're overloaded,
the next best thing is to keep track of it on each node, and basically do what
the article described.

