
The load-balanced capture effect - luu
https://rachelbythebay.com/w/2015/02/16/capture/
======
johngalt
Anyone looking for more ideas on statistical anomaly detection. Check out this
talk:

[https://www.usenix.org/conference/lisa14/conference-
program/...](https://www.usenix.org/conference/lisa14/conference-
program/presentation/boubez)

Specifically the section about 20mins in where he talks about KS windowing.

~~~
jaytaylor
This is an interesting talk, thanks for sharing!

For those not already familiar with K-S tests, I'll save you from a google
query:
[http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test](http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)

------
waterside81
AWS uses health checks to solve this problem. When one of your load balanced
instances does not respond to a health check often enough for a fixed size
window, AWS automatically takes your instance offline. It works pretty well.

[http://docs.aws.amazon.com/ElasticLoadBalancing/latest/Devel...](http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/configure-
healthcheck.html)

~~~
nbm
Pretty much all load balancers have health checks - active where they reach
out to each server, or passive where they observe the responses of existing
requests if they can.

One of the issues is making your active health check more like a doctor's
physical than "'tis but a scratch" self-reporting. But also ensuring you're
not dealing with a whole bunch of hypochondriacs.

Passive health checks at least have the property that they fail servers when
the servers are unable to serve, even if the active health check does not
consider some subsystem in its response. But alone they can easily be fooled
by really fast non-error responses.

Anyway, saying "name of brand of load balancers" solves this problem is only
covering the most basic cases. General solutions are at best only the first
step of the full solution. You need to think about the edges - which I suspect
is what Rachel is advocating.

~~~
klaruz
Cloudwatch does what you're referring to as well. It's more of a basic server
monitoring system that happens to integrate with the load balancer.

You get a set of basic VM level metrics, and you can feed it custom metrics
from your app, or log files. All of which can be configured to alarm. I don't
think it's possible to run advanced statistics on the metrics for alarming
(eg, standard deviation from 30 minute exceeds N), but it may be. Usually it's
just an event count, like more than N 500 errors over X time.

I do agree you need to think deeper than basic health checks though, 'broken
server' is always a hard boolean to nail down.

------
peterwwillis
Production load balancers don't treat an HTTP response of 5xx as "success" and
therefore won't continue to send traffic there. They also have periodic sanity
checks of dynamic content which must match certain criteria or the host gets
flagged. Monitoring systems also keep track of various periodic
low/high/averages, tail access and error logs, alert on unusual criteria, and
can trim the hosts in the scoreboard in extreme cases.

You'd typically learn this after probably six months of running a large-scale
continuously-deployed dynamic website and it breaks from poorly-tested
configuration changes + hardware issues. Sysadmins know this stuff. That's why
there is a job title of sysadmin and not "developer who does sysadmin stuff
sometimes".

~~~
nbm
5xx may be the correct response - sometimes the server is asked to do
something valid (ie, not a 4xx/client screwed up), but had an error when it
tried it.

No load balancer I know of will remove a web server that returns a single 5xx
from its healthy pool. It will need to use some heuristic as Rachel points
out, some percentage that is based on the statistical norm. Otherwise it'll
fail out too many hosts and cause a problem.

I think you're severely underselling developers. I've met people who have only
had the Software Engineer title and never installed a Linux distribution who
get this stuff at least as well as anyone who calls themself a systems
administrator. Sysadmins don't have a monopoly on understanding failure cases
and failure handling - false positives, false negatives, outliers and outlier
detection, metrics to look at. It's a skill that comes from experience, which
can happen (or not) whatever you call yourself or others call you.

I'm lucky enough that I get to focus on this type of problem, and while
there's definitely an aptitude portion, it is also a teachable skill - I see
my job partly as getting the team I'm working with to be able to do this stuff
when I'm not around. That usually means finding two or more people in the team
and cultivating their interest in it.

~~~
peterwwillis
If your sanity check in your load balancer passes only on a 200, it will fail
on a 500, disable the host, and keep retrying until it gets a 200 again. It
helps for there to be more than one single request to try in your sanity
check.

For "random" requests, if you have a 500 response, requests of the same "type"
should not longer be sent to that host. This can be changed based on
scoreboard settings. Depending on the context, you may choose to serve cached
content on 500s. This is one of the reasons multiple layers of cache and
application intelligence is so handy.

I'm not underselling anything. Domain-specific knowledge comes with
experience. If you ask a mechanical engineer 'What's wrong with my car if it
makes the noise "bang-sputz-sputz-screech-screech-screech?"', the engineer
will start making you lists of what parts can make each of those noises and
begin cross-referencing to see maybe in what conditions a combination of those
might happen. The mechanic will immediately tell you that for your 1991
Mercury Sable, the A/F mixture is off, the MAF sensor needs cleaning, the
radiator has a crack and the accessory belt needs replacing. Sysadmin is a
trade, not a skill.

~~~
nbm
Okay, if we're only considering active health checks, then I'm not sure any
load balancer considers a 5xx a success by default, let alone a "production"
load balancer.

For a non-healthcheck 5xx response, it almost never is clear that this host in
itself is responsible for the 5xx response. 5xx is the correct response when
there is an error on the server side (ie, not an error on the client side, not
a correct response), but it doesn't mean the server is a problem - it just
means that the server experienced a problem in serving the request. That
failure itself may be from one of many RPCs that server made to other
services. As such, all web servers behind the load balancer for that request
type will exhibit the 5xx response (at some rate, and depending on any state
in connection sharing/reuse between the server and their upstream service),
and all would subsequently be removed. Which isn't the correct response at
all.

As someone who has had the job title "Systems Administrator" and the job title
"Software Engineer", and currently has neither but still does exactly what
he's always done - solving problems by understanding systems and, among other
things, by writing code - I wouldn't consider load balancing and failure
domains/types/handling as the sole or even primary purview of a systems
administrator - especially in the case of large installations.

~~~
peterwwillis
There's different kinds of load balancers, and as such different responses to
different criteria. If you don't want to serve 500 error pages to all your
users, one of your load balancers (or "proxy layers", for more or less
intelligent forms of load balancers) should be doing something when you're
getting 500s, like moving traffic around, or serving different content. It's
far too common for 500s to be due to a machine-specific or network-specific
problem to just assume they'll resolve themselves or are unresolvable.

------
bjwbell
Corollary, Steve Yegge's comment that monitoring and QA are the same thing in
his Platforms Rant ([http://steverant.pen.io/](http://steverant.pen.io/))

~~~
slashnull
This piece is the gift that keeps on giving

------
mortehu
One way to alleviate this problem is to treat all failures as having a fixed
cost equal to an expensive successful request. E.g. treat all >= 400 HTTP
status codes as having taken 500ms. This works well even if there's a stable
stream of faulty requests, since it'll affect all backends equally.

------
cakoose
Reminds me of the differential gear in cars and what happens when one wheel is
up in the air with no traction.

------
jrochkind1
> That is, it doesn't attempt to do any work, and instead just throws back a
> HTTP 5xx error

I'm surprised the OP doesn't suggest having your load balancer pay attention
to returned error codes.

If the load balancer knows what average amount of 500 or non-200 responses is,
and one unit is returning way more than an average rate of non-succesful
responses, it would make sense for the load balancer to back off sending to
that machine. But maybe still send an occasional request there, so it can
notice when/if it's error rate returns to normal.

Do any load balancers work like this?

~~~
sytringy05
Not all load balancers support this (I recall some very expensive alteons that
didnt) but worse is when the LB does support it but it isn't configured to
check. Out of the box most LB's will do a TCP bind and if that works, then
fire away.

The reverse of this issue can also cause problems where you get a bad node
that accepts the request but never returns (or takes 2-3 mins to respond). As
a rule the timeouts will cause queuing, thread pool starvation and general not
working all the way back up your request chain to whatever is facing the
internet where your site will either hang or give back a 503 page.

------
pmontra
Some applications could have to serve fast requests (maybe ajax calls to a
json api) and slow requests (full page rendering) from the same servers. In
this case the technique from the post should be tweaked by creating two
classes of requests with different averages and variances. A server should be
taken out from the pool only if it doesn't belong to any of the classes.

------
dkarapetyan
I think HAProxy can make sure the response code for HTTP backends is actually
200 and will mark the connection down if it gets 500 or 400.

Just goes to show you that thinking only of the happy path can often lead you
astray.

~~~
memnips
HAProxy absolutely can support an HTTP response code health check, but in my
experience out of the box it just makes sure the port (say 80) is open. I
learned this once the hard way and will never make that mistake again... ;)

------
packetized
The article is an interesting mental exercise in using statistical analysis to
identify issues in infrastructure, but the configurations that are assumed to
be in use demonstrate an exceptional lack of understanding around modern L4-7
load-balancing solutions, whether hardware or software. 99% of the 'issues' in
the article are solved by features that are present in nearly every COTS or
open-source solution - they just require knowledgable people to configure and
tune them for the system in question.

More than a few real-world front-end implementations lack the kind of rigorous
instrumentation that's necessary to identify problems of the kinds mentioned
in the article. While proper configuration and experienced ops folks tuning
said infrastructure can solve most of the stated problems, very fine-grained
monitoring is sometimes the only thing that allows for effective
troubleshooting when things really go L-shaped.

