
Interns with toasters: how I taught people about load balancers - deafcalculus
https://rachelbythebay.com/w/2018/04/21/lb/
======
snovv_crash
This reminds me strongly of another system of distribution, which suffers from
the same effect: differential gearboxes in cars.

Since the torque to all the wheels is equal, if one wheel slips it very
quickly takes all of the power of the engine, since power = rotational
velocity * torque. Only if the speeds are similar is the allocation of power
to the wheels similar.

In cars, the solution is to make sure that the power is distributed fairly
evenly even if a wheel loses traction. This used to be done with thick grease
inside the differential gearbox to limit the difference in speeds - more
recently complex gearboxes and traction control achieve the same thing, but
through constantly monitoring the speeds of each wheel and braking wheels that
are spinning out of control.

It's interesting how such disparate distribution systems seem to have such
similar failure modes. I wonder if the two sides could learn something from
each other here.

~~~
abakker
There are a variety of mechanical systems used to combat this problem, not
just thick grease.

Limited Slip differentials can be employed to ensure that one wheel can only
slip so far before the differential locks and then both wheels are forced to
rotate together. LSDs can be employed in 3 positions front, rear, and center.
The front and rear will lock left and right wheels together, the center one
will lock the rotation speed of the front and back axles.

You can also add manually selectable locking differentials in any of those
positions instead. Frequently, you see LSDs in front and back and manual
lockers in the center.

In off road applications it is more common to see manually selectable lockers
in every position. Because of the lack of differentials, though, manual
lockers put a lot of load on the drivetrain if the surface isn’t slippery.

The alternative, traction control can use variable differentials to send
different power amounts to different wheels, typically by employing the brakes
on the slipping wheel to force power to distribute evenly.

Obviously, there is a lot to this, but highlights some of the options.

~~~
pitaj
There's also the Torsen differential [1]

[1]: [https://youtu.be/JEiSTzK-A2A](https://youtu.be/JEiSTzK-A2A)

~~~
bigiain
And for those on a budget - what we in Australia call "The CIG Locker"...

[http://www.billzilla.org/diffs.htm](http://www.billzilla.org/diffs.htm)

(They _kinda_ work on rally cars and hillclimbers...)

------
vasco
And that's why you always set health checks on servers behind a load balancer
that should take bad ones out as soon as possible. You then get another
interesting problem, which is if the server is only spitting errors when
serving requests, it'll then be marked healthy again and forever toggle
between healthy and unhealthy. But that you can also solve.

~~~
mabbo
The mistake I've seen here is having the "health check" do little or nothing.
In many web services, I've seen a /ping or /health api that the load balancer
calls to ask "are you healthy?". But people get rushed or lazy or requirements
change, and they wind up with that api just doing "return true".

Now you've got a host that can't access a database or can't do much of
anything- but it can return true!

Health check APIs should always do some level of checks on dependencies and
complex internal logic (maybe a few specific unit/integration tests?) to
ensure things are truly healthy.

~~~
s_kilk
At work we have two endpoints on every service, `/status` and `/health-check`.

The former is basically a "return true" endpoint, which can tell you if the
service is alive and reachable. The latter will usually do something like
"select 1;" from any attached databases and only succeed if everything is OK.

~~~
randerson
And the former is the one you want your Load Balancers to be checking. With a
deep health check, even a brief database outage will cause every web server to
be taken out of rotation, and then you're completely down for at least as many
health check intervals as it takes for the LB to consider a host healthy
again. Same goes for any other shared resource that is likely to affect all
web servers if it becomes unavailable.

~~~
azernik
Or you just have smarter load balancers, that realize when _all_ of their
servers are acting funky and stop taking servers out of rotation.

~~~
Terr_
Or to riff on this idea further: You make the concept of a node's "relative
stability" part of the load-balancer logic the same way that "relative
idleness" is a factor.

Then if all 100/100 nodes get taken down by some shared problem, the system
simply degenerates into picking the idle-est of the 100.

------
stephengillie
At a real estate webhost in 2014, we had a small web farm behind a single load
balancer. I've written some previous posts about the interesting architecture
choices made by the lead architect in previous HN posts. Along with these, it
was the days of "move fast and break things", so developers got admin access
to servers and would develop against live sites. Fun times to keep a web farm
online all night long.

Partly because it was such a small operation, we heavily instrumented the web
servers with PRTG, along with hitting a number of key sites every minute, on
each web server. "When XYZRealty goes down, so do all of these other sites!"
"We'll put a sensor on XYZRealty."

This gave us great data about the health of the servers, including identifying
bad apples, and even aiding in performance testing of new modules. We were
able to catch memory leaks and processing spikes before they broke our sites.
And when 64-bit modules were ready to replace the 32-bit modules, we had
baseline data ready to compare and evaluate.

Not that this won't scale - quite the contrary. Though it creates some data
and requires dedication to maintain.

------
cecilpl2
I see that on my continuous integration system. We use Teamcity with ~50
agents for build tasks that take 20-30 minutes. Each agent can only be running
a single task.

During the day, all agents are busy and the queue fills up with 30 or so
pending tasks.

If one agent gets into a bad state where, say, it fails to checkout from
source control and fails within the first 20 seconds of a build, then it will
very quickly chew through the entire queue of pending tasks, failing them all.

You'd think the more agents you have the better insulated you are from the
failure of a single one, but this particular failure mode actually becomes
more common the more agents you add!!!

~~~
ithkuil
> this particular failure mode actually becomes more common the more agents
> you add

I'm surprised. I would expect the length of the pending job queue to also have
an impact on this failure _mode_ (as opposed to the failure _cause_), and the
queue length is inversely proportional to the number of agents, flooring to 0
when you have to many agents than necessary to deal with peak demand.

If the bad agents (I'm assuming that the fault to check out from SCM lies in
the agents themselves, not in the overloaded SCM server due to too many agents
polling, otherwise this is not an example relevant to the thread) are e.g. 1
every 50 agents and the number of agents is much larger than the peak load
(e.g. a million agents), the probability of a job of failing approaches 1/50,
because the good agents blocked on doing work are just a small fraction and
thus they don't significantly skew the probability of a new job to get
scheduled to a bad agent (which are more likely to be ready).

If on the other hand you have less agents than peak load, the queue will
contain some jobs, and bad agents will chew through the whole queue when they
get the chance, as you described.

And they always get the chance, since they are almost always ready (if they
fail very quickly), so that the pool of ready agents will almost always
contain only bad agents.

If you add more agents you have to deal with these bad agents "floating on the
top of the pool", and the new ones you add (at the rate of e.g. 1 every 50).
So, if you have 5 bad agents and you add another 5 you still have a pretty
high chance that new jobs will land on the bad ones.

But you can add more and the probability of success just gets better.

~~~
cecilpl2
That is a good point and I like the visual metaphor of bad agents "floating on
top of the pool".

I was considering the fairly narrow problem domain between say 20 and 40
agents, and the queue fills up to perhaps 50 pending tasks mid-day. You are
correct in the case that once you add enough agents to keep the queue empty,
then additional agents reduce the fail chance.

Sadly, though, the CI system I use (Teamcity) prioritizes agents that have a
history of completing that particular task the fastest. That's great if your
agents are all different specs, but fails spectacularly in this case, where a
bad agent will be selected above a good one.

------
logronoide
To summarize: “implement health checks to take bad servers out of the service
pool, always!”

Love the very didactic way of writing. Perfect for managers, not interns!

------
grosjona
That's why I like random load balancing. If each machine is powerful enough
and can handle a few thousand users then the distribution averages out.

Smart load balancers are only really necessary if you have inefficient servers
that can't handle more than 100 connections per second and they're difficult
to get right.

If you toss a coin 10 times, you're much more likely to get >=80% heads than
if you were to toss that coin 1000 times.

~~~
jakobegger
That model works if you have a large number of short lived requests.

On the other hand, if you have a small number of long running requests that
you want to distribute over a number of servers, then your load balancer needs
to track which servers are busy, or you'll spend a lot of time waiting for the
randomly assigned server to finish, while others are idle.

~~~
rollcat
Then you design your system to always respond as quickly as possible. HTTP
even has a status code for that: 202 Accepted. You put the job in a queue and
return some ID or cookie to the client, possibly also an ETA. Client can then
poll for results (with bounded exponential back-off), again polling is just a
quick key look-up; alternatively, if you have some means to push results to
clients (like via WebSockets), use that.

If Python is your thing, Celery makes this pattern trivial to implement.

This also simplifies so many other things: for example, deployments (and
rollbacks!) can proceed much faster, since you don't need to wait obnoxiously
long for all connections to drain.

Of course you're pushing more complexity up the stack, but in my experience
it's a good trade-off.

~~~
hluska
I've spent enough time building systems and doing ops that this complexity
genuinely scares me. Have you used this pattern in highly scalable production
apps? What's it like to debug when everything goes to hell?

~~~
rollcat
So first, this is not my idea, but something that people much smarter than me
came up with.

This is the pattern that AppEngine enforces. Apps that regularly take more
than 1s to respond are penalized, and there's a 60s hard deadline on all
requests. Things like a task queue or push channels are built-in to the
platform, with a high-level API exposed, so you just focus on writing the
application / business logic.

Celery tries to basically do the same for you, w/o the proprietary Google
SDK's.

> What's it like to debug when everything goes to hell?

From my experience so far... You will find bugs in your application code way
more often than in a battle-tested task queuing system, much like you're quite
unlikely to find a bug in nginx or Python.

Fixing things is MUCH easier than in a request / response model. I know it's
not a "web scale" example, but imagine an installation involving camera rigs,
some networked hardware, a transcoding server, etc. Dude walks into a photo
booth, types in his email/phone number on a tablet, etc. When I see a failure
in a component of the pipeline, I can hack together and deploy a patch in
seconds / minutes (from alert to running the fix in production) and just tell
the system to retry a failed job. Dude gets his silly photo via email/sms
within two minutes instead of one.

Picture a similar situation in a web application that processes user-uploaded
media. You write your batch processing logic to bail early on any sign of
error, and when you get the chance to fix some bug that e.g. affected 3% of
your users, again - you resubmit the failed jobs and users just see the result
with a delay, instead of having to re-upload.

~~~
hluska
Thanks for the response - I understand it much better now.

------
indigodaddy
Not sure how relevant this is to Rachel's incident, but looks like they use
ECMP/BGP->shiv(L4)->proxygen(L7). So hard to believe that health checking
wasn't in the mix. If the nodes were passing the health check, but not
properly serving requests still, then I'd assume that a post-mortum items
would have involved improving health checks.

Found this pretty cool presentation/PDF about FB's load-balancing
architecture. Stays fairly high level:
[http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user....](http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user.pdf)

------
lazyant
Great "teaching by showing" example but I don't get the initial technical
scenario; you have sophisticated (ie, not RR / random) load balancers that
keep track of queues in web servers, so they have to get some information back
from them (when job was completed, from HTTP response for ex), but somehow
don't react to 500 errors? seems like badly configured LBs. Still scenario can
be used as teaching or interview question.

~~~
camtarn
I've worked on a system which communicated errors by responding with a 200 and
then indicating the error in the response body!

Similarly, you might have e.g. a server which is responding to a 'get all
widgets in category' call by quickly responding '200 OK, no widgets found',
and thus ends up sucking up all the traffic.

~~~
asmithmd1
Exactly, you need to have a health check that actually exercises all the
layers of the system and returns real data.

You don't want to use actual data, so you add a dummy account.

Someone deletes the dummy account all the servers that are working perfectly
start reporting they are sick.

------
pulkitsh1234
Maybe some people dismissed her problem as 'impossible' because she didn't
inform what kind of load balancing technique was the load balancer using?

    
    
      I suspect what happened is that they didn't understand the problem, and so resorted to ineffective means to steer
      the attention away from their own inadequacies.
    

^ This statement is kinda harsh.

~~~
gaius
Every post on that blog is about how the author is incredibly clever and
surrounded by dolts.

~~~
bestes
Many of her posts are about teaching or communicating issues like this to
others. In other words, there is no reason to give a class like this if she
doesn’t actually know more than the interns.

------
KaiserPro
> "some of the commenters dismissed either it ("impossible") or me ("doesn't
> know anything about load balancing"). "

This to me is odd. Was this posted when AWS's ELBs were new and shiny?

Most of the big failure cases I've dealt with are along these lines.

One server does something stupid and gobbles up the world.

That being said, this is a very neat way of describing the problem, I shall be
referencing this in the future, I might start putting that in an interview
question....

~~~
AdmiralAsshat
Given the author's presumed gender, I understood it to be thinly-veiled
sexism. But I could be wrong.

~~~
KaiserPro
Perhaps, or just typical dickishness. (mind you its the same thing really.)

As a devop I get told or sorts of "facts" about various building blocks.

We are currently looking at low latency API (read not using HTTP) half of the
suggestions I got back were about using haskell or Go. No one bothered with
the low hanging fruit, like re-using TCP sockets, using predefined schemas
instead of json, etc etc.

Not one person thought about LB latency...

------
hightowk
>and so on down the line until everyone had bread. At various points, a
toaster would finish and would pop up. ... I'd notice that they were done and
would run over to give them more toast.

You'd give them more bread, not toast. Toast is already processed bread :).

~~~
rachelbythebay
Ah! Correct you are. Fixed. Thanks!

------
sjwright
A great story, but I'm distracted by the word _caromed._ How did I not know
about that word? Now I feel the need to inject it into my active lexicon.

------
ttflee
A protocol without considering the possibility of insane counterpart is doomed
to DDoS the systems that utilize it.

------
unlivingthing
If possible can you share the link to your lectures? (:

------
amingilani
Who is Rachel and how do her short and simple stories always hit the front-
page?

They're interesting but I always think they're a little _too_ simple. I mean
this entire thing can be summed up with:

 _500s (and other errors) are returned faster than processed requests. Load-
balancers will find a misbehaving server 's queue empty more often and give it
all the requests_

~~~
jakobegger
Everything seems trivial once you understand it. The author seems to be good
at making people understand, so that it seems trivial afterwards.

Knowing nothing about load balancers, had I read your summary before Rachels
article, I would probably just glossed over it, but I would not have
understood its profound importance for designing load balancers.

Rachel's story kept me engaged and explained the problem in a way that I will
surely remember a lot longer.

Teaching is not about reciting facts -- that's what reference manuals are for.
Teaching is for engaging students, making them interested, giving background
on a topic, telling you why something matters, making it easy to understand,
and making sure the student remembers the most important bits.

It's something I have been struggling a lot with junior devs and interns. I
show them what they are doing wrong, how to do it correctly, and the next day
they make the same mistake again. If people were perfect logical machines,
just telling them facts might be sufficient. Unfortunately people are people,
and it's really really hard to teach them something effectively.

~~~
arkades
I think the truth lies in the middle somewhere. I knew absolutely nothing
about load balancing before reading OPs comment and the blog post, and while
you’re correct that I wouldn’t have retained OPs bullet point version as well
as Rachel’s, Rachel really went on at a length far beyond what was needed to
tell the story / explain the concept (eg, the beach ball sized bag of bread).

People need an intuitive concept and image to grasp and reason on while they
master an idea, but that doesn’t justify any arbitrary amount of over-
elaborating.

~~~
mrmanner
I found that the (over)-elaboration made the article funnier to read. Since
it's a Sunday, I wouldn't have read it if it was a boring, efficient, fact-
crammed text reading like something from one of the more boring computer
science classes I take. I did finish this though, and learned something on the
way.

So my point is: There are times when efficient and to the point is called for,
and there are times when it's not.

------
kelukelugames
>When it landed on certain web sites, some of the commenters dismissed either
it ("impossible") or me ("doesn't know anything about load balancing").

Does anyone have a link to what she is talking about?

------
darshitpp
> This is what happened when one bad web server decided it was going to fail
> all of its requests, and would do so while incurring the absolute minimum
> amount of load on itself.

Good ELI5 explanation, but it doesn't really explain why the webserver failed
the requests as it did. Or maybe I'm missing something?

~~~
masklinn
> it doesn't really explain why the webserver failed the requests as it did.
> Or maybe I'm missing something?

That the article is about load-balancers and how a single "rogue actor" can
have outsized effect on the entire thing. The failing webserver is besides the
point.

~~~
darshitpp
Oh, ok

------
BillinghamJ
I don't really see how this is an issue engineers need to be particularly wary
of?

Firstly, your typical load balancer doesn't work this way anyway. It will just
keep feeding requests to the application hosts on a round robin or random
basis. Most don't keep track of how busy each instance is.

Secondly, any decent (HTTP/layer 7) load balancer will notice if an instance
is returning exclusively 5xx errors and will stop routing requests to it.
Would fail even the most basic of health checks.

~~~
captn3m0
AWS ELBs work this way. They offer more requests to clients with the shortest
backlog

~~~
BillinghamJ
Do you have a link to any documentation mentioning this?

As far as I’m aware - using ELBs heavily day to day - it doesn’t maintain a
backlog and simply distributes requests on a round robin basis the moment they
come in. There is no queue of waiting requests I don’t believe. If the request
can’t be fulfilled immediately, it is rejected.

~~~
nvarsj
From the AWS docs:

> With Classic Load Balancers, the load balancer node that receives the
> request selects a registered instance using the round robin routing
> algorithm for TCP listeners and the least outstanding requests routing
> algorithm for HTTP and HTTPS listeners.

------
andrew_wc_brown
Based on how verbose this article is I think maybe her original article was
misunderstood because it was a chore to sort out the junk from the content
just like this one.

Too much flourishing, poor pacing

~~~
mrmanner
Just the right amount of flourishing for a lazy Sunday :)

------
scopecreep
I feel like if you have to teach them about load balancers as college interns,
FB needs to find a better school to pull interns from.

~~~
rajacombinator
What kind of college teaches about load balancers?

~~~
BillinghamJ
If they’re not teaching about the most basic principles of building scalable
systems, what is the point of doing a degree, or hiring people who have done a
degree?

~~~
mrgordon
Several of the top CS programs specifically do not teach details like
configuring a load balancer as with a proper background in computer science it
is assumed that you can figure out the latest programming languages and the
install instructions for popular software packages. It would thus be a waste
of a very expensive degree as you could learn about configuring LAMP and a
load balancer at a trade school instead of a theoretical research university.

No one hires MIT grads because they graduate with more extensive load balancer
knowledge.

~~~
walshemj
That is what you get the technicians to do

