Since the torque to all the wheels is equal, if one wheel slips it very quickly takes all of the power of the engine, since power = rotational velocity * torque. Only if the speeds are similar is the allocation of power to the wheels similar.
In cars, the solution is to make sure that the power is distributed fairly evenly even if a wheel loses traction. This used to be done with thick grease inside the differential gearbox to limit the difference in speeds - more recently complex gearboxes and traction control achieve the same thing, but through constantly monitoring the speeds of each wheel and braking wheels that are spinning out of control.
It's interesting how such disparate distribution systems seem to have such similar failure modes. I wonder if the two sides could learn something from each other here.
Limited Slip differentials can be employed to ensure that one wheel can only slip so far before the differential locks and then both wheels are forced to rotate together. LSDs can be employed in 3 positions front, rear, and center. The front and rear will lock left and right wheels together, the center one will lock the rotation speed of the front and back axles.
You can also add manually selectable locking differentials in any of those positions instead. Frequently, you see LSDs in front and back and manual lockers in the center.
In off road applications it is more common to see manually selectable lockers in every position. Because of the lack of differentials, though, manual lockers put a lot of load on the drivetrain if the surface isn’t slippery.
The alternative, traction control can use variable differentials to send different power amounts to different wheels, typically by employing the brakes on the slipping wheel to force power to distribute evenly.
Obviously, there is a lot to this, but highlights some of the options.
The design consists of a series of plates with holes cut in them and relies on a dilitant fluid (silicon paste) filling the holes to have a significant resistance to shearing. This allows the outputs to run at different speeds than each other, up to a point where the paste gets hot from shearing and the pressure increase locks the plates to each other. When the output speeds are similar, the diff doesn't lock up but will allow one to overspeed the other.
One main benefit to this design over others, is that by changing the number of plates you can change the amount of torque sent to the outputs. When installed with twice as many discs to the rear than the front you end up with twice as much torque to the back wheels, as in the transfer case of a E30 BMW 325iX. Effectively there is no mechanical connection between the engine and the wheels until the transfer case input starts moving which starts to lock up and transfer torque to the rear. The rear diff outputs are not mechanically connected until the shearing in it causes an increases in pressure. But the increase in pressure happens so quickly, lockup being within milliseconds, that you wouldn't know it while driving.
(They _kinda_ work on rally cars and hillclimbers...)
Now you've got a host that can't access a database or can't do much of anything- but it can return true!
Health check APIs should always do some level of checks on dependencies and complex internal logic (maybe a few specific unit/integration tests?) to ensure things are truly healthy.
10 500s in a row? Go to the timeout chair until you get better.
The former is basically a "return true" endpoint, which can tell you if the service is alive and reachable. The latter will usually do something like "select 1;" from any attached databases and only succeed if everything is OK.
Then if all 100/100 nodes get taken down by some shared problem, the system simply degenerates into picking the idle-est of the 100.
This allows you to have less frequent checks that are more in-depth, and basic ones that are just the `return true` type.
I guess you're correct that you can have a widespread db outage that makes many of them fail, but then there should be new ones coming very fast, as you can set the deployment minimum for services taking requests. I think you can get very close to stability even in this circumstance.
It can check the database connection, redis, cache, email, up-to-date migrations, and S3 credentials.
There are so many types of load balancing, and so many different ways to guard, simply with best practices, against what this story explains.
I also found the 'quadruple hump' in a graph reference difficult to follow without more of a backstory of what the graph was representing.
I also don't understand why it would be so difficult to grep through the logs of that load balancer (or load balancers) and find that common fault on the backend.
During the day, all agents are busy and the queue fills up with 30 or so pending tasks.
If one agent gets into a bad state where, say, it fails to checkout from source control and fails within the first 20 seconds of a build, then it will very quickly chew through the entire queue of pending tasks, failing them all.
You'd think the more agents you have the better insulated you are from the failure of a single one, but this particular failure mode actually becomes more common the more agents you add!!!
I'm surprised. I would expect the length of the pending job queue to also have an impact on this failure _mode_ (as opposed to the failure _cause_), and the queue length is inversely proportional to the number of agents, flooring to 0 when you have to many agents than necessary to deal with peak demand.
If the bad agents (I'm assuming that the fault to check out from SCM lies in the agents themselves, not in the overloaded SCM server due to too many agents polling, otherwise this is not an example relevant to the thread) are e.g. 1 every 50 agents and the number of agents is much larger than the peak load (e.g. a million agents), the probability of a job of failing approaches 1/50, because the good agents blocked on doing work are just a small fraction and thus they don't significantly skew the probability of a new job to get scheduled to a bad agent (which are more likely to be ready).
If on the other hand you have less agents than peak load, the queue will contain some jobs, and bad agents will chew through the whole queue when they get the chance, as you described.
And they always get the chance, since they are almost always ready (if they fail very quickly), so that the pool of ready agents will almost always contain only bad agents.
If you add more agents you have to deal with these bad agents "floating on the top of the pool", and the new ones you add (at the rate of e.g. 1 every 50). So, if you have 5 bad agents and you add another 5 you still have a pretty high chance that new jobs will land on the bad ones.
But you can add more and the probability of success just gets better.
I was considering the fairly narrow problem domain between say 20 and 40 agents, and the queue fills up to perhaps 50 pending tasks mid-day. You are correct in the case that once you add enough agents to keep the queue empty, then additional agents reduce the fail chance.
Sadly, though, the CI system I use (Teamcity) prioritizes agents that have a history of completing that particular task the fastest. That's great if your agents are all different specs, but fails spectacularly in this case, where a bad agent will be selected above a good one.
Partly because it was such a small operation, we heavily instrumented the web servers with PRTG, along with hitting a number of key sites every minute, on each web server. "When XYZRealty goes down, so do all of these other sites!" "We'll put a sensor on XYZRealty."
This gave us great data about the health of the servers, including identifying bad apples, and even aiding in performance testing of new modules. We were able to catch memory leaks and processing spikes before they broke our sites. And when 64-bit modules were ready to replace the 32-bit modules, we had baseline data ready to compare and evaluate.
Not that this won't scale - quite the contrary. Though it creates some data and requires dedication to maintain.
Love the very didactic way of writing. Perfect for managers, not interns!
Smart load balancers are only really necessary if you have inefficient servers that can't handle more than 100 connections per second and they're difficult to get right.
If you toss a coin 10 times, you're much more likely to get >=80% heads than if you were to toss that coin 1000 times.
On the other hand, if you have a small number of long running requests that you want to distribute over a number of servers, then your load balancer needs to track which servers are busy, or you'll spend a lot of time waiting for the randomly assigned server to finish, while others are idle.
If Python is your thing, Celery makes this pattern trivial to implement.
This also simplifies so many other things: for example, deployments (and rollbacks!) can proceed much faster, since you don't need to wait obnoxiously long for all connections to drain.
Of course you're pushing more complexity up the stack, but in my experience it's a good trade-off.
This is the pattern that AppEngine enforces. Apps that regularly take more than 1s to respond are penalized, and there's a 60s hard deadline on all requests. Things like a task queue or push channels are built-in to the platform, with a high-level API exposed, so you just focus on writing the application / business logic.
Celery tries to basically do the same for you, w/o the proprietary Google SDK's.
> What's it like to debug when everything goes to hell?
From my experience so far... You will find bugs in your application code way more often than in a battle-tested task queuing system, much like you're quite unlikely to find a bug in nginx or Python.
Fixing things is MUCH easier than in a request / response model. I know it's not a "web scale" example, but imagine an installation involving camera rigs, some networked hardware, a transcoding server, etc. Dude walks into a photo booth, types in his email/phone number on a tablet, etc. When I see a failure in a component of the pipeline, I can hack together and deploy a patch in seconds / minutes (from alert to running the fix in production) and just tell the system to retry a failed job. Dude gets his silly photo via email/sms within two minutes instead of one.
Picture a similar situation in a web application that processes user-uploaded media. You write your batch processing logic to bail early on any sign of error, and when you get the chance to fix some bug that e.g. affected 3% of your users, again - you resubmit the failed jobs and users just see the result with a delay, instead of having to re-upload.
Think of something like video transcoding, or complex database queries. At some point you are going to have a number of servers that do the work, and a load balancer needs to balance the work between them. If one server is busy with a job that will take an hour, there's no point in randomly queueing up lots of work while others are idle.
Except now you can use a round-robin/"dumb" LB on the frontend, fixing that exact issue (and a whole class of other problems).
Yes, a broken worker will quickly drain the queue, except once you find out and remove it, you can (hopefully) resubmit the failed jobs. It gives you a new primitive to work with - HTTP requests are 1:1 to HTTP responses, but a job's result can be updated, and the new result propagated up the chain. (Think: media transcoding / post-processing, rendering pipelines, data analysis, etc.)
> that would require totally changing the application architecture
True, usual cost/benefit assessment when dealing with technical debt, etc.
When you're dealing with big numbers across relatively few servers, randomness averages out everything.
When you have a sample of 10k users, they tend to behave consistently with any other 10k sample... Assuming that all machines are the same size and the code is the same on all machines.
Found this pretty cool presentation/PDF about FB's load-balancing architecture. Stays fairly high level: http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user....
And what if the responses are opaque to the LB? TCP-level load balancing, maybe? There are reasons you might choose that. Or maybe the protocol you use is not HTTP-based and you don't have a way to teach your LB about it.
Similarly, you might have e.g. a server which is responding to a 'get all widgets in category' call by quickly responding '200 OK, no widgets found', and thus ends up sucking up all the traffic.
You don't want to use actual data, so you add a dummy account.
Someone deletes the dummy account all the servers that are working perfectly start reporting they are sick.
I suspect what happened is that they didn't understand the problem, and so resorted to ineffective means to steer
the attention away from their own inadequacies.
Anyway, several of the posts tend to do quite well on HN, so it looks like commenters look past that when they engage with the stories.
This to me is odd. Was this posted when AWS's ELBs were new and shiny?
Most of the big failure cases I've dealt with are along these lines.
One server does something stupid and gobbles up the world.
That being said, this is a very neat way of describing the problem, I shall be referencing this in the future, I might start putting that in an interview question....
As a devop I get told or sorts of "facts" about various building blocks.
We are currently looking at low latency API (read not using HTTP) half of the suggestions I got back were about using haskell or Go. No one bothered with the low hanging fruit, like re-using TCP sockets, using predefined schemas instead of json, etc etc.
Not one person thought about LB latency...
You'd give them more bread, not toast. Toast is already processed bread :).
They're interesting but I always think they're a little _too_ simple. I mean this entire thing can be summed up with:
500s (and other errors) are returned faster than processed requests. Load-balancers will find a misbehaving server's queue empty more often and give it all the requests
Knowing nothing about load balancers, had I read your summary before Rachels article, I would probably just glossed over it, but I would not have understood its profound importance for designing load balancers.
Rachel's story kept me engaged and explained the problem in a way that I will surely remember a lot longer.
Teaching is not about reciting facts -- that's what reference manuals are for. Teaching is for engaging students, making them interested, giving background on a topic, telling you why something matters, making it easy to understand, and making sure the student remembers the most important bits.
It's something I have been struggling a lot with junior devs and interns. I show them what they are doing wrong, how to do it correctly, and the next day they make the same mistake again. If people were perfect logical machines, just telling them facts might be sufficient. Unfortunately people are people, and it's really really hard to teach them something effectively.
People need an intuitive concept and image to grasp and reason on while they master an idea, but that doesn’t justify any arbitrary amount of over-elaborating.
So my point is: There are times when efficient and to the point is called for, and there are times when it's not.
That has an interesting consequence with patents. In a patent infringement trial plaintiff is going to have to explain to the jury what the patent covers. But if plaintiff succeeds in getting the jury to understand the patent well enough to be able to realize that defendant infringed, there is a good chance they will understand the patent well enough to think that it was not non-obvious. The defendant also explains to the jury what the patent does, in order to explain their theory as to why they are not infringing.
However, it might be prudent to point out that these things will feature in the first chapter of any book on load balancing, as an example of the pros and cons of different load balancing methods: least connections, least load, round robin etc.
(Sometimes I suspect picking up a book is becoming a lost art in this day of devops and breaking things fast. Learning from mistakes is all well and good, but there are other ways too.)
Given that the subject matter strongly overlaps with what HN is about and the quality is high, I'm not surprised it gets HN attention fairly frequently.
She experienced trouble explaining a problem in a way the students would understand, so she came up with a better method.
Now, the actual technical or learning content is lower than the usual articles of the same length but it was a pleasanter read.
Which is important context for many of her posts in the last month or so.
If this had been some post about lessons learned about a the latest whizbang JS framework people wouldn't have had a second thought.
I've read plenty of her posts and have sympathized, as being a long time sysadmin and learning the exact same lessons. She indeed does have a story telling skill which is sorely needed: ops people tend to suck at effectively sharing knowledge - she's doing valuable work here.
I don't understand why there is criticism?
If I were to be cynical, I'd say it's because a woman writing on information technology. Or because someone is writing about information technology without it being strict and to the point and boring to everyone that doesn't have a geeky interest in that specific piece of technology. But most likely the former.
If you then move the health check into the main app, maybe it eliminates a lot of false positives but the same thing can then happen further into the stack...
Does anyone have a link to what she is talking about?
Good ELI5 explanation, but it doesn't really explain why the webserver failed the requests as it did. Or maybe I'm missing something?
That the article is about load-balancers and how a single "rogue actor" can have outsized effect on the entire thing. The failing webserver is besides the point.
Firstly, your typical load balancer doesn't work this way anyway. It will just keep feeding requests to the application hosts on a round robin or random basis. Most don't keep track of how busy each instance is.
Secondly, any decent (HTTP/layer 7) load balancer will notice if an instance is returning exclusively 5xx errors and will stop routing requests to it. Would fail even the most basic of health checks.
As far as I’m aware - using ELBs heavily day to day - it doesn’t maintain a backlog and simply distributes requests on a round robin basis the moment they come in. There is no queue of waiting requests I don’t believe. If the request can’t be fulfilled immediately, it is rejected.
> With Classic Load Balancers, the load balancer node that receives the request selects a registered instance using the round robin routing algorithm for TCP listeners and the least outstanding requests routing algorithm for HTTP and HTTPS listeners.
Too much flourishing, poor pacing
Also, don't mistake her teaching approach as indicating the people in the room weren't aware of what load balancers are or how they work. It's a good teaching technique to start out with some basic ground-work leading in to the point you wish to make. Starting out as she does ensures that everyone in the group knows _exactly_ what she's talking about before she gets to the key point, _and_ should be able to immediately understand what is going on and why.
Regardless of that, there's a lot of subtleties of dealing with load balancers that people rarely think of until they've been bitten by them. I've used load balancers quite regularly when interviewing candidates because I can almost always find some aspect of land balancers and load balancer behaviour that people aren't aware of. That gives me an ideal chance to explore a subject with a candidate and find out how quickly they can piece things together and learn.
No one hires MIT grads because they graduate with more extensive load balancer knowledge.
What it did teach me: locks, semaphores, mutual exclusion, the memory hierarchy, threading... the core concepts of concurrent and distributed systems that are independent of the specifics of running a large-scale web application.
The point of a technology degree had better be to teach you how to learn about new technology and experiment with it, or else a 2014 degree will be worthless in 2034.
Some degrees in CS are really vocational certificates, while more traditional degrees are not.