Hacker News new | comments | ask | show | jobs | submit login
Interns with toasters: how I taught people about load balancers (rachelbythebay.com)
336 points by deafcalculus 10 months ago | hide | past | web | favorite | 117 comments



This reminds me strongly of another system of distribution, which suffers from the same effect: differential gearboxes in cars.

Since the torque to all the wheels is equal, if one wheel slips it very quickly takes all of the power of the engine, since power = rotational velocity * torque. Only if the speeds are similar is the allocation of power to the wheels similar.

In cars, the solution is to make sure that the power is distributed fairly evenly even if a wheel loses traction. This used to be done with thick grease inside the differential gearbox to limit the difference in speeds - more recently complex gearboxes and traction control achieve the same thing, but through constantly monitoring the speeds of each wheel and braking wheels that are spinning out of control.

It's interesting how such disparate distribution systems seem to have such similar failure modes. I wonder if the two sides could learn something from each other here.


There are a variety of mechanical systems used to combat this problem, not just thick grease.

Limited Slip differentials can be employed to ensure that one wheel can only slip so far before the differential locks and then both wheels are forced to rotate together. LSDs can be employed in 3 positions front, rear, and center. The front and rear will lock left and right wheels together, the center one will lock the rotation speed of the front and back axles.

You can also add manually selectable locking differentials in any of those positions instead. Frequently, you see LSDs in front and back and manual lockers in the center.

In off road applications it is more common to see manually selectable lockers in every position. Because of the lack of differentials, though, manual lockers put a lot of load on the drivetrain if the surface isn’t slippery.

The alternative, traction control can use variable differentials to send different power amounts to different wheels, typically by employing the brakes on the slipping wheel to force power to distribute evenly.

Obviously, there is a lot to this, but highlights some of the options.


The Viscous Coupling is really neat.

The design consists of a series of plates with holes cut in them and relies on a dilitant fluid (silicon paste) filling the holes to have a significant resistance to shearing. This allows the outputs to run at different speeds than each other, up to a point where the paste gets hot from shearing and the pressure increase locks the plates to each other. When the output speeds are similar, the diff doesn't lock up but will allow one to overspeed the other.

One main benefit to this design over others, is that by changing the number of plates you can change the amount of torque sent to the outputs. When installed with twice as many discs to the rear than the front you end up with twice as much torque to the back wheels, as in the transfer case of a E30 BMW 325iX. Effectively there is no mechanical connection between the engine and the wheels until the transfer case input starts moving which starts to lock up and transfer torque to the rear. The rear diff outputs are not mechanically connected until the shearing in it causes an increases in pressure. But the increase in pressure happens so quickly, lockup being within milliseconds, that you wouldn't know it while driving.


There's also the Torsen differential [1]

[1]: https://youtu.be/JEiSTzK-A2A


And for those on a budget - what we in Australia call "The CIG Locker"...

http://www.billzilla.org/diffs.htm

(They _kinda_ work on rally cars and hillclimbers...)


It takes a while to get your head around how these work but they are fantastic. Fortunately there are a few videos on it now to help.


Chevy made a brilliant video in 1937 explaining how differentials work:

https://www.youtube.com/watch?v=yYAw79386WI


If I remember the movie My avoiding Vinnie correctly, is this called Positraction? Marissa Tomei’s character says positraction prevents being “stuck in the mud”, where “one tire spins, the other tire does nuthin’”


Positraction was a brand name used by GM for some limited slip differential systems.


My cousin Vinnie


And that's why you always set health checks on servers behind a load balancer that should take bad ones out as soon as possible. You then get another interesting problem, which is if the server is only spitting errors when serving requests, it'll then be marked healthy again and forever toggle between healthy and unhealthy. But that you can also solve.


The mistake I've seen here is having the "health check" do little or nothing. In many web services, I've seen a /ping or /health api that the load balancer calls to ask "are you healthy?". But people get rushed or lazy or requirements change, and they wind up with that api just doing "return true".

Now you've got a host that can't access a database or can't do much of anything- but it can return true!

Health check APIs should always do some level of checks on dependencies and complex internal logic (maybe a few specific unit/integration tests?) to ensure things are truly healthy.


Outlier detection also fixes this, without having to implement health checks that replicate the whole functionality (complexity) of the rest of the service.

10 500s in a row? Go to the timeout chair until you get better.


At work we have two endpoints on every service, `/status` and `/health-check`.

The former is basically a "return true" endpoint, which can tell you if the service is alive and reachable. The latter will usually do something like "select 1;" from any attached databases and only succeed if everything is OK.


And the former is the one you want your Load Balancers to be checking. With a deep health check, even a brief database outage will cause every web server to be taken out of rotation, and then you're completely down for at least as many health check intervals as it takes for the LB to consider a host healthy again. Same goes for any other shared resource that is likely to affect all web servers if it becomes unavailable.


Presumably your service would not be in a very useful state anyway if the main data store it needs to function is out, so which kind of health-check you use will depend on the failure mode you want to expose to your users.


Or you just have smarter load balancers, that realize when all of their servers are acting funky and stop taking servers out of rotation.


Or to riff on this idea further: You make the concept of a node's "relative stability" part of the load-balancer logic the same way that "relative idleness" is a factor.

Then if all 100/100 nodes get taken down by some shared problem, the system simply degenerates into picking the idle-est of the 100.


That's not necessarily true. In kubernetes, for example, you have Liveness and Readiness probes, which can both have a period, timeout, initial delay and importantly a number of failures before you kill the service and spawn a new one.

This allows you to have less frequent checks that are more in-depth, and basic ones that are just the `return true` type.

I guess you're correct that you can have a widespread db outage that makes many of them fail, but then there should be new ones coming very fast, as you can set the deployment minimum for services taking requests. I think you can get very close to stability even in this circumstance.

https://kubernetes.io/docs/tasks/configure-pod-container/con...


Databases are 10-100x as reliable as the application tier in my experience


Although the OP didn't describe as such, but health checks that do a simple query are also testing the connectivity to the db. Someone might have hardcoded a db address in an environment variable, or there are connection pooling issues.


The best way I can think of is to aggregate errors over time, categorize them and build a health check around those metrics.


There's a great gem for Rails that you can use for comprehensive health checks: https://github.com/ianheggie/health_check

It can check the database connection, redis, cache, email, up-to-date migrations, and S3 credentials.


I was going to say something similar. I get that this was a very basic explanation of what load balancing can be but very rarely have I seen it implemented in such a fashion, as described - that is, 'serving toast, when finished with the toast, that toaster gets more toast, etc.'

There are so many types of load balancing, and so many different ways to guard, simply with best practices, against what this story explains.

I also found the 'quadruple hump' in a graph reference difficult to follow without more of a backstory of what the graph was representing.

I also don't understand why it would be so difficult to grep through the logs of that load balancer (or load balancers) and find that common fault on the backend.


If you are doing an L7 http health check, then you should be able to validate a specific URI's expected successful/healty response.


I see that on my continuous integration system. We use Teamcity with ~50 agents for build tasks that take 20-30 minutes. Each agent can only be running a single task.

During the day, all agents are busy and the queue fills up with 30 or so pending tasks.

If one agent gets into a bad state where, say, it fails to checkout from source control and fails within the first 20 seconds of a build, then it will very quickly chew through the entire queue of pending tasks, failing them all.

You'd think the more agents you have the better insulated you are from the failure of a single one, but this particular failure mode actually becomes more common the more agents you add!!!


> this particular failure mode actually becomes more common the more agents you add

I'm surprised. I would expect the length of the pending job queue to also have an impact on this failure _mode_ (as opposed to the failure _cause_), and the queue length is inversely proportional to the number of agents, flooring to 0 when you have to many agents than necessary to deal with peak demand.

If the bad agents (I'm assuming that the fault to check out from SCM lies in the agents themselves, not in the overloaded SCM server due to too many agents polling, otherwise this is not an example relevant to the thread) are e.g. 1 every 50 agents and the number of agents is much larger than the peak load (e.g. a million agents), the probability of a job of failing approaches 1/50, because the good agents blocked on doing work are just a small fraction and thus they don't significantly skew the probability of a new job to get scheduled to a bad agent (which are more likely to be ready).

If on the other hand you have less agents than peak load, the queue will contain some jobs, and bad agents will chew through the whole queue when they get the chance, as you described.

And they always get the chance, since they are almost always ready (if they fail very quickly), so that the pool of ready agents will almost always contain only bad agents.

If you add more agents you have to deal with these bad agents "floating on the top of the pool", and the new ones you add (at the rate of e.g. 1 every 50). So, if you have 5 bad agents and you add another 5 you still have a pretty high chance that new jobs will land on the bad ones.

But you can add more and the probability of success just gets better.


That is a good point and I like the visual metaphor of bad agents "floating on top of the pool".

I was considering the fairly narrow problem domain between say 20 and 40 agents, and the queue fills up to perhaps 50 pending tasks mid-day. You are correct in the case that once you add enough agents to keep the queue empty, then additional agents reduce the fail chance.

Sadly, though, the CI system I use (Teamcity) prioritizes agents that have a history of completing that particular task the fastest. That's great if your agents are all different specs, but fails spectacularly in this case, where a bad agent will be selected above a good one.


At a real estate webhost in 2014, we had a small web farm behind a single load balancer. I've written some previous posts about the interesting architecture choices made by the lead architect in previous HN posts. Along with these, it was the days of "move fast and break things", so developers got admin access to servers and would develop against live sites. Fun times to keep a web farm online all night long.

Partly because it was such a small operation, we heavily instrumented the web servers with PRTG, along with hitting a number of key sites every minute, on each web server. "When XYZRealty goes down, so do all of these other sites!" "We'll put a sensor on XYZRealty."

This gave us great data about the health of the servers, including identifying bad apples, and even aiding in performance testing of new modules. We were able to catch memory leaks and processing spikes before they broke our sites. And when 64-bit modules were ready to replace the 32-bit modules, we had baseline data ready to compare and evaluate.

Not that this won't scale - quite the contrary. Though it creates some data and requires dedication to maintain.


To summarize: “implement health checks to take bad servers out of the service pool, always!”

Love the very didactic way of writing. Perfect for managers, not interns!


That's why I like random load balancing. If each machine is powerful enough and can handle a few thousand users then the distribution averages out.

Smart load balancers are only really necessary if you have inefficient servers that can't handle more than 100 connections per second and they're difficult to get right.

If you toss a coin 10 times, you're much more likely to get >=80% heads than if you were to toss that coin 1000 times.


That model works if you have a large number of short lived requests.

On the other hand, if you have a small number of long running requests that you want to distribute over a number of servers, then your load balancer needs to track which servers are busy, or you'll spend a lot of time waiting for the randomly assigned server to finish, while others are idle.


Then you design your system to always respond as quickly as possible. HTTP even has a status code for that: 202 Accepted. You put the job in a queue and return some ID or cookie to the client, possibly also an ETA. Client can then poll for results (with bounded exponential back-off), again polling is just a quick key look-up; alternatively, if you have some means to push results to clients (like via WebSockets), use that.

If Python is your thing, Celery makes this pattern trivial to implement.

This also simplifies so many other things: for example, deployments (and rollbacks!) can proceed much faster, since you don't need to wait obnoxiously long for all connections to drain.

Of course you're pushing more complexity up the stack, but in my experience it's a good trade-off.


I've spent enough time building systems and doing ops that this complexity genuinely scares me. Have you used this pattern in highly scalable production apps? What's it like to debug when everything goes to hell?


So first, this is not my idea, but something that people much smarter than me came up with.

This is the pattern that AppEngine enforces. Apps that regularly take more than 1s to respond are penalized, and there's a 60s hard deadline on all requests. Things like a task queue or push channels are built-in to the platform, with a high-level API exposed, so you just focus on writing the application / business logic.

Celery tries to basically do the same for you, w/o the proprietary Google SDK's.

> What's it like to debug when everything goes to hell?

From my experience so far... You will find bugs in your application code way more often than in a battle-tested task queuing system, much like you're quite unlikely to find a bug in nginx or Python.

Fixing things is MUCH easier than in a request / response model. I know it's not a "web scale" example, but imagine an installation involving camera rigs, some networked hardware, a transcoding server, etc. Dude walks into a photo booth, types in his email/phone number on a tablet, etc. When I see a failure in a component of the pipeline, I can hack together and deploy a patch in seconds / minutes (from alert to running the fix in production) and just tell the system to retry a failed job. Dude gets his silly photo via email/sms within two minutes instead of one.

Picture a similar situation in a web application that processes user-uploaded media. You write your batch processing logic to bail early on any sign of error, and when you get the chance to fix some bug that e.g. affected 3% of your users, again - you resubmit the failed jobs and users just see the result with a delay, instead of having to re-upload.


Thanks for the response - I understand it much better now.


I'm not sure how that would solve the problem. Not every server is a front end web server.

Think of something like video transcoding, or complex database queries. At some point you are going to have a number of servers that do the work, and a load balancer needs to balance the work between them. If one server is busy with a job that will take an hour, there's no point in randomly queueing up lots of work while others are idle.


Read up on how task queues work - they do fill a role of a load balancer, your video transcoding case is a perfect example application. Basically workers can leave/join the pool at any time, poll for more work whenever idle/done, and signal failures so that jobs can be retried later. There's no such thing as an idle worker, unless the queue is empty. No "overloaded" workers either, running two CPU-bound tasks concurrently won't get either of them done sooner. Your monitoring / auto-scaling system can watch the length of the task queue and add/remove workers as needed. (You can also try to combine this with AWS spot instances.)


Okay, so that would require totally changing the application architecture, and in the end it would suffer from exactly the same issue: A broken worker that takes tasks from the queue and immediately returns an incorrect result will quickly take over a significant portion of the tasks. That’s exactly the issue the article was talking about.


> A broken worker that takes tasks from the queue and immediately returns an incorrect result will quickly take over a significant portion of the tasks.

Except now you can use a round-robin/"dumb" LB on the frontend, fixing that exact issue (and a whole class of other problems).

Yes, a broken worker will quickly drain the queue, except once you find out and remove it, you can (hopefully) resubmit the failed jobs. It gives you a new primitive to work with - HTTP requests are 1:1 to HTTP responses, but a job's result can be updated, and the new result propagated up the chain. (Think: media transcoding / post-processing, rendering pipelines, data analysis, etc.)

> that would require totally changing the application architecture

True, usual cost/benefit assessment when dealing with technical debt, etc.


If each server can handle thousands of operations at a time, the duration of each operation shouldn't matter because long operations should also be evenly spread across servers... Also due to probability.

When you're dealing with big numbers across relatively few servers, randomness averages out everything.


It depends on how long your long operations are. Back when I was working on Justin.tv, we had to write a custom load balancer because we had to maintain a sizable number of TCP connections that lasted for hours or days without disconnecting.


I've worked on a cryptocurrency trading platform that had WebSocket connections which also sometimes lasted for days (including trading bots). Each user-facing server could handle 10k concurrent sockets and the long connections were spread out evenly.

When you have a sample of 10k users, they tend to behave consistently with any other 10k sample... Assuming that all machines are the same size and the code is the same on all machines.


Also if you have moderately long lived requests, if one server starts to get overloaded and it starts to slow down, new requests will still be sent to that server, which will make it more overloaded and slower, and jobs start to build up on that server, making it slower and slower.


Random balancing only works with homogenous servers.


You can weight them.


Isn't it almost always advisable to have homogenous servers in a web farm? What are some example cases there that doesn't make sense?


I'm on mobile, so won't look for the source. I heard Netflix benchmarks each new instance bootstrapped in AWS because they have found a dramatically high level of variability in performance from what should be home get out hardware. They toss out low performers and rebootstrap the instance.


Containerization and spot instances.


This is true, but if you have a tighter tolerance on how much load each machine needs to be able to handle then you may be able to waste less resources.


Is this better than round-robin somehow?


Not sure how relevant this is to Rachel's incident, but looks like they use ECMP/BGP->shiv(L4)->proxygen(L7). So hard to believe that health checking wasn't in the mix. If the nodes were passing the health check, but not properly serving requests still, then I'd assume that a post-mortum items would have involved improving health checks.

Found this pretty cool presentation/PDF about FB's load-balancing architecture. Stays fairly high level: http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user....


Great "teaching by showing" example but I don't get the initial technical scenario; you have sophisticated (ie, not RR / random) load balancers that keep track of queues in web servers, so they have to get some information back from them (when job was completed, from HTTP response for ex), but somehow don't react to 500 errors? seems like badly configured LBs. Still scenario can be used as teaching or interview question.


Suppose that a machine succeeds fast, not fails fast. Say a machine's job is to run N things on its input, and the identities of these N things are picked up from configuration somewhere. One machine gets a corrupted config somehow, and has only 1 thing to run. The others are running 10 things. So one machine is (barring framework/networking/etc. overhead) 10x faster than the rest, and it thinks it's doing fine. Now even if your mind is stuck on HTTP, you have to think about the same problem but with 200 response codes. Still a non-issue?

And what if the responses are opaque to the LB? TCP-level load balancing, maybe? There are reasons you might choose that. Or maybe the protocol you use is not HTTP-based and you don't have a way to teach your LB about it.


I've worked on a system which communicated errors by responding with a 200 and then indicating the error in the response body!

Similarly, you might have e.g. a server which is responding to a 'get all widgets in category' call by quickly responding '200 OK, no widgets found', and thus ends up sucking up all the traffic.


Exactly, you need to have a health check that actually exercises all the layers of the system and returns real data.

You don't want to use actual data, so you add a dummy account.

Someone deletes the dummy account all the servers that are working perfectly start reporting they are sick.


Maybe some people dismissed her problem as 'impossible' because she didn't inform what kind of load balancing technique was the load balancer using?

  I suspect what happened is that they didn't understand the problem, and so resorted to ineffective means to steer
  the attention away from their own inadequacies.
^ This statement is kinda harsh.


I think it would be better phrased as suggesting they thought they understood but really didn’t understand the problem and therefore made a preemptive conclusion.


Every post on that blog is about how the author is incredibly clever and surrounded by dolts.


Many of her posts are about teaching or communicating issues like this to others. In other words, there is no reason to give a class like this if she doesn’t actually know more than the interns.


And I thought I was the only one that gets that vibe from the site's writing.

Anyway, several of the posts tend to do quite well on HN, so it looks like commenters look past that when they engage with the stories.


> "some of the commenters dismissed either it ("impossible") or me ("doesn't know anything about load balancing"). "

This to me is odd. Was this posted when AWS's ELBs were new and shiny?

Most of the big failure cases I've dealt with are along these lines.

One server does something stupid and gobbles up the world.

That being said, this is a very neat way of describing the problem, I shall be referencing this in the future, I might start putting that in an interview question....


Given the author's presumed gender, I understood it to be thinly-veiled sexism. But I could be wrong.


Perhaps, or just typical dickishness. (mind you its the same thing really.)

As a devop I get told or sorts of "facts" about various building blocks.

We are currently looking at low latency API (read not using HTTP) half of the suggestions I got back were about using haskell or Go. No one bothered with the low hanging fruit, like re-using TCP sockets, using predefined schemas instead of json, etc etc.

Not one person thought about LB latency...


>and so on down the line until everyone had bread. At various points, a toaster would finish and would pop up. ... I'd notice that they were done and would run over to give them more toast.

You'd give them more bread, not toast. Toast is already processed bread :).


Ah! Correct you are. Fixed. Thanks!


A great story, but I'm distracted by the word caromed. How did I not know about that word? Now I feel the need to inject it into my active lexicon.


A protocol without considering the possibility of insane counterpart is doomed to DDoS the systems that utilize it.


If possible can you share the link to your lectures? (:


Who is Rachel and how do her short and simple stories always hit the front-page?

They're interesting but I always think they're a little _too_ simple. I mean this entire thing can be summed up with:

500s (and other errors) are returned faster than processed requests. Load-balancers will find a misbehaving server's queue empty more often and give it all the requests


Everything seems trivial once you understand it. The author seems to be good at making people understand, so that it seems trivial afterwards.

Knowing nothing about load balancers, had I read your summary before Rachels article, I would probably just glossed over it, but I would not have understood its profound importance for designing load balancers.

Rachel's story kept me engaged and explained the problem in a way that I will surely remember a lot longer.

Teaching is not about reciting facts -- that's what reference manuals are for. Teaching is for engaging students, making them interested, giving background on a topic, telling you why something matters, making it easy to understand, and making sure the student remembers the most important bits.

It's something I have been struggling a lot with junior devs and interns. I show them what they are doing wrong, how to do it correctly, and the next day they make the same mistake again. If people were perfect logical machines, just telling them facts might be sufficient. Unfortunately people are people, and it's really really hard to teach them something effectively.


I think the truth lies in the middle somewhere. I knew absolutely nothing about load balancing before reading OPs comment and the blog post, and while you’re correct that I wouldn’t have retained OPs bullet point version as well as Rachel’s, Rachel really went on at a length far beyond what was needed to tell the story / explain the concept (eg, the beach ball sized bag of bread).

People need an intuitive concept and image to grasp and reason on while they master an idea, but that doesn’t justify any arbitrary amount of over-elaborating.


I found that the (over)-elaboration made the article funnier to read. Since it's a Sunday, I wouldn't have read it if it was a boring, efficient, fact-crammed text reading like something from one of the more boring computer science classes I take. I did finish this though, and learned something on the way.

So my point is: There are times when efficient and to the point is called for, and there are times when it's not.


> Everything seems trivial once you understand it

That has an interesting consequence with patents. In a patent infringement trial plaintiff is going to have to explain to the jury what the patent covers. But if plaintiff succeeds in getting the jury to understand the patent well enough to be able to realize that defendant infringed, there is a good chance they will understand the patent well enough to think that it was not non-obvious. The defendant also explains to the jury what the patent does, in order to explain their theory as to why they are not infringing.


Light hearted introductions to work related stuff you might encounter is always a crowd pleaser. It makes for great lightning talk ideas too.

However, it might be prudent to point out that these things will feature in the first chapter of any book on load balancing, as an example of the pros and cons of different load balancing methods: least connections, least load, round robin etc.

(Sometimes I suspect picking up a book is becoming a lost art in this day of devops and breaking things fast. Learning from mistakes is all well and good, but there are other ways too.)


Do you recommend any books in particular in this area? It's something I need to learn more about.


They are short and simple, but looking from her blog it seems she also writes one every single day.

Given that the subject matter strongly overlaps with what HN is about and the quality is high, I'm not surprised it gets HN attention fairly frequently.


As I read it, this piece wasn't that much about load balancers but rather about teaching.

She experienced trouble explaining a problem in a way the students would understand, so she came up with a better method.


I think it has to do with how much (quality) content she puts out there. She posts consistently, which helps build an audience. That audience posts her stuff to HN and also upvotes her posts when they see them on HN.


I wonder that too. But it seems she has gift: telling tech stuff as story that resonates with the HN crowd.


Good writing, good story telling skills. I read this the whole way through, and I usually just skim. So she is able to write in such a way to do that.

Now, the actual technical or learning content is lower than the usual articles of the same length but it was a pleasanter read.


apparently you aren't the first one to wonder that [1]. It is Rachel Kroll, an engineer at fb [2]. Good point about the summary.

[1] https://news.ycombinator.com/item?id=13401293 [2] https://www.facebook.com/wogrammer/posts/1748187012080407:0


Recently left FB, in fact.

Which is important context for many of her posts in the last month or so.


So basically she's using HN audience as a marketing blog and people are eating it up?


Marketing? What pray tell is she selling?

If this had been some post about lessons learned about a the latest whizbang JS framework people wouldn't have had a second thought.

I've read plenty of her posts and have sympathized, as being a long time sysadmin and learning the exact same lessons. She indeed does have a story telling skill which is sorely needed: ops people tend to suck at effectively sharing knowledge - she's doing valuable work here.

I don't understand why there is criticism?


> I don't understand why there is criticism?

If I were to be cynical, I'd say it's because a woman writing on information technology. Or because someone is writing about information technology without it being strict and to the point and boring to everyone that doesn't have a geeky interest in that specific piece of technology. But most likely the former.


Load testers can do health checks, though, right? Like wanting a 200 OK response versus just any response.


The problem comes when enough of the server is up to do a health check, but not enough to actually serve the request. For instance, a Java app server which has one app to respond to health checks, and another app to do processing - the second one can be jammed while the first one's happily reporting that everything's okay.

If you then move the health check into the main app, maybe it eliminates a lot of false positives but the same thing can then happen further into the stack...


My first encounter was the "yak shaving" article that had a bunch of inaccuracies and it was pretty offputting. If that article was any reflection on the general sophistication I wouldn't really want to read any more. It felt like I was reading something out of the 90s.


It's really low value compared to the most of the rest of the front page stuff. Interesting in a way but there is a lot of interesting stuff around that does not hug the front page.


>When it landed on certain web sites, some of the commenters dismissed either it ("impossible") or me ("doesn't know anything about load balancing").

Does anyone have a link to what she is talking about?


> This is what happened when one bad web server decided it was going to fail all of its requests, and would do so while incurring the absolute minimum amount of load on itself.

Good ELI5 explanation, but it doesn't really explain why the webserver failed the requests as it did. Or maybe I'm missing something?


> it doesn't really explain why the webserver failed the requests as it did. Or maybe I'm missing something?

That the article is about load-balancers and how a single "rogue actor" can have outsized effect on the entire thing. The failing webserver is besides the point.


Oh, ok


Take your pick. There's a remarkable number of things that can cause a server to start returning 500s while the rest of the fleet is fine. Doesn't even have to be that specific server's fault (e.g. database behind it could have reached a connection limit handling the connection pools from all the other webservers, leaving this one in the dust. Fun part from this is it can result in the misbehaving server _moving_ as connections close and re-open)


Understood! Thanks for explaining!


I don't really see how this is an issue engineers need to be particularly wary of?

Firstly, your typical load balancer doesn't work this way anyway. It will just keep feeding requests to the application hosts on a round robin or random basis. Most don't keep track of how busy each instance is.

Secondly, any decent (HTTP/layer 7) load balancer will notice if an instance is returning exclusively 5xx errors and will stop routing requests to it. Would fail even the most basic of health checks.


AWS ELBs work this way. They offer more requests to clients with the shortest backlog


You can configure health checks for ELBs: https://docs.aws.amazon.com/elasticloadbalancing/latest/clas...


Do you have a link to any documentation mentioning this?

As far as I’m aware - using ELBs heavily day to day - it doesn’t maintain a backlog and simply distributes requests on a round robin basis the moment they come in. There is no queue of waiting requests I don’t believe. If the request can’t be fulfilled immediately, it is rejected.


From the AWS docs:

> With Classic Load Balancers, the load balancer node that receives the request selects a registered instance using the round robin routing algorithm for TCP listeners and the least outstanding requests routing algorithm for HTTP and HTTPS listeners.


Based on how verbose this article is I think maybe her original article was misunderstood because it was a chore to sort out the junk from the content just like this one.

Too much flourishing, poor pacing


Just the right amount of flourishing for a lazy Sunday :)


I feel like if you have to teach them about load balancers as college interns, FB needs to find a better school to pull interns from.


Why would someone need to learn about load balancers in high school or earlier?

Also, don't mistake her teaching approach as indicating the people in the room weren't aware of what load balancers are or how they work. It's a good teaching technique to start out with some basic ground-work leading in to the point you wish to make. Starting out as she does ensures that everyone in the group knows _exactly_ what she's talking about before she gets to the key point, _and_ should be able to immediately understand what is going on and why.

Regardless of that, there's a lot of subtleties of dealing with load balancers that people rarely think of until they've been bitten by them. I've used load balancers quite regularly when interviewing candidates because I can almost always find some aspect of land balancers and load balancer behaviour that people aren't aware of. That gives me an ideal chance to explore a subject with a candidate and find out how quickly they can piece things together and learn.


What kind of college teaches about load balancers?


If they’re not teaching about the most basic principles of building scalable systems, what is the point of doing a degree, or hiring people who have done a degree?


Several of the top CS programs specifically do not teach details like configuring a load balancer as with a proper background in computer science it is assumed that you can figure out the latest programming languages and the install instructions for popular software packages. It would thus be a waste of a very expensive degree as you could learn about configuring LAMP and a load balancer at a trade school instead of a theoretical research university.

No one hires MIT grads because they graduate with more extensive load balancer knowledge.


That is what you get the technicians to do


My university taught me the most basic principles of building scalable systems. It didn't teach me about setting up load balancers.

What it did teach me: locks, semaphores, mutual exclusion, the memory hierarchy, threading... the core concepts of concurrent and distributed systems that are independent of the specifics of running a large-scale web application.


In the mid-1990s, a respectable CS program would certainly have all of its graduates familiar with TCP/IP. They would know a little about routing and have built a few toy programs that implemented socket-based communication. Hot topics included "firewalls" and the race to build a gigabit-speed router. (But packet filtering at gigabit speed was quite a ways off in the future.)

The point of a technology degree had better be to teach you how to learn about new technology and experiment with it, or else a 2014 degree will be worthless in 2034.


The S in CS is science. Load balancers have nothing to do with the science of computation. It’s just like no self-respecting university will teach PowerPoint or “fixing Windows” in their CS curriculum.


In the 90s my uncle was disgusted with me because I couldn’t fix his TV — didn’t I have an EE degree?


Rightly or wrongly, universities are not vocational colleges. They exist to teach the theoretical principles of a field, not the practical skills of a particular job.


> If they’re not teaching about the most basic principles of building scalable systems, what is the point of doing a degree, or hiring people who have done a degree?

Some degrees in CS are really vocational certificates, while more traditional degrees are not.


Where else might they learn about load balancers? I doubt there are many high schools teaching these concepts.


Vendors like Cisco have excellent training, but it's biased towards their own gear.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: