Ways to do load balancing wrong

wjossey · on Dec 25, 2016

As someone who has worked at high scale (1,000+ load balanced servers), this is a nice "primer" I'd share with a new(er) engineer, or someone who has not worked in a horizontally scaled environment.

A few things I'll point out in addition to this article:

[1] I've found I greatly prefer databases that handle the resiliency and horizontal scalability natively (or natively in their respective client libraries). DBs natively scaling has been one of the greatest advancements in web technologies during my career, and I'd really recommend taking a look at the technologies like Riak & Cassandra for inspiration.

[2] When scaling horizontally with load balancing, connection pooling becomes something to always pay attention to. If you have 1,000 servers, and 100 database instances, you likely don't want 100,000 active connections. Focusing on things like intra-zone preferences for servers (Server A prefers all DBs in its own rack, "zone", or "region") can be a real headache saver as you scale up.

[3] Capacity planning is always key. What are you trying to plan for? Do you want to survive an entire AWS zone outage? Do you want to be able to survive a region outage? Do you worry about single servers going down, and if something "bigger" happens you'll just eat the downtime? Make sure you and your business stakeholders agree on what "resiliency" looks like to your business. It'll help you balance cost and downtime appropriately.

[4] Remember when you're looking at spare capacity that you take into account the situation where a cache layer suffers an outage. If you have to fall back to the primary source of data, and it potentially is also suffering from degradation of service, what sort of capacity will you have?

0xbadcafebee · on Dec 25, 2016

There needs to be more publications covering this, because I don't see this becoming common knowledge or practiced very often.

Then there's the modern design decisions that make some of these subjects taboo, like "you should never cache anything" and "a single node can handle a million concurrent connections so why worry" and "let's make all our machines use 99% of their capacity with containers because it's more efficient and hope we can deploy standbys before the database queues too much and blows the entire stack up". With the emergence of the pseudo-Devops cargo cult, people who like using JavaScript for server backend code are designing infrastructure they've never administered before. (sorry for the rant)

There was a really great paper published a few years ago that was basically an Everything You Need To Know To Run Large Scale Enterprise Systems type paper. I can never find it when these types of threads come up, but it was dozens of pages long and really in-depth.

torinmr · on Dec 25, 2016

The SRE Book (http://shop.oreilly.com/product/0636920041528.do) has a number of chapters on load balancing that are well worth reading.

One thing that the book covers that I think this article glossed over was the fact that in sufficiently large systems there's never a single "load balancer" - instead there's many layers of load balancing systems at different levels of the stack. E.g.:

DNS load balancing -> high capacity network-level load balancing -> shared reverse HTTP proxy -> application server -> database (with a "load balancer" internal to the application server load balancing among DB replicas).

0xbadcafebee · on Dec 26, 2016

They mention there are different applications for load balancers, and then use one or two examples. But their main points are the most important ones: if you don't know what you're using it for, it might be worthless.

The SRE book seems to only have a couple sub-chapters on it, so I would recommend looking up white papers and best practices for your particular application of load balancing. Safari Books has over 17,000 matches for "load balancing" in its library, for example, and F5 has 12 white papers on it just for their products.

theptip · on Dec 26, 2016

http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf ?

This was featured on Adrian Colyer's blog earlier this year (https://blog.acolyer.org/2016/09/12/on-designing-and-deployi...)

noam87 · on Dec 26, 2016

replying just in case someone finds what paper that was.

sasas · on Dec 26, 2016

Thanks for the contribution. What is the preferred approach for geographical redundancy in an active/standby site configuration? F5 offer BIG-IP GTM. I'm wondering if there are good open source equivalents?

mnutt · on Dec 26, 2016

What sort of availability guarantees do you need? If you can tolerate a minute or two of less-than-full availability, Route53 DNS using geographical failover is a decent solution.

sasas · on Dec 26, 2016

Looking at four 9s (4.38 min a month). Solution must be self hosted and not reliant on commercial cloud infrastructure.

wjossey · on Dec 26, 2016

I haven't been in a situation where we had geographic redundancy and that small an SLA, and a commercial gear requirement. Ive got a couple of ideas, but most of them likely violate one of your requirements.

The best advice I can give is to see if you can get someone from a large company that self hosts to chat with you. Not too many companies have the sort of requirements you're looking at, so it's going to be of the ilk of someone at Microsoft, Google, FB, Oracle, etc. Search your network, see if you've got a connection to someone on an SRE team there, and ask to buy them a coffee. :)

sasas · on Dec 27, 2016

> Not too many companies have the sort of requirements you're looking at

Telco is always the odd ball with redundancy. This requirement is somewhat relaxed. Generally mobile core networks are designed for 5 nines with geographical redundancy. This is all baked into the products and protocols (i.e SIGTRAN).

In this instance i'm looking at traditional HTTP for an API that interfaces to location based services.

One idea is to use DNS with the records dynamically modified based on availability monitoring and set the TTL very low. I see possible caveats with this solution though.

Another idea could be to maintain sessions across sites and have an active/active setup with round robin. Traffic is then routed to the active site. The cost here is inter site bandwidth being wasted and additional complexity being introduced.

mnutt · on Dec 27, 2016

Route53 is basically the solution you describe with DNS records, and you could build out anycast DNS servers yourself if you wanted. The main thing to be aware of is that not all clients respect TTLs. We set a 30s ttl and see a 99% switchover in about 60s but if you have some control over the clients you could probably go lower.

wjossey · on Dec 27, 2016

Sasas I've got some thoughts that are probably easier done via a back and forth outside of HN. Hit me up on twitter if you use it: @dustywes

siculars · on Dec 26, 2016

I'm at basho makers of Riak. If you're looking for ha at the db take a look.

sasas · on Dec 27, 2016

Thanks. Not looking at HA with DB. I took a look at Riak briefly - looks interesting.

feld · on Dec 27, 2016

https://github.com/blblack/gdnsd/

I know of someone running their own CDN with this

yaaaq · on Dec 26, 2016

Great bits of knowledge, though to throw something out there. In some circles when presented with the definition of high scale being a 1000+ plus. They will likely reply with; I forgot how to count that low. Life gets interesting at 10,000+ servers.

sulam · on Dec 26, 2016

Feel free to elaborate how life gets interesting at larger scale, if you want to be constructive.

Or not.

The parent post gives a good template for how to be constructive, if you are inspired to take that path.

wjossey · on Dec 26, 2016

Life was plenty interesting at 1,000 (although we were plenty larger when you added in non user facing servers). :)

ddorian43 · on Dec 26, 2016

Can I ask how big were these servers, and would it be ~nicer to have half x2 bigger servers ?

wjossey · on Dec 26, 2016

Really depends on the role of the server. We didn't tend to have "monster" servers that were gigantic and served multiple purposes. We had "flavors" that matched workflow needs and were sizes to make sure we could maximally utilize our bottleneck resource.

So, for example, our app servers typically were always scaled by CPU capacity. Thus, we tried to maximize the cost per compute unit (so if you were in Amazon this would correlate to one of their "c" flavors). Even if a fleet of servers was in the dozens or hundreds, we didn't usually try to use the largest flavor available. The reasoning behind this was mostly that making web apps and databases fully utilize massive machines can sometimes be a bit of a dark art. We repeatedly found a horizontally scaled fleet of "commodity" sized boxes was less of a headache than a small cluster of vertically scaled boxes.

You'll notice in my post above that I favor horizontally scalable technologies at every layer, and it's for that reason. Our database instances went wide before they went tall. If we needed more memory or disk we added a server of the same type, which adds incremental capacity. In some cases, we had emergencies which necessitated a vertical scaling event, so it was always nice having that bullet to fire when needed.

So, to wrap up my response to answer your question, no, not really. We didn't have any issues at 1000 that we really didn't also have at 500. Most of the servers we ran were no bigger or faster than what you could put in a small tower in your bedroom (mind you with Xeons instead of i7s). If I had to do it over again I'd stay middle of the pack as soon as I had enough traffic to fully balance across three servers and then scale from there.

stenio123 · on Dec 26, 2016

Is the discussion of "load balancers for resilience vs scale" even relevant in a cloud-enabled world? Just use an AWS Elastic Load Balancer with an Auto Scale Group, plus Cloudwatch alarms for monitoring and you get the best of all worlds... :-)

surrealvortex · on Dec 27, 2016

Load balancing strategies are also important. For a service that receives requests that take different amounts of effort to fulfill, weighted least conns is almost certainly better than random or round robin.

hueving · on Dec 25, 2016

How does something like this end up published by the acm? Is this a special publication that isn't related to research at all?

I don't mean to sound harsh, but there was nothing in this article even close to novel in an academic sense, and none of it is even new field knowledge if you've spent any non-trivial time around load balancers. Perhaps my notions of what the acm publishes are wrong...

Make no mistake, it's a good article for things to look out for if you're getting into load balancing. It's just nothing new if you are already experienced in the field.

dgacmu · on Dec 25, 2016

Yes - ACM queue is explicitly not a research publication, it's a general interest for practitioners magazine: https://en.m.wikipedia.org/wiki/ACM_Queue

alanfranz · on Dec 25, 2016

ACM is 'advancing computing as a science and profession'. It's not just for the Academy. One of the targets of modern ACM is exactly that of closing the gap between research and practice.

hueving · on Dec 26, 2016

Doesn't such a vague definition just make it a blog platform for anything CS related? Nothing in this article was even new to practice.

alanfranz · on Dec 26, 2016

Well, yes, ACM itself is quite broad, with many focused groups as well as other, more divulgative, sections.

But ACM Queue it's not "a blog platform", more like a magazine. Something is new, something it's a recap, something is an opinion.

YesThatTom2 · on Dec 27, 2016

Thanks! I'm glad you liked it.

The target audience of ACM Queue the practitioner, not the researcher. My column tends to try to bring people up to a baseline of sanity rather than explore new research. 90% of the world is below (what I consider to be) the baseline for IT sanity. :-)