
Stop Buying Load Balancers and Start Controlling Your Traffic Flow with Software - danmccorm
http://bits.shutterstock.com/2014/05/22/stop-buying-load-balancers-and-start-controlling-your-traffic-flow-with-software/
======
mattzito
I'm really reaching back into the depths of my memory, but I've implemented
this in the past. It's not quite as simple as they make it sound - there's a
lot of sticky edge cases that crop up here (some of which have no doubt been
addressed in subsequent years).

\- It heavily limits the number of nodes you can have - that is something the
article does say, but I want to highlight here. It strikes me as a really bad
strategy for scale-out.

\- I've run into weirdness with a variety of different router platforms
(Linux, Cisco, Foundry) when you withdraw and publish BGP routes over and over
and over again (i.e. you have a flapping/semi-available service).

\- It is true that when a node goes down, the BGP dead peer detection will
kick in and remove the node. _However_ the time to remove the node will vary,
and require tuning on the router/switch side of things.

This is a fairly crude implement to swing - machete rather than a scalpel. You
lose a lot of the flexibility load balancers give you, and depend a lot more
on software stacks you have less insight and visibility into (router/switches)
and are also not designed to do this.

My suggestion would be that this is a great way to scale across multiple load
balancers/haproxy nodes. Use BGP to load balance across individual haproxy
nodes - that keeps the neighbor count low, minimizes flapping scenarios, and
you get to keep all the flexibility a real load balancer gives you.

One last note - the OP doesn't talk about this, but the trick I used back in
the day was that I actually advertised a /24 (or /22, maybe?) from my nodes to
my router, which then propagated it to a decent chunk of the Internet. This is
good for doing CloudFlare-style datacenter distribution, but has the added
benefit that if all of your nodes go down, the BGP route will be withdrawn
automatically, and traffic will stop flowing to that datacenter. Also makes
maintenance a lot easier.

~~~
Florin_Andrei
> _My suggestion would be that this is a great way to scale across multiple
> load balancers /haproxy nodes. Use BGP to load balance across individual
> haproxy nodes_

Exactly. BGP, while it may work like the OP said, was not meant to live this
close to the actual server nodes.

You could push BGP even further away. In a more traditional model, it's meant
to be used to switch (or load balance) between geographically separated
datacenters.

------
donavanm
This works well if you need to push high bit rates and are lookig for
relatively simple load balancing. A trident box, ala juniper qfx, can push a
few hundred gbs for ~$25,000. Thats an incredibly low price point compared to
any other lb solution.

Some caveats and comments about the technique.

BGP & ExaBGP are implementation details. OSPF, quagga, & bird will all
accomplish the same thing. Use whatevere your comfortable with.

Scale out can get arbitrarily wide. At a simplistic design youll ECMP on the
device (TOR) where your hosts are connected. Any network device will give you
8 way ECMP. Most Junos stuff does up to 32 way today, and 64 way with an
update. You can ECMP before that as well, in your agg or border layer. That
would give you 64 x 64 = 4096 end points per external "vip."

ECMP giveth and taketh away. If you change your next hops expect all those
flows to scramble. The reason is that ordering of next hops / egress
interfaces are generally included in assignment of flows to next hop. In a
traditional routing application his has no effect. When the next hops are
terminating TCP sessions youll be sending RSTs to 1/2 of your flows.

For this same reason youll have better luck advertising more specific routes,
like /32s instead of a whole /24\. This can help limit the blast radius of
flow rehash events to a single destination "vip."

There are more tricks you can play to mitigate flow rehashes. Its quite a bit
of additional complexity though.

For the same reason make double plus sure that you dont count the ingress
interface in the ECMP hash key. On Junos this is incoming-interface-index and
family inet { layer-4 }, IIRC.

You _really_ dont want to announce your routes from the same host that will
serve your traffic. Separate your control plane and data plane. Its terrible
when a host has a gray failure, say oom or read only disk, and route
announcements stay up while the host fails to serve traffic. You end up null
routing or throwing 500s for 1/Nth of your traffic.

~~~
phil21
Great post, you said things better than I ever could.

So I'll nitpick instead :)

> You really dont want to announce your routes from the same host that will
> serve your traffic. Separate your control plane and data plane. Its terrible
> when a host has a gray failure, say oom or read only disk, and route
> announcements stay up while the host fails to serve traffic.

This was probably the largest area I spent architectural time on before
deployment. Was it better to run a tiny little healthcheck script on my
HAProxy boxes to tear down bgpd, or should I run a route server.

In the end, I went with what I felt was the simpler solution of the two for
our scale (~60 total HAProxy boxes spread around the world) and used an
extremely simple "is haproxy accepting connections on the VIP or not" script
that stayed in memory. It was also tossed in init, just in case it did get
oomkilled or crashed - and in the end we never had an outage based on a
service check that should have been caught. Knock on wood. Completely agree
that a route server is better than a naive script ran from cron as a service
check.

The route-server method is more interesting to me at scale, but adds
additional complexity to the problem. Now people need to know how this all
works, your service checks need to execute remotely (not a bad thing!), and
probably a few other things I'm forgetting. However, it makes management and
scaling much better and is the way I'd go if I did it all over again.

The failure model for this setup is basically is HAProxy up? Yes? Then
announce routes. If not, pull routes. HAProxy was responsible for detecting
the health of the application itself and deciding what to do with an app
failure. We did add some code later on to down haproxy should it be unable to
reach _any_ webservers, but honestly the complexity and additional failure
modes this adds usually is not worth it due to it being such a rare event.

~~~
donavanm
Ha! I _did_ have an outage because route announcements and data plane were on
the same host a few years back. Having a separate health check service & route
server is a trade off of complexity vs control. I could see the argument when
you only have a couple hosts total. With dozens of endpoints in a fleet its
quite nice to tolerate more wonky grey failures.

Unfortunately I dont know of any existing public lib/application/framework
that does this type of layer 2/3/4 load balancing for fleets of endpoints. The
vrrp/carp/keepalive/heartbeat crew seem focussed on master/slave failover
which is totally uninteresting to me.

~~~
phil21
Yeah, it's pretty custom. We started doing this nearly 10 years ago and when
explaining it to vendors their eyes would glaze over. These days it seems
quite a bit more common - and I hope I had a tiny bit to do with that
evangelizing it wherever I could.

Curious how you solved the hash redistribution problem? We never came up with
anything good (some clever hacks though!), but luckily for our uses it wasn't
a big deal and we could do away with a whole shedload of complexity.

The best we came up with was pre-assign all the IP's (or over-assign if you
wanted more fine-grained balancing) a given cluster could ever maximally
utilize. Then distribute those IPs evenly across the load balancers, and have
the remaining machines take over those IPs should there be a failure. This was
complicated as hell, and obviously broke layer3 to the access port so was a
non-starter.

I'm sure we had better/more clever ideas, but we never had reason to chase
them down so I honestly forget. At this point, if someone needs to refresh a
page once out of every 100 million requests I'm pretty happy.

------
keepper
Half a million dollars of load balancers? Either you are buying from the wrong
vendor, or you have some wonky ideas of how many load balancers you need per
data center, or are not using them correctly. ( hint: check a10 networks and
Zeus)

The reality is, that if your problem is only L3, then arguably this can be
solved many ways. For example, networks have been doing 10s of gigabits of L3
load balancing using DSR for ages. Dynamic route propagation doesn’t have a
hold on this ( albeit it’s more “elegant” ).

But most people do more than L3, and really do L4-L7 load balancing, and most
modern “application load balancing” platforms are really software packages
bundled up in a nice little appliance. This is where packages like varnish
with its vcl/vmod’s and caching, Aflex ( from a10 networks ) and Traffic
Sscript from Zeus , amongst others, come in. Shuffling bits is the easy part!
Understanding the request, and making decisions on that is the harder part.

If you split the problem, and are using varnish or nginx as your application
load balancer, you can’t claim you’ve gotten rid of them, you were either not
buying the right initial platform, or using it correctly. When you put a “stop
buying load balancers”… you must first define what you mean by “load balancer”
;)

For the record, I've used both commercial load balancing platforms as well as
contributed patches and used OSS load balancing platforms..

~~~
jsmthrowaway
Half a million dollars of Netscalers is extremely easy, and given that Google
originally ran all Netscalers, a lot of ex-Google people that run operations
at startups now default to them as a kneejerk. Similarly, F5 is a pretty easy
budget torpedo, too.

~~~
tootie
Nah, $20K and free shipping: [http://www.amazon.com/F5-Networks-Standard-
Contract-F5-BIG-L...](http://www.amazon.com/F5-Networks-Standard-
Contract-F5-BIG-LTM-1600-4G-SBDL/dp/B00B9HZ328)

~~~
jsmthrowaway
You'd have a compelling case if I only needed one.

------
at-fates-hands
This is actually a great idea. The last company I worked for ran into several
issues with load balancing servers. Two out of the three major releases I was
around for were an unmitigated disasters.

The first was because of old load balancing servers getting bogged down with
traffic. CIO got pissed, dropped three million on brand new spanking SSD
drives and new "state-of-the-art" servers. Cue to the next release.

Pretty much the same issue. It was two lines in a program that was calling a
file from Sharepoint - thousands of times a second to the server and bogged
down all three of the load balancer servers with traffic within minutes of the
release. Took the back-end developers a week and some help from Microsoft to
fix the bug. I just sat back and giggled since the CIO spent two hours in a
meeting with the whole IT department lecturing them on the importance of load
testing immediately after the first releases failure.

Needless to say, they didn't do any load testing for the applications either
time which contributed to the issue. Of course, it just goes to show even with
the bestest, newest hardware, you can still bring your site/applications to
its knees.

~~~
Florin_Andrei
> _It was two lines in a program that was calling a file from Sharepoint -
> thousands of times a second to the server and bogged down all three of the
> load balancer servers with traffic within minutes of the release._

That type of issue can be quickly identified if you have, on the team, the
sort of mind who is inquisitive of what goes on at low levels, and is not
afraid of poking around with tcpdump and network interface traffic counters
and stuff like that.

------
lazyjones
This article is a bit low on details, so it's hard to judge the quality of the
proposed solution (without having tested a similar setup).

We faced the choice of either upgrading our aging Foundry load balancers or
building our own solution a few years ago and came up with a very stable and
scalable setup:

2+ load balances (old web servers sufficed) running Linux and:

* wackamole for IP address failover (detects peer failure with very low latency, informs upstream routers; identical setup for all load balancers works, can be tuned to have particular IP addresses preferably on particular load balancer hosts) [http://www.backhand.org/wackamole/](http://www.backhand.org/wackamole/)

* Varnish for HTTP proxying and load balancing (identical setup on all load balancers) - www.varnish.org

* Pound for HTTPS load balancing (identical configuration on all load balancers, can handle SNI, client certificates etc. ...) [http://www.apsis.ch/pound/](http://www.apsis.ch/pound/)

This scales pretty much arbitrarily, just add more load balancers for more
Varnish cache or SSL/TLS handshakes/second. We also have nameservers on all
load balancers (also with replicated configuration and IP address failover).
Configuration is really easy, only Varnish required some tuning (larger
buffers etc.) and Pound (OpenSSL really) was set up carefully for PFS and good
compatibility with clients.

The only drawback is that actual traffic distribution over the load balancers
is arbitrary and thus unbalanced (wackamole assigns the IP addresses randomly
unless configured to prefer a particular distribution), but the more IP
addresses your traffic is spread out over, the less of a problem this becomes.

~~~
phil21
This solution works, but as you point out balancing over your load balancers
is a huge hack and relies basically on DNS.

If you replaced wackamole/DNS RR with BGPd on your varnish/pound boxes, you
would achieve the same goals but be able to fully direct your traffic flows
yourself vs. a random RFC-busting DNS cache somewhere.

The other big downside to this solution is being forced to run a l2 broadcast
domain for failover to work. Fine at your scale here, but when you get into
dozens or hundreds of switches and larger scale I firmly believe (if at all
possible) dropping L3 down to the access port is the way to go. Debugging STP
issues on such networks is about the last thing in the world I'd like to be
doing on a friday night.

------
peterwwillis
Really he's talking about layer 4 load balancing, not 3, and assuming your
juniper router has an Internet Processor II ASIC to juggle tcp flows. You're
still buying hardware to do the load balancing, you just use software to do
the BGP announce.

Honestly it all seems a bit crude and unreliable. If i'm writing a software
load balancer i'm not going to use curl, bash scripts and pipes to do it. But
this is why devops people shouldn't be designing highly available traffic
control software.

~~~
mprovost
It's a cool hack because you probably already have bought the hardware to do
this. Most decent switches these days can run BGP (or at least OSPF, which is
capable of the same thing). And switches are usually way cheaper than hardware
load balancers, at least per-port. Sure it's not perfect but for a startup
trying not to spend a lot it can get you pretty far.

~~~
peterwwillis
I don't know. From an infrastructure perspective I don't like the idea of
having too many eggs in one basket, like combining the router/switch with the
lb vips. Ideally you'd get a couple commodity boxes and configure them with
LVS or pfsense or something. That way things like maintenance and access
control of different parts of your network are separated based on the
resource, and stability of one component won't necessarily affect another. It
would also probably be cheaper to buy a couple servers than buy a couple
routers/switches for redundancy.

~~~
phil21
You're not understanding how this works at all. Your router/switch doesn't
"combine" the vips.

Ignore the running of bgpd on the webserver - that's really an extremely
specific use-case that is not useful for most folks.

However, imagine your scenario where you have two routers, two switches, and
two load balancers configured in failover (LVS per your example) - with
webservers behind that stack.

Now you need more than a single load balancer worth of capacity? How do you
scale it?

Generally, you're pretty much stuck doing DNS RR to load balance across VIPs,
and you add at least one VIP per load balancer you have. Need to do
maintenance? Good luck not directing traffic to the load balancer you want to
take out of service :) You can wait 3 days for all the DNS caches in the world
to purge, or you're going to be killing sessions when you fail that VIP over.

Now consider instead running BGPd vs. DNS RR.

You have a single VIP, and as many load balancers as you like. I enjoy
HAProxy, so I'll use it here.

haproxy01 1.1.1.1 haproxy02 1.1.1.2 haproxy03 1.1.1.3

All these machines advertise the VIP to the switch they are connected to. You
set up path cost on your network so all these VIPs share an equal cost at your
routers, and your router will ECMP to each. Doesn't matter what switch or rack
your HAProxy box is connected to in your network (I suggest paying attention
for traffic management reasons) - as long as it can speak BGP to the
switch/router, the traffic flows.

Need to do maint? Kill bgpd on one of the HAProxy boxes. Current sessions stay
up, you go get a coffee, and when you come back you have an empty session
table and are free to do whatever you like to the machine. Turn it back on by
starting up bgpd, and watch your traffic instantly rebalance.

You are completely correct that if you don't need more than one (in a pair for
HA) load balancer worth of capacity, this solution is likely overkill. But
once you need to start scaling, you're quickly going to learn that it will
either be prohibitively expensive, or come with lots of downsides. There
pretty much are no downsides to this architecture, other than needing someone
with a small amount of clue to operate it.

~~~
peterwwillis
What I was saying with that "combining" is that your router is now essentially
the vip, in the sense that it is the load balancer and the peers are chosen
and routed to from it. As opposed to a normal router which is merely passing
routed traffic into the network and letting a different device handle load
balancing. The idea being that different devices are used for different
purposes, and separation of their functions may improve overall stability and
increase flexibility of your network services.

One downside here is ECMP assumes all paths cost the same, which is ridiculous
in real-world load balancing. One of your haproxies is going to get overloaded
and then traffic to your site is going to intermittently suck balls as
sessions stream in to both under-loaded and over-loaded boxes.

Of course, you have the same problem with round-robin DNS to load balancers,
but in the case of a DR LVS load balancer for example, at least it's just
starting the connection and handing it off to the appropriate proxy instead of
randomly pinning sessions to specific interfaces. With DR it's the backend
proxy that determines its return path; the LVS VIP isn't in the path. With LVS
it can pick a destination path based on real-world load.

The other downside that you seem to gloss over with regard to scaling is the
maximum of 16 ECMP addresses in the forwarding table. I'm sure we'll never
need more than 16 of those, though....... (For reference: the company I used
to work for had up to 23 proxies just for one application... might cause some
hiccups with this set-up)

Doing maintenance on a vip address and doing maintenance on one of these bgp
peers work about the same. You stop accepting new connections, let old
connections expire, then take down the vip. As far as changing DNS records,
instead of that you can either add a hot-spare vip with the IP of the one you
want to maintain, or add the ip of that vip to an existing load balancer.

~~~
phil21
Ok, I understand what you meant now. I do disagree - it's simply doing what
routers do, and has no specific knowledge or configuration for the VIP. It's
simply forwarding traffic based on a destination table just like any other
packet. If this was problematic in any way, your average backbone would
implode - ECMP is utilized extensively to balance busy peers. Also routers
already do redundancy (at least via L3) extremely robustly - so it's basically
a "free" way to load balance your load balancers. You simply are not going to
get the same level of performance out of a LVS/DR solution, as it's competing
with very mature implementations done in silicon. We'll have to agree to
disagree here.

Of course in ECMP all paths are the same - I don't see this as a downside
though. Most router vendors do support ECMP weights if really needed, but
there are better ways to architect things. I've run this setup with over
1500gbps of Internet-facing traffic, and never ran into a full 10g line
because it was engineered properly. An in-house app that lowers my hashing
inputs would probably require a different setup though, I agree.

16 ECMP is a decent number, but these days most routers I work with support
32. Some are supporting 64 now. But that's almost irrelevant, unless you're
stuffing all your load balancers on a single switch. It's per-device, so you
have 8 load balancers connected (and peering via BGP) to one switch, 8
another, and so on. Those then forward those routes up to the router(s) which
then ECMP from there (up to 16/32 downstream switches per VIP). I've never
needed more than "two levels" of this so I haven't really played with a sane
configuration for more than 1024 load balancers for a single VIP (or 512 in
your 16-way case). It scales more than perhaps a dozen companies in the world
would need it to. Note that this explanation may sound complicated, but in a
well engineered (aka not a giant L2 broadcast domain that spans the entire DC)
network it just happens without you even specifically configuring for it.

Since my knowledge is dated - how do you "stop accepting new connections" with
the LVS/DR model? I'm sure you can, just can't mentally model it at the
moment. You need to have the VIP bound to the host in question for the current
connections to complete, how do you re-route new connections to a different
physical piece of gear at the same time utilizing the same VIP?

There are certainly downsides to this model as well, I don't want to pretend
it's the ultimate solution. But, it's generally leaps and bounds better than
any vendor trying to sell you a few million dollars of gear to do the same
job. The biggest downside to ECMP based load balancing is the hash
redistribution after a load balancer enters/leaves the pool. I know some
router vendors support persistent hashing, but my use case didn't make this a
huge problem. There are of course ways to mitigate this as well, but they get
complicated.

In the end, for the scale you can achieve with this the simplicity is
absolutely wonderful. It's one of those implementations you look at when
you're done and say "this is beautiful" since there are no horrible-to-
troubleshoot things that do ARP spoofing and other fuckery on the network to
make it work. ECMP+BGP is what you get, you can traceroute, look at route
tables, etc. and that displays reality with no room for confusion. No STP
debugging to be found anywhere :)

------
mjolk
This is a cool setup, but with the caveat that Allan stated, it forces you to
think a little more about a layer that most systems people are less
experienced in. The software approach is particularly useful because one could
take the "healthcheck" setup and have it keep your alerting/dashboards in sync
with reality (e.g. do healthcheck, fork: return exit code; POST {$hostname:
'ok'} to metric collector).

I also see that Shutterstock is actively hiring. For anyone looking,
Shutterstock is a great place to work and employs some really brilliant
people.

Disclaimer: Ex-Shutterstock employee

------
jdubs
I can boil water for tea in my oven, but should I?

This is totally nonstandard and it will be a nightmare to document and also
difficult to hand this off to a new team mate.

~~~
subway
This is becoming increasingly standard, and is quite easy to document.

~~~
yelloblac
+1

------
transitorykris
This can shift complexity to elsewhere in your stack. A couple points to add.

Be mindful of the specific routing hardware you're using:

Announcing and withdrawing prefixes can cause the router to select new next
hops (i.e. servers). This is mostly a problem with TCP and other connection
oriented protocols (or even connectionless if you're expecting a client to be
sticky to a server).

You may also lose the ability to do un-equal cost load balancing.

~~~
yelloblac
I think that's an important catch, flow breaking can happen dependent on what
hardware you're using and how you swap in next-hops. It definitely requires a
deep understanding of how flows are hashed.

~~~
phil21
Agreed. Also many applications for this you might just not care if a flow gets
broken. Really depends on your use case.

Persistent hashing is key if you care about it though, I honestly don't know
which routers do this off the top of my head :)

------
jauer
You can also do this (Equal Cost MultiPath to servers) without a dynamic
routing protocol but you are at the mercy of whatever health checks your top
of rack switch supports.

On Cisco switches you can use a IP SLA check to monitor for DNS replies from a
DNS server and then have a static route that tracks the SLA check. If your DNS
server stops responding the route would be withdrawn and traffic routed away.
This can happen within a few seconds. Slides from a NANOG talk about this
(PDF):
[http://www.nanog.org/meetings/nanog41/presentations/Kapela-l...](http://www.nanog.org/meetings/nanog41/presentations/Kapela-
lightning.pdf)

------
nonuby
>"it is actually more of a load-balance per-flow since each TCP session will
stick to one route rather than individual packets going to different backend
servers."

This strikes me as expensive, does this mean packets no longer pass through
the ASIC only side of a router and thus the software in the router has to do
some of the heavy lifting, thus limiting the capacity/throughout to a mere
fraction of what the route is really capable of?

disclaimer: i have only a high-level overview of router tech

------
dmourati
Been running software load balancers for over a decade. I started with LVS
(Linux Virtual Server) now called ivps. Now we run HAProxy and we're looking
Apache Traffic Server.

Some of the load balancers have even run BGP as called out in the OP. Nothing
really fancy but enough to be interesting.

One of the coolest things I built was a Global Server Load Balancer to balance
Load Balancers. We needed it initially to move data centers. It was built on
top of PowerDNS and ketama hash.

------
vhost-
As someone who works with NetScalers on the regular, I have to say I love this
idea. Citrix support is terrible and NetScalers are such a pain to configure.
Then I see the bill and it frosts the cake.

We recently upgraded from version 9 to version 10 and it took down our
production site because of some asinine undocumented rate limiting they
"finally enforced" in version 10.

I'd like to play with software load balancing in the testing facility.

------
mseebach
_Even though the above says load-balance per-packet, it is actually more of a
load-balance per-flow since each TCP session will stick to one route rather
than individual packets going to different backend servers. As far as I can
tell, the reasoning for this stems from legacy chipsets that did not support a
per-flow packet distribution._

Is this not fairly risky? It's essentially relying on a bug?

------
contingencies
Yeah, I spent the last weekend configuring Cisco gear which basically I feel
should have been done in software on Linux. The era of the hardware firewall
slash load balancer is over as far as hardware goes. Buy a dedicated box (or
two) and configure .. it's faster and more predictable/reliable.

------
techprotocol
AWS's offering which is software based -
[http://aws.amazon.com/elasticloadbalancing/](http://aws.amazon.com/elasticloadbalancing/)

~~~
Xorlev
Doesn't work for datacenters. Also is implemented with round-robin DNS to
nodes (1-N, check X-Forwarded-For) in each AZ, which then handle balancing.

Also worth noting that unless you turn on cross-region balancing, if an AZ
doesn't have a node in it and the RR DNS points them at that AZ, they'll be
turned away. Additionally, without it you need to scale by multiples of AZs
you run in otherwise you'll have unbalanced traffic.

On another note, I've always been curious if they're just abstractions around
HAproxy at the per-node level.

~~~
skyebook
FWIW, Amazon recently added the ability to load balance across AZ's (Still
round-robin though)

~~~
Xorlev
Very true. Cross-AZ load balancing works quite well. I believe Amazon has said
it's RR across the servers with the least connections, but degenerates to a
simple RR without many nodes per AZ.

