
True Zero-Downtime HAProxy Reloads - meatmanek
http://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html
======
lobster_johnson
This was more complicated that I hoped it would be. Is there any reason
HAProxy couldn't simply add a way to reload it's conifg file while running?

It already supports a bunch of commands to modify the config while it is
running, via the Unix socket, but there's no way to pipe an entire config file
to that socket.

It wouldn't work for version upgrades, of course, but that's not really the
main use case.

Another option could be to start up a parallel HAProxy that listens to a
different port. Then use iptables to route traffic to that port instead of the
old one. Not sure if this would break connections, though, or if there is a
way to prevent breaking connections.

~~~
moe
_Another option could be to start up a parallel HAProxy that listens to a
different port._

That's how we do it and the script isn't even complicated.

Use iptables to shift the traffic to the new instance, wait for the old
instance to drain (netstat is your friend), and kill it.

Much respect to the effort that went into the yelp solution, but it does feel
a bit brittle and over-engineered for a such a simple problem.

~~~
snowman17
The solution you mention has some tradeoffs. Imagine that you have hundreds of
services each of which have a port and have some long running connections.
With your solution you can only restart the load balancer at the rate of the
longest running connections or whenever ports run out, whichever comes first.

Also then you'd have to maintain the logic to do the port mapping, iptables
switching, draining, etc ...

The post mentions they considered this option and they claim that they were
worried about engineering risk; seems reasonable to be worried about that.

~~~
moe
_With your solution you can only restart the load balancer at the rate of the
longest running connections_

Well, perhaps I should have provided more detail. Our ansible deployment
script actually counts the port up until it finds a free one (we keep a port-
range reserved for this purpose). So if there are multiple changes in rapid
succession then more than two haproxy instances may dangle around for a while.

The "port discovery" is a shell-task that registers the port to be used as a
variable, which is then used by the templates of the haproxy and iptables
roles.

The cleanup is done by a 15min-cronjob which kills all haproxy instances that
have no connections in netstat and don't match the haproxy pidfile.

------
click170
My favorite part was when you were rewarded for scouring the source code by
finding an undocumented qdisc, that was golden. :)

You mentioned that your colleague had produced a patch that did file handle
passing? Is that patch available? My first thought on reading that was, how
many other people have tried that very task. Im wondering if it would be
helpful to HAPrpxy if your patch was made available?

Thank you for sharing the plug qdisc, this is hugely useful in any number of
situations. Did you find any others by chance?

I can see why some would say this solution is brittle, but actually I quite
like it. It feels elegant and The Right Way to do it.

~~~
josnyder
Yep, I found a number of them, but I promptly forgot them in excitement when I
found the one I actually wanted. If you want to check them out yourself, they
appear to congregate in the kernel tree at net/sched/sch_* [1].

The only other qdisc I have much experience with is sch_netem, which emulates
behaviors of a WAN (delay, loss, etc). I used it in this post [2] to conduct
adversarial testing of MySQL replication (search 'tc qdisc').

[1] [http://lxr.free-electrons.com/source/net/sched/](http://lxr.free-
electrons.com/source/net/sched/)

[2] [http://engineeringblog.yelp.com/2014/03/mysql-replication-
ne...](http://engineeringblog.yelp.com/2014/03/mysql-replication-network-
issues-and-why-you-might-want-to-upgrade.html)

------
wereHamster
How about using IPVS? To reload HAProxy, start a new instance, add as a new
IPVS real server, set the weight of the old HAProxy to zero, wait until the
connection count drops to zero, and then you can remove the old HAProxy real
server.

IPVS also allows to gradually shift traffic from one instance to the other. So
if you see elevated error reports from the new server, you can quickly drop it
and go back to the old setup.

~~~
jolynch
I am not familiar with IPVS, I'll look into it.

From your description though I imagine it might have the same drawbacks that
exist with the multiple HAProxy + iptables swap solution. With a solution
where we run multiple HAProxies and move traffic between them I worry about
port exhaustion, long running connections, and the maintenance costs of the
infrastructure. It could just be the case that we're talking about different
problems. At Yelp we have hundreds of services that are added, changed, moved,
and removed from physical machines causing our internal HAProxies (which
listen on hundreds of ports and many have long running connections) to reload
pretty constantly.

As all engineering decisions, I could very easily be wrong and be
overestimating the complexity involved with the multiple HAProxy instance
solution. At the end of the day I made a call based on the data available, and
I decided the solution I talked about in the blog post was lowest risk.

------
jpgvm
Everyone seems to have overlooked the fact this only works for outgoing
traffic, you can't use this strategy on a public facing load-balancer without
additional work.

Specifically you will need to accept the traffic on one interface and forward
to a new local interface in order to apply the qdisc plug.

This is cool, but the above is still a pain in the neck.

~~~
jolynch
Totally agree, I mention this as the largest drawback of this approach in the
post.

------
gawry
Doesn't unix socket commands work without dropping connections?

[http://cbonte.github.io/haproxy-
dconv/configuration-1.5.html...](http://cbonte.github.io/haproxy-
dconv/configuration-1.5.html#9.2)

~~~
lobster_johnson
Yes, but you can only apply small updates. You can't reload the whole config.

------
ksri
Why not simply let DNS handle the routing and load balancing?

Bring up another HAProxy with the new configuration. Then swap the new and old
IP addresses in DNS. Wait for DNS to propagate. When traffic on old HAProxy is
zero, bring it down.

~~~
mrphoebs
DNS is a fragile system to use within a DC to do service/endpoint discovery.
That's because DNS tends to be a single point of failure.

Though the DNS system can inherently be resilient, within a DC most people
only operate a single DNS server because a hierarchical domain scheme and DNS
setup within DC is too cumbersome and is much less reactive to end point
changes. Eg: Changing a service endpoint in an emergency takes way too long.

A single DNS server means, still there are propagation delays(local caches)
and a single point of failure at critical moments(when there is a thundering
herd)

~~~
hueving
That's a pretty weak argument. It's like calling a databases a single point of
failure because the people who set them up tend to be too lazy to setup HA.

------
cagenut
wow, ton of respect for solving that and proving it out so clearly.

but also, soooo glad to have 'real' loadbalancers and not have had to do this.

------
nodesocket
When you run `/etc/init.d/nginx reload` I was under this impression this is
zero-downtime. Is that not true then?

~~~
fweespeech
If you truly meant Nginx and not HAProxy, Nginx does its best to avoid
dropping connections on a reload:

[http://nginx.org/en/docs/control.html](http://nginx.org/en/docs/control.html)

> In order for nginx to re-read the configuration file, a HUP signal should be
> sent to the master process. The master process first checks the syntax
> validity, then tries to apply new configuration, that is, to open log files
> and new listen sockets. If this fails, it rolls back changes and continues
> to work with old configuration. If this succeeds, it starts new worker
> processes, and sends messages to old worker processes requesting them to
> shut down gracefully. Old worker processes close listen sockets and continue
> to service old clients. After all clients are serviced, old worker processes
> are shut down.

It brings up the new worker processes to handle new connections and wait for
the existing connections to terminate. Key phrase is "does its best", I've
seen edge cases that aren't really Nginx's fault that cause connections to be
dropped. [e.g. OOM issues when it tries to bring up new workers]

------
donflamenco
I'm not familiar with HAProxy, but with other load balancers. I've been
looking into it as a replacement for the hardware load balancers we use.

Why do you have to reload haproxy? When you update the configuration?

I reload nginx all the time (nginx -s reload) and I'm not sure if that is a
true zero-downtime reload either.

Interesting hack nonetheless (stopping SYNs.)

~~~
fxx2
Reloading HAProxy is part of the SmartStack architecture, IIRC.
SmartStack/synapse does polling on say, ZooKeeper or Docker or whatever,
generates a new HAProxy configuration and then reloads it.

~~~
e12e
That's an important detail, as it explains why the tests aren't entirely
flawed: Testing 10 reloads every second really distorts the numbers, assuming
a reload every hour, or every few hours is more realistic. But if reloads are
part of what amounts to doing live testing on a large part of live traffic --
then the exercise makes more sense.

And to be sure; being _able_ to do ten reloads every second with few ill
effects enables different, more nimble systems engineering.

But if we assume 2000 requests per second, per box - fighting ~100 reset
connections a day (assuming two ha-proxy reconfigs) doesn't really seem worth
the effort - packet loss and other outages would probably(?) domminate anyway.

~~~
eropple
_> Testing 10 reloads every second really distorts the numbers, assuming a
reload every hour, or every few hours is more realistic._

It depends on what you do. I've seen shops (successfully and, IMO, correctly)
scaling AWS instances for services with a threshhold of every fifteen minutes,
and I've seen Mesos clusters dynamically spinning up web instances much more
nimbly than that (think every two minutes under spiky load--the instances
would come up in five seconds, so it didn't hurt to down them).

~~~
e12e
Well, once every 120 seconds is still quite a leap from 10 every second...

~~~
eropple
Sure, but if it doesn't work there, I don't trust it to work if, say, a piece
of my scheduler goes nuts and suddenly is upping and downing containers every
few seconds. The problem remains, it's just not as acute and still must be
fixed.

------
mobiplayer
I'm afraid I'm also of the feeling that this is too complex, especially moving
forward with the business. It was probably fun to work on it and solved your
issue, but somehow it looks like a patch for an architecture issue.

Does this add extra complexity to platform (software) updates?

------
pitr
Does this issue exist in servers like Unicorn? A new instance forks from the
old one, inheriting sockets, and starts handling requests.

[http://unicorn.bogomips.org/SIGNALS.html](http://unicorn.bogomips.org/SIGNALS.html)

~~~
thwarted
I suspect inherited sockets across forks don't exhibit this, if a test done by
Nicholas Grilly described on the haproxy mailing list [0] is to be believed
(no code though, but shouldn't be hard to reproduce).

One thing that makes doing this in haproxy difficult is that there is not
really any shared state between the parent and child processes, so the child
doesn't have a good way to know which file descriptor maps to which listening
endpoint in the configuration, since the child pretty much throws its entire
state away and reads the config file anew. It's not that there can't be any
shared state, but that's not how it's been architected. Finding out the
endpoint via something like getsockname(2) might be doable, but the mapping of
listening endpoints to listen configuration blocks isn't one-to-one, so it's
actually "safer" (from an amount of code standpoint) to use SO_REUSEPORT and
let the OS handling the shared listening.

[0]
[http://permalink.gmane.org/gmane.comp.web.haproxy/14143](http://permalink.gmane.org/gmane.comp.web.haproxy/14143)

------
halayli
My gut feeling is that this approach can be prone to silent misbehavior that
they cannot detect.

There's no reason not to patch HA to reload it's config and reapply it.

~~~
meatmanek
We're using this for internal load balancing, so we control both ends of the
connection. If this started misbehaving, we'd see timeouts, dropped
connections, or errors, which would show up in logs.

~~~
halayli
I see. Any reason not to patch HA to reload config?

~~~
jolynch
Patching HAProxy to reload config is really hard. There have been ideas,
patches, and discussions on the HAProxy mailing list for a few years now
trying to get zero downtime reloads natively supported in HAProxy, but the
reality is that it just is not as easy as it might seem.

For more details check out the mailing list:
[http://marc.info/?l=haproxy](http://marc.info/?l=haproxy)

