Hacker News new | comments | show | ask | jobs | submit login
True Zero-Downtime HAProxy Reloads (yelp.com)
128 points by meatmanek on Apr 13, 2015 | hide | past | web | favorite | 49 comments

This was more complicated that I hoped it would be. Is there any reason HAProxy couldn't simply add a way to reload it's conifg file while running?

It already supports a bunch of commands to modify the config while it is running, via the Unix socket, but there's no way to pipe an entire config file to that socket.

It wouldn't work for version upgrades, of course, but that's not really the main use case.

Another option could be to start up a parallel HAProxy that listens to a different port. Then use iptables to route traffic to that port instead of the old one. Not sure if this would break connections, though, or if there is a way to prevent breaking connections.

Another option could be to start up a parallel HAProxy that listens to a different port.

That's how we do it and the script isn't even complicated.

Use iptables to shift the traffic to the new instance, wait for the old instance to drain (netstat is your friend), and kill it.

Much respect to the effort that went into the yelp solution, but it does feel a bit brittle and over-engineered for a such a simple problem.

The solution you mention has some tradeoffs. Imagine that you have hundreds of services each of which have a port and have some long running connections. With your solution you can only restart the load balancer at the rate of the longest running connections or whenever ports run out, whichever comes first.

Also then you'd have to maintain the logic to do the port mapping, iptables switching, draining, etc ...

The post mentions they considered this option and they claim that they were worried about engineering risk; seems reasonable to be worried about that.

With your solution you can only restart the load balancer at the rate of the longest running connections

Well, perhaps I should have provided more detail. Our ansible deployment script actually counts the port up until it finds a free one (we keep a port-range reserved for this purpose). So if there are multiple changes in rapid succession then more than two haproxy instances may dangle around for a while.

The "port discovery" is a shell-task that registers the port to be used as a variable, which is then used by the templates of the haproxy and iptables roles.

The cleanup is done by a 15min-cronjob which kills all haproxy instances that have no connections in netstat and don't match the haproxy pidfile.

The HAProxies in question are dealing with at least one port per service, with dozens of services. If you were to remap ports, you'd end up with a lot of port mapping rules.

We also considered using different loopback IPs per HAProxy, so ports could stay consistent, but decided against it:

- We have other things (like scribe) listening on the same loopback IP address, so we'd either have to move those to different IPs or exclude them from iptables rules.

- We thought it could be confusing/misleading to see HAProxy listening on one IP/port, but connections being made to a different IP/port.

There has been ongoing work in HAProxy for many years to support zero downtime reloads via a few different potential mechanisms (file descriptor passing, a socket server, etc ...). Unfortunately it turns out that it is really hard given the architecture of HAProxy. That being said, I'm sure patches are always welcome.

The post does mention that they did consider something similar to what you are suggesting with the multiple HAProxy instances but decided against it due to engineering uncertainties. Could be that they just overestimated how hard it would be.

This was more complicated that I hoped it would be

I also hoped that it would be less complicated, but after prototyping a few different options, many of which I mention in the blog post (e.g. a patch to do fd passing, huptime, or running multiple parallel HAProxies), I decided that this was the lowest risk given the engineering constraints that we had. Also I wasn't entirely sure of an implementation plan for how to cleanly do port mappings for the hundreds of services. For example our service provisioning system adds, change, move, or removes services in various Yelp environments dynamically, so the port mapping logic would have to keep sync with that. It is certainly possible but I stand by my concern.

To put this in perspective, I got this working in production at Yelp in about two or three days, and it has not had a single issue since. With this solution we apply some TC rules and some iptables rules once, and then the act of restarting is just a set of TC commands. It solves all of our use cases and does so without significant infrastructure that someone else has to maintain two years from now.

If a month from now HAProxy releases native zero downtime restarts, I don't feel bad about throwing all of this away.

I think you can use -m state --state ESTABLISHED,RELATED to match existing tcp flows.

But that builds on conntrack, you could get in trouble if this HA proxy is public facing and handles a gazillion requests ?

Probably, but realistically, it's probably fine for most shops.

Most shops don't handle a gazillion requests. And if you don't handle a gazillion requests, you don't really care about 20ms downtime while HA proxy is reloading ?

Another approach: have haproxy fork/exec a child, inheriting the same file descriptor for the listening socket, then have the child tell the parent when it should stop accepting connections on that socket.

No idea how that works in practice, but it seems like a sound concept? :)

Yes, that's how Unicorn works. But HAProxy isn't designed to inherit another process's current state (see this other comment: https://news.ycombinator.com/item?id=9371064), and it's probably quite complicated and error-prone.

If you're going down that route, it's probably much easier to write the code required to simply re-read the config and apply a diff to the internal data structures.

A less fragile approach would be for the old process to simply pass the listen fd to the new process over a unix socket. You could also pass any active connections in this way.

My favorite part was when you were rewarded for scouring the source code by finding an undocumented qdisc, that was golden. :)

You mentioned that your colleague had produced a patch that did file handle passing? Is that patch available? My first thought on reading that was, how many other people have tried that very task. Im wondering if it would be helpful to HAPrpxy if your patch was made available?

Thank you for sharing the plug qdisc, this is hugely useful in any number of situations. Did you find any others by chance?

I can see why some would say this solution is brittle, but actually I quite like it. It feels elegant and The Right Way to do it.

Yep, I found a number of them, but I promptly forgot them in excitement when I found the one I actually wanted. If you want to check them out yourself, they appear to congregate in the kernel tree at net/sched/sch_* [1].

The only other qdisc I have much experience with is sch_netem, which emulates behaviors of a WAN (delay, loss, etc). I used it in this post [2] to conduct adversarial testing of MySQL replication (search 'tc qdisc').

[1] http://lxr.free-electrons.com/source/net/sched/

[2] http://engineeringblog.yelp.com/2014/03/mysql-replication-ne...

Josh gets all the credit for finding the plug qdisc. I was originally planning on using netem and having a fixed delay of 50ms or something, but Josh decided that wasn't elegant enough and went looking for a better solution.

As for the patch, I can certainly ask John but I do know that the proof of concept was written during a hackathon so I imagine there could either be fundamental flaws or just need a lot of work to get ready to merge. From reading the HAProxy mailing list I think they have been working on this for a while, but the issue seems to be merge risk (it's a fairly large architecture change). I'll mention it to John though.

I'm glad you think it's elegant :-) I tried hard to find a solution that was minimally invasive (no code changes, no significant infrastructure, etc ...).

How about using IPVS? To reload HAProxy, start a new instance, add as a new IPVS real server, set the weight of the old HAProxy to zero, wait until the connection count drops to zero, and then you can remove the old HAProxy real server.

IPVS also allows to gradually shift traffic from one instance to the other. So if you see elevated error reports from the new server, you can quickly drop it and go back to the old setup.

I am not familiar with IPVS, I'll look into it.

From your description though I imagine it might have the same drawbacks that exist with the multiple HAProxy + iptables swap solution. With a solution where we run multiple HAProxies and move traffic between them I worry about port exhaustion, long running connections, and the maintenance costs of the infrastructure. It could just be the case that we're talking about different problems. At Yelp we have hundreds of services that are added, changed, moved, and removed from physical machines causing our internal HAProxies (which listen on hundreds of ports and many have long running connections) to reload pretty constantly.

As all engineering decisions, I could very easily be wrong and be overestimating the complexity involved with the multiple HAProxy instance solution. At the end of the day I made a call based on the data available, and I decided the solution I talked about in the blog post was lowest risk.

Everyone seems to have overlooked the fact this only works for outgoing traffic, you can't use this strategy on a public facing load-balancer without additional work.

Specifically you will need to accept the traffic on one interface and forward to a new local interface in order to apply the qdisc plug.

This is cool, but the above is still a pain in the neck.

Totally agree, I mention this as the largest drawback of this approach in the post.

Doesn't unix socket commands work without dropping connections?


Yes, but you can only apply small updates. You can't reload the whole config.

Why not simply let DNS handle the routing and load balancing?

Bring up another HAProxy with the new configuration. Then swap the new and old IP addresses in DNS. Wait for DNS to propagate. When traffic on old HAProxy is zero, bring it down.

DNS is a fragile system to use within a DC to do service/endpoint discovery. That's because DNS tends to be a single point of failure.

Though the DNS system can inherently be resilient, within a DC most people only operate a single DNS server because a hierarchical domain scheme and DNS setup within DC is too cumbersome and is much less reactive to end point changes. Eg: Changing a service endpoint in an emergency takes way too long.

A single DNS server means, still there are propagation delays(local caches) and a single point of failure at critical moments(when there is a thundering herd)

That's a pretty weak argument. It's like calling a databases a single point of failure because the people who set them up tend to be too lazy to setup HA.

DNS is surprisingly tricky as a service discovery tool. A lot of clients are poorly behaved, and will cache values for too long (or forever). It'd also be another dependency critical for site functionality.

Heh, the number of times DNS has bitten me in production... My main objections to DNS are:

1) You have no control over clients, there is absolutely no rule that says clients have to respect TTLs (I'm looking at you Java)

2) We are talking about hundreds of HAProxy load balancers, thousands of clients and hundreds of backend services which are moving around all the time. I just honestly didn't want to deal with DNS propagation limiting my flexibility.

3) At least at Yelp, we don't really have particularly nice control apis for DNS. This is sort of specific to Yelp, but it was a factor.

Don't muck with the DNS record. Use an "A record" associated with a secondary/floating IP; then simply move the IP associated with the A record to the new machine and issue a gratuitous ARP.

You can avoid setting arbitrarily low--but not low enough--TTL, etc.

wow, ton of respect for solving that and proving it out so clearly.

but also, soooo glad to have 'real' loadbalancers and not have had to do this.

When you run `/etc/init.d/nginx reload` I was under this impression this is zero-downtime. Is that not true then?

If you truly meant Nginx and not HAProxy, Nginx does its best to avoid dropping connections on a reload:


> In order for nginx to re-read the configuration file, a HUP signal should be sent to the master process. The master process first checks the syntax validity, then tries to apply new configuration, that is, to open log files and new listen sockets. If this fails, it rolls back changes and continues to work with old configuration. If this succeeds, it starts new worker processes, and sends messages to old worker processes requesting them to shut down gracefully. Old worker processes close listen sockets and continue to service old clients. After all clients are serviced, old worker processes are shut down.

It brings up the new worker processes to handle new connections and wait for the existing connections to terminate. Key phrase is "does its best", I've seen edge cases that aren't really Nginx's fault that cause connections to be dropped. [e.g. OOM issues when it tries to bring up new workers]

Correct - article explains what HAProxy reload does behind the scenes and how it can drop connections for a very short period (many if you have many connections like yelp does!)

I'm not familiar with HAProxy, but with other load balancers. I've been looking into it as a replacement for the hardware load balancers we use.

Why do you have to reload haproxy? When you update the configuration?

I reload nginx all the time (nginx -s reload) and I'm not sure if that is a true zero-downtime reload either.

Interesting hack nonetheless (stopping SYNs.)

You should read that, it's very interesting:


It is. Nginx launch a new worker while the other with the old config is still running, the master redirects new traffic to the worker and keeps the old worker until all previous requests has been handled. Once the old worker has finished the master process close it or when it time outs. It works even for updating nginx binaries.

It is a very useful approach and I use it all time as well.

I implemented the same in node by using the cluster api [1].

[1]: http://joseoncode.com/2015/01/18/reloading-node-with-no-down...

Right the configuration is loaded into memory, and you need to reload it to get any changes in.

I believe it's API let's you enable/disable existing configured servers but not dynamically add or remove them.

Reloading HAProxy is part of the SmartStack architecture, IIRC. SmartStack/synapse does polling on say, ZooKeeper or Docker or whatever, generates a new HAProxy configuration and then reloads it.

That's an important detail, as it explains why the tests aren't entirely flawed: Testing 10 reloads every second really distorts the numbers, assuming a reload every hour, or every few hours is more realistic. But if reloads are part of what amounts to doing live testing on a large part of live traffic -- then the exercise makes more sense.

And to be sure; being able to do ten reloads every second with few ill effects enables different, more nimble systems engineering.

But if we assume 2000 requests per second, per box - fighting ~100 reset connections a day (assuming two ha-proxy reconfigs) doesn't really seem worth the effort - packet loss and other outages would probably(?) domminate anyway.

> Testing 10 reloads every second really distorts the numbers, assuming a reload every hour, or every few hours is more realistic.

It depends on what you do. I've seen shops (successfully and, IMO, correctly) scaling AWS instances for services with a threshhold of every fifteen minutes, and I've seen Mesos clusters dynamically spinning up web instances much more nimbly than that (think every two minutes under spiky load--the instances would come up in five seconds, so it didn't hurt to down them).

Well, once every 120 seconds is still quite a leap from 10 every second...

Sure, but if it doesn't work there, I don't trust it to work if, say, a piece of my scheduler goes nuts and suddenly is upping and downing containers every few seconds. The problem remains, it's just not as acute and still must be fixed.

I'm afraid I'm also of the feeling that this is too complex, especially moving forward with the business. It was probably fun to work on it and solved your issue, but somehow it looks like a patch for an architecture issue.

Does this add extra complexity to platform (software) updates?

Does this issue exist in servers like Unicorn? A new instance forks from the old one, inheriting sockets, and starts handling requests.


I suspect inherited sockets across forks don't exhibit this, if a test done by Nicholas Grilly described on the haproxy mailing list [0] is to be believed (no code though, but shouldn't be hard to reproduce).

One thing that makes doing this in haproxy difficult is that there is not really any shared state between the parent and child processes, so the child doesn't have a good way to know which file descriptor maps to which listening endpoint in the configuration, since the child pretty much throws its entire state away and reads the config file anew. It's not that there can't be any shared state, but that's not how it's been architected. Finding out the endpoint via something like getsockname(2) might be doable, but the mapping of listening endpoints to listen configuration blocks isn't one-to-one, so it's actually "safer" (from an amount of code standpoint) to use SO_REUSEPORT and let the OS handling the shared listening.

[0] http://permalink.gmane.org/gmane.comp.web.haproxy/14143

My gut feeling is that this approach can be prone to silent misbehavior that they cannot detect.

There's no reason not to patch HA to reload it's config and reapply it.

We're using this for internal load balancing, so we control both ends of the connection. If this started misbehaving, we'd see timeouts, dropped connections, or errors, which would show up in logs.

I see. Any reason not to patch HA to reload config?

Patching HAProxy to reload config is really hard. There have been ideas, patches, and discussions on the HAProxy mailing list for a few years now trying to get zero downtime reloads natively supported in HAProxy, but the reality is that it just is not as easy as it might seem.

For more details check out the mailing list: http://marc.info/?l=haproxy

Not sure; I'm not the author of the post (a colleague), but I'd guess it's a fairly complicated patch. I know the author has made some changes to the HAProxy codebase[1], so I'm sure he considered it.

[1] http://comments.gmane.org/gmane.comp.web.haproxy/21025

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact