It already supports a bunch of commands to modify the config while it is running, via the Unix socket, but there's no way to pipe an entire config file to that socket.
It wouldn't work for version upgrades, of course, but that's not really the main use case.
Another option could be to start up a parallel HAProxy that listens to a different port. Then use iptables to route traffic to that port instead of the old one. Not sure if this would break connections, though, or if there is a way to prevent breaking connections.
That's how we do it and the script isn't even complicated.
Use iptables to shift the traffic to the new instance, wait for the old instance to drain (netstat is your friend), and kill it.
Much respect to the effort that went into the yelp solution, but it does feel a bit brittle and over-engineered for a such a simple problem.
Also then you'd have to maintain the logic to do the port mapping, iptables switching, draining, etc ...
The post mentions they considered this option and they claim that they were worried about engineering risk; seems reasonable to be worried about that.
Well, perhaps I should have provided more detail. Our ansible deployment script actually counts the port up until it finds a free one (we keep a port-range reserved for this purpose). So if there are multiple changes in rapid succession then more than two haproxy instances may dangle around for a while.
The "port discovery" is a shell-task that registers the port to be used as a variable, which is then used by the templates of the haproxy and iptables roles.
The cleanup is done by a 15min-cronjob which kills all haproxy instances that have no connections in netstat and don't match the haproxy pidfile.
We also considered using different loopback IPs per HAProxy, so ports could stay consistent, but decided against it:
- We have other things (like scribe) listening on the same loopback IP address, so we'd either have to move those to different IPs or exclude them from iptables rules.
- We thought it could be confusing/misleading to see HAProxy listening on one IP/port, but connections being made to a different IP/port.
The post does mention that they did consider something similar to what you are suggesting with the multiple HAProxy instances but decided against it due to engineering uncertainties. Could be that they just overestimated how hard it would be.
I also hoped that it would be less complicated, but after prototyping a few different options, many of which I mention in the blog post (e.g. a patch to do fd passing, huptime, or running multiple parallel HAProxies), I decided that this was the lowest risk given the engineering constraints that we had. Also I wasn't entirely sure of an implementation plan for how to cleanly do port mappings for the hundreds of services. For example our service provisioning system adds, change, move, or removes services in various Yelp environments dynamically, so the port mapping logic would have to keep sync with that. It is certainly possible but I stand by my concern.
To put this in perspective, I got this working in production at Yelp in about two or three days, and it has not had a single issue since. With this solution we apply some TC rules and some iptables rules once, and then the act of restarting is just a set of TC commands. It solves all of our use cases and does so without significant infrastructure that someone else has to maintain two years from now.
If a month from now HAProxy releases native zero downtime restarts, I don't feel bad about throwing all of this away.
But that builds on conntrack, you could get in trouble if this HA proxy is public facing and handles a gazillion requests ?
No idea how that works in practice, but it seems like a sound concept? :)
If you're going down that route, it's probably much easier to write the code required to simply re-read the config and apply a diff to the internal data structures.
You mentioned that your colleague had produced a patch that did file handle passing? Is that patch available? My first thought on reading that was, how many other people have tried that very task. Im wondering if it would be helpful to HAPrpxy if your patch was made available?
Thank you for sharing the plug qdisc, this is hugely useful in any number of situations. Did you find any others by chance?
I can see why some would say this solution is brittle, but actually I quite like it. It feels elegant and The Right Way to do it.
The only other qdisc I have much experience with is sch_netem, which emulates behaviors of a WAN (delay, loss, etc). I used it in this post  to conduct adversarial testing of MySQL replication (search 'tc qdisc').
As for the patch, I can certainly ask John but I do know that the proof of concept was written during a hackathon so I imagine there could either be fundamental flaws or just need a lot of work to get ready to merge. From reading the HAProxy mailing list I think they have been working on this for a while, but the issue seems to be merge risk (it's a fairly large architecture change). I'll mention it to John though.
I'm glad you think it's elegant :-) I tried hard to find a solution that was minimally invasive (no code changes, no significant infrastructure, etc ...).
IPVS also allows to gradually shift traffic from one instance to the other. So if you see elevated error reports from the new server, you can quickly drop it and go back to the old setup.
From your description though I imagine it might have the same drawbacks that exist with the multiple HAProxy + iptables swap solution. With a solution where we run multiple HAProxies and move traffic between them I worry about port exhaustion, long running connections, and the maintenance costs of the infrastructure. It could just be the case that we're talking about different problems. At Yelp we have hundreds of services that are added, changed, moved, and removed from physical machines causing our internal HAProxies (which listen on hundreds of ports and many have long running connections) to reload pretty constantly.
As all engineering decisions, I could very easily be wrong and be overestimating the complexity involved with the multiple HAProxy instance solution. At the end of the day I made a call based on the data available, and I decided the solution I talked about in the blog post was lowest risk.
Specifically you will need to accept the traffic on one interface and forward to a new local interface in order to apply the qdisc plug.
This is cool, but the above is still a pain in the neck.
Bring up another HAProxy with the new configuration. Then swap the new and old IP addresses in DNS. Wait for DNS to propagate. When traffic on old HAProxy is zero, bring it down.
Though the DNS system can inherently be resilient, within a DC most people only operate a single DNS server because a hierarchical domain scheme and DNS setup within DC is too cumbersome and is much less reactive to end point changes.
Eg: Changing a service endpoint in an emergency takes way too long.
A single DNS server means, still there are propagation delays(local caches) and a single point of failure at critical moments(when there is a thundering herd)
1) You have no control over clients, there is absolutely no rule that says clients have to respect TTLs (I'm looking at you Java)
2) We are talking about hundreds of HAProxy load balancers, thousands of clients and hundreds of backend services which are moving around all the time. I just honestly didn't want to deal with DNS propagation limiting my flexibility.
3) At least at Yelp, we don't really have particularly nice control apis for DNS. This is sort of specific to Yelp, but it was a factor.
You can avoid setting arbitrarily low--but not low enough--TTL, etc.
but also, soooo glad to have 'real' loadbalancers and not have had to do this.
> In order for nginx to re-read the configuration file, a HUP signal should be sent to the master process. The master process first checks the syntax validity, then tries to apply new configuration, that is, to open log files and new listen sockets. If this fails, it rolls back changes and continues to work with old configuration. If this succeeds, it starts new worker processes, and sends messages to old worker processes requesting them to shut down gracefully. Old worker processes close listen sockets and continue to service old clients. After all clients are serviced, old worker processes are shut down.
It brings up the new worker processes to handle new connections and wait for the existing connections to terminate. Key phrase is "does its best", I've seen edge cases that aren't really Nginx's fault that cause connections to be dropped. [e.g. OOM issues when it tries to bring up new workers]
Why do you have to reload haproxy? When you update the configuration?
I reload nginx all the time (nginx -s reload) and I'm not sure if that is a true zero-downtime reload either.
Interesting hack nonetheless (stopping SYNs.)
It is a very useful approach and I use it all time as well.
I implemented the same in node by using the cluster api .
I believe it's API let's you enable/disable existing configured servers but not dynamically add or remove them.
And to be sure; being able to do ten reloads every second with few ill effects enables different, more nimble systems engineering.
But if we assume 2000 requests per second, per box - fighting ~100 reset connections a day (assuming two ha-proxy reconfigs) doesn't really seem worth the effort - packet loss and other outages would probably(?) domminate anyway.
It depends on what you do. I've seen shops (successfully and, IMO, correctly) scaling AWS instances for services with a threshhold of every fifteen minutes, and I've seen Mesos clusters dynamically spinning up web instances much more nimbly than that (think every two minutes under spiky load--the instances would come up in five seconds, so it didn't hurt to down them).
Does this add extra complexity to platform (software) updates?
One thing that makes doing this in haproxy difficult is that there is not really any shared state between the parent and child processes, so the child doesn't have a good way to know which file descriptor maps to which listening endpoint in the configuration, since the child pretty much throws its entire state away and reads the config file anew. It's not that there can't be any shared state, but that's not how it's been architected. Finding out the endpoint via something like getsockname(2) might be doable, but the mapping of listening endpoints to listen configuration blocks isn't one-to-one, so it's actually "safer" (from an amount of code standpoint) to use SO_REUSEPORT and let the OS handling the shared listening.
There's no reason not to patch HA to reload it's config and reapply it.
For more details check out the mailing list: http://marc.info/?l=haproxy