
How we handle deploys and failover without disrupting user experience - notknifescience
http://code.mixpanel.com/2012/09/28/how-we-handle-deploys-and-failover-without-disrupting-user-experience/
======
TimothyFitz
By far the easiest way to get similar behavior is to run your Django app via
Gunicorn (proxied from Nginx). Gunicorn supports hot code reloading via
SIGHUP, and it does so by forking and gracefully killing old processes.

If your requirements don't match gunicorn (not django, not python, etc) then
you can use <https://github.com/TimothyFitz/zdd> a project I wrote to automate
rewriting nginx config files to deal with changing proxied portfiles. To
integrate any existing server, all you have to do is make it bind to port 0
(let the OS choose a port) and the write a foo.port file that contains the
port number (like a pidfile). That's it.

~~~
mattyb
I mentioned this in a thread the other day; I've yet to see a good use case
for hot code reloading. Can you really not drain requests to that host via
HAProxy (or similar), and then actually restart the service? The nice thing
about that approach is that your choice of service runtime doesn't matter.

~~~
TimothyFitz
Your method will stall responses for server shutdown + server startup time,
which for Ruby/Python apps is usually measured in tens of seconds, and for
other web servers can be much worse. Hot code reloading lets you avoid any
downtime at all, and with it usually being built into the framework/language
specific server you get the functionality "for free".

Zdd (the project I linked to) is all about spawning a new process in parallel.
All the advantages of your approach (switch to an entirely different language?
Who cares) but without the stalls.

Zdd also lets you keep the old process alive through the duration of the
deploy (and after), and with a little work could let you switch back in the
event of a bad deploy without having to start the old version up again.

~~~
jonhohle
The method he's describing requires no downtime or stalled requests.
Connections are drained from some pool in a load balance set if servers. The
services is restarted with the new code. The hosts are given traffic again
once the are initialized and healthy.

The advantages to this method include being completely platform agnostic, as
well as giving you a window to verify successful update without worrying about
production traffic.

~~~
TimothyFitz
Thanks for the clarification. The downsides to that approach are that you need
multiple machines, and the duration of your deploys is much longer. Not to
mention, you'd have to script a deploy process across multiple machines (which
is not easy, in the way that "SIGHUP Gunicorn" is easy).

Personally I've found the "put new instances into a load balancer" method to
make more sense for system changes (packages, kernels, OS versions) where
deploying the change is inherently slow or expensive, but the method doesn't
make sense for code deploys where deploy time is important.

------
ivix
I thought the commonly accepted way to do this was to

* Have more than one application server

* Deploy new code to a non-active application server

* Send some traffic to the new app server (perhaps based on cookie)

* When confident, switch traffic (using Varnish/other load balancer) to the new application server

------
SpikeGronim
I'll share how wavii does this. We use amazon elb and chef. The first chef
recipe that runs on our frontend is "touch /tmp/website_should_drain". The
website then returns 404 from the /status health check URL. We wait 30s,
update the site, and then rm the draining sentinel file. If the deployment
fails the host stays out of the elb. If more than 1/3 host fail we abort the
whole deployment. This works well for us and was very simple to implement.

~~~
zwily
We've found that when the ELB health check returns non-ok, the ELB will
immediately kill any requests in flight to that box. So it's kind of
impossible to do a real "drain" with ELB. Are you seeing different behavior?

~~~
mattyb
We've seen the same behavior, so we do the draining via HAProxy, which does
the right thing.

------
mpd
Mixpanel had some downtime 2 days ago, from 22:59 until 23:23 UTC according to
our error logs. It might have only been for API users, but downtime
nonetheless.

It's certainly possible it was scheduled, but it's not clear where that
downtime is announced if so.

------
grosskur
Another method for doing this with haproxy+iptables:
[http://www.igvita.com/2008/12/02/zero-downtime-restarts-
with...](http://www.igvita.com/2008/12/02/zero-downtime-restarts-with-
haproxy/)

------
xxpor
I hope this is all automated :)

This seems very tightly coupled to the front end webapp? Do you guys have a
completely different system for backend services? Or is there a level of
abstraction that isn't being revealed?

------
drumdance
I like the way they implemented the custom connection timeout for queries.
Great example of using exceptions intelligently.

