
Deploying a Django App with No Downtime - cuu508
https://medium.com/@healthchecks/deploying-a-django-app-with-no-downtime-f4e02738ab06
======
Beltiras
I was responsible for a 10^6 views/day (due to most traffic being daytime,
comes out to 10-20 req/sec) website running on Django. I went through hell
arriving at this solution because the system was in disarray when I took over
and deployed on an expensive cloud solution (so resources were limited).
Finally I just put my foot down and demanded we buy servers instead, we would
have more bare metal by a factor of 4 than we needed and we would pay only a
fraction of what we did for the previous "solution". I could provision all the
VMs I wanted, provided I didn't break the host.

I arrived at this procedure: Spin up a dyno, worker, caches and warm
everything up. Run smoketests to validate everything from the app-server
onwards is lit and firing on all cylinders. Redirect nginx traffic to the new
setup. If everything turns to shit in a second, reverse traffic change and re-
evaluate. If everything works and nothing is broken after an hour, take the
old dynos/workers/caches offline.

The trick was if there were migrations that broke the older version that you
did it in 2 or more steps to isolate the changed model interactions.

And nobody should be using gunicorn, it's way less performant than uWSGI
(citing:[http://blog.kgriffs.com/2012/12/18/uwsgi-vs-gunicorn-vs-
node...](http://blog.kgriffs.com/2012/12/18/uwsgi-vs-gunicorn-vs-node-
benchmarks.html)).

~~~
CoffeeDregs

        And nobody should be using gunicorn, it's way
        less performant than uWSGI
    

uWSGI may be different now, but 4 years ago I built a browser toolbar C&C API
server that peaked at about 2k requests per second. We used uWSGI to front our
Django application with SoftLayer HTTP load balancers in front of uWSGI. uWSGI
was speaking HTTP and translating to WSGI, but did so only in one
thread/process. So while 8-16 Django threads were tootling along, the single
uWSGI HTTP->WSGI thread was pegged. We switched to gunicorn and had
dramatically better performance.

~~~
Veratyr
You can deal with this by having NGINX load balance and translate from uWSGI's
binary protocol to HTTP: [https://uwsgi-
docs.readthedocs.org/en/latest/Nginx.html](https://uwsgi-
docs.readthedocs.org/en/latest/Nginx.html)

~~~
Beltiras
It's well worth the effort to get uWSGI running under it's native protocol.

Here's a must-read if you use it: [http://uwsgi-
docs.readthedocs.org/en/latest/ThingsToKnow.htm...](http://uwsgi-
docs.readthedocs.org/en/latest/ThingsToKnow.html)

------
smilliken
To simplify this process further, you can run both the new and old gunicorn
process on the same port using the SO_REUSEPORT feature in Linux 3.9. This way
you don't need to listen on a new port and update nginx config (potentially
once for each upstream web server).

I submitted a patch to gunicorn just this week about this:
[https://github.com/benoitc/gunicorn/issues/598](https://github.com/benoitc/gunicorn/issues/598).

Ideally I'd love to see all server software that listens on a port have a
SO_REUSEPORT option for hot-swapping. This feature makes operations so much
simpler.

~~~
abrookewood
But if both processes are listening on the same port, how would he check the
health of only the new process?? How do you determine which process answers
each incoming connection?

~~~
andy_ppp
Will the old processes not be successfully killed and the new processes just
take over? Healthwise supervisor would be the one checking the gunicorn
processes?

Looking at this article is just seems to make ports operate either with any
new server initialisation just automatically takes over the port (or maybe
they become effectively load balanced between all running processes, it's hard
to tell conclusively)...

[http://freeprogrammersblog.vhex.net/post/linux-39-introdued-...](http://freeprogrammersblog.vhex.net/post/linux-39-introdued-
new-way-of-writing-socket-servers/2)

------
crdoconnor
UWSGI also handles zero-downtime restarts:

[https://uwsgi-
docs.readthedocs.org/en/latest/articles/TheArt...](https://uwsgi-
docs.readthedocs.org/en/latest/articles/TheArtOfGracefulReloading.html)

~~~
Beltiras
If you seriously want no downtime you should never touch-reload. ALWAYS spin
up new infrastructure, smoketest, switch the loadbalancer. You might run into
problems with load you didn't see from testing but you can reverse and
minimize downtime.

~~~
hackerboos
This. I load tested my zero downtime unicorn setup and got a few lost requests
during the reload.

That said, if you boot a VM and it's bad, you are going to lose requests then
too.

~~~
Beltiras
First of all, what is a "bad vm"?

Second: smoketests should show if the engine is up and serving requests not
under load.

Third: you have the fallback of reversing the flow.

You will _always_ lose requests. All you can do is minimize the damage.

EDIT: Want to clarify what I mean by lose requests since someone is
downvoting. In the process of administrating a site with that many requests
and a moving infrastructure you can plan as much as you want and try to employ
hedging strategies all you want. You will STILL run into problems once in a
while that will drop messages. You don't have to lose requests to handing
load-balancing off to another handler and you can reload apps without losing
requests if the app server is properly coded.

------
rer0tsaz
There was a good talk about deployments at DjangoCon US 2015.

Django Deployments Done Right by Peter Baumgartner:
[https://www.youtube.com/watch?v=SUczHTa7WmQ&list=PLE7tQUdRKc...](https://www.youtube.com/watch?v=SUczHTa7WmQ&list=PLE7tQUdRKcyaRCK5zIQFW-5XcPZOE-y9t&index=12)

~~~
ipmb
That's me! (shameless plug) I'm also running a Kickstarter campaign for a
screencast series that will go into detail on Django deployments
[http://kck.st/1FRZyMx](http://kck.st/1FRZyMx)

------
danialtz
Nice approach! "reload" was a nice touch to learn.

What I miss usually are two parts of deploys that are often ignored, and are
critical in production environments:

1\. Revert/rollback to older versions. We're humans and despite having all the
processes sometimes murphy's law applies on moderate to complex server setups.

2\. There is git/svn for code tracking, but even more important is the
database consistency. Versioned backups and restores should be also part of
the whole setup.

I'm currently looking into having the full setup with no downtime with open
ears.

~~~
irremediable
Any advice about versioned database backups?

~~~
smt88
In my opinion, this is one of the major unsolved (or under-solved) issues in
web development. You have to write and test migrations, and then you have to
make sure the migrations are tied to your deployment/rollback process.

If there's a service/tool that automates a lot of this or makes it safer, I'd
be really happy to learn about it!

~~~
UK-AL
When I used to use django ages ago, there used to be Django south. I think its
built in now.

~~~
metachris
Yes, its built in now:
[https://docs.djangoproject.com/en/1.8/topics/migrations/](https://docs.djangoproject.com/en/1.8/topics/migrations/)

Of course a database rollback may loose information from any newly created
fields.

~~~
jeffasinger
So if you're aiming for zero downtime rollbacks with Django, here's how adding
a new field to a model might work:

    
    
      1) Add a Migration that adds the new field, but allows null. Deploy.
      2) Add the field to the model. Also make sure it sets it's default on write. Deploy.
      3) Execute a background task to set the field to whatever it's default value should be.
      4) Add migrations to enforce integrity and add indexes. Deploy.
      5) Actually deploy code that needs the new field.
    

Yes, this is super annoying, no, most people don't do this. Separating out #1
and #2 means you can always roll back all the way to right after #1 without
losing any data. An extra, nullable field on a model with no indices on it
shouldn't hurt anything.

~~~
micah_chatt
Totally agree. I've been advocating this for months but the ease/simplicity of
simply restarting the server with new code has won out for now.

------
methodover
It's interesting, reading about how my counterpart at other shops do it.

In our own Django web app, we basically just use the load balancer for
deploys. Our service provider (Linode) has a nice API, so when deploying we
just do one server at a time and direct traffic away from the ones doing the
upgrade. It's not complicated and works just fine... At least when there are
just two machines.

~~~
joeyspn
+1 for rolling upgrades... Having 2+ appservers makes upgrades easier and
increases resilience in general. Docker makes this really easy...

~~~
cdnsteve
Could you provide more details on this?

------
sidmitra
I'm curious, unless i'm missing something the OP could have tried a graceful
reload of the gunicorn process?

>kill -HUP <pid>

EDIT: That probably applies to nginx too.

~~~
vacri
nginx also has a 'reload' command like apache - service nginx reload. No need
to hunt down the pid yourself...

~~~
klibertp

        nginx -s reload
    

also works, you don't need service/systemctl at all for this.

------
IgorPartola
As I understand it mod_wsgi with Apache will do a code reload when you either
touch your WSGI file or do an apache2 reload (that is send it a SIGHUP). The
big downside there is having my Python code be re-loaded and re-parsed as the
new processes get spun up. That usually results in the first few requests
being really slow. But it is effectively a zero downtime system. In general,
if the components of your stack support being reloaded via SIGHUP instead of
having to be restarted, use that. If they do not, consider using different
components.

However, I'd say that this gets a whole lot easier as you add even a "local"
load balancer: that is a load balancer that lives on the same box but does not
require a restart during a deploy. In this case you can start up your
gunicorn/apache2/uwsgi/whatever on a new internal port, then remap which port
is "live" using your firewall or by updating your load balancer's config and
reloading it. This is how dokku works with Docker containers and I love that I
have zero downtime deploys with it out of the box.

------
ipmb
Instead of rewriting your supervisord config every time you should be able to
use a symlink that points to the newest virtualenv. I haven't tested gunicorn,
but reloading uWSGI will resolve the new symlink to your new virtualenv.
Rolling back is simply adjusting the symlink.

Also make sure you're on pip 7+ to take advantage of the automatic wheel cache
to speed up building new virtualenvs.

As another commenter noted, I discussed a similar approach at DjangoCon US
this year
[https://www.youtube.com/watch?v=SUczHTa7WmQ&t=1083](https://www.youtube.com/watch?v=SUczHTa7WmQ&t=1083)

------
Pephers
For gunicorn running Flask I use something along the lines of for zero-
downtime deploy which really works well as long as your app runs on a single
server:

    
    
        rsync -avz ./ user@remote:/home/user/web-app
        ssh user@remote 'cd /home/user/web-app; \
        venv/bin/pip install -r requirements.txt; \
        venv/bin/alembic upgrade head; \
        supervisorctl pid web-app | xargs kill -HUP'
    

The full deployment script has a few extra options which I've omitted for
clarity, but this is basically it.

~~~
Galanwe
I don't really think you should promote these.

The _real_ way to perform clean update is using a load balancer, offloading
instances for upgrade one at a time.

Also please people stop deploying using github. Package management &
versioning IS a thing.

> rsync -avz ./ user@remote:/home/user/web-app

This is not an atomic update, meaning that you could end up loading modules
from the new version from code of the older version. Use a new directory to
upload the new files.

> venv/bin/pip install -r requirements.txt

You do not delete the old requirements, this is incrementally bloating the
virtualenv. Just create a new one, virtualenv are designed to be
created/removed quickly.

> venv/bin/alembic upgrade head

Huu, that's nice if you only have 1 instance to upgrade at a time. Otherwise
it will just blow up.

------
xmatos
I use git hooks for my Django deployments and it couldn't be simpler. My post-
receive hook look like this:

#!/bin/sh

dest=/var/www/myproj

GIT_WORK_TREE=$dest git checkout -f

$dest/manage.py collectstatic --noinput

$dest/manage.py migrate

touch $dest/myproj/wsgi.py

Running a git push to the prod server fires this hook and that's all. I could
improve it by first checking out to a test environment to run django tests
and, if everything passes, do a git checkout to production.

Plus, am I the only one using apache?

~~~
mangeletti
Back before WSGI really took off, mod_python made Apache fairly prolific, but
most Python apps these days are using Gunicorn (or some other Python-based
WSGI server), or Nginx in front of uWSGI, so yes, to some degree, you are the
only one using Apache.

As a side note: I've run copious tests against Nginx -> uWSGI setups and
Gunicorn, etc. setups, and Nginx -> uWSGI is in a whole different league (way
faster, way more requests per second, way less memory, and you CAN (despite
what I'm reading herein), very EASILY with uWSGI, deploy without losing a
single request). However, I still use Gunicorn, because it's quick and easy,
it works well, and I don't have to sys admin the thing.

------
dbravender
[https://github.com/dbravender/gitric](https://github.com/dbravender/gitric)
(which I wrote) is a much more generic fabric module that will work for more
cases. It looks like this particular solution only works because the database
query was copied to the web server which might not work for everyone. Check
out the sample blue/green deployment fab file here:
[https://github.com/dbravender/gitric/blob/master/bluegreen-e...](https://github.com/dbravender/gitric/blob/master/bluegreen-
example/fabfile.py). We used the same technique at my last job on a Django
site and deployed it several times a day for two and a half years with zero
downtime.

------
glogla
That SQL statement that does update and insert is pretty cool. It chains
together both parts into single statement so it doesn't even need to start and
commit trasaction, right?

It makes me wish for some kind of "advanced postgres" guide.

~~~
cuu508
I think PostgreSQL still executes it as a transaction, even though it is a
single statement.

While benchmarking this, I learned that with the default and safe settings
(synchronous commit), PostgreSQL can only do few hundred write transactions
per second, however simple. Set synchronous_commit=off in postgresql.conf and
TPS goes into thousands.

~~~
anarazel
With synchronous_commit = on the throughput heavily depends on your IO
subsystem and the number of concurrent transactions.

As every transaction has to be durable at commit, fsync (well, fdatasync() if
supported) of a sequential log is required. Most single drives can do a ~100
(rotational disks, normal speeds) to a few thousands fsyncs/sec. If you have a
battery backed raid controller you can do tens to hundreds of thousands, since
the data will not actually be written to disk.

Concurrency matters because several commits can be made durable with a single
flush to disk if they're happening concurrently (sometimes called "group
commit").

EDIT: spelling, grammar, minor clarification.

------
PythonicAlpha
Really nice, but when you have bigger DB-changes, this might run into trouble,
when the old code runs on changed tables or the new code runs on unchanged
tables ...

At least such cases should IMHO be considered before doing such upgrades.

------
annnnd
Nice!

I would indeed sleep better if the service monitored my cron jobs. But I would
sleep even better if I knew you were making some money from this, because this
gives you an incentive to keep providing the service.

For instance, you could offer a paid plan for users with >N/day requests or
something like that...

Not sure if you want to go there or not (and I am not currently in the market
for such a solution as I have solved my need with a proprietary solution), but
the fact that there is no paid plan lowers my expectations of the service.

EDIT: clarified my point.

~~~
cuu508
Thanks for the comment! I've received similar feedback a few times already,
and will be thinking about aligning incentives. I do want to keep the existing
feature set free though.

------
seanwilson
Have you considering using anything like AWS or Heroku? I know you say in the
article you want to keep it simple but services like those can hide this
complexity for you and their solution is likely to be more robust. Also, if
you're only using a single Digital Ocean droplet at the moment, you will have
to consider the changes you'll need to make when you require more than one
droplet or the droplet you're using goes down. Looks like a fun project; good
luck!

~~~
cuu508
I might look into making it easily deployable on Heroku. Would still prefer
plain VMs (like the ones from AWS and DigitalOcean). Figuring out deployment
is part of the fun!

------
mrweasel
I might be missing something, but he's using Fabric and Python 3? Fabric
doesn't seem to support Python3 yet.

Or is he using Python 2 to deploy a Python 3 application?

~~~
cuu508
Correct, I run Fabric with Python 2. The app itself works with both Python 2
and Python 3.

~~~
mrweasel
Aah okay, we're trying to eliminate Python 2 completely, so sadly no Fabric
for us.

~~~
sidmitra
Even though i've moved to python3, I keep fabric installed via apt-get and
using the system python which is usually 2.7 I don't have anything in my
fabric that needs access to my django app or virtualenv.

------
shubhamjain
I am new to Python world but is there any reason one must be necessarily using
"virtualenv" on production server? If I am not wrong, its purpose is to avoid
conflicting dependencies but on the deployment server it can be assumed that
only one project will be deployed, so I don't believe there can be many
instances on conflicting dependencies.

~~~
riquito
It's not a strictly Python problem. He's deploying new code with updated
dependencies in a separate directory, while the original site is running.

If you manage globally the dependencies you would probably brake the running
version of the site during deploy, because it may expects different (old) API
from his dependencies. For example the new code may NOT require a previously
used dependency, and thus the deploy would remove it and the running code
would collapse.

Also, if the deploy fail midway you'd like the global environment to be
unaffected.

------
leetrout
OT but I think you can make the location a bit more readable by using the
count quantifier...

^/(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})/?$

------
mvermaat
Cool! We have a pretty similar setup implemented in Ansible for our Mutalyzer
app:

[https://github.com/mutalyzer/ansible-role-
mutalyzer](https://github.com/mutalyzer/ansible-role-mutalyzer)

healthchecks.io looks pretty nice, I might start using it!

------
Kiro
I don't understand how Django apps can have downtime. When I deploy I just do
a git pull and all new requests end up in the new code. Either your request
goes in before the pull or after, there's no in between. What am I missing?

~~~
andybak
1\. Update your code (a few seconds) 2\. Update your requirements (this can
take several minutes in some cases) 3\. Run migrations 4\. Run collectstatic
5\. Restart any daemons 6\. Restart your app

Even if you can guarantee no processes will restart in this period (maybe you
pause your daemons and you've configured your webserver to not spawn any new
workers for a while) you've got the issue of migrations.

Personally I stop the webserver before this sequence and restart it afterwards
as I can't reliably reason about all the things that can go wrong in between.

~~~
andybak
(meta: Once again markdown and it's ilk prove to be the bane of my life...
It's like WordStar never died.)

------
elktea
What about simply spinning up a new VM on a service like AWS, keep the
database and switch the load balancer you've confirmed the app is responding
on the new VM?

~~~
abrookewood
Yeah, that's how I would typically do it, but he isn't on AWS and he's aiming
to keep the cost down. And he's also keeping it simple, so no load balancers
etc.

~~~
mrtron
If you are getting a request per second, I think a load balancer on AWS has
tremendous value.

~~~
ngrilly
1 request per second is nothing for a modern server. Most applications are
able to handle hundreds or thousands of transactions on a single machine. Why
do you think a load balancer would be useful?

------
mherrmann
Very cool. Also cool to see a free alternative to cronitor.io!

------
stefantalpalaru
> To verify this in practice, I wrote a quick script that requests a
> particular URL again and again in an infinite loop.

That's not how you verify it. Use a load benchmark like "ab -c 10 -n 3000 -q
domain.tld/" and if that doesn't report non-200 responses while deploying the
new code, then you can talk about no downtime.

My solution for this problem (tested and deployed on Django sites) is this:
[https://github.com/stefantalpalaru/uwsgi_reload](https://github.com/stefantalpalaru/uwsgi_reload)

------
GPGPU
I update my Erlang based ChicagoBoss / Cowboy web services live all the time,
without one transaction being lost.

Erlang lets you update live code, while it's running.

Still, I'm a big fan of Python and Django, and it's nice to see people
realizing the value of "live" upgrades, even in a round-about way.

