

Ask HN: Patterns for deploying webapp updates with no downtime - simonw

I'm interested in techniques that can be used for deploying new versions of web applications with no perceived downtime for end users, without having to disable writes.<p>I think I know how to do this while disabling writes: run two copies of the database (one replicated from the other), disable writes at the application label, separate the slave and continue to serve reads from the master, upgrade the slave's schema, "activate" the slave (essentially telling it it's now a master), point a new instance of the application running the updated application code at it and switch the HTTP traffic over - then set the original master up as a slave to the new master and enable writes again.<p>First question: is this sane / best practice?<p>Second question: if I want to do this without disabling writes for the period of the upgrade, what are my options?<p>Plenty of sites seem to manage to deploy new features without noticeable periods of downtime or read-only mode, so presumably there are a bunch of patterns for dealing with this. Where can I learn about them?<p>To clarify: I'm talking about updates that include some kind of modification to the database schema.
======
jeremyw
It sounds like you're making rolling updates across your app server cluster,
version n -> n + 1. You have to separate database updates into innocuous and
harmful, and your developers have to signal that state for deploy.

Changes:

a) Schema changes and row updates that are compatible with 'n'. No downtime,
no worries.

b) Schema changes and row updates that are _in_compatible with 'n'. Ideally
this requires downtime, but I've seen architectures that get away with live
rolls by grouping app server updates by shard.

c) Database changes that will have a severe performance impact, e.g.
index/update a massive table or hit some other perf corner of your db.
Downtime or you invest in a key-value or FriendFeed-style architecture.

Most agile updates tend to be (a), luckily.

------
wizard_2
After reading <http://highscalability.com/> for a while I've happily found
myself using some tips from there from time to time. I've only used these
methods a few times, I usually just push large database changing changes at
night and try not to do anything that takes longer then 20 minutes.

One way is to use two tables and have the application logic read from both and
write to the new one. This can't always work easily but for tables that don't
get a lot of joins its not that hard. Deploy the code, migrate the data into
the new table and then drop the old table. I've used this once to update a
user_profile table for a busy forum.

Another way to mitigate downtime on table changes is to have lots of tables. I
believe one of the larger Chinese social networks was reviewed on the HS blog
and boasted that they found it easier not to have more then two columns on a
table (pk and value). That's a little crazy imho, but I have see it working
with smaller column groups.

You use a lot of 1 to 1 relations and each column or logical group of columns
gets it's own table with a foreign key to the main object. This way you can
modify a column without restricting access to most of the object at the cost
of more joins. I worked on a django project where we had a users table and any
user information (there was a lot) was a different table. The data models were
related to the user model and handled all the lookups. (User.profile,
User.contact, User.reporting_prefrences, User.support_requests, etc.)

I've never used mongodb or couch, but with a nosql you can just have the app
logic take care of upgrading records on read. Run a script to upgrade
everything. Drop the app logic.

------
stingraycharles
What we're doing is this: when doing upgrades that actually changes the data
model, they go in two phases:

* First, an upgrade that understands the old model and the new model, internally uses the new model, and writes in the old model. This means that this new version is 100% compatible with the old version. We launch new services, test them, add them to the load balancer, and remove the old services from the loadbalancer.

* Secondly, a new update is launched: this one is almost the same as the previous version, except that it writes its data in the new model too. The same process with launching new services and adding to the load balancer is repeated.

Using this two-phase upgrade has the major advantage that you're always
running the new services next to an old version that is completely compatible,
data-model-wise, and thus allows you to do an emergency rollback to a previous
version if required. The trick with adding to the load balancer also ensures
that no downtime is experienced for the clients.

All this requires quite a bit of work (especially since you need to deploy
multiple releases), so it depends on how much zero-downtime upgrades are worth
to you.

------
kylecordes
I wrote up how we attack this problem a couple of years ago:

<http://kylecordes.com/2007/01/20/web-app-swap/>

including how we handle schema changes.

~~~
huherto
Great post. The idea is very clearly explained in a few paragraphs; It is very
complete since several variations are considered such as clustering,
bookmarks, schema changes. Everything just makes a lot of sense.

------
lamby
Are you just trying to avoid it looking "bad" for visitors, or do you actually
require your site to be up that long?

If the former, one hack is just make the downtime for users more fun - I added
a chat interface so that anyone waiting doesn't get too bored and can interact
with other members.

Screenshot: <http://lamby.uwcs.co.uk/b/playfire_maintenance.png>

Architecturally, it doesn't touch our database or "main" site at all so we are
free to break everything during an upgrade.

~~~
simonw
That's really cool. We're aiming for almost full functionality in this case,
but I can see how that would work great for some projects.

------
thegoleffect
Depends on what scale you're dealing with. If you have a high traffic site,
the db should be sharded so if you do a manual switch master-slave, only a
small piece would be affected at a given time.

But I'm guessing you're dealing with a single M-S setup. I've asked around and
it seems the standard practice for that type of a setup is to create a second
table for each one you are attempting to modify, 'insert into table2 select *
from table 1;', modify table2, rename table 1, rename table 2 to table 1.
Then, script or manually cope with any 'leftovers' in table 1 that didn't get
ported to table 2.

Would be interesting to have an in-memory (but written to disk) NoSQL layer
sandwiched between MySQL and the user. Then, you can change schema all you
want or switch in/out DB servers without any visible impact. Might be a leaky
abstraction though. Not like I tried that out.

~~~
simonw
You know what, I never actually thought about doing a migration by having a
duplicate of the tables running in the same database. That sounds like it
could work really well - thanks for the tip.

------
SlyShy
There are also options like Erlang and Node.js where hot code-swapping is
possible. Although having a second database is useful as a slave, of course, I
don't think it is necessary to run two copies of the database just to
redeploy.

Github just redeploys by killing and restarting Unicorn workers gradually.
It's graceful, because any worker that is handling a connection won't be
killed, so you won't get any dropped connections.
<http://github.com/blog/517-unicorn>

~~~
simonw
I'm not hugely concerned about swapping out application logic, since with a
bunch of application servers it's possible to pull some out of the pool,
upgrade them, then use the HTTP load balancer to redirect all traffic to the
servers running the updated code.

The big challenge is making changes to the database schema and co-ordinating
the deployment of those changes with the switch over of traffic to the new
application logic.

~~~
mpfefferle
This may not be a perfectly generalized solution, but is it possible to
structure your system such that when upgrading from version i to version i+1,
version i of your app is compatible with both versions i and i+1 of your
database and vice versa?

Say, for example, your factoring a column into it's own table. Don't just drop
the column, set up a trigger to synchronize the original column value with a
value from one of the rows in your new table.

You could then finally drop the column in version i+2.

I seem to remember finding a book on database refactorings that covered this
technique in more detail. It was online so you could try Googling for it.

------
wvenable
I've got a pretty good setup going, most of the changes do not require any
downtime at all. Adding a table or column rarely requires any downtime (the
existing code knows nothing about the table/column and continues on it's way)
-- push the DB change first then the code. Removing a table or column can work
as well, push the code change first and then remove them.

For more grueling changes (those that require data conversion), I still take
down the site. I script the changes and then take the site down, convert,
deploy, bring the site back up. Smaller changes take only a few minutes,
longer changes can take hours. However, the length of the downtime is
inversely proportional to how frequently you need to do it.

Sometimes taking the site down is appropriate. For really big changes, users
simply cannot continue to use the site and be unaffected.

------
jhancock
Its a pretty special app that can't handle a few seconds of downtime. The
first thing I would do is be very certain this is a requirement.

I thought it was a requirement for a couple of webapps I manage and I now
think otherwise. I have scripts for starting and stopping various server
processes and have other scripts that pull them together to do full deploy
like what you are talking about. I could optimize it, but after doing it this
way for a bit I realized I feel safer by keeping it simple. I'm fairly certain
I've never had a quality of service problem with my users.

------
thinkbohemian
Not sure if this is exactly what you're getting at, I use capistrano, it was
built for rails deployments and does require some scripting/setup but once
you've got that down, i can push changes to any of my sites all day long. I
have a few wordpress installs that i deploy with capistrano as well. Once
you've got everything setup, there is no noticeable downtime.

<http://www.capify.org> ... to modify the database schema you can put
migration logic in your scripts.

------
emmett
For simple things like adding tables or adding columms, just do it. Add the
column/table, then release new app code relying on it.

For more complex things (changing the name of an existing column, or breaking
a table into two parts, etc.) you need to write a compatibility mode into the
application code. New writes go to the new column/table name, reads go to both
places. Once that's released, migrate all the data as slowly as you like
behind the scenes. When you're done, you can drop the old column or table.

------
kfool
This is not sane practise. There's a technology trying to address this without
you having to modify your web application at all:

<http://chronicdb.com>

The idea is to allow both the old version of the web application and the new
version to work concurrently, with no errors.

------
lamby
Can you change your database paradigm? ¬_¬ A document-oriented database like
CouchDB would "just work" in the most common database schema changes. Or
perhaps you could throw upgrade-friendly data in a KV store encoded with
Google Protocol Buffers.

------
lol_Sprint
Are you working for vendor X on the Sprint.com upgrade? 'cuz they seem to be
having this precise problem lately. Down since 2300 on Saturday with no end in
site.

