
How Balanced Does Database Migrations With Zero Downtime - mahmoudimus
http://blog.balancedpayments.com/payments-infrastructure-suspending-traffic-zero-downtime-migrations
======
sigil
You mean you didn't have to "build a custom HTTP server and application
dispatching infrastructure around Python's Tornado and Redis" in order to
suspend traffic? [1]

I like the simplicity of your approach, it's literally 2 lines of code. Bravo
guys.

[1] [https://www.braintreepayments.com/blog/how-we-built-the-
soft...](https://www.braintreepayments.com/blog/how-we-built-the-software-
that-processes-billions-in-payments)

~~~
jtdowney
To be fair, they had to upgrade to a development release (1.5) of HAProxy to
configure it in this way. When we introduced the broxy at Braintree that
feature did not exist.

(Disclosure - I work at Braintree)

~~~
pgr0ss
The broxy also adds more functionality, such as intelligent rate limiting by
merchant (no one merchant can consume all of our backend app processes).

(Disclosure - I also work at Braintree)

~~~
msherry
We actually have much the same functionality. We can dynamically route/rate-
limit requests on a per-marketplace (or anything else, really) basis, using
Nginx's Lua integration capabilities. We may end up writing another post on
this, if there's any interest from the community.

~~~
rurounijones
I think you can safely assume that any post along the lines of

"How we handle something <TECHNICAL> at balanced"

or

"How we handled <SITUATION> using <TECHNOLOGY> at balanced"

is of interest to this community.

No need to ask, get writing, hop to it! :)

------
bokonist
This strikes me as a very risky way to do a migration. If there are bugs in
the new application code you basically cannot roll back, since the old version
of the code ran against an old database schema. The slower approach of making
the schema change backwards compatible, deploying new code, and then dropping
the old columns seems a lot safer.

~~~
msherry
You're definitely correct that bugs in the new application code would be a
showstopper. Another bit of infrastructure we plan to write about if there's
interest is our testing setup.

Basically, before any deploy (not just drastic ones like this), not only do
all existing unit tests for the new code have to pass (with a certain minimum
code coverage threshold), but also a full acceptance suite, which tests the
new code and how it connects to all of our other bits of infrastructure. We
simulate an extensive set of operations that a client might perform against a
test instance of the server, and also run a number of tests with all of our
services loaded in-memory, which allows us to mock/patch arbitrary points in
the code, to assert that what we expect to happen is actually happening. We
also run each of our various clients' test suites against the new code, to
make sure that each client sees the behavior it expects to see. This testing
suite has dramatically increased our confidence any time we have to do a
deploy, and best of all, it's all done automatically.

~~~
mahmoudimus
If HN is interested in how this is, I've demonstrated this to a few people but
it allows us to move FAST and confidently.

Since we use services internally, being able to confidently test interactions
between all our services (6+ at this point), it is a HUGE win for us.

Open up an issue here: <https://github.com/balanced/balanced.github.com> if
you're interested.

~~~
krichman
Are you seriously questioning that there's interest? I come here for the few
links to tech articles with high signal:noise ratios. There are perhaps one or
two a day.

Judging by what just got posted, please assume you can post anything about
your tech stack and we will love it.

------
bobf
I gave a presentation on zero downtime database migrations at a devops
conference in Boston in November 2012. Slides are here:
[http://www.completefusion.com/zero-downtime-deployments-
with...](http://www.completefusion.com/zero-downtime-deployments-with-mysql/)

I also wrote a more detailed post about it for SysAdvent 2012 -
[http://sysadvent.blogspot.com/2012/12/day-3-zero-downtime-
my...](http://sysadvent.blogspot.com/2012/12/day-3-zero-downtime-mysql-schema-
changes.html)

------
victortrac
What's the point of ELB, Nginx, and then HAProxy?

~~~
msherry
Nginx is primary for SSL termination and static assets. At the time we set up
our infrastructure, I don't believe HAProxy supported SSL termination.
According to Willy Tarreau's comment to the first answer of this question
([http://serverfault.com/questions/426919/should-i-use-an-
ssl-...](http://serverfault.com/questions/426919/should-i-use-an-ssl-
terminator-or-just-haproxy)) it was added in the same release as the patch I
mentioned, coincidentally.

~~~
victortrac
Why not let ELB handle SSL termination and load balancing (ignoring the fact
that HAproxy can delay connections by 15 seconds)?

~~~
msherry
Due to the fact that we process credit card payments and thus fall under PCI
scope, we have to adhere to the PCI DSS (data security standard). There's a
"quick" summary of it here
[https://www.pcisecuritystandards.org/documents/pci_ssc_quick...](https://www.pcisecuritystandards.org/documents/pci_ssc_quick_guide.pdf)
, and section 4.1 in particular specifies that we have to secure cardholder
data all they way to our servers -- Amazon's ELB doesn't quite count.

~~~
stephen
I believe Amazon is PCI compliant now? Would that change things?

~~~
msherry
Amazon being PCI-compliant was a requirement for us using them in the first
place :) We could have possibly made a case for their PCI-compliance obviating
the need for us to do our own SSL termination, but that could have gone either
way, depending on our PCI audits.

Using Nginx also lets us do fun stuff with routing using Nginx's Lua
integration, which we may end up writing about in the future as well.

~~~
zwily
OK then why HAProxy? Why not just let nginx do the load balancing? (Obviously
you have a reason now if you plan to use the method in the blog post again,
but what about before?)

~~~
AaronBBrown
I use nginx + haproxy and use haproxy for the load balancing piece, too.
haproxy simply has much more visibility into the queue. I'm not aware of
anything built into nginx that is as robust as the logging and stats page from
haproxy. This makes horizontal scaling decisions infinitely easier.

~~~
zwily
I see... Do you run nginx and haproxy on the same box?

~~~
AaronBBrown
Yes.

------
gngeal
"- perform schema changes in a way that won’t break existing code (e.g.,
temporarily allowing new non-nullable columns to be NULL). \- deploy code that
works with both the old and new schema simultaneously, populating any new rows
according to the new schema. \- perform a data migration that backfills any
old data correctly, and updates all constraints properly. \- deploy code that
only expects to see the new schema."

That sounds a lot like transactional schema updates in Firebird to me. Plus
being careful about how the app handles the data. Schema updates in Firebird
are essentially instantaneous, with the row updates performed lazily (although
if you need it, a SELECT COUNT(*) will force an update of all rows
immediately).

~~~
joseph_cooney
Oracle is the same - DDL is automatically committed. I think in SQL Server DDL
can be transactional. Not sure about Postgres.

~~~
joevandyk
PostgreSQL can have almost all DDL changes inside a transaction.

------
brianr
That's a great approach for the cases where the migration takes only a few
seconds. When you start running into situations where the migration takes
hours or days, it's back to the "normal" way.

~~~
msherry
Sure, this definitely isn't the right approach for all migrations everywhere.
When we have migrations that we know will take a long time, we try to make
them non-invasive enough that we can have code that works with old and new
schemas simultaneously. That way, the migration can take as long as it needs
to, and everything Just Works™.

------
dantiberian
This was a really interesting article. I feel silly for asking this but why
didn't you set up a script which did all of the maintenance, migration and
deploy tasks automatically?

You obviously thought about this a lot but I don't see why you'd want two
humans doing it instead of a script.

~~~
msherry
Thanks! We had been practicing on test servers and so we had the commands
ready to be repeated, but you're right -- for reproducibility, a script would
have been the way to go.

~~~
hvidgaard
Not just for reproducibility, but to avoid human errors. I always just get
nervous when humans make changes to production environments. I'd much rather
have a tested script do it.

------
AaronBBrown
This isn't zero downtime. It's still 15s of downtime (which is really a
trivial amount of time for a migration). As a user, I would rather see a
maintenance page up than have my connections stall out and have me staring at
a blank page.

~~~
mahmoudimus
(Posting for msherry since he can't seem to respond at the moment)

The whole point of this was so API requests _wouldn't_ fail, they would just
take slightly longer. API requests don't get the option to see a maintenance
page -- they would just return errors to the client, which potentially means
lost business for them.

------
jstanley
This sounds incredibly rudimentary compared to
[http://www.percona.com/doc/percona-toolkit/2.1/pt-online-
sch...](http://www.percona.com/doc/percona-toolkit/2.1/pt-online-schema-
change.html)

~~~
msherry
Hi there. I'm the author of this post.

If I'm reading this correctly, your suggestion would alter a single table
online, and at the end, I would end up with a table with a new schema
(assuming I had no foreign keys referencing the table being modified, which
seems to introduce additional complications). Presumably, this change happens
while my application was running, which means that during the migration, I
would have to use the old table format, and then cut over to the new one
instantly once the migration has completed.

Our migration at the time involved multiple table changes, many of which had
foreign keys referencing each other. It doesn't sound like this tool would
atomically switch all tables to the new schemas, which would have led to
broken data for us. Does that make sense?

EDIT: grammar

~~~
jstanley
You're right about needing to switch your application over instantly. Where
I've used it, it has mostly been to add columns to a table and thus the old
application code continues to run perfectly fine.

While you can't use pt-osc to do multi-table updates directly, you can use the
same strategy. All it does is creates a new table with the new schema, adds
triggers to the old table to duplicate row modifications to the new table, and
then copies old rows across. Then, when all the copies are done, atomically
renames the new table into place, then deletes the old table.

There is nothing to stop you from delaying the rename until all new tables are
ready, except that it is more hassle than just using pt-osc as it comes.

But, point taken: your case is more complicated than the one I was thinking
of. And thanks for your thoughtful response to my somewhat dismissive comment
:)

------
jpollock
I would have used sharding, and then normal failure handling and
resynchronisation between nodes. If you have the requirement that two adjacent
versions have to be able to resynchronise after failure upgrades (and
rollback) become equivalent to normal node failure.

As it stands, during your upgrade you've lost all of your fault tolerance and
can't meet your performance requirements - you've gone from 5 nodes able to
process the traffic to 1!

~~~
msherry
You make a good point re: fault tolerance. As it happens, I've simplified the
diagram quite a bit to make it simpler to visualise. We have more than 2 Nginx
instances, and more than one shard of our app was running the new code.

~~~
jpollock
It sounds like a shard collapsed down to a single database instance?
Otherwise, you wouldn't have had to turn off requests to the shard, you could
have quiesced one of the nodes, taken it out of service and brought it back up
with the new release?

------
perkof
It seems like the interaction required by the two engineers could have been
scripted to remove the human element. Was there a reason you chose not to do
this?

------
orofino
Perhaps I'm missing something, but why not down a segment (shard I guess),
upgrade the application and database, then fail over to that shard?

This would provide a fail back mechanism assuming you could resolve data
continuity.

I'm sure there is a reason this wasn't viable (possibly the data issue), but I
was curious.

~~~
msherry
This is basically what we did, except during the "upgrade the application and
database" step, we suspended all traffic to our app servers. The schema change
was an incompatible change (I think this is what you mean by "resolve data
continuity"?) So basically, our old code and new code could not run
simultaneously, because they were designed against incompatible schemas.

~~~
orofino
So if this was the case, why couldn't the other shards handle application
requests while you casually upgraded the application and database tier in
say... 30 seconds (or even minutes) as opposed to sub 15s?

~~~
jdunck
As stated in the post, the db migration was a large enough change that having
both codebases working on the migrating db would have been a high cost.

------
akoumjian
I've been eyeing Soundcloud's Large Hadron Migrator
(<https://github.com/soundcloud/large-hadron-migrator>). I would love to see a
Django/South specific implementation.

~~~
matclayton
We're a django shop, and pt-online-schema-change is an amazing tool. We're run
it on tables with 10-50M rows in production with minimal downtime, <1 Second.
I can't speak highly enough of it if you are a MySQL shop.

------
ultimoo
Great writeup! Loved reading it and adding this to my stash of known HA
strategies. Did you folks also chart out historical traffic and carry out this
migration when it was the most sparse? Like early morning on a weekend or
something.

~~~
msherry
Absolutely.

Being a payments-processing company, we have a variety of users
(marketplaces), each of whom has users who are widely geographically
distributed, so our usage doesn't drop off as much as some other types of
sites might on weekends. That being said, on weekends we see slightly lower
traffic than during the weekday, so we performed this migration on a Saturday
evening.

------
trungonnews
what happens if it takes more than 15 seconds to migrate the database?

~~~
msherry
Then some of our clients would have been disappointed by the timeout errors
they had started seeing, if they happened to make a request at the very start
of the migration ;)

We ran our migrations multiple times on test instances of our database,
because we were worried about this exact issue. We optimized the migration to
remove extraneous changes a few times in order to cut down the time taken.
Also, 13 seconds was actually the upper bound of what we saw -- many times we
ran it, it took closer to 9-10 seconds.

~~~
trungonnews
The time to migrate your database will only get larger over time as the data
grow. Looking toward to your follow up post. :)

------
lsh123
"Zero Downtime" migration but if the migration will take less than 15 seconds.
Honestly, this is not very interesting. The same effect could be achieved by
increasing connection timeout on the clients and the server and then just
letting the clients to wait while the DB schema changes takes place. This
works great for small tables/data sets. When you get to a bigger scale and you
migration takes longer (minutes, hours, days) then you might start to look at
more advanced tools including percona tools or custom migration code in your
app.

~~~
msherry
Are you saying it would work without code changes by increasing the timeout
limit? We had to deploy new code to work with the new schemas, since the
underlying models went through drastic changes. Having our code be compatible
with old and new schemas simultaneously, as it is during most of our
migrations, would have been extremely difficult in this specific case, which
is what prompted this solution. I don't think this is a problem that could
have been solved by simple timeout changes, but I'd love to hear your thoughts
on it.

~~~
lsh123
For, say, adding a new column or changing indexes, one can probably just run
update w/o any code changes. For more complex cases, you might need to modify
the code to work with both old/new schemas. It's all about details :)

------
ralph
Is the socat required in the Fabric bits at the end?

