
Online migrations at scale - hepha1979
https://stripe.com/blog/online-migrations
======
docaholic
We're doing nearly the exact same approach in my office out of necessity.

We "allowed" small amounts of downtime previously (say 10-15 minutes at a time
for migrations), but over the last year as our customer base has expanded, the
window has shrunk smaller and smaller as to avoid service disruptions.

We're now at a stage where downtime is effectively not allowed anymore, so
we've taken to this approach (make a new table, dual write new records, then
write all the old records, remove references to the old table) to mitigate our
downtime.

It's nice to know that other companies have taken this approach as well, my
team honestly didn't know if we were doing it "properly" or not (if there is
such a thing as properly)

~~~
user5994461
> It's nice to know that other companies have taken this approach as well, my
> team honestly didn't know if we were doing it "properly" or not (if there is
> such a thing as properly)

Incidentally, I was thinking about doing a blog post about that too.

That's good. You're confirming there's an audience for my writing. Thank you
:D

------
davidcaseria
Avoiding large data migrations is one of the reasons to consider if using
Command Query Responsibility Segregation (CQRS) and Event Sourcing (ES) is
right for the application domain. It seems in this example the domain events
haven't changed (i.e. subscription created, subscription canceled) but the
logic enforcing subscriptions has (i.e. only one subscription then allowing
multiple). If the events were stored in an append only journal then the
subscription business logic just needs to change and no data needs to be
migrated. Additionally, since the catalyst for this change seemed to be query
side performance, the query service can be optimized for reading the records
separate from the command service that writes to the event journal.

Event sourcing isn't without it's difficulties but if the domain is pretty
well understood then it can be a very powerful pattern.

Stripe has an awesome product and seemingly an equally great engineering
culture. I wonder if they have debated using CQRS/ES.

~~~
chatmasta
Does this require a separate data source (e.g. a message queue) to ingest and
log the events before reacting to them? Or could you retroactively use the
mongodb oplog? I imagine there is way too much data for the oplog to be
feasible in this case. And if you didn't design the system from day 1 around
an event model, wouldn't it be difficult to retroactively introduce this new
architecture?

------
Diederich
It's nice to see this kind of thing covered.

A company I worked for some years ago, LiveOps, also followed this
methodology, because we had to.

LiveOps was and is a 'telephony in the cloud' provider, and the databases are
in-line with some of the important call flows.

Last I looked, in the ten years previous, LiveOps had handled at least one
call from something like 15% of all of the possible phone numbers in the
United States, and all of that data was available in a huge replicated mysql
cluster spanning multiple datacenters. We had mysql tables with tens of
billions of rows, replicated across many servers.

And just to be crystal clear: there was absolutely no downtime allowed, ever,
because downtime of any kind meant that phone calls would not go through.

We used extensive feature flags, along with the techniques described in the
stripe.com article, to achieve exceptional uptime, relatively ok operational
overhead and pretty good development velocity.

The point I'd like to drive home is that this is all possible even for medium
sized organizations and moderately sized staffs. The important part is that
everyone needs to keep these priorities, methods and techniques in mind, all
the time, to make it work.

------
smnc
The article refers interchangeably to tables and collections. Reading this[1],
it seems they are using MongoDB as their transactional database. If they do,
it would be interesting to know at what point did they decide to split the
subscriptions data into its own collection. The article just states "As we
added new features, this data model became problematic."

Mongo-modelling being a black art of sorts (to me, at least) I'd be more
interested in what the tipping point was for them (data size and shape, usage
patterns) than the relatively straightforward (conceptually at least,
operatinally these things are never to be underestimated) table-doubling
approach to changing a data model.

[1] [https://stripe.com/blog/announcing-
mosql](https://stripe.com/blog/announcing-mosql)

~~~
mglukhovsky
We plan to share more posts on our data infrastructure: like you pointed out,
there are lots of useful patterns to share. In the meantime, I’ve updated the
post to be more consistent (thanks for mentioning it.)

------
misterbowfinger
This article seems to be about really fundamental changes to your data
modeling. If you're using MySQL and interested in only changing one table,
i.e. adding/removing a column or an index, check out LHM:

[https://github.com/soundcloud/lhm](https://github.com/soundcloud/lhm)

I've used it in the past a lot. You have to make sure to throttle the
migration, or else it'll max out your CPU & memory.

~~~
wolf550e
Is it better than pt-online-schema-change from percona-toolkit? That thing
creates a new table which is the same shape as the old one, runs an ALTER on
the new table, sets up triggers on the old table to update the new table, and
then copies the old rows in batches while monitoring replication slave lag.
Then it renames the tables and drops the old one. You do need storage for
twice the space of the table, at least temporarily.

~~~
mcpherrinm
Square's wrapper around pt-online-schema change is really slick:
[https://github.com/square/shift](https://github.com/square/shift)

It has a nice webapp to monitor and run schema migrations, including peer
review. I use it all the time at Square, and it's pretty great.

------
YZF
An alternate pattern goes like this:

1\. Create new data model and adjust reads to read from the new model first
and fall back to the old model. Update ALL your readers first.

2\. Change writes to use the new model.

3\. Now you have different options. You can leave things as they stand. You
can actively migrate data from old to new. You can do a lazy migration, for
example, only when your read old data you migrate it.

There are some gotchas depending on your system specifics, esp. races between
writes from the migration process and writes coming from the application. How
you address those depends on the specifics of your situation...

------
rattray
I'm curious about how this edge case was handled:

1\. Lisa has old tv_service subscription, saved to snapshot, but not live-
copied to Subscription table.

2\. Lisa removes tv_service subscription.

3\. Before snapshot is updated, batch processing adds lisa_tv_service to the
Subscription table.

4\. Subscription now has lisa_tv_service, but Lisa.subscriptions does not
include tv_service.

My only idea: wait a day, then a batch-process job looking for data that
exists in Subscription table but not in Customer.subscriptions

~~~
smnc
In step two both collections/tables are updated. The backfilling routine is
hopefully "smart" enough to leave stuff already in the new table alone.

~~~
rattray
Right, but there's nothing to delete in the Subscription table at step 2.
`lisa_tv_service` isn't added until the backfilling step.

~~~
smnc
Ah, I see, you're right.

Well, one solution would involve a log table used to log/lookup the migration
status: Anytime the real-time system touches data (eg. deletes the
subscription in the source table and has nothing to do in the new table) the
event is logged. The backfilling routine uses that data to decide wether its
data or the current data is "fresher".

If the old subscription data is never (or for a certain period) deleted but
merely marked inactive (eg. with fields "timestamp_end" and
"timestamp_modified" updated) then that information can be used during the
backfilling to decide on what data is more recent. [edit: I guess this is what
haldean is reffering to above by "tombstone records."]

That makes sense, I think. Does it? Elsewhere in the thread i referred to the
procedure used as "conceptually relatively simple". Most things are
conceptually simple, but always with pitfalls and dark things lurking around
the edges.

------
dorianm
There are other articles for those interested:

\- [https://blog.codeship.com/rails-migrations-zero-
downtime/](https://blog.codeship.com/rails-migrations-zero-downtime/)

\- [https://blog.engineyard.com/2011/zero-downtime-deploys-
with-...](https://blog.engineyard.com/2011/zero-downtime-deploys-with-
migrations)

In the Ruby on Rails world, it's definitely a thing (made easier with
attribute reader / writer, and db/migrations/)

------
d7
Considering doing something similar while we add a graph database for
relationships between users in addition to Postgres. The consistency checks
between the two databases can get messy. How long did you test to make sure no
niche case fell through the cracks?

------
tybit
Nice article. The use of scientist before switching over reads was especially
good to know.

------
overcast
The big question that wasn't really covered, is how they are selectively
updating the writers and readers on a live environment. Especially migrating
on a per object basis to have them start dual writing, and reading.

~~~
luhn
Feature flags, most likely. Deploy the dual-write code behind a feature flag,
enable the feature flag for a small fraction of requests, start notching up
the percentage until you're at 100%, then remove the feature flag and the old
single-write code. Rinse and repeat for the remaining three steps.

~~~
overcast
The actual deployment was my curiosity :)

~~~
luhn
Ah, my bad, I misunderstood.

It's pretty normal for tech companies to do dozens of deploys a day, thanks to
CI/CD. There are lots of ways to achieve this, but they all pretty boil down
to the same idea: Boot up the new version before shutting down the old
version.

------
etaty
> Once the migration is complete, run the Scalding job once again to make sure
> there are no existing subscriptions missing from the Subscriptions
> collection.

That sounds scary! how many time do you need the "once again"?

~~~
mglukhovsky
This is a safety check. Because of the dual-writing pattern, there's no reason
it should be necessary, but with migrations it makes sense to be extra-
cautious and to double-check your work.

------
ploggingdev
In Part 2 where they start reading from the new table, what happens when old
subscription data is not present in the new table? Do they then copy the data
from the old table to the new table when the need arises?

~~~
MBCook
You have two options. The first is to not move to phase two until EVERY record
is duplicated. Looks like that's what they did.

The other option is the on-demand option you mentioned. The biggest issue I
see is the latency to fetch that information is suddenly a lot more random. If
it's not in DB B then you have to fetch from A and write into B before
returning the information.

------
tbrock
Sure, but what do you do if your migration requires merging rows of existing
data?

You can't rewrite the pointers to the merged rows while the application is
online.

------
andyfleming
How were the animated graphics for this article created?

~~~
awinter-py
looks like custom tweening code they wrote for their blog

[https://stripe.com/assets/blog/animate-svg-
fe67ad3b4fe397689...](https://stripe.com/assets/blog/animate-svg-
fe67ad3b4fe3976891c53f1bd707c64a.js)

------
yeukhon
But how are you going to do this if you are constantly changing models? Like
every other day?

------
jaequery
so to sum it up short, create table, copy data, rename table. isnt this
already tackled by tools created by percona and many others out in market?
other than them adding in hadoop to speed things up, what else is new here?

~~~
adevine
"What else is new here" is you can't have any downtime, and you have to have a
strategy for having your code work against the data in all of those
transitional states during that process.

A lot of migrations would be pretty trivial if you could have even a couple
hours of downtime, but in an expectation of 24/7 availability that is no
longer acceptable in most situations.

------
danthaman44
This is some very impressive stuff!

------
spullara
Yikes. Use evolvable schemas and never do mass migrations.

~~~
icebraining
How would that work? If you never do an explicit migration, you'll always have
records on very old schema versions. Doesn't that force you to support every
version forever in the code?

------
mighty_warrior
This sounds a little over complicated all in an effort to decrease the amount
of code changed at any given time. They took great pains to keep data in sync
across A and B datastores and I'm not so sure that extra cost was worth the
perceived stability of this approach.

~~~
perlgeek
> They took great pains to keep data in sync across A and B datastores and I'm
> not so sure that extra cost was worth the perceived stability of this
> approach.

Such great pains come with huge systems. What's the alternative?

Taking the platform offline for a few hours? Management will say no. Or maybe
Management will say yes once every three years, severely limiting your ability
to refactor.

Doing a quick copy, and hope nobody complains about inconsistencies? Their
reputation would suffer severely.

~~~
mighty_warrior
They maintained a replication process across both tables as they updated the
read processes before updating the write process. Say for whatever reason
their offline replication process broke for 2 hours. For those 2 hours of
downtime that replication is broken, the system is reading from a table that
is not in sync with the table that is receiving writes. At that point you are
displaying incorrect subscription data to your customers.

~~~
sorkin2
> They maintained a replication process across both tables as they updated the
> read processes before updating the write process. Say for whatever reason
> their offline replication process broke for 2 hours.

From the article I got the impression that both tables were being written to
in the same database transaction, so this is not a possible failure scenario
at all.

