
Continuous Deployment at Instagram - nbm
http://engineering.instagram.com/posts/1125308487520335/continuous-deployment-at-instagram/
======
madetech
Shameless Plug: I've recently been involved in writing a book on Continuous
Deployment, which covers many of the points Instagram are writing about here
(but in greater detail).

I've got ~1,000 printed copies to give away. So if anyone wants one, go here:
[http://madete.ch/1S3OGvl](http://madete.ch/1S3OGvl) and follow the link on
the left hand side and we'll mail a copy to you.

~~~
lpbonenfant
Do you ship outside the United States?

~~~
madetech
We're based in the UK and have shipped plenty overseas so far. I'm sure we'll
be able to get something to you, if you're not in _too_ remote a location.

~~~
salex89
Strange, I'm not from the US, (I'm from Serbia) and it says I can't even buy
the book on Amazon (I was hoping for the kindle edition). I did apply for the
hard copy though.

------
iyn
What are the best practices for database migrations when trying to setup
continuous deployment? Are there any existing tools/solutions that
solve/simplify the problem? This is the issue that is almost always missing in
articles/tutorial about CD

~~~
nbm
Beyond a certain size (basically, once the time the migration will take
because of the size of the data it applies to is too large), migrations are a
heavy investment - in elapsed time, I/O, and so forth. As such, they are
planned to a degree that CD probably isn't the solution for it (for example,
you probably can only have one migration in flight at a time).

They aren't done live as a single big process that have the potential to lock
all queries/updates over their execution, but rather as a set of smaller steps
that don't lock.

Facebook has spoken about its online schema change process before -
[https://www.facebook.com/notes/mysql-at-facebook/online-
sche...](https://www.facebook.com/notes/mysql-at-facebook/online-schema-
change-for-mysql/430801045932/) and its follow-up at
[https://www.facebook.com/notes/mysql-at-facebook/online-
sche...](https://www.facebook.com/notes/mysql-at-facebook/online-schema-
change-for-mysql-part-2/431123910932) for example, and I'm sure elsewhere.

Most people using MySQL would potentially first use something like
[https://www.percona.com/doc/percona-toolkit/2.1/pt-online-
sc...](https://www.percona.com/doc/percona-toolkit/2.1/pt-online-schema-
change.html) instead of trying to create their own.

The same principles apply to other data stores that have more rigid schemas.

~~~
henrikschroder
MySQL has online, non-blocking schema changes since version 5.6. But the
underlying data file has to be upgraded to the latest format version for it to
work first, and to do that in a non-blocking way on a master server you are
probably best off running percona toolkit one first. ALTER TABLE --- FORCE
does a data file rewrite.

~~~
Rapzid
That's a big "yeah but".

You can't rate limit the io. That's huge and the percona tool and LHM offer
ways to keep io under control. Along with that you have to pay attention to
the live schema update matrix in the docs; many things will upgrade with out
locking but require copying the table and this will blast your IO.

In addition even with the tools some changes require exclusive table locks. It
needs it for a short period of time but if it can't get it because of a long
running transaction , queue of transactions, and etc it can block everything
up.

------
avolcano
Also seconding the confusion that other commenters have regarding the "three
commits max" rule for automated deploys. Maybe engineers at Facebook are just
big fans of rebasing, but I often make commits on feature branches that don't
"stand on their own" \- i.e., would break some functionality without
subsequent commits. I'm not sure why you'd want to deploy one-commit-at-a-time
unless you kept a very strict "one commit == one standalone feature/bugfix"
rule, which isn't mentioned in this post.

(I suppose it's also possible that that's referring specifically to _merge
commits_ into master, which would make a lot more sense to me)

~~~
nbm
I'm not sure about Instagram, but Facebook is a fan of rebasing in general.
Nothing should ever appear as a commit in master that isn't something that
should be used in production - ie, should never intentionally be broken in
isolation.

In general, feature branches are relatively very short-lived, and will be code
reviewed, rebased and landed as a single commit onto master.

Features are often feature flagged off anyway, so it is acceptable to commit
partially-functional features to master while that feature is flagged away.

There is a concept of stacked commits, but each commit in the stack needs to
be a working step towards the end goal, and as such can (and will) be landed
in isolation as they are code reviewed.

~~~
serge2k
> Nothing should ever appear as a commit in master that isn't something that
> should be used in production - ie, should never intentionally be broken in
> isolation.

I don't understand why people do it any other way.

~~~
ihsw
Some people Ctrl+S and commit every few files, to keep from pushing changes to
10+ files in a single commit.

------
seanwilson
If they're doing up to 50 commits a day and deploy all commits to master
automatically how does that line up with "It makes it much easier to identify
bad commits. Instead of having to dig through tens or hundreds of commits to
find the cause of a new error, the pool is narrowed down to one, or at most
two or three"?

If you do a commit and find out in the middle of the day the latest deploy is
having problems and people are still committing in new code wouldn't this make
things much harder to narrow down?

~~~
nbm
Assuming they're spread out over ~10-12 hours (some crazy morning people, some
crazy nocturnal people), that's only ~4-6 commits per hour.

Most problems will be discovered by someone and reported in an hour, and most
of those will also be discoverable in a dataset on a system like Scuba -
[https://www.facebook.com/notes/facebook-engineering/under-
th...](https://www.facebook.com/notes/facebook-engineering/under-the-hood-
data-diving-with-scuba/10150599692628920) \- and you can identify the first
time that particular issue happened.

If you're lucky, it lines up exactly to a commit landing, and you only need to
look at that. Otherwise, due to sampling, maybe you need to look at two or
three commits before your first report/dataset hit. You can also use some
intuition to look at which of the last n commits are the likely cause. A URL
generation issue? Probably the commit in the URL generation code. You'd do the
same thing with a larger bundled rollout, but over a larger number of commits
(50, in the case of a daily push).

------
ed_blackburn
I've always found that the biggest hurdle with CD is never the tech. It's
overcoming fear and tweaking culture.

Nevertheless it's always great to read how others accomplish their goals and
even better that they're willing to share the journey.

Personally I find it incredibly frustrating to see code that I write not ship
for sometimes weeks or even months at some clients. It's a slow process but
we'll get there..

------
lopatin
Looks like a lot of schema migration talk here. Out of curiosity does anyone
have production experience with lazy migrations for serialized data? Where
your model migrations exist as code: an array of functions that convert one
version of the model schema to the next. The schema version is encoded into
the data. The migrations are lazy because the model is fast forwarded its
latest version at the last possible moment, when the code reads the serialized
model. I know Braintree does this with Riak. Anyone else?

~~~
siculars
I work for Basho. Would love to hear more about that use case. Regardless,
unstructured data with encoded schema/explicit versioning ftw!

------
HealthyTree
Regarding canary releases and detecting errors, one aspect that is sometimes
overlooked is the possibility for bugs on the client side. At work, we have a
fairly large js centric app with a non-negligible amount of bugs pure client
side. While tracking http status codes on the backend is fairly
straightforward, we find it much harder to get the same type of information
from the frontend. Would love to hear if anyone has experience in that area.

~~~
nevon
Look into client-side error tracking. We're using Sentry where I work.

It will catch any uncaught exceptions, group them together and normalize the
stack traces (because of course they look different across browsers). If you
tag the errors by their release, you can see if your release introduced any
regressions.

We're not using it in any automated way, though, because there's so much
noise. Any time a phone happens to run out of memory or some extension
crashes, you'll get an error report.

Of course, it also doesn't cover non-crashing regressions, where you may have
incorrect behavior rather than crashes. Those are much harder to catch, unless
your integration tests are incredibly granular.

------
aytekin
We do something very similar at JotForm. The hardest part has always been
the"keeping it fast" part. The deployment should take less than minute
otherwise developers spend too much time waiting to see if the commit went
live without problems.

The solution is usually to run things in parallel. Examples: Run tests in
parallel. Keep things in separate repos and push them into separate folders in
the app servers.

------
krzyk
Almost all CD cases I've seen talk about canary or black/white(red) releases -
a case where there is a fleet of servers.

How to do that on a much smaller scale e.g. when I have only 3 servers
available? If I deploy to one of them potentially 1/3 of customers might get
broken version.

~~~
nbm
This is actually a problem whether you're using continuous deployment or not.
You need to put a release out, which you have done some testing for, but you
know that only real production traffic can shake out certain types of
problems.

You don't have to have all three servers getting the same amount of traffic,
and you don't have to have a single copy of your service on each server. So,
you could reduce the weight of a single server that does canary traffic to
reduce the pain, or you could run two copies of your service on a server, and
have the canary copy get a trickle of traffic.

Another approach is to use shadow traffic - instead of handling the requests
on the canary host, you handle it on the production host _and_ the canary
host. You'd need to ensure the canary can't adjust the production database,
for example - or maybe you only shadow read requests. If you don't get any
errors, or you're able to prove to yourself that they function the same, you
can then move to a more traditional canary.

You definitely need to adjust your continuous deployment implementation to
your environment, whatever it is.

------
brown9-2
What is the purpose of the backlog of deploys?

For example, let's say 50 commits all land on master within the same second.
Why break those into many deployments stretched across hours instead of
deploying them all in the next event?

If you landed a bad commit in the middle of that 50, it seems like it might
not be immediately obvious once it was deployed that it was bad - and then 5
or 30 minutes later another commit is deployed on top of it.

You might not notice a problem until hours after all of the commits have been
deployed, which leaves you in the same situation as if you had deployed all 50
changes in one event, but in this model those 50 commits have been stretched
over a much longer period of time between commit and liveness to users.

~~~
mgorven
By splitting them into multiple deploys it makes it easier to identify bad
commits when an error starts. We can usually correlate the start of an error
with a specific deploy, and instead of digging through 50 commits to find the
cause we only have to look at a few.

------
kaizendad
"Expect bad deploys: Bad changes will get out, but that's okay. You just need
to detect this quickly, and be able to roll back quickly."

That's an amazing statement to me. I've always worked in smaller environments
where we roll up many changes and try to deploy them perfectly. The penalty
for bad changes has been high. This is a really new way of thinking.

It's an exciting way of thinking, but I'm not sure I love it. I wonder how
well "sometimes we break things" scales with users of smaller services. I
guess the flip side is that "we often roll out cool new things" definitely is
desirable to users of small services.

~~~
nbm
As long as you're not in the spacefaring, automotive, banking/insurance, and
medical industries, it's probably the case that it's acceptable to have some
downtime and bugs - nobody will die or have their livelihood destroyed by it.

Given this, your confidence threshold for a release is not approaching 100%,
it's hitting some "good enough" value, where the work you're doing to test for
the next 1% is 2x of the testing you're doing now and is "not worth it". As
you burn through some sort of error/downtime budget, you'll adjust that level
of confidence - as you have more problems, and take more time with responding
to problems.

Continuous deployment's upside is a confidence in the release process (since
you do it so often), and some assurance that you'll be able to find the
problem reasonably fast (since you only have to look through a smaller number
of changes). You'll have fewer bigger problems, and more smaller problems.
There definitely are cases where 10 smaller downtimes of 5 minutes is worse
than 1 larger downtime of one hour, but usually it's better to have the
former.

------
yeukhon
What are the best CD practice for infrastructure? Especially when you have to
deal with commits which only need to be in one environment, or commits which
need to be in all environments?

~~~
txutxu
Not sure if I understand the question, but for infra, I could say: version it.

Switches have config files or firmware dumps, the same goes for bios and raid
bios, for documentation in the infra and connections, etc...

Infra will evolve, and so will do the "version".

While in "test" stage, it's "next version" infra, while in production, the
architecture, firmware, connections and configuration, run a tested "version".

Is not easy to integrate/automate infra from different vendors, but it can be
done. Been there, done that.

~~~
yeukhon
Hi npm answered first so I responded to his. Please see below. Versioning is
not so simple I thought about it :( it kind of work well IF we are talking
about versioning containers or versioning images.

------
brazzledazzle
The post refers to a system called "Landcastle". Is that an internal project
or an internal name for an existing open source project?

~~~
nbm
It's an internal project. It basically takes a "diff" from Phabricator, and
goes through the process of applying it to master and ensuring that all unit
and integration tests pass before committing it.

This removes the need for every developer to do a rebase, run test, attempt
land, discover conflict, rebase, run test, ... loop.

------
hitsurume
Kind of confused, is this setup for their production deployments only? Does
instagram involve any staging environments that would catch a lot of their
issues ahead of time, such as failed test builds, bad commits etc? Are their
developers allowed to commit directly to master or do they go through a formal
pull-request process that gets signed off by someone?

~~~
nbm
As mentioned in the article itself, Instagram uses code review (using
Phabricator), and also uses automated continuous integration for running tests
on each "diff", before it allows the change to be landed to master.

~~~
UK-AL
Depends what automated tests you run. You could catch things that are
technically wrong, but what about things like the button being in the wrong
place.

That's nots going to spew errors.

~~~
nbm
This class of errors is sometimes fairly easy to catch with canaries.

If a button is obscured or inactive/non-functional for some reason, then
chances are some metric is going to be statistically varied enough to call out
while in the canary phase.

For more manual canaries, this same approach can be used for metrics like
memory usage, latency, number of upstream/database connections, and so forth.
Of course, that could be the _purpose_ of the change, which is why it likely
will be checked with manual canaries (ie, not the canaries used for the
continuous deployment process).

------
shockzzz
Is their test suite just unit tests, or does it include user-acceptance
testing?

------
smaili
Great writeup! Would love to have seen more details on actually getting the
newly shipped code recognized by the app servers gracefully, i.e., without
terminating any pending requests.

------
mark242
"The test suite needs to be fast. It needs to have decent coverage, but
doesn't necessarily have to be perfect."

Holy hell, what a telling statement that is. I get not unit testing for 1 ==
1, but come on, unit and integration tests for, say, user login should be
_difficult_ , not _fast_. There are some test suites that actually do need to
be perfect, unless Instagram thinks that eg OWASP isn't "decent coverage".

~~~
467568985476
The probably use an existing, well tested tool for user authentication (why
reinvent the wheel?), or have comprehensive coverage for it already. There's
no reason for tests for logging in to be slow either way. And after all, it's
a social media app, it doesn't have to run perfectly all the time.

