Hacker News new | comments | show | ask | jobs | submit login
Continuous Deployment at Instagram (engineering.instagram.com)
239 points by nbm on Apr 12, 2016 | hide | past | web | favorite | 91 comments

Shameless Plug: I've recently been involved in writing a book on Continuous Deployment, which covers many of the points Instagram are writing about here (but in greater detail).

I've got ~1,000 printed copies to give away. So if anyone wants one, go here: http://madete.ch/1S3OGvl and follow the link on the left hand side and we'll mail a copy to you.

Filled out the form. Sending out free physical books seems like a lot though - what's the motivation behind that?

It was originally a content marketing experiment. We thought we'd rather have 100 people receive a free printed copy of the book and love us for this, than a 1000 people receive ebooks and end up in their trash.

The ROI from the 100 trial worked well, so we decided to try and scale it out to more people. TBD whether we'll see a good ROI from 1000 people, but we hope people enjoy it and like us for sending them a free book!

I ordered a copy from Buenos Aires, Argentina. Will you ship it? :)

I was going to order from Nicaragua, but decided to use a U.S. address after all

Hey, I'm currently also writing a book on automating deployments (https://deploybook.com), and if you're interested, I'd like to chat about your experience with writing and publishing this book.

Please drop me a line moritz.lenz@gmail.com. Also, looking forward to read your book :-)

Great, will do + you can get me on rory_at_madetech_com

Filled out the form. Definitely excited to get into a CD environment, rather than the 1-2 deploys/day that I've been exposed to in the past!

CD doesn't have to mean 'push every green build to prod'. It's more about the ability to push new functionality when asked by the business, than the fact of always pushing it by default. You may be doing CD well already, knowing just what you have said.

My preconception is that what you said is exactly right, but the idea that one maybe should push every green build to prod -- and the fact that a lot of big-scale companies are doing that -- is really intriguing.

Oops, I meant CD as in continuous delivery, but that is not what the article is discussing. Sorry! :)

Great service! Just for your information, the "Thanks for signing up!"-message[1] showed up twice when I ordered the book.

[1]"Thanks for signing up! Your book is on its way, you should receive it within a few working days."

Very interesting! I have signed up for a copy. Thank you

Do you ship outside the United States?

We're based in the UK and have shipped plenty overseas so far. I'm sure we'll be able to get something to you, if you're not in too remote a location.

Strange, I'm not from the US, (I'm from Serbia) and it says I can't even buy the book on Amazon (I was hoping for the kindle edition). I did apply for the hard copy though.

Can't seem to find the link on mobile safari, care to share in a reply? (I'm in SF)

Do you ship epub?

What are the best practices for database migrations when trying to setup continuous deployment? Are there any existing tools/solutions that solve/simplify the problem? This is the issue that is almost always missing in articles/tutorial about CD

It's hard. The general rule of thumb is that you make database changes independent of code changes, and you always make the changes in such a way that they are compatible with the current deployed code AND with the next version that you plan to roll out.

For example: let's say you have a feature that uses a column, but you want to move it to using a separate table.

Step 1: design the new table

Step 2: deploy the new table - existing code continues to run against the column

Step 3: run a back-fill to ensure the new table has all of the data that exists in the current column

Step 4: deploy code to use the new table instead of the column

The above is the best-case scenario - but it often doesn't work like that, because you need to ensure that data added to the old column between steps 3 and 4 is correctly mirrored across. One approach here is to deploy code that dual-writes - that writes to both the old column and the new table - along with extra processes to sanity-check the conversion.

GitHub's Scientist library is a smart approach to these more complex kinds of migration - which can take months to fully deploy. https://github.com/github/scientist and http://githubengineering.com/scientist/

Obviously your millage will vary but I've also found something that augments the above approach is to separate reads from writes in your app. Read from views instead of directly from the tables. You may also choose to write using sprocs, and, or explicit 'writer' types/functions in your app code.

This can help in the migration process because you can version the views and sprocs whilst running more than one version at a time. Essentially you're creating a versioned API for you db.

Of course you now have more db objects to manage (boo, hiss, more moving parts) but it also encourages you down a saner path of versioning your db objects and rationalising your persistence somewhat (Do we really need 3NF? If I update this table in isolation..I jeopardise consistency of this entity etc.)

None of this _solves_ anything but I've found it mitigates a hard problem.

Beyond a certain size (basically, once the time the migration will take because of the size of the data it applies to is too large), migrations are a heavy investment - in elapsed time, I/O, and so forth. As such, they are planned to a degree that CD probably isn't the solution for it (for example, you probably can only have one migration in flight at a time).

They aren't done live as a single big process that have the potential to lock all queries/updates over their execution, but rather as a set of smaller steps that don't lock.

Facebook has spoken about its online schema change process before - https://www.facebook.com/notes/mysql-at-facebook/online-sche... and its follow-up at https://www.facebook.com/notes/mysql-at-facebook/online-sche... for example, and I'm sure elsewhere.

Most people using MySQL would potentially first use something like https://www.percona.com/doc/percona-toolkit/2.1/pt-online-sc... instead of trying to create their own.

The same principles apply to other data stores that have more rigid schemas.

MySQL has online, non-blocking schema changes since version 5.6. But the underlying data file has to be upgraded to the latest format version for it to work first, and to do that in a non-blocking way on a master server you are probably best off running percona toolkit one first. ALTER TABLE --- FORCE does a data file rewrite.

That's a big "yeah but".

You can't rate limit the io. That's huge and the percona tool and LHM offer ways to keep io under control. Along with that you have to pay attention to the live schema update matrix in the docs; many things will upgrade with out locking but require copying the table and this will blast your IO.

In addition even with the tools some changes require exclusive table locks. It needs it for a short period of time but if it can't get it because of a long running transaction , queue of transactions, and etc it can block everything up.

That's not the case for Instagram, they use PostgreSql, they could update tables without any downtime and this is fast. The only problem are migrations which copies data or doesn't just add fields or remove them.

I work in a company that does several deployments a day, and has a giant database. The short answer is you design around it. If you can do something without majorly changing your data you do it. Another common thing is to deploy the code, but then to add a "feature toggle" so you just turn the code on when the data is ready. Basically, figure out how you can refuel while in the air.

I did a gig whereby the databases were HUGE changing the schema was aways avoided unless it was deemed it would pay for itself. i.e. the accompanying feature warranted a change and to what degree. To reduce entropy a full time DBA would tweak indexes and work with each component team to release db refactoring with the teams.

I don't think this is an approach I would recommend but if your gig is a giant enterprisey shop that manages risk far to conservatively it's a reasonable half-measure.

CD is an aspiration for a lot of shops and getting there is a cautious tale of lots of small victories and earning the trust of the decision makers. Sometimes that means:

> designing around it

> The short answer is you design around it.

Sorry but that seems like a pretty crap answer.

The answer to "what tools allow you to manage database migrations with CD" should not be "don't do database migrations with CD" or "roll your own toggling features."

I apologize for the disappointing answer, but continuously deploying in the sense of constantly pushing code to production multiple times a day isn't just a feature you add to your stack. It's not just another tool. It requires an engineering culture that is focused on it. We have testing, and automated tools for sure, but we have A LOT more tools for logging, and robustly handing errors. When a bad commit is pushed into production, I can pull up a graph that shows me the number of errors, and the full read out of every occurrence of it. I can also revert that code very quickly. We also have a team that does nothing but monitor the code, and the deployments. There are a bunch of tools you can find probably... but my point is that it's something we're thinking about from the design stage... because from design to production it might just be a few days. Major features will get pushed in stages, its a very different way of doing things (at least for me). Before I was here, deploying was something I did maybe once every few months. The code had to be absolutely perfect, because it'll be a while before I can get a chance to fix it. Here, I deploy while I drink my morning coffee.

If you went back to the previous job, would you push toward more frequent deployments, or is there something that makes the two situations different?

If I was starting a new company I would without a doubt push towards doing frequent deployments all day long.

That said, I would would never push for it at my previous company. At this company it's something they've been doing since day 1. It's always been built into our process, culture, and hiring practices. At my previous company the "agile" process was just a shortened waterfall process. Though many of the engineers were very talented, and responsible... there was more than a handful that I would never trust to "self test" their code, or to monitor it as it deploys.

I definitely like that the more I think about it. We try to do deployments every couple months, but changes are so big and risky we end up pushing off deployments to the next release just because nobody wants to risk it.

How is lack of extant tooling swalsh's fault? Would you like to write some?

I've always done two-phase migrations.

Phase 1: Upgrade schema for new code. Migrate initial data from old schema to new schema.

[Deploy: New code starts taking requests, writing to new schema. Old code is drained from the pool of handlers, continues to write to the old schema. Once old code is drained from the pool and the new code is validated by production traffic, run Phase 2.]

Phase 2: Catch-up migration of old data to new schema. Drop old schema.

I used Liquibase for migrations - change-sets can be tagged with contexts[1] and when you run the migration you can specify contextual information that each change-set can target (e. g. development AND pre-deployment). The principal tags I used were pre-deployment and post-deployment (which map to Phase 1 and Phase 2 above).

Schema migrations were a little harder to write but it meant that we could migrate live without impact to customers.

[1]: http://www.liquibase.org/documentation/contexts.html

I've used Flyway for db migrations and it works well.


Anyone else use Flyway? Looks compelling for anyone not using RoR/Active Record (where migrations are out of box).

We also use it on a multi (postgresql) database setup where every customer has its own database. Works really good, we never had any issues with migrations and do them frequently (maybe once a week).

we use it for a fairly large postgres database (growing 4-5GB a day) and haven't had any big issues so far. currently looking at a sort of catch-22 issue at the moment where the application running flyway also reads information from the database to initialize some caches and that data is inserted and maintained by another application. this works fine on an already initialized database but not for the automated testing server where the data is wiped.

We use Flyway in a number of ways. Scripted as part of some deployment processes, included with Spring Boot applications for a RoR style migration on boot, and actually use it in Java code to manage other databases from our application during runtime.

An interesting way to use it is that you can have multiple applications that may use the same schema but are responsible for their own tables- you can specify a different schema metadata table for each application and they can generally live together fairly happily. We also have an application that 'deploys' data to target databases, so we are able to run Flyway via their Java library to make sure the target schemas are correct before running our statements against them.

Prior to Flyway I used Liquibase, which is also pretty powerful- but Flyway has just been so much more versatile- and the Spring Boot 'auto' integration has been awesome.

A somewhat similar question: what do you do with frontend assets? If you have JS/CSS/image files with a cache-busting URL, and you have different versions on different servers, you can run into problems.

For example, if most of your web servers are running version 1 and you're in the middle of rolling out version 2. A page request goes to one of the web servers with version 2, which returns HTML containing links to assets with the version 2 cache-busting URLs. The browser requests those assets, perhaps through the CDN, but the load balancer sends the requests to a web server still running version 1, where the assets don't exist yet. This means the browser will get a 404 error.

What are the best options for dealing with this problem?

Not sure that this is the best answer, but seems like a good use of feature flags. If the flag points to the old state, serve the old assets. If it points to the new state, serve the new assets. So make a deploy with both assets, deploy everywhere, then flip the flag.

A bit unsatisfying because it probably isn't always easy the include both assets.

Another idea (not sure if a good one) would be having the load balancers pin a certain session to a certain backend machine. Seems like this would make it better without fixing it, though: that session will still need to switch to a different set of assets when "their" server is deployed.

The underlying problem is the use of a naive cache buster which would cause the old server to reply one way and the new one another. For both the people who want the old asset and the new asset.

If instead of being something the web server throws away before replying, the version number actually caused different assets to be returned, then you'd not have this problem.

One pattern is to separate out assets into a different package that is deployed to a separate host group and have your clients request a different host name, or have your load balancers use a path match to use those servers for those requests.

Another is to push asset updates first to all hosts. All hosts, even without code update, will now be able to respond for the new assets.

Another is to use a local cache plus some backend service or database to serve the assets from the web servers - again, all hosts will now respond correctly for the old assets and the new.

We use alembic, a python sort of DSL, for migrations. We can go forward or backwards. Any schema changes need an alembic script for both directions. We generally try to do 'mini-migrations' that go with each change, as opposed to big bang migrations. In the past building stock exchanges it was generally big bang with a mass of SQL that could only run once. This are better these days. :)

A lot of it depends on the particulars of your database system. But there are certainly tools/solutions that exist, and essentially they all boil down to the same pattern: migration scripts should be executed exactly once in order. How this is done varies.

I do a lot in the .NET/SQL Server world, and my tool of choice is one that I wrote: http://josephdaigle.me/2016/04/03/introducing-horton.html. Conceptually, what this tool does could work for any RDBMS.

Similar tools for the . NET world include Roundhouse and fluentmigrator.net. I'm personally more of a fan of the latter, but we use both in our company in conjunction with Octopus Deploy. Roundhouse needs a bit too much Powershell for my liking.

Octopus Deploy also has a small tool, DBUp, which can run migration scripts. I've been using it for years with no issues.


Irrespective of language / platform does anyone ever write 'down' migrations? Do you ever use them?

I've always found they don't pay for themselves in terms of testing and likelihood of being required.

Alternatives are snapshots (depending on if you can afford downtime) and simply writing recovery scripts that aren't tied to individual migrations but rather a deployment. i.e. run script before running any migrations and if it goes wrong run contra script to backout as opposed to cleanup script.

Also seconding the confusion that other commenters have regarding the "three commits max" rule for automated deploys. Maybe engineers at Facebook are just big fans of rebasing, but I often make commits on feature branches that don't "stand on their own" - i.e., would break some functionality without subsequent commits. I'm not sure why you'd want to deploy one-commit-at-a-time unless you kept a very strict "one commit == one standalone feature/bugfix" rule, which isn't mentioned in this post.

(I suppose it's also possible that that's referring specifically to merge commits into master, which would make a lot more sense to me)

I'm not sure about Instagram, but Facebook is a fan of rebasing in general. Nothing should ever appear as a commit in master that isn't something that should be used in production - ie, should never intentionally be broken in isolation.

In general, feature branches are relatively very short-lived, and will be code reviewed, rebased and landed as a single commit onto master.

Features are often feature flagged off anyway, so it is acceptable to commit partially-functional features to master while that feature is flagged away.

There is a concept of stacked commits, but each commit in the stack needs to be a working step towards the end goal, and as such can (and will) be landed in isolation as they are code reviewed.

> Nothing should ever appear as a commit in master that isn't something that should be used in production - ie, should never intentionally be broken in isolation.

I don't understand why people do it any other way.

Some people Ctrl+S and commit every few files, to keep from pushing changes to 10+ files in a single commit.

The biggest reason IMO is git bisect which is mostly broken without having each commit be "on its own".

If they're doing up to 50 commits a day and deploy all commits to master automatically how does that line up with "It makes it much easier to identify bad commits. Instead of having to dig through tens or hundreds of commits to find the cause of a new error, the pool is narrowed down to one, or at most two or three"?

If you do a commit and find out in the middle of the day the latest deploy is having problems and people are still committing in new code wouldn't this make things much harder to narrow down?

Assuming they're spread out over ~10-12 hours (some crazy morning people, some crazy nocturnal people), that's only ~4-6 commits per hour.

Most problems will be discovered by someone and reported in an hour, and most of those will also be discoverable in a dataset on a system like Scuba - https://www.facebook.com/notes/facebook-engineering/under-th... - and you can identify the first time that particular issue happened.

If you're lucky, it lines up exactly to a commit landing, and you only need to look at that. Otherwise, due to sampling, maybe you need to look at two or three commits before your first report/dataset hit. You can also use some intuition to look at which of the last n commits are the likely cause. A URL generation issue? Probably the commit in the URL generation code. You'd do the same thing with a larger bundled rollout, but over a larger number of commits (50, in the case of a daily push).

Why roll out to all servers at once? What if each new commit went out to 1% of servers, then spread? At most you need to rollback a small % of servers, as you clearly see which servers are acting bad.

I've always found that the biggest hurdle with CD is never the tech. It's overcoming fear and tweaking culture.

Nevertheless it's always great to read how others accomplish their goals and even better that they're willing to share the journey.

Personally I find it incredibly frustrating to see code that I write not ship for sometimes weeks or even months at some clients. It's a slow process but we'll get there..

Looks like a lot of schema migration talk here. Out of curiosity does anyone have production experience with lazy migrations for serialized data? Where your model migrations exist as code: an array of functions that convert one version of the model schema to the next. The schema version is encoded into the data. The migrations are lazy because the model is fast forwarded its latest version at the last possible moment, when the code reads the serialized model. I know Braintree does this with Riak. Anyone else?

I work for Basho. Would love to hear more about that use case. Regardless, unstructured data with encoded schema/explicit versioning ftw!

Regarding canary releases and detecting errors, one aspect that is sometimes overlooked is the possibility for bugs on the client side. At work, we have a fairly large js centric app with a non-negligible amount of bugs pure client side. While tracking http status codes on the backend is fairly straightforward, we find it much harder to get the same type of information from the frontend. Would love to hear if anyone has experience in that area.

Look into client-side error tracking. We're using Sentry where I work.

It will catch any uncaught exceptions, group them together and normalize the stack traces (because of course they look different across browsers). If you tag the errors by their release, you can see if your release introduced any regressions.

We're not using it in any automated way, though, because there's so much noise. Any time a phone happens to run out of memory or some extension crashes, you'll get an error report.

Of course, it also doesn't cover non-crashing regressions, where you may have incorrect behavior rather than crashes. Those are much harder to catch, unless your integration tests are incredibly granular.

We do something very similar at JotForm. The hardest part has always been the"keeping it fast" part. The deployment should take less than minute otherwise developers spend too much time waiting to see if the commit went live without problems.

The solution is usually to run things in parallel. Examples: Run tests in parallel. Keep things in separate repos and push them into separate folders in the app servers.

Almost all CD cases I've seen talk about canary or black/white(red) releases - a case where there is a fleet of servers.

How to do that on a much smaller scale e.g. when I have only 3 servers available? If I deploy to one of them potentially 1/3 of customers might get broken version.

This is actually a problem whether you're using continuous deployment or not. You need to put a release out, which you have done some testing for, but you know that only real production traffic can shake out certain types of problems.

You don't have to have all three servers getting the same amount of traffic, and you don't have to have a single copy of your service on each server. So, you could reduce the weight of a single server that does canary traffic to reduce the pain, or you could run two copies of your service on a server, and have the canary copy get a trickle of traffic.

Another approach is to use shadow traffic - instead of handling the requests on the canary host, you handle it on the production host _and_ the canary host. You'd need to ensure the canary can't adjust the production database, for example - or maybe you only shadow read requests. If you don't get any errors, or you're able to prove to yourself that they function the same, you can then move to a more traditional canary.

You definitely need to adjust your continuous deployment implementation to your environment, whatever it is.

Could you add a 4th (maybe even smaller) server, and configure your load balancer to only route 5% of traffic to this server? We have the same problem, continuously enough requests to implement a system like this, but we only run on a small number of production servers. We've recently added a canary server into the pool, which does increase costs a little, but we made it a small instance, and route only a small amount of traffic to it.

If you paying by the 10 minute increment or less, you just have a 4th server for a few minutes. Something like:

1) Add 4th server at new version

2) Drop 1 old version

3) Launch 1 more new version

4) Drop 1 old version

5) Launch 1 more new version

6) Drop 1 old version

This way you never have less than 3 servers serving requests, but you never pay for more than 4. Should only cost a few cents for most cloud providers for this temp server.

What is the purpose of the backlog of deploys?

For example, let's say 50 commits all land on master within the same second. Why break those into many deployments stretched across hours instead of deploying them all in the next event?

If you landed a bad commit in the middle of that 50, it seems like it might not be immediately obvious once it was deployed that it was bad - and then 5 or 30 minutes later another commit is deployed on top of it.

You might not notice a problem until hours after all of the commits have been deployed, which leaves you in the same situation as if you had deployed all 50 changes in one event, but in this model those 50 commits have been stretched over a much longer period of time between commit and liveness to users.

By splitting them into multiple deploys it makes it easier to identify bad commits when an error starts. We can usually correlate the start of an error with a specific deploy, and instead of digging through 50 commits to find the cause we only have to look at a few.

"Expect bad deploys: Bad changes will get out, but that's okay. You just need to detect this quickly, and be able to roll back quickly."

That's an amazing statement to me. I've always worked in smaller environments where we roll up many changes and try to deploy them perfectly. The penalty for bad changes has been high. This is a really new way of thinking.

It's an exciting way of thinking, but I'm not sure I love it. I wonder how well "sometimes we break things" scales with users of smaller services. I guess the flip side is that "we often roll out cool new things" definitely is desirable to users of small services.

As long as you're not in the spacefaring, automotive, banking/insurance, and medical industries, it's probably the case that it's acceptable to have some downtime and bugs - nobody will die or have their livelihood destroyed by it.

Given this, your confidence threshold for a release is not approaching 100%, it's hitting some "good enough" value, where the work you're doing to test for the next 1% is 2x of the testing you're doing now and is "not worth it". As you burn through some sort of error/downtime budget, you'll adjust that level of confidence - as you have more problems, and take more time with responding to problems.

Continuous deployment's upside is a confidence in the release process (since you do it so often), and some assurance that you'll be able to find the problem reasonably fast (since you only have to look through a smaller number of changes). You'll have fewer bigger problems, and more smaller problems. There definitely are cases where 10 smaller downtimes of 5 minutes is worse than 1 larger downtime of one hour, but usually it's better to have the former.

The point here is that bad changes get out no matter often or rarely how you do your deployment. Everywhere has deployed buggy code. Doing rapid deploys simply decreases the amount of time it takes to recover from that.

" I wonder how well "sometimes we break things" scales with users of smaller services" ---

Writing "business software" I have noticed that this doesn't scale at all. I mean when you have a couple of thousand people depending on the software for work bugs are really not tolerated that well.

It's probably different if you have hundreds of servers and can detect bugs on a deployment on one of them, so it only affects a small percentage of users and then you can roll back and try again. But if you have a single installation and you break that all the time with your commits then it probably doesn't work so good. And for the majority of software you really do not need "webscale" installations with millions of "heroku boxen" or droplets etc... Sure have some for redundancy but it really doesn't help with this "deploy master on each commit" type of deal.

That depends on the service. If you can afford outages that may be fair game. But if you have a high traffic service that's running hundreds or thousands of hosts, you can't take them all offline at once. Deploys can take hours, so can rollbacks. In that situation with high SLA requirements you can't really "expect" bad deploys.

What are the best CD practice for infrastructure? Especially when you have to deal with commits which only need to be in one environment, or commits which need to be in all environments?

Not sure if I understand the question, but for infra, I could say: version it.

Switches have config files or firmware dumps, the same goes for bios and raid bios, for documentation in the infra and connections, etc...

Infra will evolve, and so will do the "version".

While in "test" stage, it's "next version" infra, while in production, the architecture, firmware, connections and configuration, run a tested "version".

Is not easy to integrate/automate infra from different vendors, but it can be done. Been there, done that.

Hi npm answered first so I responded to his. Please see below. Versioning is not so simple I thought about it :( it kind of work well IF we are talking about versioning containers or versioning images.

Can you give an example? Your question seems too abstract for me to see if I have any experience with what you're talking about.

Sure. Well, not even with continuous deployment. We deploy infrastructure changes once a week. We have two teams (SRE & infrastructure) working on our Ansible repo. We currently commit everything into the master branch once the PR has been merged (we are on GitHub).

Now on Friday I will compile a list of ready-to-go commit for next week. These changes will be moved to staging, then to production. However, I am seeing pain managing the release process because

* sometimes a bug fix is only required in one environment (could just be production), but we still merge into master.

* we can make a weekly release tag, but then we have to merge hot fixes in. okay, not a big deal, but this happens

* we also have changes which affect global deployment (for example logstash filter files are globally used, not versioned by environment). If someone want to test a filter change in dev and only in dev for whatever legitimate reason, we will still have to push that change to production. However, this is a bad practice - I do not like pushing changes because they are part of the tree.

I thought about branching and make use of GitHub tags to help identify scope of changes (dev? stage? prod? all?), and identify component affects (right now I have to read the commit to really understand what is being changed...). But maintaining dev, stage, prod branch is costly too; I have to cherrypick commit into different branch.

So here I am with weekly release and I feel pain, I can't imagine myself doing CD (as frequent as one day at least) any time soon.

Configuration is perhaps more complicated than binary deployment. With binary deployment, you can end up with, say, only the version in production, and then version that is about to be in production - the "old" and then "new". If you've made it from dev to staging with one binary, and you discover a bug, you go through dev and to staging again, just with a new "new" binary.

Configuration, especially configuration management, often needs a more staged/tagged approach (in fact, you may have moved from having n custom builds to having one build with n configurations). You turn on a feature for some people, for one cluster, for all clusters of one type (say, v6-only clusters), and so forth. The potential combinatorial explosion is huge.

For the feature-flag case, you can use a canary approach, at least.

It's a lot harder to canary a change on one of your two (or four, or whatever) core switches, though.

A pattern I've seen is to move from a single weekly deploy of disparate changes (say, server config management, switch port config, switch ACL config, ...) to multiple smaller deploys (potentially done by fewer people) based on the type.

One "nice" thing about infrastructure is that most problems are fairly immediately apparent. There are also generally a lot fewer integration-style tests you need to consider. You can detect failures and roll back quickly. Unfortunately, you've usually had a huge impact when you fail. And it's also relatively hard to verify your change before you land it.

The post refers to a system called "Landcastle". Is that an internal project or an internal name for an existing open source project?

It's an internal project. It basically takes a "diff" from Phabricator, and goes through the process of applying it to master and ensuring that all unit and integration tests pass before committing it.

This removes the need for every developer to do a rebase, run test, attempt land, discover conflict, rebase, run test, ... loop.

It's an internal system, but it was discussed at F8 last year. Here's the video (https://www.youtube.com/watch?v=X0VH78ye4yY#t=31m30s) and someone's notes on it (http://gregoryszorc.com/blog/2015/03/28/notes-from-facebook'...).

Probably an internal sister project to Sandcastle, another Facebook project mentioned in the post.

Seems to be an internal Facebook tool.

Kind of confused, is this setup for their production deployments only? Does instagram involve any staging environments that would catch a lot of their issues ahead of time, such as failed test builds, bad commits etc? Are their developers allowed to commit directly to master or do they go through a formal pull-request process that gets signed off by someone?

As mentioned in the article itself, Instagram uses code review (using Phabricator), and also uses automated continuous integration for running tests on each "diff", before it allows the change to be landed to master.

Depends what automated tests you run. You could catch things that are technically wrong, but what about things like the button being in the wrong place.

That's nots going to spew errors.

This class of errors is sometimes fairly easy to catch with canaries.

If a button is obscured or inactive/non-functional for some reason, then chances are some metric is going to be statistically varied enough to call out while in the canary phase.

For more manual canaries, this same approach can be used for metrics like memory usage, latency, number of upstream/database connections, and so forth. Of course, that could be the _purpose_ of the change, which is why it likely will be checked with manual canaries (ie, not the canaries used for the continuous deployment process).

Is their test suite just unit tests, or does it include user-acceptance testing?

Great writeup! Would love to have seen more details on actually getting the newly shipped code recognized by the app servers gracefully, i.e., without terminating any pending requests.

"The test suite needs to be fast. It needs to have decent coverage, but doesn't necessarily have to be perfect."

Holy hell, what a telling statement that is. I get not unit testing for 1 == 1, but come on, unit and integration tests for, say, user login should be difficult, not fast. There are some test suites that actually do need to be perfect, unless Instagram thinks that eg OWASP isn't "decent coverage".

The probably use an existing, well tested tool for user authentication (why reinvent the wheel?), or have comprehensive coverage for it already. There's no reason for tests for logging in to be slow either way. And after all, it's a social media app, it doesn't have to run perfectly all the time.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact