I've got ~1,000 printed copies to give away. So if anyone wants one, go here: http://madete.ch/1S3OGvl and follow the link on the left hand side and we'll mail a copy to you.
The ROI from the 100 trial worked well, so we decided to try and scale it out to more people. TBD whether we'll see a good ROI from 1000 people, but we hope people enjoy it and like us for sending them a free book!
Please drop me a line firstname.lastname@example.org. Also, looking forward to read your book :-)
"Thanks for signing up!
Your book is on its way, you should receive it within a few working days."
For example: let's say you have a feature that uses a column, but you want to move it to using a separate table.
Step 1: design the new table
Step 2: deploy the new table - existing code continues to run against the column
Step 3: run a back-fill to ensure the new table has all of the data that exists in the current column
Step 4: deploy code to use the new table instead of the column
The above is the best-case scenario - but it often doesn't work like that, because you need to ensure that data added to the old column between steps 3 and 4 is correctly mirrored across. One approach here is to deploy code that dual-writes - that writes to both the old column and the new table - along with extra processes to sanity-check the conversion.
GitHub's Scientist library is a smart approach to these more complex kinds of migration - which can take months to fully deploy. https://github.com/github/scientist and http://githubengineering.com/scientist/
This can help in the migration process because you can version the views and sprocs whilst running more than one version at a time. Essentially you're creating a versioned API for you db.
Of course you now have more db objects to manage (boo, hiss, more moving parts) but it also encourages you down a saner path of versioning your db objects and rationalising your persistence somewhat (Do we really need 3NF? If I update this table in isolation..I jeopardise consistency of this entity etc.)
None of this _solves_ anything but I've found it mitigates a hard problem.
They aren't done live as a single big process that have the potential to lock all queries/updates over their execution, but rather as a set of smaller steps that don't lock.
Facebook has spoken about its online schema change process before - https://www.facebook.com/notes/mysql-at-facebook/online-sche... and its follow-up at https://www.facebook.com/notes/mysql-at-facebook/online-sche... for example, and I'm sure elsewhere.
Most people using MySQL would potentially first use something like https://www.percona.com/doc/percona-toolkit/2.1/pt-online-sc... instead of trying to create their own.
The same principles apply to other data stores that have more rigid schemas.
You can't rate limit the io. That's huge and the percona tool and LHM offer ways to keep io under control. Along with that you have to pay attention to the live schema update matrix in the docs; many things will upgrade with out locking but require copying the table and this will blast your IO.
In addition even with the tools some changes require exclusive table locks. It needs it for a short period of time but if it can't get it because of a long running transaction , queue of transactions, and etc it can block everything up.
I don't think this is an approach I would recommend but if your gig is a giant enterprisey shop that manages risk far to conservatively it's a reasonable half-measure.
CD is an aspiration for a lot of shops and getting there is a cautious tale of lots of small victories and earning the trust of the decision makers. Sometimes that means:
> designing around it
Sorry but that seems like a pretty crap answer.
The answer to "what tools allow you to manage database migrations with CD" should not be "don't do database migrations with CD" or "roll your own toggling features."
That said, I would would never push for it at my previous company. At this company it's something they've been doing since day 1. It's always been built into our process, culture, and hiring practices. At my previous company the "agile" process was just a shortened waterfall process. Though many of the engineers were very talented, and responsible... there was more than a handful that I would never trust to "self test" their code, or to monitor it as it deploys.
Phase 1: Upgrade schema for new code. Migrate initial data from old schema to new schema.
[Deploy: New code starts taking requests, writing to new schema. Old code is drained from the pool of handlers, continues to write to the old schema. Once old code is drained from the pool and the new code is validated by production traffic, run Phase 2.]
Phase 2: Catch-up migration of old data to new schema. Drop old schema.
I used Liquibase for migrations - change-sets can be tagged with contexts and when you run the migration you can specify contextual information that each change-set can target (e. g. development AND pre-deployment). The principal tags I used were pre-deployment and post-deployment (which map to Phase 1 and Phase 2 above).
Schema migrations were a little harder to write but it meant that we could migrate live without impact to customers.
An interesting way to use it is that you can have multiple applications that may use the same schema but are responsible for their own tables- you can specify a different schema metadata table for each application and they can generally live together fairly happily. We also have an application that 'deploys' data to target databases, so we are able to run Flyway via their Java library to make sure the target schemas are correct before running our statements against them.
Prior to Flyway I used Liquibase, which is also pretty powerful- but Flyway has just been so much more versatile- and the Spring Boot 'auto' integration has been awesome.
For example, if most of your web servers are running version 1 and you're in the middle of rolling out version 2. A page request goes to one of the web servers with version 2, which returns HTML containing links to assets with the version 2 cache-busting URLs. The browser requests those assets, perhaps through the CDN, but the load balancer sends the requests to a web server still running version 1, where the assets don't exist yet. This means the browser will get a 404 error.
What are the best options for dealing with this problem?
A bit unsatisfying because it probably isn't always easy the include both assets.
Another idea (not sure if a good one) would be having the load balancers pin a certain session to a certain backend machine. Seems like this would make it better without fixing it, though: that session will still need to switch to a different set of assets when "their" server is deployed.
If instead of being something the web server throws away before replying, the version number actually caused different assets to be returned, then you'd not have this problem.
One pattern is to separate out assets into a different package that is deployed to a separate host group and have your clients request a different host name, or have your load balancers use a path match to use those servers for those requests.
Another is to push asset updates first to all hosts. All hosts, even without code update, will now be able to respond for the new assets.
Another is to use a local cache plus some backend service or database to serve the assets from the web servers - again, all hosts will now respond correctly for the old assets and the new.
I do a lot in the .NET/SQL Server world, and my tool of choice is one that I wrote: http://josephdaigle.me/2016/04/03/introducing-horton.html. Conceptually, what this tool does could work for any RDBMS.
I've always found they don't pay for themselves in terms of testing and likelihood of being required.
Alternatives are snapshots (depending on if you can afford downtime) and simply writing recovery scripts that aren't tied to individual migrations but rather a deployment. i.e. run script before running any migrations and if it goes wrong run contra script to backout as opposed to cleanup script.
(I suppose it's also possible that that's referring specifically to merge commits into master, which would make a lot more sense to me)
In general, feature branches are relatively very short-lived, and will be code reviewed, rebased and landed as a single commit onto master.
Features are often feature flagged off anyway, so it is acceptable to commit partially-functional features to master while that feature is flagged away.
There is a concept of stacked commits, but each commit in the stack needs to be a working step towards the end goal, and as such can (and will) be landed in isolation as they are code reviewed.
I don't understand why people do it any other way.
If you do a commit and find out in the middle of the day the latest deploy is having problems and people are still committing in new code wouldn't this make things much harder to narrow down?
Most problems will be discovered by someone and reported in an hour, and most of those will also be discoverable in a dataset on a system like Scuba - https://www.facebook.com/notes/facebook-engineering/under-th... - and you can identify the first time that particular issue happened.
If you're lucky, it lines up exactly to a commit landing, and you only need to look at that. Otherwise, due to sampling, maybe you need to look at two or three commits before your first report/dataset hit. You can also use some intuition to look at which of the last n commits are the likely cause. A URL generation issue? Probably the commit in the URL generation code. You'd do the same thing with a larger bundled rollout, but over a larger number of commits (50, in the case of a daily push).
Nevertheless it's always great to read how others accomplish their goals and even better that they're willing to share the journey.
Personally I find it incredibly frustrating to see code that I write not ship for sometimes weeks or even months at some clients. It's a slow process but we'll get there..
It will catch any uncaught exceptions, group them together and normalize the stack traces (because of course they look different across browsers). If you tag the errors by their release, you can see if your release introduced any regressions.
We're not using it in any automated way, though, because there's so much noise. Any time a phone happens to run out of memory or some extension crashes, you'll get an error report.
Of course, it also doesn't cover non-crashing regressions, where you may have incorrect behavior rather than crashes. Those are much harder to catch, unless your integration tests are incredibly granular.
The solution is usually to run things in parallel. Examples: Run tests in parallel. Keep things in separate repos and push them into separate folders in the app servers.
How to do that on a much smaller scale e.g. when I have only 3 servers available? If I deploy to one of them potentially 1/3 of customers might get broken version.
You don't have to have all three servers getting the same amount of traffic, and you don't have to have a single copy of your service on each server. So, you could reduce the weight of a single server that does canary traffic to reduce the pain, or you could run two copies of your service on a server, and have the canary copy get a trickle of traffic.
Another approach is to use shadow traffic - instead of handling the requests on the canary host, you handle it on the production host _and_ the canary host. You'd need to ensure the canary can't adjust the production database, for example - or maybe you only shadow read requests. If you don't get any errors, or you're able to prove to yourself that they function the same, you can then move to a more traditional canary.
You definitely need to adjust your continuous deployment implementation to your environment, whatever it is.
1) Add 4th server at new version
2) Drop 1 old version
3) Launch 1 more new version
4) Drop 1 old version
5) Launch 1 more new version
6) Drop 1 old version
This way you never have less than 3 servers serving requests, but you never pay for more than 4. Should only cost a few cents for most cloud providers for this temp server.
For example, let's say 50 commits all land on master within the same second. Why break those into many deployments stretched across hours instead of deploying them all in the next event?
If you landed a bad commit in the middle of that 50, it seems like it might not be immediately obvious once it was deployed that it was bad - and then 5 or 30 minutes later another commit is deployed on top of it.
You might not notice a problem until hours after all of the commits have been deployed, which leaves you in the same situation as if you had deployed all 50 changes in one event, but in this model those 50 commits have been stretched over a much longer period of time between commit and liveness to users.
That's an amazing statement to me. I've always worked in smaller environments where we roll up many changes and try to deploy them perfectly. The penalty for bad changes has been high. This is a really new way of thinking.
It's an exciting way of thinking, but I'm not sure I love it. I wonder how well "sometimes we break things" scales with users of smaller services. I guess the flip side is that "we often roll out cool new things" definitely is desirable to users of small services.
Given this, your confidence threshold for a release is not approaching 100%, it's hitting some "good enough" value, where the work you're doing to test for the next 1% is 2x of the testing you're doing now and is "not worth it". As you burn through some sort of error/downtime budget, you'll adjust that level of confidence - as you have more problems, and take more time with responding to problems.
Continuous deployment's upside is a confidence in the release process (since you do it so often), and some assurance that you'll be able to find the problem reasonably fast (since you only have to look through a smaller number of changes). You'll have fewer bigger problems, and more smaller problems. There definitely are cases where 10 smaller downtimes of 5 minutes is worse than 1 larger downtime of one hour, but usually it's better to have the former.
Writing "business software" I have noticed that this doesn't scale at all. I mean when you have a couple of thousand people depending on the software for work bugs are really not tolerated that well.
It's probably different if you have hundreds of servers and can detect bugs on a deployment on one of them, so it only affects a small percentage of users and then you can roll back and try again.
But if you have a single installation and you break that all the time with your commits then it probably doesn't work so good.
And for the majority of software you really do not need "webscale" installations with millions of "heroku boxen" or droplets etc... Sure have some for redundancy but it really doesn't help with this "deploy master on each commit" type of deal.
Switches have config files or firmware dumps, the same goes for bios and raid bios, for documentation in the infra and connections, etc...
Infra will evolve, and so will do the "version".
While in "test" stage, it's "next version" infra, while in production, the architecture, firmware, connections and configuration, run a tested "version".
Is not easy to integrate/automate infra from different vendors, but it can be done. Been there, done that.
Now on Friday I will compile a list of ready-to-go commit for next week. These changes will be moved to staging, then to production. However, I am seeing pain managing the release process because
* sometimes a bug fix is only required in one environment (could just be production), but we still merge into master.
* we can make a weekly release tag, but then we have to merge hot fixes in. okay, not a big deal, but this happens
* we also have changes which affect global deployment (for example logstash filter files are globally used, not versioned by environment). If someone want to test a filter change in dev and only in dev for whatever legitimate reason, we will still have to push that change to production. However, this is a bad practice - I do not like pushing changes because they are part of the tree.
I thought about branching and make use of GitHub tags to help identify scope of changes (dev? stage? prod? all?), and identify component affects (right now I have to read the commit to really understand what is being changed...). But maintaining dev, stage, prod branch is costly too; I have to cherrypick commit into different branch.
So here I am with weekly release and I feel pain, I can't imagine myself doing CD (as frequent as one day at least) any time soon.
Configuration, especially configuration management, often needs a more staged/tagged approach (in fact, you may have moved from having n custom builds to having one build with n configurations). You turn on a feature for some people, for one cluster, for all clusters of one type (say, v6-only clusters), and so forth. The potential combinatorial explosion is huge.
For the feature-flag case, you can use a canary approach, at least.
It's a lot harder to canary a change on one of your two (or four, or whatever) core switches, though.
A pattern I've seen is to move from a single weekly deploy of disparate changes (say, server config management, switch port config, switch ACL config, ...) to multiple smaller deploys (potentially done by fewer people) based on the type.
One "nice" thing about infrastructure is that most problems are fairly immediately apparent. There are also generally a lot fewer integration-style tests you need to consider. You can detect failures and roll back quickly. Unfortunately, you've usually had a huge impact when you fail. And it's also relatively hard to verify your change before you land it.
This removes the need for every developer to do a rebase, run test, attempt land, discover conflict, rebase, run test, ... loop.
That's nots going to spew errors.
If a button is obscured or inactive/non-functional for some reason, then chances are some metric is going to be statistically varied enough to call out while in the canary phase.
For more manual canaries, this same approach can be used for metrics like memory usage, latency, number of upstream/database connections, and so forth. Of course, that could be the _purpose_ of the change, which is why it likely will be checked with manual canaries (ie, not the canaries used for the continuous deployment process).
Holy hell, what a telling statement that is. I get not unit testing for 1 == 1, but come on, unit and integration tests for, say, user login should be difficult, not fast. There are some test suites that actually do need to be perfect, unless Instagram thinks that eg OWASP isn't "decent coverage".