Even if you have unit tests, continuous integration, a staging environment, and fast rollbacks, deploying to production can still be tricky:
1) Do you want to run MySQL, PostgreSQL, Oracle or SQL Server? Watch out for ALTER TABLE; it may sometimes hold table locks or invalidate open cursors. Of course, you could always rewrite your application to use a schema-free NoSQL database and migrate your records at read time. But that has a lot of consequences. Maybe 10 minutes of scheduled maintenance is an acceptable tradeoff?
2) Does your staging environment exactly match your production environment? Can you generate a full-sized, real-world load on your staging servers? Alternatively, have you built all the tools required for a phased deployment of your back end?
Now, I'm all in favor of deploying to production 5 times a day. That's pretty easy for any group with good unit test coverage and a fast rollback mechanism. But it's expensive to eliminate the occasional 10 minute maintenance window. And if you're going to take the site down for maintenance, it makes sense to do it at an off-peak hour.
There are lots of things that make sense at the appropriate scale. Not having any tests. Completely manual processes for builds and deployments. Having only one server. Making backups every day or maybe even just every week. Etc. However, as your business grows you need to recognize that it's important to change these things and move to better processes. Because if you don't then those things could become a very serious drag on development velocity and even business capability. You could find yourself spending all your time drowning in process when you should be spending your efforts more efficiently.
You stay up until 3am to perform a risky manual deployment, sometimes you spend some time fixing problems afterward. As a consequence you either spend a day in zombie mode with limited productivity or you start working much later. In either case you've taken away a fairly decent chunk of time that could have been used for productive work. Similarly, perhaps you spend too much of every day fighting your build system, or recovering from your build always being on the floor, or excessive operational support because your platform doesn't have sufficient redundancy at every tier. Etc, etc, etc.
From my experience, it's easier to start there from day one than to try to get there after years of dysfunctional deployment processes.
Interestingly, scaling is probably the least valuable thing to "overbuild" early and yet that's by far the most common thing teams tend to do.
As is suggest above, you could use NOSQL. But NOSQL still has schemas, in a sense - if you decide to restructure anything you may still need to knock the db out for a while (or write really hairy code to work around two different versions of your data, but be unable to restructure stuff because you're too scared to change anything on the db).
I saw a talk by Brett Durrett, VP Dev & Ops at IMVU last week. They don't make backwards incompatible schema changes. Ever. Also, on big tables, they don't use alter statements ever, because MySQL sometimes takes an hour to ALTER.
Instead, they create a new table that is the "updated" version of the old table, and then their code does copy-on-read. i.e. they look for the row in the new table, if it's not there, copy it out of the old table and insert into the new table. Later, a background job will come through and migrate all old records into the new table. Eventually, the old table will be deleted and the copy-on-read code will be removed.
It's a lot of extra work, but they think it's worth the effort.
I need to finish my blog post on the rest of the talk...
No matter what you end up doing it'll be extra work. It's best to find a way of working that provides enough flexibility to continue developing at a good pace without being an operational nightmare. Generally that means you'll have to take things a bit slower, but that'd be true almost regardless of the technology you're using, schema changes rarely have a trivial impact.
My app, on startup, ensures that it has all of the tables (with all of the columns) and any seed data that it may need (by issuing ADD TABLE and/or ALTER TABLE). This allows me to simply roll out new (tiny, incremental) database changes and code and be sure that it works.
This also makes testing easier, in that my integration tests can start with a new database from scratch, build what needs to be built, run the tests that talk to the database, and then remove the database once it has finished.
If you must make non-backwards compatible changes (renames, whatever), I would suggest doing them one at a time.
Their examples show that it is not possible to do all application upgrades without downtime, and they show how to keep downtime to a minimum (e.g. by only making parts of your application inaccessible).
Another option is to just wake up much later: my office is dead until 2pm, and the "key players" often don't wake up until 6pm.
"Problem 1: You presume there will be problems that impact availability. You have no confidence in your code quality; or (or maybe, and), you have no confidence in your infrastructure and deployment process."
Or you're playing it safe. You absolutely cannot guarantee that every update you are deploying will have zero problems. If your business absolutely relies on users making payments online or anything of that ilk, you could lose a lot of money.
"Imagine, for a moment, that your team is rolling out an update to a service that monitors life-support systems in hospitals."
What? Why? Throwing out hypothetical "what if" scenarios that would affect 0.1% of your readership isn't a very useful thing to do.
You're exactly right. This whole article sounds like someone pretending to understand risk analysis. You can make as many technological and human process improvements as you want leading up to the deploy, but even after doing everything else possible to reduce potential impact you'll still further reduce potential impact by pulling triggers when your service is at minimum load. And there is always a trigger to pull, the article argues for gradual rollout (which is good), but one still has to introduce new code to replace or run side-by-side with old code sometime. What if v2 worked alongside v1 in testing and staging just fine but something in production makes it explode?
Assume if everything else is equal, A is better than B. Also assume if everything else being equal C is better than D. This article says "You're doing A?!?!?! WTF DUDE!? Just do C instead of D, then forget about A and go back to B".
If he wants to argue that the benefits of A over B outweigh the costs of D over C, then he should do that instead of writing what comes across as saying A is the magic bullet that makes C and D equivalent. Not to mention that there the value of A over B and cost of C over D are different from organization to organization.
Besides, in the small event of deployment-caused downtime and problems, who is to say you have enough time to restore service by the time your customers come online? By taking advantage of the clock you've only given yourself a few more hours to deal with any problems, rather than finding ways to not have any problems in the first place (canary-in-the-coalmine-style deployments, etc).
Right, and that's where I strongly disagree with the author. Sometimes finding and taking care of that last 0.1% risk of failure just isn't worth it. Sometimes you're better off babysitting the deployment for 30 minutes once or twice a month rather than spending valuable development hours updating your deployment scripts to be fully automatic and handle every possible contingency.
If your deployment scripts are fully automatic, you can enjoy the (many) benefits of deploying more often than once or twice a month.
Shouldn't I decide where I want to spend my energy?
The people who use my production systems are people who are logging transactions which they have made (in the real world) into my system. It's not a hospital. If the system is down, you just come back later.
We do a lot to make sure our deployments will go smoothly, downtime is minimized, and they affect as few users as possible. But the effort required for my team to deliver "five nines" would be insane. It's much easier for one guy to take the application server down for 10 minutes (at midnight) once a month.
For the projects I've worked on lately, the ideal of "zero downtime deployments, fully automated, during the daytime, as non-events" isn't at all about getting a particular number of nines, it's about deploying more often than once every month.
When the deployment you've been working on for a whole month goes wrong, which of the many hundreds of changes are problematic?
I'd rather have a guy spend whole day making sure everything is working and rechecking, etc.
I also hope that my colleagues will have the foresight to test an update/deployment on as fresh mirror of production environment (or a representative subset) as possible.
And I'd say that this is ESPECIALLY important for NoSQL environments.
That's the number 1 reason for an overnight deployment, by far! If you can deploy code without bringing the site down, then of course, do it during the day!
Also, we do a daily backup around midnight. If the deployment is botched, I can come in early in the morning, notice it and simply do a roll-back using the backup. Very easy.
Now, if you're actually staying up until 3 AM and doing things manually ... you need to automate things.
You roll out when it least impacts customers because:
1. Impacted customers affects your bottom line. Minimize that and you minimize losses if something screws up.
2. Murphy's law. I don't care how much testing and QA you do; stuff will ALWAYS creep through, and sometimes it'll be nasty. QA works to minimize this, but it can't eliminate it entirely due to diminishing returns. Show me your guaranteed bug free deployed code and then I'll consider changing my view.
3. If you've designed and tested your rollback procedure prior to actually doing it, the chances of not being able to roll back in the real deployment is orders of magnitude lower than the chances of a failed deployment requiring a rollback, which in turn, if you've done your testing and QA, is orders of magnitude less likely than a successful rollout (but not 0, thus the midnight rollout).
If you're worth your salt, you have a tested rollback procedure, laid out in simple to follow instructions (or better yet, an automated rollback mechanism with a simple-to-follow manual process when the automated method inevitably fucks up).
You rollout, and if it fails, you roll back. And if that fails, you use the manual procedure. You should have the entire process time boxed to the worst case scenario (assuming successful manual rollback) so that you know beforehand what the impact is, and won't need to go around waking people up asking what to do.
The way to not impact a customer is to make deploys trivial, automated and tolerant to failure because everything fails.
I basically agree with this idea, but when I'm selling people on the idea of making deployments trivial non-events that happen in the daytime, having the notion of "if something goes wrong, you can very easily jump back" gives people a sense of security.
In practice, when things go wrong, I've found it easier to roll forward than to roll backwards.
For a stable system that's in production, you don't need "guaranteed bug free deployed code" you need code that is not any worse than what's currently running out there. Doing frequent (daytime) deployments makes it easier to make a change, test (both with humans and robots) that change, and get it out there. You don't have to manually test everything in order to change anything when you're changing just one thing at a time.
I've come to believe that the far riskier approach is to make a bunch of changes at once, introduce a bunch of bugs, test fix bugs until you feel confident, and then release this huge change all at once in the middle of the night.
I'm sure this is a noob reaction, but can anyone point me to some good technical walkthroughs about how to deploy to live without taking your site down or interrupting your database connection? Is there a tool/term/practice I'm missing?
1. distribute new package to all servers;
2. run an additional application service on all servers and run some quick tests on each of them to verify proper working;
3. add new application servers to load balancers;
4. upgrade data model to new version (we use postgresql, and this happends in a single db transaction, and remember that our new version x is compatbible with both the current and the previous data model);
5. remove old application services from load balancers;
6. upgrade successful.
If anything goes wrong, we can roll back each of these steps. Note that this whole process, perhaps needless to say, is fully automated.
By running multiple versions inside the load balancer at the same time, and having the requirement that version x + 1 is always compatible with version x, this procedure allows us to seamlessly upgrade to a new version without any downtime.
In most specific cases, there are alternatives to stopping the show during an upgrade. First, you check your backups! Then, there are various strategies. You can make many simple schema changes while the system is running. Or put the DB in read-only mode and still mostly manage to serve pages -- perhaps with an occasional HTTP 503, so hopefully that's okay -- during your upgrade. Or phase in changes over several releases, carefully architecting for backward compatibility. Or bring up a parallel system and gradually migrate active users to it. Depending on what you're doing, you may find yourself having to write a special upgrade script that migrates old formats to new formats, or even using database triggers to keep "old" and "new"-format tables in sync during the transition. A well-constrained database may help keep things sensible during the transition -- or, you may have to drop half of your supposedly-sensible constraints just to make the transition work.
Moral: Even if you know exactly what you're doing, live updates are more work: More planning, more code, more infrastructure, and/or more stress. In many real-world cases, you should just take the system down for a minute. Focus your engineering effort on making sure that "minute" is as short as possible, and in making sure that you can detect problems and roll back as quickly as possible.
> the general answer is "don't change the database
> structure of a database that's serving live traffic:
> You'll never enumerate all the things that could
> possibly go wrong."
"Don't do it" is not really an answer.
> Or put the DB in read-only mode and still mostly
> manage to serve pages
The reason there aren't a lot of general-purpose walkthroughs is that there is no general case.
The closest thing we have to a general-purpose solution is: Take the system down, do the upgrade, put the system back up. But, yes, this is often not a very good answer. It is often a lousy and expensive answer. In which case you hire engineers to build a better strategy. And I can't tell you, dear reader, what that better strategy is going to be, because it's different from case to case, and I don't know what your problem is.
And, yes, of course you shouldn't use the sloppy strategies on your financial-transaction processing app. Just as you probably shouldn't spend three engineer-weeks designing a complex zero-uptime rollout strategy for your blog comment system.
As for your second question, if you have a load balancer, you can always take nodes out of it in order to update them, before re-enabling them in the LB and moving on to the next one. It's called a rolling upgrade, and the ease and details of doing such depend on the actual pieces involved.
We normally disallow non-backwards compatible changes, such as renaming columns. We only drop tables after renaming and waiting a while (so we can quickly rename back).
When you have a lot of database servers this is pretty important since trying to keep them all exactly in-sync with the same schema at all times in the process is pretty much impossible. While doing the change, you are always going to end up with some finishing earlier than others.
I boil it down to this: The safest change to deploy to a stable system is the smallest change possible. Most of my changes don't require schema changes at all.
The application itself, on startup, verifies that the database has the tables/columns it needs in order to work. If it doesn't, it will CREATE TABLE or ALTER TABLE appropriately.
I try to avoid backwards-incompatible schema changes whenever possible, so I can rollback. It's always safer and easier to add a new table or column than to delete/rename an existing one. Something wrong with the code? Rolling back won't send you back to an incompatible state.
I use an ORM instead of stored procedures, because I find them a lot more friendly with this general process, you don't have procs that expect a particular parameter signature.
You may need to decouple your db-changes deployments from your non-db-changes deployments. Doing that can at least make the non-db-changes deployments a lot less painful.
Three words: Defense in depth. You don't always need every advantage you can get, and this is a pretty costly one. Still, it's simply wrong to assume that a particular precaution is always needless.
Interestingly, you hit a bit of a sweet-spot if your primary customer base is located halfway around the world. You can roll out updates when your people are most wide awake, but when you'll only break things for a few users if something goes wrong.
That whole line about trusting your infrastructure, etc, is hog-wash. I say, don't trust your infrastructure. Don't trust your code. Be safe, be smart, do things at a time when the least number of people could be affected by a problem.
Humans are responsible for everything that is happening in a deployment, and humans make mistakes. So, no, I don't trust my servers or my code at a time of deployment.
It would be irresponsible of me to do so even if I had the most talented developers, and the most solid and secure platform in the world.
Actually, an 8am deployment in Europe has the advantages of a rested brain and a few hours before America hammers the Internet ;-)
If the software company is either A nor B, then the software company need to notify the non-user users that they may experience a disruption.
And the disadvantage of happening in the middle of mainland Europe morning rush. I think the point is that you take the downtime when it's least disruptive to your users, who/wherever they might be.
Choosing the time should be a balance of off-peak usage time and a reasonable time for the developers. If you're a global business then you may not have a perfect time, but you can usually still choose a day and time that is less likely to cause inconvenience.
There is a great description of their web server stack that allows them to seamlessly push new code live without downtime (see the second Slow Deploys section.) It's a great idea about how to continue serving during a deployment, and I'm looking into using Unicorn like this for some projects I'm working on.
Anybody know of any articles explaining this process for a rails app?
For this to work, every action that will result in a database write needs to go in log, and that action log needs to be replayable be the updated version.
In the context of a web application, you have:
1. Take your database backup, start recording action log.
2. Perform your database migration against your database backup.
3. Install your updated web application (not available to users yet).
4. Test your updated web application.
5. Disable your existing web application.
6. Replay the contents of your action log using the updated web application.
7. Making your updated web application available to your users.
It's definitely not a trivial matter to sort out: it means lots of time working on your deployment process, for example you have to be set up to have both versions deployed simultaneously, which likely means a lot of mucking about with URL rewriting. The application also has to be built to support it.
Maybe there is a better way?
The author makes some amazing points that I'm surprised intelligent people are missing.
Let me give an example. We are currently migrating from storing blobs of data in Voldemort (a key/value store) to storing them in S3. They should have never been in there in the first place but whatever. We're going to do it with "zero" migration time. In fact we're already doing it.
- Set up a job that copies existing data in Voldemort to S3.
- Deploy a minor release of our code that multiplexes the current writes to both Voldemort and S3.
- Continue migrating existing data
- When existing data is finished migrating, deploy a new release that forces all traffic to S3 instead of Voldemort.
People need to learn to do things like dark launching and feature flags. Dark launching let's you exercise new code paths with no impact to the user. Feature flags give you the ability to enable to features to some or all of your users. Feature flags are awesome ways to A/B test as well.
People need to stop doing stupid shit like redefining what some property means mid-release and instead define NEW properties and deprecate old ones in subsequent releases.
Same goes for schema changes. If some part of your code base cannot tolerate an additional column that it doesn't need, that's a bug.
You can also adopt some configuration management that allows you to provision duplicates of the components you're deploying so you can swap back and forth between them in the case where you might have a breaking release. That's what we do and it's one of the upsides of using EC2.
All of this requires discipline and dedication but the benefit is so worth it. You have to stay on top of dependencies. You can't let bitrot take hold by going 4 major revisions without upgrading some package. We did and it bit us in the ass.
This is why we adopted the swinging strategy of duplicating our entire stack (takes about 30 minutes depending on Amazon allocation latency) on major upgrades.
As for deploying to a segment of users put a "version" field in the user and then gate features based on the version. If you name your versions or use a unique numbering system it should be fairly easy to remove the old gating.
Yes, none of these things are foolproof but then again your existing system probably isn't either. The more you deploy the better you get at it and the more automated it becomes. Force the issue by forcing a deploy everyday to begin with, then ramp up to twice a day, etc. After a month or two you'll learn solutions to almost all your deployment problems and there will be a well-known solution in the organization for solving the problem whether it be gating features or schema migrations.
What about having two code bases on your production server and having your web server (nginx) route the traffic accordingly? 85% goes to the current code base and 15% goes to the new one?
If that's possible it would seem pretty easy.
Gating would probably work better in the long term as you get more devs and users, and need to do things like A/B/C testing.
Whether or not you agree with the entire article, for me, the statement above was the most important one. There is no excuse why any software development team does not have an automated build and deploy process.
There may be valid business reasons to deploy in the middle of the night, but there's no business reason to deploy by hand.
Also, I am assuming that the posting only reflects application source code and not database changes, which is a different beast in terms of deployments.
I worked somewhere that only deployed during business hours because it was all hands on deck. We just kept the users primed with information and reminders prior to go-live, and when it happened everyone knew about it.
But there's a more fundamental point. We're dealing with IT, and for internet businesses and global enterprises the issue is the same: business is 24 hours a day 7 days a week. You cannot avoid deploying somewhere in the world that's awake.
All you have to do is plan a rollback and just get on with it.
My experience of companies who are timid about rolling out, changes is that they're state of advancement is completely unstable. They are too scared to change meaning that some technologies become out-of-sync with the current state of play. You only have to introduce a new piece of wiz-bang software and issues come pouring out, and it's a frantic race to upgrade random components in the hope you might plug the leak. This is far far worse in every way than doing regular "risky" upgrades.
Small incremental changes can be easily rolled back, because you know what you just changed.
At my previous job, we were still in process to get session migration to work so servers within a cluster could update while others were running.
At my current job, we don't have a cluster (we have a single over-provisioned server), and that method simply isn't in the budget for the foreseeable future. However, 3:00 AM PDT is 3:30 PM IST (India), which is where most of my team is, as well as the support staff they need (most staff there are on swing shift, as well). Thus, an early morning release is actually a very convenient, as well as lower cost, time to do the release. As the site is mostly US customers in both a B2C and B2B use case, this timing (around 6:00 AM EDT) works very well, and I don't see it changing.
I realize this doesn't scale for a global app, but it explains the lack of pressure to spend much consideration on higher tech, more expensive, deployment methodologies.
Because you don't hire a system admin to do that job for you. :}
"Release early, release often" is great, but that same call to action can be what motivates people to push code live at 3am when they're tired, just finished it, and perhaps skipping QA.
Otherwise, I think the article as put is empty and seems more straw man. Assumes you have no confidence in your code, assumes it is complicated, assumes you can't roll back.
Simply put, deploy small stuff often, make plans for big stuff, and don't be stupid.
If it only takes 5 minutes to deploy your app why not do it at 2AM when few users are online? Deployments should be simple, easy, and NOT require any problem solving by the deployer. Midday deployments with all hands on deck to handle issues seems like a bigger problem to me.
We're not doing 2am deployments, but we're certainly doing them "off-hours".
Nothing in your SLA says you have to have servers 1, 2 and 3 available. It says you have to have service X available. You can do that with out taking a downtime hit.
It's an "SERVICE LEVEL AGREEMENT" not a "SERVER LEVEL AGREEMENT".
Sounds like your experience is the only one that matters, then.