"Problem 1: You presume there will be problems that impact availability. You have no confidence in your code quality; or (or maybe, and), you have no confidence in your infrastructure and deployment process."
Or you're playing it safe. You absolutely cannot guarantee that every update you are deploying will have zero problems. If your business absolutely relies on users making payments online or anything of that ilk, you could lose a lot of money.
"Imagine, for a moment, that your team is rolling out an update to a service that monitors life-support systems in hospitals."
What? Why? Throwing out hypothetical "what if" scenarios that would affect 0.1% of your readership isn't a very useful thing to do.
You're exactly right. This whole article sounds like someone pretending to understand risk analysis. You can make as many technological and human process improvements as you want leading up to the deploy, but even after doing everything else possible to reduce potential impact you'll still further reduce potential impact by pulling triggers when your service is at minimum load. And there is always a trigger to pull, the article argues for gradual rollout (which is good), but one still has to introduce new code to replace or run side-by-side with old code sometime. What if v2 worked alongside v1 in testing and staging just fine but something in production makes it explode?
Assume if everything else is equal, A is better than B. Also assume if everything else being equal C is better than D. This article says "You're doing A?!?!?! WTF DUDE!? Just do C instead of D, then forget about A and go back to B".
If he wants to argue that the benefits of A over B outweigh the costs of D over C, then he should do that instead of writing what comes across as saying A is the magic bullet that makes C and D equivalent. Not to mention that there the value of A over B and cost of C over D are different from organization to organization.
Besides, in the small event of deployment-caused downtime and problems, who is to say you have enough time to restore service by the time your customers come online? By taking advantage of the clock you've only given yourself a few more hours to deal with any problems, rather than finding ways to not have any problems in the first place (canary-in-the-coalmine-style deployments, etc).
Right, and that's where I strongly disagree with the author. Sometimes finding and taking care of that last 0.1% risk of failure just isn't worth it. Sometimes you're better off babysitting the deployment for 30 minutes once or twice a month rather than spending valuable development hours updating your deployment scripts to be fully automatic and handle every possible contingency.
If your deployment scripts are fully automatic, you can enjoy the (many) benefits of deploying more often than once or twice a month.
Shouldn't I decide where I want to spend my energy?
The people who use my production systems are people who are logging transactions which they have made (in the real world) into my system. It's not a hospital. If the system is down, you just come back later.
We do a lot to make sure our deployments will go smoothly, downtime is minimized, and they affect as few users as possible. But the effort required for my team to deliver "five nines" would be insane. It's much easier for one guy to take the application server down for 10 minutes (at midnight) once a month.
For the projects I've worked on lately, the ideal of "zero downtime deployments, fully automated, during the daytime, as non-events" isn't at all about getting a particular number of nines, it's about deploying more often than once every month.
When the deployment you've been working on for a whole month goes wrong, which of the many hundreds of changes are problematic?
I'd rather have a guy spend whole day making sure everything is working and rechecking, etc.
I also hope that my colleagues will have the foresight to test an update/deployment on as fresh mirror of production environment (or a representative subset) as possible.
And I'd say that this is ESPECIALLY important for NoSQL environments.