Hacker News new | comments | show | ask | jobs | submit login

While there is some truth in there, I feel like some of the advice is pretty reckless:

"Problem 1: You presume there will be problems that impact availability. You have no confidence in your code quality; or (or maybe, and), you have no confidence in your infrastructure and deployment process."

Or you're playing it safe. You absolutely cannot guarantee that every update you are deploying will have zero problems. If your business absolutely relies on users making payments online or anything of that ilk, you could lose a lot of money.

"Imagine, for a moment, that your team is rolling out an update to a service that monitors life-support systems in hospitals."

What? Why? Throwing out hypothetical "what if" scenarios that would affect 0.1% of your readership isn't a very useful thing to do.




>Or you're playing it safe.

You're exactly right. This whole article sounds like someone pretending to understand risk analysis. You can make as many technological and human process improvements as you want leading up to the deploy, but even after doing everything else possible to reduce potential impact you'll still further reduce potential impact by pulling triggers when your service is at minimum load. And there is always a trigger to pull, the article argues for gradual rollout (which is good), but one still has to introduce new code to replace or run side-by-side with old code sometime. What if v2 worked alongside v1 in testing and staging just fine but something in production makes it explode?

Assume if everything else is equal, A is better than B. Also assume if everything else being equal C is better than D. This article says "You're doing A?!?!?! WTF DUDE!? Just do C instead of D, then forget about A and go back to B".

If he wants to argue that the benefits of A over B outweigh the costs of D over C, then he should do that instead of writing what comes across as saying A is the magic bullet that makes C and D equivalent. Not to mention that there the value of A over B and cost of C over D are different from organization to organization.


I work on medical software. Surprisingly, most hospital systems require a fair bit of downtime. Hospitals have downtime procedures that they use during these periods (basically switching to manual, paper-based systems).


But I think the author's point is that if you find yourself worrying that a deployment has even a sliver of a chance of causing downtime, you should be spending your energy on finding ways to eradicate that risk rather than proceeding in the middle of the night.

Besides, in the small event of deployment-caused downtime and problems, who is to say you have enough time to restore service by the time your customers come online? By taking advantage of the clock you've only given yourself a few more hours to deal with any problems, rather than finding ways to not have any problems in the first place (canary-in-the-coalmine-style deployments, etc).


>But I think the author's point is that if you find yourself worrying that a deployment has even a sliver of a chance of causing downtime, you should be spending your energy on finding ways to eradicate that risk rather than proceeding in the middle of the night.

Right, and that's where I strongly disagree with the author. Sometimes finding and taking care of that last 0.1% risk of failure just isn't worth it. Sometimes you're better off babysitting the deployment for 30 minutes once or twice a month rather than spending valuable development hours updating your deployment scripts to be fully automatic and handle every possible contingency.


Sometimes you're better off babysitting the deployment for 30 minutes once or twice a month rather than spending valuable development hours updating your deployment scripts to be fully automatic

If your deployment scripts are fully automatic, you can enjoy the (many) benefits of deploying more often than once or twice a month.


Given no other dependencies, sure.


If you're not worrying you either don't care or you're fooling yourself. If it takes 30 min to fix a potential problem in production, I'd rather upset 10 people in the middle of the night than 1000 people during working hours.


> But I think the author's point is that if you find yourself worrying that a deployment has even a sliver of a chance of causing downtime, you should be spending your energy on finding ways to eradicate that risk rather than proceeding in the middle of the night.

Shouldn't I decide where I want to spend my energy?


The author is giving advice, not telling you what to do.


Throwing out hypothetical "what if" scenarios that would affect 0.1% of your readership isn't a very useful thing to do.

Indeed.

The people who use my production systems are people who are logging transactions which they have made (in the real world) into my system. It's not a hospital. If the system is down, you just come back later.

We do a lot to make sure our deployments will go smoothly, downtime is minimized, and they affect as few users as possible. But the effort required for my team to deliver "five nines" would be insane. It's much easier for one guy to take the application server down for 10 minutes (at midnight) once a month.


...the effort required for my team to deliver "five nines" would be insane. It's much easier for one guy to take the application server down for 10 minutes (at midnight) once a month.

For the projects I've worked on lately, the ideal of "zero downtime deployments, fully automated, during the daytime, as non-events" isn't at all about getting a particular number of nines, it's about deploying more often than once every month.

When the deployment you've been working on for a whole month goes wrong, which of the many hundreds of changes are problematic?


I would sincerely hope that my colleagues always assume that there WILL be problems that impact availability when dabbling in production environment.

I'd rather have a guy spend whole day making sure everything is working and rechecking, etc.

I also hope that my colleagues will have the foresight to test an update/deployment on as fresh mirror of production environment (or a representative subset) as possible.

And I'd say that this is ESPECIALLY important for NoSQL environments.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: