

Why Deployment Freezes Don't Prevent Outages - jsnell
http://www.xaprb.com/blog/2014/11/29/code-freezes-dont-prevent-outages/

======
objectified
Whether or not this theory is true, can be measured. A few important numbers
come to mind.

\- do you get more operations tickets right after a production deployment?

\- do your call centers get an increased calls/hour rate after a production
deployment?

\- are there often noticeable anomalies in system resource usage that seem to
be directly related to your deployment cycle?

\- do your monitoring tools show a higher rate of warnings/criticals right
after a production deployment?

Whether or not production deployments introduce more risk is probably largely
subjective. How well do you test, what number of changes are inside a
deployment, how collaborative are your operations/development teams, are your
technical teams understaffed, and so on.

The advantages of a freeze (defined as "no production deployments during a
certain time window") that I see, are at least the following.

\- from my own experience while working in operations, I remember very well
the many sleepless nights that somehow always seemed to be the immediate
consequence of new (byte)code running in production

\- it gives operations a break; they often work at ungodly hours the whole
year, and getting some rest is very much needed

\- it gives a moment to stand still and reflect; work on internal tooling,
putting some structure into ad hoc things that sneaked in, etcetera.

Furthermore, I don't agree with the philosophy that "everything is always
broken". Sure: disks and power supplies break all the time, even load
balancers break, security patches need to be applied, and so on. But these are
things that are part of day to day operations, and most operations engineers
know how to do them. It's usually controllable. Unlike a bug in some newly
introduced code by one developer that causes a stack overflow every time one
certain application flow is being hit. That requires a different kind of
discipline to solve.

I think it's a little dangerous to generalize these things without having
actual numbers to back them up; before you know it, your operations team won't
have any excuse to just sit and play Quake for a week. In most cases, that's a
joke.

~~~
dasil003
I agree the OA is a bit disingenuous about acknowledging the risk of
deployments, but the four metrics you came up with also don't tell the whole
story. _Of course_ there will be more breakage after a deployment, the real
question isn't whether that's true or not, it's whether subsequent deployments
will becomes _even more_ risky by withholding earlier deployments.

~~~
Retric
Not all risks are equivelent. There is a reason planned outages are scheduled
for ~2AM local time not ~2PM local time. Plenty of companies have dealt with
2h windows where if the site is down they fail. Some have even been down
during that time period and ended up laying off everyone.

------
tezza
Change freezes have more uses than the OP has highlighted.

We've just had Christmas and Boxing Day. There may be less support staff
available on those days, and the devs may be on holiday.

By having a change freeze beforehand, the set of things that have changed is
reduced, so any issue that arises will be easier to diagnose.

Less changes allows a firm to justify the lower support overhead... not
eliminate it

~~~
sargun
I'd say that's a straight up work freeze, not just a change freeze.

~~~
minot
Not quite a work freeze. There is a subtle difference in that only the most
critical code fixes will go in, we'd just document everything else and come up
with solutions for the rest.

So this isn't exactly a 100% code freeze. There are still critical fixes that
might go in with managerial understanding and approval.

The downside is people then start saying things like "my fix doesn't require
any code changes, only SQL changes." I am not trying to be pedantic and say
SQL is code, which it is, Rather, if the change could be done better in C# but
we do a workaround in the database layer, that isn't exactly ideal.

------
tomohawk
The idea is that if you have a deployment process, you should use it.
Otherwise, you end up with "deployment by emergency", which is a process that
only proceeds when an exceptional condition occurs.

A process should normally flow and make progress. If it can only make progress
by exception, it is not a process and is probably greatly increasing risk.

An assembly line is a good example. By default, it moves forward. It is only
stopped if something exceptional occurs.

------
greenleafjacob
A corollary might be that one constraint to be optimized for is how much
maintenance is required. That is:

* A positive amount of resources should be allocated towards deciding what to do when the disk becomes full, when memory runs out, etc., and it should be automated. * When deciding between two ways to solve a problem, one factor in that decision should be whether it injects a dependency into some other process / function.

------
beejiu
Nothing wrong with 'feature freezes' rather than complete code freezes. It's a
great opportunity to get work done that you've been putting off all year. If
the management aren't expecting new features, it's a perfect time to fix bugs,
improve reliability and experiment.

------
nailer
The author hits it on the head: financial institutions don't understand that
stasis also has risk. I spent a couple of years at an investment bank that
happily paid 5000 USD per server per year to run an out of date, critical bugs
only copy of Solaris 8 as their production OS.

------
lostcolony
The author seems to be mixing his messages.

His second sentence ("What really happens, of course, is that the system in
question becomes booby-trapped with extra risk. As a result, problems are more
likely, and when there there is even a slight issue, it has the potential to
escalate into a major crisis.") I -completely- disagree with, as
phrased/contextualized, both from a theoretical and an experienced perspective
(week long change freezes prior to important events have led to fairly
straightforward operation cycles, whereas deployments lead sometimes to
unexpected new features, new UI elements, new bugs, etc, which people on a
trade show floor probably don't want to have to deal with when demoing).

However, his conclusion (the big long paragraph at the end that I won't bother
quote) is suitably vague and abstract and filled with good advice (while
ignoring subtleties and specifics, such as when he admits 'Frozen systems can
run as-is briefly', but then immediately goes on to describe the issues with
leaving them running as is for long periods) that I can't directly disagree
with it.

In short, I'm not really sure what to do with this. "Improve your deployment
processes!" Well, sure, that's always a good thing. But that's mostly
orthogonal to whether we have change/deployment freezes (they don't happen
because of fear of the deployment process, but because of fear of the new code
and the changes it brings). And the entire argument seems to posit that all
change freezes are bad, yet he both throws a bone that they at least are
tenable for brief periods in his conclusion, -and- ignores the body of
evidence pretty much everyone has that short change freezes -do- grant
comparative stability, which may make for sound business decisions (I
personally have sat in a demo where key functionality was broken because of a
last minute check-in of some library code from someone on another team that
they had not properly tested. That stuff -happens-).

Were the tone changed to, sure, include all the risks inherent in a change
freeze, but to rephrase the arguments to show that as time goes on an
inflection point is reached such that the cons outweigh the benefits, and
posit that that point is reached much faster than you might think, I'd be
completely on board. As it is...meh.

EDIT: Maybe the author is specifically targeting web applications with
frequent pushes to prod, i.e., the code you're freezing hasn't had time to
shake out any of its issues, as compared to a versioned release that has been
in production, and patched as necessary, for a week or two prior to the
freeze, and the code freeze is just applicable to new versions, not patches
considered critical.

