
Why are you still deploying overnight? - gnubardt
http://briancrescimanno.com/2011/09/29/why-are-you-still-deploying-overnight/
======
ekidd
Sometimes, it's cheaper to live with 10 minutes of scheduled maintenance at a
low-traffic hour than to re-engineer your application to have 99.999% uptime.

Even if you have unit tests, continuous integration, a staging environment,
and fast rollbacks, deploying to production can still be tricky:

1) Do you want to run MySQL, PostgreSQL, Oracle or SQL Server? Watch out for
ALTER TABLE; it may sometimes hold table locks or invalidate open cursors. Of
course, you could always rewrite your application to use a schema-free NoSQL
database and migrate your records at read time. But that has a lot of
consequences. Maybe 10 minutes of scheduled maintenance is an acceptable
tradeoff?

2) Does your staging environment exactly match your production environment?
Can you generate a full-sized, real-world load on your staging servers?
Alternatively, have you built all the tools required for a phased deployment
of your back end?

Now, I'm all in favor of deploying to production 5 times a day. That's pretty
easy for any group with good unit test coverage and a fast rollback mechanism.
But it's _expensive_ to eliminate the occasional 10 minute maintenance window.
And if you're going to take the site down for maintenance, it makes sense to
do it at an off-peak hour.

~~~
InclinedPlane
I don't think the author's point is that everyone should be doing middle-of-
the day seamless deployments from day one. I think the point is that if you
continue to do deployments at 3am indefinitely you are hurting yourself.

There are lots of things that make sense at the appropriate scale. Not having
any tests. Completely manual processes for builds and deployments. Having only
one server. Making backups every day or maybe even just every week. Etc.
However, as your business grows you need to recognize that it's important to
change these things and move to better processes. Because if you don't then
those things could become a very serious drag on development velocity and even
business capability. You could find yourself spending all your time drowning
in process when you should be spending your efforts more efficiently.

You stay up until 3am to perform a risky manual deployment, sometimes you
spend some time fixing problems afterward. As a consequence you either spend a
day in zombie mode with limited productivity or you start working much later.
In either case you've taken away a fairly decent chunk of time that could have
been used for productive work. Similarly, perhaps you spend too much of every
day fighting your build system, or recovering from your build always being on
the floor, or excessive operational support because your platform doesn't have
sufficient redundancy at every tier. Etc, etc, etc.

~~~
MartinCron
_middle-of-the day seamless deployments from day one_

From my experience, it's easier to start there from day one than to try to get
there after years of dysfunctional deployment processes.

~~~
wisty
Is there any way to do fast database schema migrations? This is not a nitpick,
I've posted a question on SO about this
([http://stackoverflow.com/questions/6740856/how-do-big-
compan...](http://stackoverflow.com/questions/6740856/how-do-big-companies-
like-say-facebook-do-migrations-without-having-downtime)).

As is suggest above, you could use NOSQL. But NOSQL still has schemas, in a
sense - if you decide to restructure anything you may still need to knock the
db out for a while (or write _really_ hairy code to work around two different
versions of your data, but be unable to restructure stuff because you're too
scared to change anything on the db).

~~~
arohner
I'm being serious, but also snarky: Don't do schema migrations.

I saw a talk by Brett Durrett, VP Dev & Ops at IMVU last week. They don't make
backwards incompatible schema changes. Ever. Also, on big tables, they don't
use alter statements ever, because MySQL sometimes takes an hour to ALTER.

Instead, they create a new table that is the "updated" version of the old
table, and then their code does copy-on-read. i.e. they look for the row in
the new table, if it's not there, copy it out of the old table and insert into
the new table. Later, a background job will come through and migrate all old
records into the new table. Eventually, the old table will be deleted and the
copy-on-read code will be removed.

It's a lot of extra work, but they think it's worth the effort.

I need to finish my blog post on the rest of the talk...

~~~
InclinedPlane
This x infinity. Any time you have a thought to do something that will lock an
entire table that is active in production you should think about another way
to do it. The read-through "cache" with rolling back-compatability method is a
great way to make such breaking changes without causing significant downtime.

No matter what you end up doing it'll be extra work. It's best to find a way
of working that provides enough flexibility to continue developing at a good
pace without being an operational nightmare. Generally that means you'll have
to take things a bit slower, but that'd be true almost regardless of the
technology you're using, schema changes rarely have a trivial impact.

------
untog
While there is some truth in there, I feel like some of the advice is pretty
reckless:

"Problem 1: You presume there will be problems that impact availability. You
have no confidence in your code quality; or (or maybe, and), you have no
confidence in your infrastructure and deployment process."

Or you're playing it safe. You absolutely cannot guarantee that every update
you are deploying will have zero problems. If your business absolutely relies
on users making payments online or anything of that ilk, you could lose a lot
of money.

"Imagine, for a moment, that your team is rolling out an update to a service
that monitors life-support systems in hospitals."

What? Why? Throwing out hypothetical "what if" scenarios that would affect
0.1% of your readership isn't a very useful thing to do.

~~~
brown9-2
But I think the author's point is that if you find yourself worrying that a
deployment has even a sliver of a chance of causing downtime, you should be
spending your energy on finding ways to eradicate that risk rather than
proceeding in the middle of the night.

Besides, in the small event of deployment-caused downtime and problems, who is
to say you have enough time to restore service by the time your customers come
online? By taking advantage of the clock you've only given yourself a few more
hours to deal with any problems, rather than finding ways to not have any
problems in the first place (canary-in-the-coalmine-style deployments, etc).

~~~
quanticle
>But I think the author's point is that if you find yourself worrying that a
deployment has even a sliver of a chance of causing downtime, you should be
spending your energy on finding ways to eradicate that risk rather than
proceeding in the middle of the night.

Right, and that's where I strongly disagree with the author. Sometimes finding
and taking care of that last 0.1% risk of failure just isn't worth it.
Sometimes you're better off babysitting the deployment for 30 minutes once or
twice a month rather than spending valuable development hours updating your
deployment scripts to be fully automatic and handle every possible
contingency.

~~~
MartinCron
_Sometimes you're better off babysitting the deployment for 30 minutes once or
twice a month rather than spending valuable development hours updating your
deployment scripts to be fully automatic_

If your deployment scripts are fully automatic, you can enjoy the (many)
benefits of deploying more often than once or twice a month.

~~~
rhizome
Given no other dependencies, sure.

------
unreal37
I was hoping the author would have some deeply insightful answer to the
argument, "because the deployment requires the site to go down for a few
minutes and we want to avoid inconveniencing customers" but he avoids that
topic entirely.

That's the number 1 reason for an overnight deployment, by far! If you can
deploy code without bringing the site down, then of course, do it during the
day!

~~~
maratd
Bingo. Deployment with us is automated. I don't actually stay up to 3 AM.
However, during the deployment the machine is automatically updated with new
packages and the new version of the software, caches are cleared, it's
restarted, etc. All of this probably takes 5 minutes at most, but still ...
why do that in the middle of the day?

Also, we do a daily backup around midnight. If the deployment is botched, I
can come in early in the morning, notice it and simply do a roll-back using
the backup. Very easy.

Now, if you're actually staying up until 3 AM and doing things manually ...
you need to automate things.

~~~
jonstjohn
Depends on the type of site/application. If your application is managing
customer data that can be updated 24 hours a day, a few minutes of down-time
might be okay, whereas rolling back 6 hours of data (and losing it) is
definitely not okay.

------
kstenerud
This is incredibly reckless and naive advice.

You roll out when it least impacts customers because:

1\. Impacted customers affects your bottom line. Minimize that and you
minimize losses if something screws up.

2\. Murphy's law. I don't care how much testing and QA you do; stuff will
ALWAYS creep through, and sometimes it'll be nasty. QA works to minimize this,
but it can't eliminate it entirely due to diminishing returns. Show me your
guaranteed bug free deployed code and then I'll consider changing my view.

3\. If you've designed and tested your rollback procedure prior to actually
doing it, the chances of not being able to roll back in the real deployment is
orders of magnitude lower than the chances of a failed deployment requiring a
rollback, which in turn, if you've done your testing and QA, is orders of
magnitude less likely than a successful rollout (but not 0, thus the midnight
rollout).

If you're worth your salt, you have a tested rollback procedure, laid out in
simple to follow instructions (or better yet, an automated rollback mechanism
with a simple-to-follow manual process when the automated method inevitably
fucks up).

You rollout, and if it fails, you roll back. And if that fails, you use the
manual procedure. You should have the entire process time boxed to the worst
case scenario (assuming successful manual rollback) so that you know
beforehand what the impact is, and won't need to go around waking people up
asking what to do.

~~~
lusis
Rollbacks are a myth. You can never rollback. Always be rolling forward.
Enabling a culture and environment that allows for small frequent changes
solves that problem.

The way to not impact a customer is to make deploys trivial, automated and
tolerant to failure because everything fails.

~~~
MartinCron
_Always be rolling forward_

I basically agree with this idea, but when I'm selling people on the idea of
making deployments trivial non-events that happen in the daytime, having the
notion of "if something goes wrong, you can very easily jump back" gives
people a sense of security.

In practice, when things go wrong, I've found it easier to roll forward than
to roll backwards.

------
xbryanx
I'm befuddled by these sorts of posts that are big on theory and light on
details. I run tests, via a continuous integration server, and have high
confidence in our code when we deploy However, we do updates at night because
we often need to take our sites off line for a while, to merge our changes
into the live codebase. If we didn't, changes to the database structure would
bork the db as visitors clicked around. Nighttime is our lowest volume of
visits, so that's the best time to deploy.

I'm sure this is a noob reaction, but can anyone point me to some good
technical walkthroughs about how to deploy to live without taking your site
down or interrupting your database connection? Is there a tool/term/practice
I'm missing?

~~~
mechanical_fish
There won't be general-purpose walkthroughs, because the general answer is
"don't change the database structure of a database that's serving live
traffic: You'll never enumerate all the things that could possibly go wrong."

In most _specific_ cases, there are alternatives to stopping the show during
an upgrade. First, you check your backups! Then, there are various strategies.
You can make many simple schema changes while the system is running. Or put
the DB in read-only mode and still mostly manage to serve pages -- perhaps
with an occasional HTTP 503, so hopefully that's okay -- during your upgrade.
Or phase in changes over several releases, carefully architecting for backward
compatibility. Or bring up a parallel system and gradually migrate active
users to it. Depending on what you're doing, you may find yourself having to
write a special upgrade script that migrates old formats to new formats, or
even using database triggers to keep "old" and "new"-format tables in sync
during the transition. A well-constrained database may help keep things
sensible during the transition -- or, you may have to drop half of your
supposedly-sensible constraints just to make the transition work.

Moral: Even if you know exactly what you're doing, live updates are more work:
More planning, more code, more infrastructure, and/or more stress. In many
real-world cases, you should just take the system down for a minute. Focus
your engineering effort on making sure that "minute" is as short as possible,
and in making sure that you can detect problems and roll back as quickly as
possible.

~~~
rorrr

        > the general answer is "don't change the database 
        > structure of a database that's serving live traffic: 
        > You'll never enumerate all the things that could 
        > possibly go wrong."
    

And what if you need to? Even adding an index on a large table can slow down
the things tremendously. Let alone adding/deleting columns with indexes.

"Don't do it" is not really an answer.

    
    
        > Or put the DB in read-only mode and still mostly 
        > manage to serve pages
    

Of yeah, let's put our credit card processing app in the read-only mode during
the day. What can possibly go wrong, just those silly 503s.

~~~
mechanical_fish
Perhaps I need to reiterate:

The reason there aren't a lot of general-purpose walkthroughs is that _there
is no general case_.

The closest thing we have to a general-purpose solution is: Take the system
down, do the upgrade, put the system back up. But, yes, this is often not a
very good answer. It is often a lousy and expensive answer. In which case you
hire engineers to build a better strategy. And I can't tell you, dear reader,
what that better strategy is going to be, because it's different from case to
case, and I don't know what your problem is.

And, yes, of course you shouldn't use the sloppy strategies on your financial-
transaction processing app. Just as you probably shouldn't spend three
engineer-weeks designing a complex zero-uptime rollout strategy for your blog
comment system.

------
amalcon
The only good point is #2: you want your people to be awake during the
deployment so they can deal with any problems. The others basically amount to
"You should have other processes in-place to avoid the same problems you're
trying to minimize."

Three words: Defense in depth. You don't always need every advantage you can
get, and this is a pretty costly one. Still, it's simply wrong to assume that
a particular precaution is always needless.

Interestingly, you hit a bit of a sweet-spot if your primary customer base is
located halfway around the world. You can roll out updates when your people
are most wide awake, but when you'll only break things for a few users if
something goes wrong.

------
dpcan
I deploy and update certain things at night because they REQUIRE down-time. I
can't have people making certain updates while something is being moved, for
example, so things have to be shut down for a few minutes.

That whole line about trusting your infrastructure, etc, is hog-wash. I say,
don't trust your infrastructure. Don't trust your code. Be safe, be smart, do
things at a time when the least number of people could be affected by a
problem.

Humans are responsible for everything that is happening in a deployment, and
humans make mistakes. So, no, I don't trust my servers or my code at a time of
deployment.

It would be irresponsible of me to do so even if I had the most talented
developers, and the most solid and secure platform in the world.

~~~
jonstjohn
Others would argue that you can architect your infrastructure/application so
that you don't require down-time for changes, ever.

------
hopeless
It's also interesting how U.S.-centric this thinking is: your sneaky 3am
deploy in the valley could mean a mid-morning outage in London.

Actually, an 8am deployment in Europe has the advantages of a rested brain and
a few hours before America hammers the Internet ;-)

~~~
virmundi
Many US based software companies either A) have a mostly American user base or
B) have a regional distribution setup for their app. So if you are a non-
American in camp A, that is a problem for you. However, remember the goal is
to be the least disruptive. If you are a non-American in a region that the
server is deployed in, there is probably a night-based take down of the system
in your region.

If the software company is either A nor B, then the software company need to
notify the non-user users that they may experience a disruption.

------
jakejake
I do think 3AM is a terrible time to do a release. I'm personally a fan of
doing releases in the early morning instead of late night. I'd rather get in
the office at 6AM & push the release. Plan for a light day and let everybody
leave early if things went smoothly. If things do go wrong it gives you about
2-3 hours before the rest of the office start showing up to roll back or fix
the situation.

Choosing the time should be a balance of off-peak usage time and a reasonable
time for the developers. If you're a global business then you may not have a
perfect time, but you can usually still choose a day and time that is less
likely to cause inconvenience.

------
schleyfox
There are deploys and then there are deploys. 99% of your deploys should be
fine to go live whenever the code hits the repo. I definitely still prefer
scheduled overnight deploys whenever significant infrastructure changes need
to go live. Downtime should still be minimized, but sometimes a maintenance
window is necessary.

------
whazzmaster
I attended a git workshop by Scott Chacon and was surprised to learn that
github's production site can be deployed by many folks on the development
team, and at any time during the day. In fact, he mentioned that github.com
can be updated as many as a dozen or more times per day as small features or
fixes are pushed live.

There is a great description of their web server stack[1] that allows them to
seamlessly push new code live without downtime (see the second Slow Deploys
section.) It's a great idea about how to continue serving during a deployment,
and I'm looking into using Unicorn like this for some projects I'm working on.

[1] <https://github.com/blog/517-unicorn>

------
coreycollins
The article discusses deploying to a percentage of your user base at a time. I
have an idea on how to do this, but it is probably the wrong way.

Anybody know of any articles explaining this process for a rails app?

~~~
amackera
I am also interested in this concept. What if a change requires a migration of
my database?

~~~
GlennS
A near-instantaneous switchover is possible if you're migrating a database to
a new schema, but it's pretty hard work.

For this to work, every action that will result in a database write needs to
go in log, and that action log needs to be replayable be the updated version.

In the context of a web application, you have:

1\. Take your database backup, start recording action log.

2\. Perform your database migration against your database backup.

3\. Install your updated web application (not available to users yet).

4\. Test your updated web application.

5\. Disable your existing web application.

6\. Replay the contents of your action log using the updated web application.

7\. Making your updated web application available to your users.

It's definitely not a trivial matter to sort out: it means lots of time
working on your deployment process, for example you have to be set up to have
both versions deployed simultaneously, which likely means a lot of mucking
about with URL rewriting. The application also has to be built to support it.

Maybe there is a better way?

~~~
lusis
There is a better way. It's to stop deleting and start deprecating. Write your
code in a backwards compatible way. If your storage engine, whatever it is,
can't handle it then switch storage engines.

The author makes some amazing points that I'm surprised intelligent people are
missing.

Let me give an example. We are currently migrating from storing blobs of data
in Voldemort (a key/value store) to storing them in S3. They should have never
been in there in the first place but whatever. We're going to do it with
"zero" migration time. In fact we're already doing it.

\- Set up a job that copies existing data in Voldemort to S3.

\- Deploy a minor release of our code that multiplexes the current writes to
both Voldemort and S3.

\- Continue migrating existing data

\- When existing data is finished migrating, deploy a new release that forces
all traffic to S3 instead of Voldemort.

\- Profit

People need to learn to do things like dark launching and feature flags. Dark
launching let's you exercise new code paths with no impact to the user.
Feature flags give you the ability to enable to features to some or all of
your users. Feature flags are awesome ways to A/B test as well.

People need to stop doing stupid shit like redefining what some property means
mid-release and instead define NEW properties and deprecate old ones in
subsequent releases.

Same goes for schema changes. If some part of your code base cannot tolerate
an additional column that it doesn't need, that's a bug.

You can also adopt some configuration management that allows you to provision
duplicates of the components you're deploying so you can swap back and forth
between them in the case where you might have a breaking release. That's what
we do and it's one of the upsides of using EC2.

All of this requires discipline and dedication but the benefit is so worth it.
You have to stay on top of dependencies. You can't let bitrot take hold by
going 4 major revisions without upgrading some package. We did and it bit us
in the ass.

This is why we adopted the swinging strategy of duplicating our entire stack
(takes about 30 minutes depending on Amazon allocation latency) on major
upgrades.

------
127001brewer
_"...I sincerely hope you’re not living in the Stone Age..."_

Whether or not you agree with the entire article, for me, the statement above
was the most important one. There is no excuse why any software development
team does not have an automated build and deploy process.

There may be valid business reasons to deploy in the middle of the night, but
there's no business reason to deploy by hand.

Also, I am assuming that the posting only reflects application source code and
not database changes, which is a different beast in terms of deployments.

------
chris_dcosta
Couldn't agree more.

I worked somewhere that only deployed during business hours because it was all
hands on deck. We just kept the users primed with information and reminders
prior to go-live, and when it happened everyone knew about it.

But there's a more fundamental point. We're dealing with IT, and for internet
businesses and global enterprises the issue is the same: business is 24 hours
a day 7 days a week. You cannot avoid deploying somewhere in the world that's
awake.

All you have to do is plan a rollback and just get on with it.

My experience of companies who are timid about rolling out, changes is that
they're state of advancement is completely unstable. They are too scared to
change meaning that some technologies become out-of-sync with the current
state of play. You only have to introduce a new piece of wiz-bang software and
issues come pouring out, and it's a frantic race to upgrade random components
in the hope you might plug the leak. This is far far worse in every way than
doing regular "risky" upgrades.

Small incremental changes can be easily rolled back, because you know what you
just changed.

------
Roboprog
Why in the middle of the night? The economics of things, mostly.

At my previous job, we were still in process to get session migration to work
so servers within a cluster could update while others were running.

At my current job, we don't have a cluster (we have a single over-provisioned
server), and that method simply isn't in the budget for the foreseeable
future. However, 3:00 AM PDT is 3:30 PM IST (India), which is where most of my
team is, as well as the support staff they need (most staff there are on swing
shift, as well). Thus, an early morning release is actually a very convenient,
as well as lower cost, time to do the release. As the site is mostly US
customers in both a B2C and B2B use case, this timing (around 6:00 AM EDT)
works very well, and I don't see it changing.

I realize this doesn't scale for a global app, but it explains the lack of
pressure to spend much consideration on higher tech, more expensive,
deployment methodologies.

------
Vitaly
Or may be a client simply insists that customers absolutely can't tolerate a
several seconds delays during the deployment ;(

------
FollowSteph3
Its not just cause ur developers are tired, which is a big issue, its also
because you may not have access to the people you need to make decisions in
the middle of the night if something needs to be escalated. Good luck trying
to get an executive or VP up at 4am to make a decision...

------
rilindo
"Why are you still deploying overnight?"

Because you don't hire a system admin to do that job for you. :}

------
krobertson
I think a point also missed in the article is the startup notion of burning
the midnight oil and pushing stuff live.

"Release early, release often" is great, but that same call to action can be
what motivates people to push code live at 3am when they're tired, just
finished it, and perhaps skipping QA.

Otherwise, I think the article as put is empty and seems more straw man.
Assumes you have no confidence in your code, assumes it is complicated,
assumes you can't roll back.

Simply put, deploy small stuff often, make plans for big stuff, and don't be
stupid.

------
Hominem
I rarely have issues but is always a chance that something is going to go
wrong, so everyone is sitting there waiting for the all clear. I was able to
make the case that having developers, DBAs, and sysadmins sitting around in
the middle of the night was counterproductive. It was fraught with tension
during the middle of the night after a long day at work. Now we do it at 1pm
on saturday, it is amlost leisurely now. Just a quick call to check in. Some
people even call the voice bridge from the beach!

------
nimblegorilla
Let's be realistic. Many applications have 24 hour availability, but almost
every application I've worked on was for people in the US and core usage was
sometime between 7AM and 9PM.

If it only takes 5 minutes to deploy your app why not do it at 2AM when few
users are online? Deployments should be simple, easy, and NOT require any
problem solving by the deployer. Midday deployments with all hands on deck to
handle issues seems like a bigger problem to me.

------
RyanMcGreal
The best reason not to do overnight deployments is that if something goes
wrong, you want your developers to be alert and sharp, not dazed and sleep-
deprived.

------
jroseattle
Because our environment is non-trivial. It has nothing to do with confidence
in application code, and everything to do with service-level agreements,
impact to existing customers, and our own psyche. Ask SalesForce if they do
their maintenance updates on weekends because of a lack of confidence in their
codebase.

We're not doing 2am deployments, but we're certainly doing them "off-hours".

~~~
lusis
Triviality has nothing to do with it.

Nothing in your SLA says you have to have servers 1, 2 and 3 available. It
says you have to have service X available. You can do that with out taking a
downtime hit.

~~~
jroseattle
Wow, thanks for having better insight into our production environment than
myself.

~~~
lusis
Feel free to correct me then. I have YET to work anywhere in any industry
(financial primarily) that stated we had to have specific hardware up and
running as part of an SLA.

It's an "SERVICE LEVEL AGREEMENT" not a "SERVER LEVEL AGREEMENT".

~~~
jroseattle
> I have YET to work anywhere in any industry

Sounds like your experience is the only one that matters, then.

------
casenelson
I started writing a reply but it turned into a blog post, here:
([http://casestatement.tumblr.com/post/11598425762/why-we-
are-...](http://casestatement.tumblr.com/post/11598425762/why-we-are-still-
deploying-overnight))

