

Weekend deployments are for chumps - rockhymas
http://blog.bitquabit.com/2011/03/08/midnight-deploys-are-for-idiots/

======
viraptor
I can't believe in how misguided this post is...

1\. They provide a service to people around the world, yet they don't ensure
that someone is available as an emergency contact on Sunday evening, when the
first post-deployment usage happens.

2\. They don't have a universal list of "this breaks, contact that guy".

3\. They don't have a known instant rollback procedure for a release.

4\. They don't have cross-component integration tests and they don't do them
manually either.

5\. They decide that since they can't do a release that doesn't break stuff
and can't organise themselves to resolve it quickly during the weekend when it
affects only a small number of people, they'll do releases in the middle of
the day now, so that they hear customers complaining right away.

Is that for real? Is he serious? Here's what I would get out of that issue
(even if it's basically reiterating the "wrong" things above):

They need to do more integration testing before a release. They need to know
who to contact and have to make sure the person is on call and ready for
action. The person handling the issue needs to have a simple, quick way to
reverse the release without manual intervention (tweaking the code). Again,
this specific issue should get regression tests right away. And the most
important thing - NEVER treat your customers as a test suite.

Of course I'm aware not everyone can afford operating like that. But at least
this could be their goal. "Let's make breakage affect more people, so we know
about it earlier and when we're at work" is a really silly conclusion.

~~~
gecko
__EDIT __: I posted a rundown of our deployment process, including where and
how tests happen, and why they failed to catch this bug,
at<http://news.ycombinator.com/item?id=2301680> .

While I'm sure there's a lot of stuff we could improve, the situation's not
exactly as you describe.

Responding to a few contacts:

1 & 2\. We do have a list of "if this breaks, contact this guy." What we don't
have (in response to your first point) is a demand that those people be
available Sunday night.

3\. We have a known rollback procedure. It does not work if we do a
irreversible schema change _and_ the problem's not caught until 20 hours
later. We couldn't just throw out 20 hours of data.

4\. We actually do a lot of testing. Beginning on Wednesday, we deploy to our
early leak accounts. We steadily increase that through the week. The problem
with this particular bug is that you could use Kiln lightly (most of our test
accounts are not large accounts) without hitting this problem at all. Even the
full QA test suite did not trigger the problem. That happened because Kiln was
designed to keep working in the case of a FogBugz communication failure until
it couldn't, which was directly proportional to how much you used Kiln. The
real problem here, which has been fixed, is that Kiln should _not_ attempt to
hide a problem communicating with FogBugz.

5\. We don't do release in the middle of the day. We do them at 10 PM. I have
no idea where you got that.

There's a lot we can improve. We need to make sure Kiln not talking to
FogBugz, which can bring down Kiln, hard-fails, instead of trying to continue.
We need to make sure that all hands are on-deck when people are going to work,
as you noted, which is vastly easier to do midweek than Sunday night. And we
probably ought to add more automated testing to the integration points. But I
think you're painting a somewhat unfair picture of the current situation.

~~~
jacques_chester
> What we don't have (in response to your first point) is a demand that those
> people be available Sunday night.

Boring, un-agile places have concepts like 24/7 rosters of operations staff
and the ability to rotate "on call" duty amongst developers.

However I agree with your conclusion that performing irreversible rollouts are
best achieved during (your) daylight hours.

> The real problem here, which has been fixed, is that Kiln should not attempt
> to hide a problem communicating with FogBugz.

Question: why wasn't an alert raised immediately?

------
swombat
Wait, what?

You have large numbers of paying customers to whom you're delivering a
mission-critical system (source control isn't exactly optional), and your
releases involve neither automated production monitoring/continuous deployment
nor formal release procedures?

I think your problem is more than just weekend deployments!

My full comments here: [http://swombat.com/2011/3/8/fog-creek-dont-do-cowboy-
deploym...](http://swombat.com/2011/3/8/fog-creek-dont-do-cowboy-deployments)

~~~
gecko
The releases are both automated (except for one component, as noted, which we
are now automating), and are fully vetted.

Here is the old release process:

1\. Monday morning, the version to be used for the next release is
automatically built for the QA team, who begins running their test suites on
it and doing soft checks.

2\. By no later than Wednesday, the new version is leaked to testing an alpha
accounts on Fog Creek On Demand. Tests are re-run at this point.

3\. The leak is increased later in the week if the QA results look good, or
the weekend release is canceled, depending on how testing goes.

4\. Provided everything has been good, on Saturday night, the leak is
increased to 100% of customers. This step does not have a full QA rundown,
because the code has already been vetted several times by QA at this point.
The sanity checks are truly sanity checks.

5\. At the same time, we monitor that our monitoring system (Nagios) agrees
that all accounts are online and that there are no major problems, such as
massive CPU spikes.

So far, so good. The issue with this release is we had a bug that did not
manifest for awhile, because Kiln had been deliberately designed to ignore the
failure condition "as long as possible", which ended up just being too damn
long. Once we started having failures, we noticed--that's why our sysadmin
called us in--but those failures started happening 20 hours after the 100%
release, and several _days_ after testing and alpha accounts were upgraded.

I am not arguing our system is perfect, but I'm a nonplussed where the your-
deployment-system-totally-sucks stuff is coming from. I'll ask our build
manager to post an even more detailed rundown.

~~~
bluesnowmonkey
Sincere question: how do you leak irreversible schema changes to a subset of
accounts? Isn't the point of the leak that you're not confident and might need
to reverse it? Or are you willing to let those accounts get hosed?

~~~
tedunangst
Fix it by hand. If it's ten accounts, that's pretty easy. If it's ten
thousand, more of a problem.

When you read irreversible, think "very difficult to reverse and not worth the
cost of writing and validating code we don't ever expect to run."

------
TamDenholm
I work for web dev agencies and it surprises me just how often they launch on
a Friday afternoon despite how every single developer pleads with them that
its an absolutely awful idea.

Golden rule, Never launch on a Friday.

Personally i've found it easy to persuade clients to do this once you say
it'll cost an extra ten grand just for the privilege of a friday launch.

------
patio11
I feel for you. On the plus side, process improvements to prevent it from
happening next time are _exactly_ how you should respond to things like this.

One which has saved my bacon numerous times is investing a few hours into
tweaking monitoring and alert systems. I hear PagerDuty exists to help with
this. I use a bunch of scripts and bubblegum, and even that caught 10 of the
last 12 big problems. Queuing systems dying has hosed me many times over the
years, for example, and a borked deploy which causes that would have my phone
ringing before I got my laptop closed.

------
ww520
Deploying in weekend or at night is a terrible idea in disguise of a good
idea. What we used to do are:

\- No deployment on Weekend

\- No deployment on Friday

\- No deployment after 4pm on Monday to Thursday

\- Deployment is rolled out in stage: one server, 5%, 10%, 50%, 100% of
servers.

\- Rollback steps must be accompanied with deployment steps.

\- Verification steps must be specified in the deployment ticket. Verification
is done by QA or OPS, other than Dev.

\- Common deployment and rollback steps are automated.

\- Emergency deployment is an exception to the above but must take extra
precaution to babysit the deployment process.

Stress level has gone down a lot and problems are resolved much faster once we
have the above.

------
agentultra
Amen.

I've tried convincing many companies I've worked for that weekend deployments
are a bad idea over the years.

Even with continuous integration tests, rolling deployments, and all the
precautions in the world things can still happen.

You need live people available to handle a deployment.

Personally, I don't like working on weekends. I've worked for companies that
refused to believe that this was a bad idea. I learned pretty fast that life
is too short to work on a weekend.

If something does go wrong, it's better to have people on hand to correct the
error and get back on track. It's much easier to schedule those people during
the work week. It's not rocket science.

------
andrewvc
I agree that weekend deploys are a shitty idea, but isn't the real issue here
not being able to roll back?

~~~
gecko
Probably? I'm welcome to be schooled here. 90% of the time, we can roll back
instantly, because there were no database changes. 5% of the time, we can roll
back with slightly more pain, because the database migrations were reversible.
In this case, the database migration was not reversible. If we'd noticed
immediately, we could still have just activated snapshots, but we didn't
notice until 20 hours later. What do others do in this situation?

~~~
peterwwillis
Wait. What blew up that it took someone 20 hours to realize? The first thing
you take from that is, don't do _anything_ without double-checking your change
to make sure it worked.

In terms of rollback, just don't do anything which isn't reversible. Taking
chances with your changes is taking chances with your business. If you don't
know how to rollback whatever you're doing, ask someone who does (there is
_always_ a way to roll back or add redundancy).

~~~
tedunangst
"The failed API call turns out to be one that’s trivially cached for a very
long time, and so is one that Kiln would allow to fail without actually
dying."

------
adamzochowski
I was taught that Thursdays are best for deployment because you got Friday to
fix stupid things, and then weekend to fix the terrible things. By Monday all
is working anyways.

And best of all, Friday people are generally happy (it is last day of the
week), respectively on Monday expect grumpy users.

~~~
_pius
Isn't that a Monday launch? :)

------
zeruch
Users seem more comfortable with predictable maintenance than arbitrary
outages. Weekend deploys are just bad all around.

When I began in my current role (managing QA/DBAs and app deploys) one of the
first things I killed was the late Friday/weekend deploys. They are spirit-
crushing and if they go south, they usually go south in a terminal-velocity
nose dive.

We set up early Fridays for maintenance, to give us enough time in case
something goes south. Aggressive Change Control Requests means the people
impacted get a heads up (including Account Managers, who in turn inform
clients) _if_ there are any user-facing impacts, and we avoid trying to pack
too much in at once.

Having QA, Engineering and the SOC team on hand is...helpful. Maybe its
paranoid, but its been very solid so far. When things have gone south, I think
the events -since everyone is "on deck" have actually helped build some
cameraderie in the teams themselves.

~~~
vacri
I seem to accrete job roles, and one time I was so far behind on my testing
that I went to the boss to ask for a Wed-Sunday work week so I could get some
actual work done without interruptions.

The first fortnight was great, got through a lot of backlog on sat/sun.

The second fortnight sucked as I was blocked on stupidly trivial issues. That
ended the experiment.

I guess the moral of the story is to pick and choose your 'out of hours' work
wisely.

------
badmash69
In my experience , there are two kinds of deployment -- ones without DB
changes and ones that are accompanies with DB changes.

The deployments that do not require DB changes are easy -- mirror the prod
box(non db) onto a smaller box, , deploy upgrades/updates to prod box . If
things go wrong, put mirror box online with a DNS/proxy configuration while
apologizing to your customer who complain about slower performance .

When DB changes are involved , you need to have your DBAs do a dry run of
backing out changes-- after all practice makes perfect. Communicate scheduled
outage to customers, backup db . Mirror your production box. Roll put update
-- if things go wrong, restore DB and bring the mirror box online.

I have always focused on DB aspect more -- loss of integrity of data can cause
customers to look for your replacement.

But I am not sure if weekly upgrades of production environment with paying
customers is advisable .

------
mgrouchy
I'm lucky to run a system that is small enough that an entire deploy consists
of around 2 seconds of downtime for the server to restart and start the new
instance of the application.

We deploy new versions side by side and then then the webserver points at the
new application on restart.

Only time it takes any longer is when there are sweeping database
changes(schedule the downtime, inspect snapshots incase of issues, etc.)

------
powdahound
We use PagerDuty (<http://pagerduty.com>) at HipChat and while I absolutely
loathe being woken up by it, it's helped us identify issues during off-peak
hours much more quickly.

But no matter what systems you have in place or how many hundreds of deploys
you've done, there's always a new way for things to break.

~~~
tomjen3
I am guessing that HipChat is a startup?

Because as an employee any pager I had would be left at work.

------
Hominem
Oh god. I release 5pm pacific every week because users "can't have a single
second" of downtime. We manually test an ever growling checklist fo
functionality. There is always, always an issue. The angry emails start to
roll in around 5:15 pacific.

------
krobertson
Their problem isn't their deployment process, its their monitoring.

Blindly ignoring errors is a recipe for failure. You should always look at
situation like that asking "how can we monitor this weak point?" Logging plus
a service like Splunk work great.

Should always have a solid on call rotation. We have two rotations, and ops
one which is first line, and dev in case deeper code changes or more eyes on
it are needed.

