Hacker News new | comments | show | ask | jobs | submit login
Why are you still deploying overnight? (briancrescimanno.com)
113 points by gnubardt 2045 days ago | hide | past | web | 77 comments | favorite

Sometimes, it's cheaper to live with 10 minutes of scheduled maintenance at a low-traffic hour than to re-engineer your application to have 99.999% uptime.

Even if you have unit tests, continuous integration, a staging environment, and fast rollbacks, deploying to production can still be tricky:

1) Do you want to run MySQL, PostgreSQL, Oracle or SQL Server? Watch out for ALTER TABLE; it may sometimes hold table locks or invalidate open cursors. Of course, you could always rewrite your application to use a schema-free NoSQL database and migrate your records at read time. But that has a lot of consequences. Maybe 10 minutes of scheduled maintenance is an acceptable tradeoff?

2) Does your staging environment exactly match your production environment? Can you generate a full-sized, real-world load on your staging servers? Alternatively, have you built all the tools required for a phased deployment of your back end?

Now, I'm all in favor of deploying to production 5 times a day. That's pretty easy for any group with good unit test coverage and a fast rollback mechanism. But it's expensive to eliminate the occasional 10 minute maintenance window. And if you're going to take the site down for maintenance, it makes sense to do it at an off-peak hour.

I don't think the author's point is that everyone should be doing middle-of-the day seamless deployments from day one. I think the point is that if you continue to do deployments at 3am indefinitely you are hurting yourself.

There are lots of things that make sense at the appropriate scale. Not having any tests. Completely manual processes for builds and deployments. Having only one server. Making backups every day or maybe even just every week. Etc. However, as your business grows you need to recognize that it's important to change these things and move to better processes. Because if you don't then those things could become a very serious drag on development velocity and even business capability. You could find yourself spending all your time drowning in process when you should be spending your efforts more efficiently.

You stay up until 3am to perform a risky manual deployment, sometimes you spend some time fixing problems afterward. As a consequence you either spend a day in zombie mode with limited productivity or you start working much later. In either case you've taken away a fairly decent chunk of time that could have been used for productive work. Similarly, perhaps you spend too much of every day fighting your build system, or recovering from your build always being on the floor, or excessive operational support because your platform doesn't have sufficient redundancy at every tier. Etc, etc, etc.

middle-of-the day seamless deployments from day one

From my experience, it's easier to start there from day one than to try to get there after years of dysfunctional deployment processes.

Some things are generally better and easier to do a bit before you really "need" them. Deployment, build, branching, and backup procedures almost certainly fall into that category. Performance as well, as long as it's done smartly.

Interestingly, scaling is probably the least valuable thing to "overbuild" early and yet that's by far the most common thing teams tend to do.

I would just add "test automation" and "restore from backups" to that list.

Indeed! Also, pretty much everything from the "Joel test" goes without saying.

Is there any way to do fast database schema migrations? This is not a nitpick, I've posted a question on SO about this (http://stackoverflow.com/questions/6740856/how-do-big-compan...).

As is suggest above, you could use NOSQL. But NOSQL still has schemas, in a sense - if you decide to restructure anything you may still need to knock the db out for a while (or write really hairy code to work around two different versions of your data, but be unable to restructure stuff because you're too scared to change anything on the db).

I'm being serious, but also snarky: Don't do schema migrations.

I saw a talk by Brett Durrett, VP Dev & Ops at IMVU last week. They don't make backwards incompatible schema changes. Ever. Also, on big tables, they don't use alter statements ever, because MySQL sometimes takes an hour to ALTER.

Instead, they create a new table that is the "updated" version of the old table, and then their code does copy-on-read. i.e. they look for the row in the new table, if it's not there, copy it out of the old table and insert into the new table. Later, a background job will come through and migrate all old records into the new table. Eventually, the old table will be deleted and the copy-on-read code will be removed.

It's a lot of extra work, but they think it's worth the effort.

I need to finish my blog post on the rest of the talk...

This x infinity. Any time you have a thought to do something that will lock an entire table that is active in production you should think about another way to do it. The read-through "cache" with rolling back-compatability method is a great way to make such breaking changes without causing significant downtime.

No matter what you end up doing it'll be extra work. It's best to find a way of working that provides enough flexibility to continue developing at a good pace without being an operational nightmare. Generally that means you'll have to take things a bit slower, but that'd be true almost regardless of the technology you're using, schema changes rarely have a trivial impact.

Would love to see that blog post since a lot of people struggle with schema changes, particularly 'alter table'.

I tend to think in terms of "augmentations" instead of migrations. Backwards-compatible data changes are way to go. Adding tables/columns is better than removing tables/columns.

My app, on startup, ensures that it has all of the tables (with all of the columns) and any seed data that it may need (by issuing ADD TABLE and/or ALTER TABLE). This allows me to simply roll out new (tiny, incremental) database changes and code and be sure that it works.

This also makes testing easier, in that my integration tests can start with a new database from scratch, build what needs to be built, run the tests that talk to the database, and then remove the database once it has finished.

If you must make non-backwards compatible changes (renames, whatever), I would suggest doing them one at a time.

Here is a paper on that topic: http://pmg.csail.mit.edu/~ajmani/papers/ecoop06-upgrades.pdf

Their examples show that it is not possible to do all application upgrades without downtime, and they show how to keep downtime to a minimum (e.g. by only making parts of your application inaccessible).

There's an easy way to do 3am US deployments: do them from Europe. I used to work at a consultancy in Ireland that did this, and it worked quite well. Deploy first thing in the morning, and you have up until some time in the afternoon to fix things if it goes pear-shaped.

"You stay up until 3am... spend a day in zombie mode with limited productivity or you start working much later."

Another option is to just wake up much later: my office is dead until 2pm, and the "key players" often don't wake up until 6pm.

I guess it makes some sense if you are trying to deploy in the late hours but it doesn't sound sustainable keeping a whole team so far out of sync with everyone else. I'd imagine those with families or do things like sports of a night would eventually quit.

While there is some truth in there, I feel like some of the advice is pretty reckless:

"Problem 1: You presume there will be problems that impact availability. You have no confidence in your code quality; or (or maybe, and), you have no confidence in your infrastructure and deployment process."

Or you're playing it safe. You absolutely cannot guarantee that every update you are deploying will have zero problems. If your business absolutely relies on users making payments online or anything of that ilk, you could lose a lot of money.

"Imagine, for a moment, that your team is rolling out an update to a service that monitors life-support systems in hospitals."

What? Why? Throwing out hypothetical "what if" scenarios that would affect 0.1% of your readership isn't a very useful thing to do.

>Or you're playing it safe.

You're exactly right. This whole article sounds like someone pretending to understand risk analysis. You can make as many technological and human process improvements as you want leading up to the deploy, but even after doing everything else possible to reduce potential impact you'll still further reduce potential impact by pulling triggers when your service is at minimum load. And there is always a trigger to pull, the article argues for gradual rollout (which is good), but one still has to introduce new code to replace or run side-by-side with old code sometime. What if v2 worked alongside v1 in testing and staging just fine but something in production makes it explode?

Assume if everything else is equal, A is better than B. Also assume if everything else being equal C is better than D. This article says "You're doing A?!?!?! WTF DUDE!? Just do C instead of D, then forget about A and go back to B".

If he wants to argue that the benefits of A over B outweigh the costs of D over C, then he should do that instead of writing what comes across as saying A is the magic bullet that makes C and D equivalent. Not to mention that there the value of A over B and cost of C over D are different from organization to organization.

I work on medical software. Surprisingly, most hospital systems require a fair bit of downtime. Hospitals have downtime procedures that they use during these periods (basically switching to manual, paper-based systems).

But I think the author's point is that if you find yourself worrying that a deployment has even a sliver of a chance of causing downtime, you should be spending your energy on finding ways to eradicate that risk rather than proceeding in the middle of the night.

Besides, in the small event of deployment-caused downtime and problems, who is to say you have enough time to restore service by the time your customers come online? By taking advantage of the clock you've only given yourself a few more hours to deal with any problems, rather than finding ways to not have any problems in the first place (canary-in-the-coalmine-style deployments, etc).

>But I think the author's point is that if you find yourself worrying that a deployment has even a sliver of a chance of causing downtime, you should be spending your energy on finding ways to eradicate that risk rather than proceeding in the middle of the night.

Right, and that's where I strongly disagree with the author. Sometimes finding and taking care of that last 0.1% risk of failure just isn't worth it. Sometimes you're better off babysitting the deployment for 30 minutes once or twice a month rather than spending valuable development hours updating your deployment scripts to be fully automatic and handle every possible contingency.

Sometimes you're better off babysitting the deployment for 30 minutes once or twice a month rather than spending valuable development hours updating your deployment scripts to be fully automatic

If your deployment scripts are fully automatic, you can enjoy the (many) benefits of deploying more often than once or twice a month.

Given no other dependencies, sure.

If you're not worrying you either don't care or you're fooling yourself. If it takes 30 min to fix a potential problem in production, I'd rather upset 10 people in the middle of the night than 1000 people during working hours.

> But I think the author's point is that if you find yourself worrying that a deployment has even a sliver of a chance of causing downtime, you should be spending your energy on finding ways to eradicate that risk rather than proceeding in the middle of the night.

Shouldn't I decide where I want to spend my energy?

The author is giving advice, not telling you what to do.

Throwing out hypothetical "what if" scenarios that would affect 0.1% of your readership isn't a very useful thing to do.


The people who use my production systems are people who are logging transactions which they have made (in the real world) into my system. It's not a hospital. If the system is down, you just come back later.

We do a lot to make sure our deployments will go smoothly, downtime is minimized, and they affect as few users as possible. But the effort required for my team to deliver "five nines" would be insane. It's much easier for one guy to take the application server down for 10 minutes (at midnight) once a month.

...the effort required for my team to deliver "five nines" would be insane. It's much easier for one guy to take the application server down for 10 minutes (at midnight) once a month.

For the projects I've worked on lately, the ideal of "zero downtime deployments, fully automated, during the daytime, as non-events" isn't at all about getting a particular number of nines, it's about deploying more often than once every month.

When the deployment you've been working on for a whole month goes wrong, which of the many hundreds of changes are problematic?

I would sincerely hope that my colleagues always assume that there WILL be problems that impact availability when dabbling in production environment.

I'd rather have a guy spend whole day making sure everything is working and rechecking, etc.

I also hope that my colleagues will have the foresight to test an update/deployment on as fresh mirror of production environment (or a representative subset) as possible.

And I'd say that this is ESPECIALLY important for NoSQL environments.

I was hoping the author would have some deeply insightful answer to the argument, "because the deployment requires the site to go down for a few minutes and we want to avoid inconveniencing customers" but he avoids that topic entirely.

That's the number 1 reason for an overnight deployment, by far! If you can deploy code without bringing the site down, then of course, do it during the day!

Bingo. Deployment with us is automated. I don't actually stay up to 3 AM. However, during the deployment the machine is automatically updated with new packages and the new version of the software, caches are cleared, it's restarted, etc. All of this probably takes 5 minutes at most, but still ... why do that in the middle of the day?

Also, we do a daily backup around midnight. If the deployment is botched, I can come in early in the morning, notice it and simply do a roll-back using the backup. Very easy.

Now, if you're actually staying up until 3 AM and doing things manually ... you need to automate things.

Depends on the type of site/application. If your application is managing customer data that can be updated 24 hours a day, a few minutes of down-time might be okay, whereas rolling back 6 hours of data (and losing it) is definitely not okay.

He did, you do a rolling update, take a few systems out of active duty, upgrade them, and then begin directing some percent of traffic to them. That, of course, only works if you are of a scale to have that many production systems, and loose enough interdependencies to allow for new and old code to run at the same time. I think some of his point was that if you aren't able to do that sort of update, it was worth your time and effort to get there. I think he's probably right, but the work to get from where you are (it requires site outage to upgrade) to where you want to be (rolling upgrades, ability to run multiple versions in parallel, etc) is going to be site specific in most cases that it's hard to discuss that move in any depth.

This is incredibly reckless and naive advice.

You roll out when it least impacts customers because:

1. Impacted customers affects your bottom line. Minimize that and you minimize losses if something screws up.

2. Murphy's law. I don't care how much testing and QA you do; stuff will ALWAYS creep through, and sometimes it'll be nasty. QA works to minimize this, but it can't eliminate it entirely due to diminishing returns. Show me your guaranteed bug free deployed code and then I'll consider changing my view.

3. If you've designed and tested your rollback procedure prior to actually doing it, the chances of not being able to roll back in the real deployment is orders of magnitude lower than the chances of a failed deployment requiring a rollback, which in turn, if you've done your testing and QA, is orders of magnitude less likely than a successful rollout (but not 0, thus the midnight rollout).

If you're worth your salt, you have a tested rollback procedure, laid out in simple to follow instructions (or better yet, an automated rollback mechanism with a simple-to-follow manual process when the automated method inevitably fucks up).

You rollout, and if it fails, you roll back. And if that fails, you use the manual procedure. You should have the entire process time boxed to the worst case scenario (assuming successful manual rollback) so that you know beforehand what the impact is, and won't need to go around waking people up asking what to do.

Rollbacks are a myth. You can never rollback. Always be rolling forward. Enabling a culture and environment that allows for small frequent changes solves that problem.

The way to not impact a customer is to make deploys trivial, automated and tolerant to failure because everything fails.

Always be rolling forward

I basically agree with this idea, but when I'm selling people on the idea of making deployments trivial non-events that happen in the daytime, having the notion of "if something goes wrong, you can very easily jump back" gives people a sense of security.

In practice, when things go wrong, I've found it easier to roll forward than to roll backwards.

I've done a number of rollbacks in my time (for enterprise banking systems). They work so long as you do a few dry runs first and have an audit system in place.

A part of this is looking at the risk assessment and the notion of "guaranteed bug free deployed code" in a different way.

For a stable system that's in production, you don't need "guaranteed bug free deployed code" you need code that is not any worse than what's currently running out there. Doing frequent (daytime) deployments makes it easier to make a change, test (both with humans and robots) that change, and get it out there. You don't have to manually test everything in order to change anything when you're changing just one thing at a time.

I've come to believe that the far riskier approach is to make a bunch of changes at once, introduce a bunch of bugs, test fix bugs until you feel confident, and then release this huge change all at once in the middle of the night.

I'm befuddled by these sorts of posts that are big on theory and light on details. I run tests, via a continuous integration server, and have high confidence in our code when we deploy However, we do updates at night because we often need to take our sites off line for a while, to merge our changes into the live codebase. If we didn't, changes to the database structure would bork the db as visitors clicked around. Nighttime is our lowest volume of visits, so that's the best time to deploy.

I'm sure this is a noob reaction, but can anyone point me to some good technical walkthroughs about how to deploy to live without taking your site down or interrupting your database connection? Is there a tool/term/practice I'm missing?

I don't have a good technical walkthrough, but we're providing a high availability service, and zero-downtime upgrades are a requirement. Our setup basically consists of load balancers and application servers. When we need to upgrade, we require application version x always needs to be backwards compatible with application version x - 1, both on a protocol and data model level. Basically, our upgrade procedure is as follows:

1. distribute new package to all servers;

2. run an additional application service on all servers and run some quick tests on each of them to verify proper working;

3. add new application servers to load balancers;

4. upgrade data model to new version (we use postgresql, and this happends in a single db transaction, and remember that our new version x is compatbible with both the current and the previous data model);

5. remove old application services from load balancers;

6. upgrade successful.

If anything goes wrong, we can roll back each of these steps. Note that this whole process, perhaps needless to say, is fully automated.

By running multiple versions inside the load balancer at the same time, and having the requirement that version x + 1 is always compatible with version x, this procedure allows us to seamlessly upgrade to a new version without any downtime.

There won't be general-purpose walkthroughs, because the general answer is "don't change the database structure of a database that's serving live traffic: You'll never enumerate all the things that could possibly go wrong."

In most specific cases, there are alternatives to stopping the show during an upgrade. First, you check your backups! Then, there are various strategies. You can make many simple schema changes while the system is running. Or put the DB in read-only mode and still mostly manage to serve pages -- perhaps with an occasional HTTP 503, so hopefully that's okay -- during your upgrade. Or phase in changes over several releases, carefully architecting for backward compatibility. Or bring up a parallel system and gradually migrate active users to it. Depending on what you're doing, you may find yourself having to write a special upgrade script that migrates old formats to new formats, or even using database triggers to keep "old" and "new"-format tables in sync during the transition. A well-constrained database may help keep things sensible during the transition -- or, you may have to drop half of your supposedly-sensible constraints just to make the transition work.

Moral: Even if you know exactly what you're doing, live updates are more work: More planning, more code, more infrastructure, and/or more stress. In many real-world cases, you should just take the system down for a minute. Focus your engineering effort on making sure that "minute" is as short as possible, and in making sure that you can detect problems and roll back as quickly as possible.

    > the general answer is "don't change the database 
    > structure of a database that's serving live traffic: 
    > You'll never enumerate all the things that could 
    > possibly go wrong."
And what if you need to? Even adding an index on a large table can slow down the things tremendously. Let alone adding/deleting columns with indexes.

"Don't do it" is not really an answer.

    > Or put the DB in read-only mode and still mostly 
    > manage to serve pages
Of yeah, let's put our credit card processing app in the read-only mode during the day. What can possibly go wrong, just those silly 503s.

Perhaps I need to reiterate:

The reason there aren't a lot of general-purpose walkthroughs is that there is no general case.

The closest thing we have to a general-purpose solution is: Take the system down, do the upgrade, put the system back up. But, yes, this is often not a very good answer. It is often a lousy and expensive answer. In which case you hire engineers to build a better strategy. And I can't tell you, dear reader, what that better strategy is going to be, because it's different from case to case, and I don't know what your problem is.

And, yes, of course you shouldn't use the sloppy strategies on your financial-transaction processing app. Just as you probably shouldn't spend three engineer-weeks designing a complex zero-uptime rollout strategy for your blog comment system.

I think of them as "speculative management" posts. Like the post here, they're usually written by someone with no direct experience in what they're talking about, and no responsibility for dealing with the problems that arise from following their advice. The posts are aspirational for group credibility (social proof), necessarily among those who also don't have direct experience in the topic. It's Monday Morning Quarterbacking every day of the week, or nerd watercooler b.s.'ing. "Why does NASA wait for certain weather in order to launch, don't they have confidence in their equipment/systems/pilots?"

As for your second question, if you have a load balancer, you can always take nodes out of it in order to update them, before re-enabling them in the LB and moving on to the next one. It's called a rolling upgrade, and the ease and details of doing such depend on the actual pieces involved.

You need to make sure all of your DB changes are backwards compatible. For example, adding new tables, adding columns (with defaults), and adding indexes can all be done without breaking existing code. The code does need to do the proper thing to make this work, such as INSERTs with the column list.

We normally disallow non-backwards compatible changes, such as renaming columns. We only drop tables after renaming and waiting a while (so we can quickly rename back).

When you have a lot of database servers this is pretty important since trying to keep them all exactly in-sync with the same schema at all times in the process is pretty much impossible. While doing the change, you are always going to end up with some finishing earlier than others.

I haven't done a routine deployment in the middle of the night for years. Here's a post (of mine) that's a little longer on details.


I boil it down to this: The safest change to deploy to a stable system is the smallest change possible. Most of my changes don't require schema changes at all.

The application itself, on startup, verifies that the database has the tables/columns it needs in order to work. If it doesn't, it will CREATE TABLE or ALTER TABLE appropriately.

I try to avoid backwards-incompatible schema changes whenever possible, so I can rollback. It's always safer and easier to add a new table or column than to delete/rename an existing one. Something wrong with the code? Rolling back won't send you back to an incompatible state.

I use an ORM instead of stored procedures, because I find them a lot more friendly with this general process, you don't have procs that expect a particular parameter signature.

You may need to decouple your db-changes deployments from your non-db-changes deployments. Doing that can at least make the non-db-changes deployments a lot less painful.

The only good point is #2: you want your people to be awake during the deployment so they can deal with any problems. The others basically amount to "You should have other processes in-place to avoid the same problems you're trying to minimize."

Three words: Defense in depth. You don't always need every advantage you can get, and this is a pretty costly one. Still, it's simply wrong to assume that a particular precaution is always needless.

Interestingly, you hit a bit of a sweet-spot if your primary customer base is located halfway around the world. You can roll out updates when your people are most wide awake, but when you'll only break things for a few users if something goes wrong.

I deploy and update certain things at night because they REQUIRE down-time. I can't have people making certain updates while something is being moved, for example, so things have to be shut down for a few minutes.

That whole line about trusting your infrastructure, etc, is hog-wash. I say, don't trust your infrastructure. Don't trust your code. Be safe, be smart, do things at a time when the least number of people could be affected by a problem.

Humans are responsible for everything that is happening in a deployment, and humans make mistakes. So, no, I don't trust my servers or my code at a time of deployment.

It would be irresponsible of me to do so even if I had the most talented developers, and the most solid and secure platform in the world.

Others would argue that you can architect your infrastructure/application so that you don't require down-time for changes, ever.

It's also interesting how U.S.-centric this thinking is: your sneaky 3am deploy in the valley could mean a mid-morning outage in London.

Actually, an 8am deployment in Europe has the advantages of a rested brain and a few hours before America hammers the Internet ;-)

Many US based software companies either A) have a mostly American user base or B) have a regional distribution setup for their app. So if you are a non-American in camp A, that is a problem for you. However, remember the goal is to be the least disruptive. If you are a non-American in a region that the server is deployed in, there is probably a night-based take down of the system in your region.

If the software company is either A nor B, then the software company need to notify the non-user users that they may experience a disruption.

> Actually, an 8am deployment in Europe has the advantages of a rested brain and a few hours before America hammers the Internet ;-)

And the disadvantage of happening in the middle of mainland Europe morning rush. I think the point is that you take the downtime when it's least disruptive to your users, who/wherever they might be.

I do think 3AM is a terrible time to do a release. I'm personally a fan of doing releases in the early morning instead of late night. I'd rather get in the office at 6AM & push the release. Plan for a light day and let everybody leave early if things went smoothly. If things do go wrong it gives you about 2-3 hours before the rest of the office start showing up to roll back or fix the situation.

Choosing the time should be a balance of off-peak usage time and a reasonable time for the developers. If you're a global business then you may not have a perfect time, but you can usually still choose a day and time that is less likely to cause inconvenience.

There are deploys and then there are deploys. 99% of your deploys should be fine to go live whenever the code hits the repo. I definitely still prefer scheduled overnight deploys whenever significant infrastructure changes need to go live. Downtime should still be minimized, but sometimes a maintenance window is necessary.

I attended a git workshop by Scott Chacon and was surprised to learn that github's production site can be deployed by many folks on the development team, and at any time during the day. In fact, he mentioned that github.com can be updated as many as a dozen or more times per day as small features or fixes are pushed live.

There is a great description of their web server stack[1] that allows them to seamlessly push new code live without downtime (see the second Slow Deploys section.) It's a great idea about how to continue serving during a deployment, and I'm looking into using Unicorn like this for some projects I'm working on.

[1] https://github.com/blog/517-unicorn

The article discusses deploying to a percentage of your user base at a time. I have an idea on how to do this, but it is probably the wrong way.

Anybody know of any articles explaining this process for a rails app?

I am also interested in this concept. What if a change requires a migration of my database?

A near-instantaneous switchover is possible if you're migrating a database to a new schema, but it's pretty hard work.

For this to work, every action that will result in a database write needs to go in log, and that action log needs to be replayable be the updated version.

In the context of a web application, you have:

1. Take your database backup, start recording action log.

2. Perform your database migration against your database backup.

3. Install your updated web application (not available to users yet).

4. Test your updated web application.

5. Disable your existing web application.

6. Replay the contents of your action log using the updated web application.

7. Making your updated web application available to your users.

It's definitely not a trivial matter to sort out: it means lots of time working on your deployment process, for example you have to be set up to have both versions deployed simultaneously, which likely means a lot of mucking about with URL rewriting. The application also has to be built to support it.

Maybe there is a better way?

There is a better way. It's to stop deleting and start deprecating. Write your code in a backwards compatible way. If your storage engine, whatever it is, can't handle it then switch storage engines.

The author makes some amazing points that I'm surprised intelligent people are missing.

Let me give an example. We are currently migrating from storing blobs of data in Voldemort (a key/value store) to storing them in S3. They should have never been in there in the first place but whatever. We're going to do it with "zero" migration time. In fact we're already doing it.

- Set up a job that copies existing data in Voldemort to S3.

- Deploy a minor release of our code that multiplexes the current writes to both Voldemort and S3.

- Continue migrating existing data

- When existing data is finished migrating, deploy a new release that forces all traffic to S3 instead of Voldemort.

- Profit

People need to learn to do things like dark launching and feature flags. Dark launching let's you exercise new code paths with no impact to the user. Feature flags give you the ability to enable to features to some or all of your users. Feature flags are awesome ways to A/B test as well.

People need to stop doing stupid shit like redefining what some property means mid-release and instead define NEW properties and deprecate old ones in subsequent releases.

Same goes for schema changes. If some part of your code base cannot tolerate an additional column that it doesn't need, that's a bug.

You can also adopt some configuration management that allows you to provision duplicates of the components you're deploying so you can swap back and forth between them in the case where you might have a breaking release. That's what we do and it's one of the upsides of using EC2.

All of this requires discipline and dedication but the benefit is so worth it. You have to stay on top of dependencies. You can't let bitrot take hold by going 4 major revisions without upgrading some package. We did and it bit us in the ass.

This is why we adopted the swinging strategy of duplicating our entire stack (takes about 30 minutes depending on Amazon allocation latency) on major upgrades.

Sometimes it absolutely can't be avoided, but usually the last version and the next version can be coded to work off of either schema.

As for deploying to a segment of users put a "version" field in the user and then gate features based on the version. If you name your versions or use a unique numbering system it should be fairly easy to remove the old gating.

Yes, none of these things are foolproof but then again your existing system probably isn't either. The more you deploy the better you get at it and the more automated it becomes. Force the issue by forcing a deploy everyday to begin with, then ramp up to twice a day, etc. After a month or two you'll learn solutions to almost all your deployment problems and there will be a well-known solution in the organization for solving the problem whether it be gating features or schema migrations.

I cringe at the thought of putting versioning gates all over my code, but maybe that's the easiest solution.

What about having two code bases on your production server and having your web server (nginx) route the traffic accordingly? 85% goes to the current code base and 15% goes to the new one?

If that's possible it would seem pretty easy.

I think that makes it less testable but would probably work, it's probably a good stop gap as long as you don't want to test more than one "feature" at a time.

Gating would probably work better in the long term as you get more devs and users, and need to do things like A/B/C testing.

"...I sincerely hope you’re not living in the Stone Age..."

Whether or not you agree with the entire article, for me, the statement above was the most important one. There is no excuse why any software development team does not have an automated build and deploy process.

There may be valid business reasons to deploy in the middle of the night, but there's no business reason to deploy by hand.

Also, I am assuming that the posting only reflects application source code and not database changes, which is a different beast in terms of deployments.

Couldn't agree more.

I worked somewhere that only deployed during business hours because it was all hands on deck. We just kept the users primed with information and reminders prior to go-live, and when it happened everyone knew about it.

But there's a more fundamental point. We're dealing with IT, and for internet businesses and global enterprises the issue is the same: business is 24 hours a day 7 days a week. You cannot avoid deploying somewhere in the world that's awake.

All you have to do is plan a rollback and just get on with it.

My experience of companies who are timid about rolling out, changes is that they're state of advancement is completely unstable. They are too scared to change meaning that some technologies become out-of-sync with the current state of play. You only have to introduce a new piece of wiz-bang software and issues come pouring out, and it's a frantic race to upgrade random components in the hope you might plug the leak. This is far far worse in every way than doing regular "risky" upgrades.

Small incremental changes can be easily rolled back, because you know what you just changed.

Why in the middle of the night? The economics of things, mostly.

At my previous job, we were still in process to get session migration to work so servers within a cluster could update while others were running.

At my current job, we don't have a cluster (we have a single over-provisioned server), and that method simply isn't in the budget for the foreseeable future. However, 3:00 AM PDT is 3:30 PM IST (India), which is where most of my team is, as well as the support staff they need (most staff there are on swing shift, as well). Thus, an early morning release is actually a very convenient, as well as lower cost, time to do the release. As the site is mostly US customers in both a B2C and B2B use case, this timing (around 6:00 AM EDT) works very well, and I don't see it changing.

I realize this doesn't scale for a global app, but it explains the lack of pressure to spend much consideration on higher tech, more expensive, deployment methodologies.

Or may be a client simply insists that customers absolutely can't tolerate a several seconds delays during the deployment ;(

Its not just cause ur developers are tired, which is a big issue, its also because you may not have access to the people you need to make decisions in the middle of the night if something needs to be escalated. Good luck trying to get an executive or VP up at 4am to make a decision...

"Why are you still deploying overnight?"

Because you don't hire a system admin to do that job for you. :}

I think a point also missed in the article is the startup notion of burning the midnight oil and pushing stuff live.

"Release early, release often" is great, but that same call to action can be what motivates people to push code live at 3am when they're tired, just finished it, and perhaps skipping QA.

Otherwise, I think the article as put is empty and seems more straw man. Assumes you have no confidence in your code, assumes it is complicated, assumes you can't roll back.

Simply put, deploy small stuff often, make plans for big stuff, and don't be stupid.

I rarely have issues but is always a chance that something is going to go wrong, so everyone is sitting there waiting for the all clear. I was able to make the case that having developers, DBAs, and sysadmins sitting around in the middle of the night was counterproductive. It was fraught with tension during the middle of the night after a long day at work. Now we do it at 1pm on saturday, it is amlost leisurely now. Just a quick call to check in. Some people even call the voice bridge from the beach!

Let's be realistic. Many applications have 24 hour availability, but almost every application I've worked on was for people in the US and core usage was sometime between 7AM and 9PM.

If it only takes 5 minutes to deploy your app why not do it at 2AM when few users are online? Deployments should be simple, easy, and NOT require any problem solving by the deployer. Midday deployments with all hands on deck to handle issues seems like a bigger problem to me.

The best reason not to do overnight deployments is that if something goes wrong, you want your developers to be alert and sharp, not dazed and sleep-deprived.

Because our environment is non-trivial. It has nothing to do with confidence in application code, and everything to do with service-level agreements, impact to existing customers, and our own psyche. Ask SalesForce if they do their maintenance updates on weekends because of a lack of confidence in their codebase.

We're not doing 2am deployments, but we're certainly doing them "off-hours".

Triviality has nothing to do with it.

Nothing in your SLA says you have to have servers 1, 2 and 3 available. It says you have to have service X available. You can do that with out taking a downtime hit.

Wow, thanks for having better insight into our production environment than myself.

Feel free to correct me then. I have YET to work anywhere in any industry (financial primarily) that stated we had to have specific hardware up and running as part of an SLA.


> I have YET to work anywhere in any industry

Sounds like your experience is the only one that matters, then.

I started writing a reply but it turned into a blog post, here: (http://casestatement.tumblr.com/post/11598425762/why-we-are-...)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact