Hacker News new | comments | show | ask | jobs | submit login

I'm befuddled by these sorts of posts that are big on theory and light on details. I run tests, via a continuous integration server, and have high confidence in our code when we deploy However, we do updates at night because we often need to take our sites off line for a while, to merge our changes into the live codebase. If we didn't, changes to the database structure would bork the db as visitors clicked around. Nighttime is our lowest volume of visits, so that's the best time to deploy.

I'm sure this is a noob reaction, but can anyone point me to some good technical walkthroughs about how to deploy to live without taking your site down or interrupting your database connection? Is there a tool/term/practice I'm missing?

I don't have a good technical walkthrough, but we're providing a high availability service, and zero-downtime upgrades are a requirement. Our setup basically consists of load balancers and application servers. When we need to upgrade, we require application version x always needs to be backwards compatible with application version x - 1, both on a protocol and data model level. Basically, our upgrade procedure is as follows:

1. distribute new package to all servers;

2. run an additional application service on all servers and run some quick tests on each of them to verify proper working;

3. add new application servers to load balancers;

4. upgrade data model to new version (we use postgresql, and this happends in a single db transaction, and remember that our new version x is compatbible with both the current and the previous data model);

5. remove old application services from load balancers;

6. upgrade successful.

If anything goes wrong, we can roll back each of these steps. Note that this whole process, perhaps needless to say, is fully automated.

By running multiple versions inside the load balancer at the same time, and having the requirement that version x + 1 is always compatible with version x, this procedure allows us to seamlessly upgrade to a new version without any downtime.

There won't be general-purpose walkthroughs, because the general answer is "don't change the database structure of a database that's serving live traffic: You'll never enumerate all the things that could possibly go wrong."

In most specific cases, there are alternatives to stopping the show during an upgrade. First, you check your backups! Then, there are various strategies. You can make many simple schema changes while the system is running. Or put the DB in read-only mode and still mostly manage to serve pages -- perhaps with an occasional HTTP 503, so hopefully that's okay -- during your upgrade. Or phase in changes over several releases, carefully architecting for backward compatibility. Or bring up a parallel system and gradually migrate active users to it. Depending on what you're doing, you may find yourself having to write a special upgrade script that migrates old formats to new formats, or even using database triggers to keep "old" and "new"-format tables in sync during the transition. A well-constrained database may help keep things sensible during the transition -- or, you may have to drop half of your supposedly-sensible constraints just to make the transition work.

Moral: Even if you know exactly what you're doing, live updates are more work: More planning, more code, more infrastructure, and/or more stress. In many real-world cases, you should just take the system down for a minute. Focus your engineering effort on making sure that "minute" is as short as possible, and in making sure that you can detect problems and roll back as quickly as possible.

    > the general answer is "don't change the database 
    > structure of a database that's serving live traffic: 
    > You'll never enumerate all the things that could 
    > possibly go wrong."
And what if you need to? Even adding an index on a large table can slow down the things tremendously. Let alone adding/deleting columns with indexes.

"Don't do it" is not really an answer.

    > Or put the DB in read-only mode and still mostly 
    > manage to serve pages
Of yeah, let's put our credit card processing app in the read-only mode during the day. What can possibly go wrong, just those silly 503s.

Perhaps I need to reiterate:

The reason there aren't a lot of general-purpose walkthroughs is that there is no general case.

The closest thing we have to a general-purpose solution is: Take the system down, do the upgrade, put the system back up. But, yes, this is often not a very good answer. It is often a lousy and expensive answer. In which case you hire engineers to build a better strategy. And I can't tell you, dear reader, what that better strategy is going to be, because it's different from case to case, and I don't know what your problem is.

And, yes, of course you shouldn't use the sloppy strategies on your financial-transaction processing app. Just as you probably shouldn't spend three engineer-weeks designing a complex zero-uptime rollout strategy for your blog comment system.

I think of them as "speculative management" posts. Like the post here, they're usually written by someone with no direct experience in what they're talking about, and no responsibility for dealing with the problems that arise from following their advice. The posts are aspirational for group credibility (social proof), necessarily among those who also don't have direct experience in the topic. It's Monday Morning Quarterbacking every day of the week, or nerd watercooler b.s.'ing. "Why does NASA wait for certain weather in order to launch, don't they have confidence in their equipment/systems/pilots?"

As for your second question, if you have a load balancer, you can always take nodes out of it in order to update them, before re-enabling them in the LB and moving on to the next one. It's called a rolling upgrade, and the ease and details of doing such depend on the actual pieces involved.

You need to make sure all of your DB changes are backwards compatible. For example, adding new tables, adding columns (with defaults), and adding indexes can all be done without breaking existing code. The code does need to do the proper thing to make this work, such as INSERTs with the column list.

We normally disallow non-backwards compatible changes, such as renaming columns. We only drop tables after renaming and waiting a while (so we can quickly rename back).

When you have a lot of database servers this is pretty important since trying to keep them all exactly in-sync with the same schema at all times in the process is pretty much impossible. While doing the change, you are always going to end up with some finishing earlier than others.

I haven't done a routine deployment in the middle of the night for years. Here's a post (of mine) that's a little longer on details.


I boil it down to this: The safest change to deploy to a stable system is the smallest change possible. Most of my changes don't require schema changes at all.

The application itself, on startup, verifies that the database has the tables/columns it needs in order to work. If it doesn't, it will CREATE TABLE or ALTER TABLE appropriately.

I try to avoid backwards-incompatible schema changes whenever possible, so I can rollback. It's always safer and easier to add a new table or column than to delete/rename an existing one. Something wrong with the code? Rolling back won't send you back to an incompatible state.

I use an ORM instead of stored procedures, because I find them a lot more friendly with this general process, you don't have procs that expect a particular parameter signature.

You may need to decouple your db-changes deployments from your non-db-changes deployments. Doing that can at least make the non-db-changes deployments a lot less painful.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact