I'm sure this is a noob reaction, but can anyone point me to some good technical walkthroughs about how to deploy to live without taking your site down or interrupting your database connection? Is there a tool/term/practice I'm missing?
1. distribute new package to all servers;
2. run an additional application service on all servers and run some quick tests on each of them to verify proper working;
3. add new application servers to load balancers;
4. upgrade data model to new version (we use postgresql, and this happends in a single db transaction, and remember that our new version x is compatbible with both the current and the previous data model);
5. remove old application services from load balancers;
6. upgrade successful.
If anything goes wrong, we can roll back each of these steps. Note that this whole process, perhaps needless to say, is fully automated.
By running multiple versions inside the load balancer at the same time, and having the requirement that version x + 1 is always compatible with version x, this procedure allows us to seamlessly upgrade to a new version without any downtime.
In most specific cases, there are alternatives to stopping the show during an upgrade. First, you check your backups! Then, there are various strategies. You can make many simple schema changes while the system is running. Or put the DB in read-only mode and still mostly manage to serve pages -- perhaps with an occasional HTTP 503, so hopefully that's okay -- during your upgrade. Or phase in changes over several releases, carefully architecting for backward compatibility. Or bring up a parallel system and gradually migrate active users to it. Depending on what you're doing, you may find yourself having to write a special upgrade script that migrates old formats to new formats, or even using database triggers to keep "old" and "new"-format tables in sync during the transition. A well-constrained database may help keep things sensible during the transition -- or, you may have to drop half of your supposedly-sensible constraints just to make the transition work.
Moral: Even if you know exactly what you're doing, live updates are more work: More planning, more code, more infrastructure, and/or more stress. In many real-world cases, you should just take the system down for a minute. Focus your engineering effort on making sure that "minute" is as short as possible, and in making sure that you can detect problems and roll back as quickly as possible.
> the general answer is "don't change the database
> structure of a database that's serving live traffic:
> You'll never enumerate all the things that could
> possibly go wrong."
"Don't do it" is not really an answer.
> Or put the DB in read-only mode and still mostly
> manage to serve pages
The reason there aren't a lot of general-purpose walkthroughs is that there is no general case.
The closest thing we have to a general-purpose solution is: Take the system down, do the upgrade, put the system back up. But, yes, this is often not a very good answer. It is often a lousy and expensive answer. In which case you hire engineers to build a better strategy. And I can't tell you, dear reader, what that better strategy is going to be, because it's different from case to case, and I don't know what your problem is.
And, yes, of course you shouldn't use the sloppy strategies on your financial-transaction processing app. Just as you probably shouldn't spend three engineer-weeks designing a complex zero-uptime rollout strategy for your blog comment system.
As for your second question, if you have a load balancer, you can always take nodes out of it in order to update them, before re-enabling them in the LB and moving on to the next one. It's called a rolling upgrade, and the ease and details of doing such depend on the actual pieces involved.
We normally disallow non-backwards compatible changes, such as renaming columns. We only drop tables after renaming and waiting a while (so we can quickly rename back).
When you have a lot of database servers this is pretty important since trying to keep them all exactly in-sync with the same schema at all times in the process is pretty much impossible. While doing the change, you are always going to end up with some finishing earlier than others.
I boil it down to this: The safest change to deploy to a stable system is the smallest change possible. Most of my changes don't require schema changes at all.
The application itself, on startup, verifies that the database has the tables/columns it needs in order to work. If it doesn't, it will CREATE TABLE or ALTER TABLE appropriately.
I try to avoid backwards-incompatible schema changes whenever possible, so I can rollback. It's always safer and easier to add a new table or column than to delete/rename an existing one. Something wrong with the code? Rolling back won't send you back to an incompatible state.
I use an ORM instead of stored procedures, because I find them a lot more friendly with this general process, you don't have procs that expect a particular parameter signature.
You may need to decouple your db-changes deployments from your non-db-changes deployments. Doing that can at least make the non-db-changes deployments a lot less painful.