Race-condition-free deployment with the "symlink replacement" trick

viraptor · on Oct 22, 2012

Ok, one problem solved. Now what's left is - schema change, making sure ongoing process flows can automatically migrate from the previous version to the new one, resources referenced from the previous version are still valid, nothing tries to read files with the code via the link (it can change mid-request).

I got the strange feeling from that article as if changing the code files was the hardest thing about yes upgrade.

ibotty · on Oct 23, 2012

you are certainly right. db schema updates are the hard part.

for db schema updates: have a look at sqitch by postgres' david wheeler. it should also support mysql (or will in the future).

Firehed · on Oct 22, 2012

This is a good system at face value, but can present other problems. Any code that has a stat cache (I know PHP does, and I'd be surprised if other common languages don't) suddenly doesn't realize that your paths are going to a different place. Because /var/www as your "base" directory is symlinked to /var/www.a, www.a is cached and when you swap your symlink to www.b to deploy the next version, anything relying on that stat cache (include directives, autoloaders, etc) suddenly starts pulling in the wrong version of the file.

Solving this in a way that doesn't require restarting any services and doesn't introduce any more race conditions is nontrivial, although it tends to work pretty reliably so long as you have a front controller and it very rarely changes. Basically it comes down to a symlink change detector by hitting a special (internal) URL after your deploy script which kills the stat cache. If people are interested I can post a more concrete example.

hkolk · on Oct 22, 2012

I work for a pretty big PHP shop. Our opcode cache has some issues with this, but our stat-cache is ok. Inside the code we unpack the symlink, so internally we are pointing to files like ~/releases/<id>/codebase/HTML/index.php instead of ~/codebase/HTML/index.php

Because of the opcode issues we still do a rolling restart of the apaches but we don't see them a lot.

We use `rename` instead of `mv`. Don't know why :)

edit: this also handles the problem of suddenly referencing different files mid-request. Because you don't

dvirsky · on Oct 22, 2012

I've been using this pattern for almost 2 years, and indeed it presented a problem at first with bytecode cache. We've added a simple post deploy script that flushes APC and it fixed this.

troels · on Oct 22, 2012

I wonder what problem this is really solving. I mean, delete+create in a script will happen pretty fast after each other, so the moment of inconsistency is really very short. If this is an actual problem for you, chances are that your setup is rather large and you have multiple nodes behind a load balancer. In that case, you have bigger issues, such as making sure the individual nodes are updated at the same time. Usually this would be solved by taking them out of rotation while updating, in which case the atomic symlink switch becomes moot.

regularfry · on Oct 22, 2012

The problem is that the issues created by out-of-sync code that happens to get loaded in the wrong order can be an absolute nightmare to debug. You'll have transient, unreproducible, potentially data-destroying bugs which vary with each release. Some releases you might get lucky, some not. If you don't think about atomicity of the deployment, you can chase your tail for days trying to figure out what went wrong.

That being said, this strikes me as more of a pain under the traditional PHP model, where reloading code from disc per request is normal, than for something like Rails which loads everything into memory once at launch.

martinaglv · on Oct 22, 2012

With APC (and stat disabled) PHP behaves in exactly the same way. Bytecode is kept in memory between requests and you can then safely push a new version of your directory tree. All that is required is to flush the cache to have the new version go live.

DougWebb · on Oct 22, 2012

At my old job our webapp handled 6+ million requests per day, and during the 8hr peak period it handled just over 100 requests per second. When you're dealing with request rates like that, even really brief periods of inconsistency will cause problems for someone.

We handled deployments a different way though. Each release went into a different directory and was made available under a different url, and users got redirected to the newest release when they logged in. Once in a session they stayed with the same version until they logged out. This also allowed us to do limited deployments; we could choose which version each customer group was sent to.

praptak · on Oct 22, 2012

Yeah, mv is atomic but I believe that a crash can still leave your data inconsistent due to write reordering, see:

http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/2...

I would do a fsync() before switching the symlink to the new dir.

ibotty · on Oct 22, 2012

yes. that's a good idea if you care about it. the good part about this solution is that even if the newly deployed file tree becomes corrupted, you can simply point the symlink to the old (certainly not corrupt) tree. that is easy and fast error recovery. an fsync in between makes the window very small, so that only the symlink might get broken.

mjs · on Oct 22, 2012

This system ensures the server is always in a consistent state, but client race conditions are still possible if the "old" index.html references an asset that isn't available after the deployment has occurred. Is there any good way of dealing with this? (I just ignore it...)

borlak · on Oct 22, 2012

There is a simple way to solve the original problem and the one you mentioned. A load balancer. One that you have a good amount of control of and an API (such as Zeus, F5...)

You basically take the nodes off one at a time, wait for connections to finish, sync over code, then bring it back up. This does make some assumptions about assets -- that they are in a different location, such as a CDN or static server. If you are removing assets, you need to do this at the very end of this node-syncing process, so that any live "old nodes" aren't linking to deleted assets.

As for newly updated assets, you should be doing versioning for those anyway (even this 'symlink trick' fails when multiple application servers are involved and no shared code space).

0xbadcafebee · on Oct 22, 2012

Version your client assets. E.g. <img src="/img/v2.12.35/logo.gif" /> <link rel="stylesheet" type="text/css" href="/css/v2.12.35/frontpage.css" />

As someone mentioned before, hard-linking old and new deploy files means duplicated content doesn't cost any disk space. Rotate out old deploys past X days, and use strong cache controls to expire the content quickly.

tlrobinson · on Oct 22, 2012

In the past I've used <base> tags to "version" a deploy. The index.html was in the root directory and all other assets (including API endpoints, as it was a single-page app) were in a timestamped subdirectory, with a <base> tag pointing to that subdirectory.

It seemed to work well, but there are issues with <base> tags you should read up on first.

jacquesm · on Oct 22, 2012

Two phases. Leave possibly referenced assets for a while until use of cached old files has become rare enough, then remove them.

mjs · on Oct 22, 2012

Sure, but this approach is going to be much more complicated and less predictable than the "switch the docroot to a completely different directory" model. (e.g. you need to distinguish between asset and non-asset directories, and make a decision about how long to keep "old" assets around.) With some effort these problems can be solved, I'm just pointing out that even with a completely static site, and perfectly atomic docroot switching, you can still end up with clients in an inconsistent state.

Negitivefrags · on Oct 22, 2012

We use use this technique at Grinding Gear Games for our deployments but here are a few random assorted details about how we are set up.

The first is that we find it's a good idea to have your release directories on the server named after tags from your VCS. Each time we want to do a deploy we just make a tag and the deployment script just takes the name of the tag to deploy as it's argument. It's very easy to see what version is deployed on a server by just looking at the address of the symlink.

The second is that you should use rsync with the --link-dest option. --link-dest allows you to specify a previous directory that rsync can use to create hard links from for files that haven't changed. For example, if you a new version to deploy in a directory called "0.9.10/2" and on the remote server you have "0.9.10/1" currently deployed, you can "rsync 0.9.10/2 server:0.9.10/2 --link-dest 0.9.10/1". What this does is create a new dir tree in /2 with all the files that didn't change from /1 hard linked but with new copies for the files that did. This saves a lot of disk space and it means you can keep versions around on the server for as long as you feel the need to.

As our deployment is ~8GB this is quite important for us. This means that we actually have releases sitting on the server for quite a while back.

The third thing is setting something up so you can have simple versioning of your deployment scripts.

We have a script that drives this whole process called "./realmctl". Deployment is split in to a 4 step process. You find scripts like this in each release dir like this:

./0.9.10/1/prepare (create/upload new release)

./0.9.10/1/stop (stop existing servers)

./0.9.10/1/deploy (change symlinks over to this release)

./0.9.10/1/start (start servers)

Each of the releases contains it's own version of the script. That means if you issue a command like "./realmctl restart --release=0.9.10/2" then the script can find the stop script for the current version then run the deploy and start scripts for the new version. In this way if your deployment process changes between versions then you can still freely move around between versions without needing to worry about the version of your deployment scripts.

The last thing is that it's really nice if your writing something similar for your scripts to have some idea about different parts of your infrastructure so that they can be controlled independently. It's really useful to be able to say something like "./realmctl restart all poe_webserver" (restart webserver processes on all servers) or "./realmctl stop ggg4 poe_instance" (stop the game instance servers on ggg4). Those kind of commands are really useful during an emergency.

0xbadcafebee · on Oct 22, 2012

Nice. It's almost rare these days to see a shop manage its deploys like application releases.

Do you do staged production deploys of new code for small groups of users? I found it was beneficial to be able to test a change on a random subset of users so if there's a production-only bug it doesn't hit everyone at once.

This also allows you to not have to "stop" the app servers because you're starting up the new version's instance in parallel with the old. The frontend just passes user-specific requests to the new instance and the old instance keeps chugging along with no downtime. Of course this usually requires no schema changes (unless you have lots of spare infrastructure handy).

Negitivefrags · on Oct 22, 2012

We don't do that on the production realm, but it's kind of because we are a game and patches are a big deal for our community. It's not like most websites where you often don't know when patches are coming or what they changed. We keep full change logs here: http://www.pathofexile.com/forum/view-forum/366

It's worth bearing in mind that we are actually deploying an application that they play on their desktop machines, it's just that our website is tightly integrated with the live realm so they are deployed together in the same deployment system.

What we do have as a game though is the ability to have a separate alpha realm that we can deploy to for testing a release and we have a trusted set of our player base that is allowed access to it.

So here is the list of realms we have:

Testing (Local continuously integrated deploy of trunk. Updated every commit)

Staging1 (Local staging for the next major patch)

Staging2 (Local staged copy of whatever is on production. This is used for when we want to test bugfixes to production)

Alpha (Deploy of the next major patch for some community members to play and test in advance. This is deployed alongside the production realm on the live servers.)

Production

All of that said though, we are adding the ability very soon for the backend to be able to spawn game instance servers for multiple versions of the realm. This would mean that we can deploy a game patch without a restart (assuming the backend didn't change). Old clients would get old game instance servers but as players restart their game client and patch, they will get on new game instance servers.

sausagefeet · on Oct 22, 2012

I'd prefer to just bring the machine out of rotation, redeploy, bring it back.

Gigablah · on Oct 22, 2012

Rex [1] already supports deployment (and rollbacks) using symlink replacement.

[1] http://rexify.org/modules/application_deployment.html

daenney · on Oct 22, 2012

Isn't this guy reinventing a simplified wheel? We already have tools like Capistrano and Fabric or Rex (which does quite a bit more than just application deployments).

dsl · on Oct 22, 2012

Thats like saying why understand how writing a file to disk works when we have things like notepad and emacs. What do you think these tools do under the hood?

Tobu · on Oct 22, 2012

After my experience of Capistrano and Fabric, I think it's a wheel that needs to be reinvented. Capistrano: hardcodes way too many things, underdocumented. Fabric: reimplements ssh (poorly). Moving from Capistrano to a custom script I ended up with less code and higher maintainability.

rll · on Oct 22, 2012

This seems a bit oversimplified. Sure, there are no race conditions if there are no interactions between files, but if there are then swapping the symlink mid-request on requests that are already in progress will cause all sorts of race conditions.

cheald · on Oct 22, 2012

Not necessarily. On a Linux-based system, you're actually going to be operating on inodes rather than path names, so once you have a handle to a file, changing the path to it isn't going to affect your ability to interact with it. Just make sure that your new directory is set up so that the same request for the same target file ends up requesting the same inode and you shouldn't notice any problems.

Someone · on Oct 22, 2012

"Just make sure that your new directory is set up so that the same request for the same target file ends up requesting the same inode"

If that is your goal, why do the mv at all?

...until your program needs to work with more than one file and the two are related in some way or opens 'the same file' twice.

For example, your compiler could fetch file X from the 'old' directory and file Y from the 'new' one, or your web server could log that it fetched file Z (which does not exist in the 'new' directory) to a log file in the 'new' directory.

It may be possible to mitigate this by requiring everybody to access all fils true a directory you opened atomically for them, but even if it is: good luck enforcing that rule, especially when using third party libraries.

kelnos · on Oct 22, 2012

But what if you don't have a handle to a file? Say you have index.php, which includes stuff.php. Order of operations:

  1. Server opens (old) index.php and begins executing
  2. Symlink swap
  3. Interpreter gets to the line that includes stuff.php, and then opens the *new* file

regularfry · on Oct 22, 2012

That's precisely why the classical PHP model is kinda stuffed under this situation.

If you want to stick with the traditional disc-read-per-request model, I'd be interested to see if something like a blue/green code deployment could work. You'd have two separate html roots - say, with /var/www/blue being current. You deploy your new code to /var/www/green, update httpd.conf to point there, then SIGHUP apache. The next deployment would switch back from /var/www/green to /var/www/blue. That way every request sees a consistent deployment from start to end.

IsTom · on Oct 22, 2012

If you use relative paths 3 will include old file. I'm not sure if that's a problem in PHP, but everything I used handled that fine.

standel · on Oct 22, 2012

mv is relying on operation "similar" to rename() defined by POSIX which specifies that it should be atomic.

So, the assumption "On Unix, mv is atomic operation" is not true. If your underlying FS is fully POSIX-compliant, mv will be an atomic operation.

I think it's important to stress it because there are some distributed FS that might even try to be POSIX-compliant but which are not guaranteeing atomic rename's and therefore this trick would not work well.

njharman · on Oct 22, 2012

Any deployment tool fab, capistrano, etc. should do this.

My preferred layout is

./releases/<datetime or rev or whatever makes sense for you stamped>

./current symlink to ./releases/<foo>

Keeping releases in directory by themselves make it easy to list them, archive old ones etc.

ralph · on Oct 22, 2012

Telling the server of the new root instead, which needs the relevant permissions as the author points out, would remove the symlink traversal on every access?

code_duck · on Oct 22, 2012

If you are on a shared host and can't restart your webserver, you probably don't have the ability to set the document root, either.

j_baker · on Oct 22, 2012

Is it just me, or does this seem like a lot of work just to avoid having assets be inconsistent for "some number of milliseconds"?

sirclueless · on Oct 22, 2012

Especially when browsers are stateless and request things at different times anyways. If you have a lot of requests, you can expect a few of them to get the HTML from one version and the javascript and CSS from another anyways, no matter your strategy.

I still think this is a valuable deployment strategy, just because you can rollback and switch deployed versions easily, which is always useful. And it's certainly better than rsyncing to a live directory, at any rate.