Hacker News new | past | comments | ask | show | jobs | submit login

Wait, what?

You have large numbers of paying customers to whom you're delivering a mission-critical system (source control isn't exactly optional), and your releases involve neither automated production monitoring/continuous deployment nor formal release procedures?

I think your problem is more than just weekend deployments!

My full comments here: http://swombat.com/2011/3/8/fog-creek-dont-do-cowboy-deploym...

Maybe in the future we can all be IMVU:

Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.


In fairness, that's a description of what a routine and successful build "should" go like. I bet if IMVU were to post a blow-by-blow account of their hairiest deployment screwup ever, it would be a good bit more colorful than that.

There are some headscratchers in the description of the Fogbugz problem, but kudos to them for explaining how and why things broke.

Tim Fitz' blog is a great source of continuous deployment done right and finding useful information in a sea of chaos.

I remember talking to him before and after he wrote some of these blog posts and it was fascinating seeing how his attitude regarding failure changed.

The releases are both automated (except for one component, as noted, which we are now automating), and are fully vetted.

Here is the old release process:

1. Monday morning, the version to be used for the next release is automatically built for the QA team, who begins running their test suites on it and doing soft checks.

2. By no later than Wednesday, the new version is leaked to testing an alpha accounts on Fog Creek On Demand. Tests are re-run at this point.

3. The leak is increased later in the week if the QA results look good, or the weekend release is canceled, depending on how testing goes.

4. Provided everything has been good, on Saturday night, the leak is increased to 100% of customers. This step does not have a full QA rundown, because the code has already been vetted several times by QA at this point. The sanity checks are truly sanity checks.

5. At the same time, we monitor that our monitoring system (Nagios) agrees that all accounts are online and that there are no major problems, such as massive CPU spikes.

So far, so good. The issue with this release is we had a bug that did not manifest for awhile, because Kiln had been deliberately designed to ignore the failure condition "as long as possible", which ended up just being too damn long. Once we started having failures, we noticed--that's why our sysadmin called us in--but those failures started happening 20 hours after the 100% release, and several days after testing and alpha accounts were upgraded.

I am not arguing our system is perfect, but I'm a nonplussed where the your-deployment-system-totally-sucks stuff is coming from. I'll ask our build manager to post an even more detailed rundown.

Sincere question: how do you leak irreversible schema changes to a subset of accounts? Isn't the point of the leak that you're not confident and might need to reverse it? Or are you willing to let those accounts get hosed?

Fix it by hand. If it's ten accounts, that's pretty easy. If it's ten thousand, more of a problem.

When you read irreversible, think "very difficult to reverse and not worth the cost of writing and validating code we don't ever expect to run."

Perhaps have Kiln send notifications on the failure conditions even if it doesn't throw an error? Better a few false positives than no indication at all.

I agree. My other thought was 'isn't there a staging server in there somewhere?' Something that is near identical to production, with fake production data, etc, that could surface the problem before a customer sees it.

btw, props to Fog Creek and OP for airing their dirty laundry. They take some heat, but in the end we all learn from it.

We have more than staging servers: we have staging accounts. I documented our full release process at http://news.ycombinator.com/item?id=2301680.

It's stunning how easy it is to spot a specific lack of "automated production monitoring" after something fails. Hey idiot, you should've been testing that thing!

I've seen all of Fog Creek's automated production monitoring courtesy of their sysadmins and devs as it was months ago, and it was very solid. I'm sure it's only gotten better.

This is a case of a specific deployment failure slipping through the cracks and being honestly explained, apologized for, and rectified. I'm obviously biased due to my history (and probably-justified guilt for this particular failure), but shotgun criticism about formal release procedures is very misguided.

Two better approaches come to mind to resolve this:

2. Full-on, properly managed releases like they do in large IT corporations, such as banks, where a "release" is not something you kick off from home via SSH on a Saturday night, but a properly planned effort that involves critical members of the dev team as well as the QA team being present and ready to both test the production system thoroughly and fix any issues that may occur.

What you describe in #2 here sounds like a complete anti-pattern when compared with the idea of continuous deployment and automated verification. This 2nd approach sounds like a huge manual effort.

It absolutely is, and I'd be surprised to see this kind of effort from any but the most paranoid corporations (like, as I mentioned, banks). Automation and continuous deployment are definitely the way forward.

But even this gargantuan effort is a better option than just "let's deploy and wait for our users to tell us if anything has gone wrong".

But even this gargantuan effort is a better option than just "let's deploy and wait for our users to tell us if anything has gone wrong".

To be fair it sounds like in the original article that they did do some verification that things were working after the deployment. However for some reason their verification tests didn't reveal the presence of a real bug.

Even in a more gargantuan system, it's possible to have tests that give false positive results.

Everyone will screw up releases at some point, the key is to be able to learn from them and get better.

If you're making a big change you first cut a CR and get approval of any teams involved. At change time, everyone knows they need to be on-call if something breaks, preferably in a live chatroom.

The rest of the time devs should just deploy when they think the code is ready and have tested it on a like-production box. They then manually verify the change worked. You use automated monitoring to ensure when something does break you are notified immediately.

Their release procedures didn't cover the case, and they're fixing it ("modifying the communication ... [to] fail early and loudly during our initial tests", according to the "with details" post on their status blog[1]).

But I still find their lack of monitors... disturbing.

[1] http://status.fogcreek.com/2011/03/sunday-night-kiln-outage-...

I agree except I don't think continuous deployment necessarily means automatic deployment. Every deploy should be done by a person and tested right after; none of this "push out all commits at X time" or "push as soon as it's committed" as both are risky.

During the day is usually preferred and never at 4:59PM on a Friday or right before everyone goes to lunch (ever had to clean up a downed cluster when some jerk pushed bad code and the whole team went to Sweet Tomatoes? yeah).

To help troubleshoot breaks, have a mailing list with changelogs showing who made a change, time/date, files touched. Also have your deploy tools mail it when there's a code push, rollback, server restart, etc. Have a simple tool someone can run to revert changes back to a time of day so if something breaks just "revert back to 6 hours ago" and debug while your old app is running (nice to take one broken box offline first to test on).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact