Here is the old release process:
1. Monday morning, the version to be used for the next release is automatically built for the QA team, who begins running their test suites on it and doing soft checks.
2. By no later than Wednesday, the new version is leaked to testing an alpha accounts on Fog Creek On Demand. Tests are re-run at this point.
3. The leak is increased later in the week if the QA results look good, or the weekend release is canceled, depending on how testing goes.
4. Provided everything has been good, on Saturday night, the leak is increased to 100% of customers. This step does not have a full QA rundown, because the code has already been vetted several times by QA at this point. The sanity checks are truly sanity checks.
5. At the same time, we monitor that our monitoring system (Nagios) agrees that all accounts are online and that there are no major problems, such as massive CPU spikes.
So far, so good. The issue with this release is we had a bug that did not manifest for awhile, because Kiln had been deliberately designed to ignore the failure condition "as long as possible", which ended up just being too damn long. Once we started having failures, we noticed--that's why our sysadmin called us in--but those failures started happening 20 hours after the 100% release, and several days after testing and alpha accounts were upgraded.
I am not arguing our system is perfect, but I'm a nonplussed where the your-deployment-system-totally-sucks stuff is coming from. I'll ask our build manager to post an even more detailed rundown.
When you read irreversible, think "very difficult to reverse and not worth the cost of writing and validating code we don't ever expect to run."