(Thomas J. Watson, IBM CEO)
We launched a product with a wonderful new element in it, which had the capacity to clean clothes better than anything that had ever been seen. Unfortunately, if it wasn’t used in exactly the right spec, it cleaned so well that it cleaned away the clothes as well...I was personally responsible for Unilever’s largest marketing disaster.
He quoted his Chairman’s response to this at the time:
We’ve just invested £350m in your education. If you think you’re going to take that somewhere else, forget it.
He managed to keep his job.. just with no more prod access. :)
You have to expect people to make mistakes. I’m not saying he didn’t fuck up, but if a company is down a billion dollars the story should be of multiple people making multiple mistakes.
For what it's worth, a billion dollars for this company, while still a lot, isn't world ending. It was probably insured, too.
Would love to hear more about that :)
Another story from the same place (won't confirm nor deny if it was the same sysadmin) - someone accidentally pushed a puppet config out that changed all hosts' timezones to CDT. A bit later, I received a random and kinda joking text from a buddy from a past job - who couldn't get an answer from my company's tech support - asking why all of their clearing reports had the wrong times.
He was the only person outside of three other sysadmins who ever mentioned anything was wrong. Not even my boss brought it up! It was odd.
Inflation-adjusted, that's like $10 million.
Then blame is irrelevant. It happened, they learned from it, they now have that experience under their belt.
"What happens if we don't spend a million training him, and he stays?"
So yeah, be nice to Ops too, because they actually have experience in stuff like this and one weekend of downtime is not an appropriate price to pay for every developer to learn a lesson.
Also, if something goes down in a way that requires a human to work on the weekend, it should result in a postmortem, and all of the components in the deployment chain related to the failure should be evaluated, with new tasks to fix their causes. If it happens multiple times, all project work should stop until it's fixed.
This of course is balanced against how much failure your business can tolerate. If the service goes down and nobody loses money, do you really need your engineers working overtime to fix it?
Or being afraid to deploy last thing on Fridays is an admission that maybe... just maybe... you're not infallible
Even with tests, I've seen startups do double charges on accounts and whatnot on deployments which didn't show up until the next day. I've also seen ops people updating OES where the storage service would segfault a day later. How does DevOps and OES go together in one sentence, you ask? it doesn't, but it just means not all ops people are pure wisdom either. The guy caused others to waste 72 hours of compute resources, because of this. So it's not limited to Dev. And yes, the first company did learn from the double spending bug, but why learn on a saturday?
So even if your DevOps practices are amazing and you have 70% test coverage on all your components, that doesn't mean you can't deploy faulty components where the deployment itself appears successful. Now what? Things aren't failing, they appear just fine. Someone has to go in and debug the problem, it may affect multiple components, it may be critical, and a simple rollback may not cut it.
Friday deployments are fine for certain components, but surely not as a general rule for everything? Friday deployments are like Monday morning or Friday meetings. You can do them, and most of the times they'll be fine, but maybe out of respect for your colleagues you shouldn't anyway.
In that type of "event" style deployments, week night deploys probably are safe, but anything scheduled for Friday is trouble.
1) common in agriculture during harvest season
Our philosophy is that if nothing ever breaks in production, you are being too conservative with your controls and development. Or if you look at it another way, you can allocate resources towards stability and new features, and the (near) 100% testing/verification/auto healing/rollback coverage means that too much of your resources are allocated to stability and not enough towards new features. Running a service with uptime too close to 100% uptime also causes pathologies in downstream services, and if your never have to fix anything manually the skills you need to fix things manually will atrophy.
Or, for our service,
- There should be a pager with 24 hour coverage, because our service is critical,
- That pager should receive some pages but not too many, so operations stays sharp but not burdened,
- Automation and service improvements should eliminate the sources of most pages, and new development should create entirely new problems to solve,
- If the service uptime is too high, it should be periodically taken down manually to simulate production failures, and development controls should be reevaluated to see if they are too restrictive.
Eliminating all the production errors takes a long time and a lot of effort. Yes, we are spending that effort, but the only way that this process will actually “finish” is if the product is dead and no more development is being done. The operations and development teams can then be disbanded and reallocated to more profitable work. A healthy product lifecycle, in general (and not in every case), should see production errors until around the team is downsized to just a couple engineers doing maintenance.
Google calls this an "error budget". We have something similar where I work. https://landing.google.com/sre/book/chapters/embracing-risk....
You can phrase it as “afraid to deploy on Friday”, but I think “afraid to cause outages in production” indicates that the blast radius of your errors is too large or that you’re being too conservative.
I prefer my midnight emergencies at a minimum.
The product has 24/7 pager coverage, but that does not mean that one person has the pager the whole time! At any given time the pager is covered by two or three people in different time zones. The way my team is structured, I will only get paged after midnight if someone else drops the page. And I only have a rotation for one week every couple months or so.
There are definitely employees who don’t enjoy having the pager, but we get compensated for holding the pager with comp time or cash (our choice). The comp time adds up to something like 3 weeks per year, and yes, there are people who take it all as vacation. No, these people are not passed over for promotions. No, this is not Europe.
So the trade off is that seven weeks a year you carry your laptop with you everywhere you go, maybe do one or two extra hours of work those weeks, and don’t go to movies or plays, and then you get three extra weeks off. Yes, it's popular. People like pager duty because they get to spend extra time with their families, because they like to go camping, or because they want the extra cash.
I have once been paged after midnight.
Adequately compensating on-call is, of course, the right way to do it. All sorts of considerations that were, otherwise, problems, such as how to ensure a "fair" rotation, magically go away .
Unfortunately, it's vanishingly rare, at least among "silicon valley" startups (and maybe all tech companies). I suspect it's one of those pieces of Ops wisdom that's vanished from the startup ecosystem because Ops, in general, is viewed as obsolete, especially by CTOs who are really Chief Software Development Officers.
Insofar as it's a prerequisite to all your other suggestions, it makes them non-starters in such companies.
 Although I suppose if the compensation is too generous, there may still end up being complaining about unfairness in allocation
I worked in a shop like that -- they had such great testing policies that they did continuous deployment, code went from commit to production as soon as the tests passed.
Until the holiday weekend when two code changes had an unexpected interaction and ended up black-holing all new customer activity that weekend. (existing customers were fine, they only lost data for new customers).
They could have recovered the data from a log on the front-end servers, but one of the admins noticed an unusual amount of disk space used on the front ends monday morning and just replaced them all (since their auto-healing allowed this without any interruption of service)... and since those logs were only used for debugging problems, they weren't persisted anywhere.
It turns out that tests aren't perfect - they only test what you think you need to test.
If the service goes down and nobody loses money, do you really need your engineers working overtime to fix it?
Money is not the only way to value a service.... but if the service goes down and no one cares, why run the service at all?
I usually don’t like a blanket don’t deploy on Friday rule.
We can usually rollback with one command easily, have good monitoring and health checks so even though something makes it into prod, it’s super easy to go back.
Unless you have changes like you mentioned, weird side effects, db scheme changes, config changes that affect machine configuration. Those are unknown unknown changes. Good practice to hold them back.
As to web and assets changes to a css file or a self contained js change, that should only re-deploy the files that changed and generally low risk.
It takes a good deal of design work to get a high level of resiliency, but it's completely within the realm of possibility. Most shops just don't dedicate the effort to it, because they're more worried about shipping new features, and this is understandable. Just different priorities.
Code freezes (and that's what blocking deploys are) are a great tool, but primarily for managing your on-call more effectively.
It's one of the central policies to ensure happy clients and smooth running operations that gets regular review and questions from clients when they are in a hurry. But when it was implemented, stress levels across the board plummeted. And only a small amount of client education was needed before they agreed it was a good policy.
There are emergency circumstances that override this rule of course.
Error discovered -> Call person responsible -> Roll back
If you deploy on Friday, run out the door, and soon find out that your contribution to a deployment caused an outage, wouldn't you immediately return to work to at least give the appearance of personal responsibility? (on any day of the week, even)
Also, wouldn't you just do a rollback to the last viable build?
- Brian recalls a time he was treated with kindness, and remembers to be kind to others.
- Boz recalls a time he was almost fired due to not being kind, and remembers to be kind to others.
So remember, be kind!
- 2475 points, 2 years ago: https://news.ycombinator.com/item?id=12707606 (Brian)
- 1198 points, 3 years ago: https://news.ycombinator.com/item?id=9534310 (Boz)
I still laugh thinking about how the president, who had never showed any emotion before and was as serious as they come, brought in a Burger King meal for me while I was up late working on the restore.
- Messed up the Software Engineering dept. Jira when I tried to customize my dept. Jira too much. Took hours to fix, during which time Software Engineering dept. couldn't do any ticketing actions.
- `sudo shutdown -h now` on a remote battery control PC because my terminal was still logged into it via SSH from 5 hours previous. I was trying to shutdown my laptop at the end of the day, and don't like the buttons when I have Tilda hotkeyed to 'F1'. Had to send a technician to the site 2 days later at a cost of several hundred bucks, plus the battery was not operational for 2 days during the most operational time we had in recent months, so we lost money there. I've done several more of this sort of thing that has required a tech, too.
- Forecast that a certain day would be the best day to do a specific operational test, but then fucked up inputting the ISO-format date/time string so it started at 7PM on a Saturday rather than 7AM (I know the format pretty well now: `2018-08-24T23:15:00-07:00`).
- Forecast that a certain day would be the best day to do a specific operational test, but fucked up the script, so it mis-calculated everything and the forecast turned out to be worse than useless (lost the company $30k over 4 hours).
Luckily, my company was fine with all this and we learned a ton from it (other people made similar mistakes, too), so it was useful in some way. I am also way more careful and deliberate about anything now, no more "iterative-keyboard-banging" (my original programming style) in Python while connecting directly to the production database!
It's come to the point where I've acquired a nagging suspicion that this is how it needs to be. That 'to be kind' will always be icing on the cake so to speak, no more. Maybe I've grown too cynical.
This article reflects the difference in how some people have treated me when I have told them I have made a mistake. When I make a mistake now, those who treated me nicely I will tell without hesitation. But when they have been less than nice to me about past failures, I will consider not telling them I made a mistake.
Was a wakening call. Taught me to be far more careful when deploying and testing s/w.
The admin taught me about find and xargs, a lesson I have remembered for the last 15 years.
1. Be more careful next time so that you're less likely to make the same mistake again.
2. Fix the system so that nobody can make that kind of mistake again.
Learning lesson #1 will means that you're less likely to make the same, but learning lesson #2 will prevent you and your team from making that same mistake.
For something that's as easy to test as "is the site working", the real lesson there is that you set up your deployment system so that the website needs to respond to a health check before the deploy finishes.
(I realize this is nitpicking and isn't the point of the article, just thought I'd mention it :P)
Though somewhat beside the point, if you have a dedicated test team, know that I don't trust my test infrastructure to catch all your screw-ups anymore than you have confidence that there won't be any. If it passes the tests, I'm pretty confident we didn't break anything, but it can wait until Monday, right?
On to the point, everyone has to do this once. We've all got a story (I have an anthology). If you're confident that a lesson has been learned, no reason to belabor the point.
I have also been frighteningly, unforgivably unkind and I can tell you, being kind is better. I still shudder when I think about how unkind I was.
Good thing IT kept excellent backups! And despite this I didn't "learn" to fire on first fuckup.