The follow-up doesn't bullshit with "extra training to make sure no one does this again", it says (effectively) "we're going to make this impossible to happen again, even if someone makes a mistake".
Enter Frans Plugge. Whenever a customer would get into that mode we'd fire Frans. This was easy, simply because he didn't exist in the first place (his name was pulled from a skit by two Dutch comedians, bonus points if you know who and which skit).
This usually then caused the customer to recant on how he/she never meant for anybody to get fired...
It was a funny solution and we got away with it for years, for one because it was pretty rare to get customers that mad to begin with and for another because Frans never wrote any blog posts about it ;)
But I was always waiting for that call from the labor board
to ask why we fired someone for who there was no record of employment.
It irks me that businesses fire people because of pressure from clients or social media. But having never been the boss, I may be missing something.
Internal repercussions notwithstanding, externally the company is a united front. It cannot cause mistakes by luck, accident, or happenstance, because the world includes luck, accidents, and happenstance, so any user-visible error is ipso facto a failure of management.
It's still mind blowing and very amusing that this is a thing in our world!
Cause that sounds pretty great.
Look at the recent GitLab incident - one guy messed up and nuked a server. Okay, that happens sometimes, go to backups. Uh oh, all the backups are broken. Minor momentary problem just turned into a major multi-day one.
That's a problem and one which could be preventable with training (or, arguably, firing and hiring). Maintaining your backups properly should be someone's duty, designing and testing systems to minimize impact of user error should be too.
If someone doesn't test their backups, you train them to test backups. If someone lies about testing the backups, maybe you fire them. But if someone trips and shatters the only backup disk, you don't yell at them - you create backups that an instant of clumsiness can't ruin.
I did overstate, training is perfectly reasonable, but I often see it cited exactly when it shouldn't be, as a solution to errors like typos or forgetfulness.
Instead, you make a machine verify the backups simply by using the backups all the time. For example, at work I feed part of our data pipeline with backups: Those processes have no access to the live data. If the backups break, those processes would provide bad information to the users, and people would come complaining in a matter of minutes.
Just like when you have a set of backup servers, you don't leave them collecting dust, or tell someone to go look at them every once in a while: you just route 1% of the traffic through them. They are still extra capacity, you can still do all kinds of things to them without too much trouble, but you know they are always working.
Never, ever, force people to do things they don't gain anything from. Their discipline would fade, just like it fades when you force them to a project management tool they get no value from.
One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data. The rest of the year it's only worth keeping an automated process reporting whether or not the things are being made, and people keeping an eye on change management to be sure no changes are made to the known-to-be working process that can break it without the new process incurring an explicit vetting cycle. Gitlab wasn't apparently testing or engaging in monitoring what was supposed to be an automated process. That's where they got burned.
Process monitoring may be boring as hell, but it's seldom wasted effort, and will prevent massive, compounded headaches from bringing operations to a chaotic halt.
Nope. Nope. Nope.
You test every backup by automatically restoring from it in a sandbox and verifying its integrity and functionality in the restored state.
Backups are worthless unless verified for their intended use of recovering a functioning system.
And constant "this succeeded" messages don't scale well.
Someone rm -rf / ing the server will happen eventually with near 100% certainty in any company and can be mitigated by tested, regular, multiply redundant backups.
Cosmic rays flipping bits will happen with near 100% probability at the scale someone like Amazon works at can by mitigated by redundant copies and filesystems with checksum style checks. Similar with hard drive failure.
Earthquakes will happen in some areas with near certainty over the time periods companies like Amazon presumably hope to be in business and could be mitigated by having multiple datacenters and well constructed buildings. Similar for 'normal' scale volcanoes.
Fires will happen but they can be mitigated (with appropriate buildings and redundancy).
Small meteorite stikes are unlikely but can be mitigated by redundancy.
Solar activity causing an electomagnetic storm - yeah one can shield one's datacenter in a Faraday cage but in this situation the whole world is probably in chaos and one's datacenter will be the least of one's concerns (unless shielding become standard in which case you'd better be doing it). Similar applies for nuclear war, super volcanoes, massive meteorite strikes or other global events at the interesting end of the scale.
But yeah there are going to be things that get missed. They key is having an organization that (1) learns from its mistakes and (2) learns from others' mistakes and continually keeps their risk modeling and mitigation measures up to date. And note that many of the hazards that are worth mitigating have the same mitigation i.e. redundancy (at different scales).
That's a great line. How should I attribute it?
Based on Amazon's decision to improve the tooling such that this category of error would be (hopefully) impossible to reproduce, I would lean more towards that being the case.
He won't make the same mistake because no one makes the same big mistake twice? I wouldn't bank on that alone.
The problem of user error can be mitigated by an appropriate level of OCD.
But OCD can't be trained, you either have it or you don't.
Tests and configuration scripts don't prevent all breakage. But when you have them, you can say, "We missed that, let's add it," or "That failed, but it's a false positive. Let's add this edge case to this test."
If you have no automation, tests or auditing systems around running deployments, you can't do any of this.
By the way - this is not just Amazon's problem now. We know the internet has a single point of failure. So does a lot of IoT.
When will we experience the first Suicide DevOps?
(Specifically https://www.youtube.com/watch?v=6OalIW1yL-k#t=3m but it's worth watching the whole clip (or even the whole movie) if you haven't seen it before. It's from Terry Gilliam's "Brazil".)
It has? I have yet to see the day where I can neither reach my email provider nor Google nor Hackernews. My local provider might screw up occasionally, or some number of of websites go unreachable for whatever reason. But I fail to come up with anything short of cutting multiple see cables that causes more than 50% of servers to be unreachable to more than 50% of users.
Jeff Bezos once said: "Good intentions never work, you need good mechanisms to make anything happen"
Amazon is taking the right approach here. The fact that a system as complex and important as S3 can be taken down is a failure of the system, not the person who took it down accidentally.
The certification is more for the organization/unit and the people working do not realize what they are for. Another thing that usually becomes a problem is the rigidity of the certification. Saying you need X, Y and Z documented is easy, but it doesn't work for projects that maybe don't have Y. So people make up documentation and process just to be compliant, this soon becomes a hinderance to the work.
At this point people either abandon the process or follow it and the work suffers.
(I lied about the "insta" part)
I've had the privilege of either working for myself, the company that acquired mine and let me run the dev, or at Google. From that perspective, and what I understand about ops, the rarity is not having the attitude mentioned in the parent.
Plus, managing humans in a 'rat out' system would be incredibly inefficient. Now you need lots of employees just to listen to the ratting!
I seem to recall an EC2 or S3 outage a few years ago that boiled down to an engineer pushing out a patch that broke an entire region when it was supposed to be a phased deployment.
I could be mis-remembering that but it's important that these lessons be applied across the whole company (at least AWS) so it would be a bigger mark against AWS if this is a result of similar tooling to what caused a previous outage.
(Source: am a self-identified post-mortems connoisseur. :)
Is mere extra training the right solution here?
Maybe they need something like the procedure that's used in missile silos:
Not allowing the shutdown system to function at all without the explicit authorization of least two people.
That's a lot more than just extra training, and a lot better than a two-key system.
Probably a bad example. The system was a pain in the ass, so they went and circumvented some of its restrictions.
> Those in the U.S. that had been fitted with the devices, such as ones in the Minuteman Silos, were installed under the close scrutiny of Robert McNamara, JFK's Secretary of Defence. However, The Strategic Air Command greatly resented McNamara's presence and almost as soon as he left, the code to launch the missile's, all 50 of them, was set to 00000000.
> Oh, and in case you actually did forget the code, it was handily written down on a checklist handed out to the soldiers.