Hacker News new | past | comments | ask | show | jobs | submit login

Look at the language used though. This is saying very loudly "Look, this isn't the engineer's fault here". It's one thing I miss about Amazon's culture- not blaming people when system's fail.

The follow-up doesn't bullshit with "extra training to make sure no one does this again", it says (effectively) "we're going to make this impossible to happen again, even if someone makes a mistake".




Any time I see "we're going to train everyone better" or "we're going to fire the guy who did it", all I can read is "this will happen again". You can't actually solve the problem of user error with training, and it's good to see Amazon not playing that game.


What bothered me about running TrueTech is that customers would sometimes demand repercussions against employees for making mistakes.

Enter Frans Plugge. Whenever a customer would get into that mode we'd fire Frans. This was easy, simply because he didn't exist in the first place (his name was pulled from a skit by two Dutch comedians, bonus points if you know who and which skit).

This usually then caused the customer to recant on how he/she never meant for anybody to get fired...

It was a funny solution and we got away with it for years, for one because it was pretty rare to get customers that mad to begin with and for another because Frans never wrote any blog posts about it ;)

But I was always waiting for that call from the labor board to ask why we fired someone for who there was no record of employment.


It is unreasonable for me to think that company owners should have the spine to say, "We take the decision to fire someone very seriously. We'll take your comments under consideration, but we retain sole discretion over such decisions."

It irks me that businesses fire people because of pressure from clients or social media. But having never been the boss, I may be missing something.


One reason to like a facet of Japanese management culture: if a customer wants someone to rake over the coals you offer management, not employees.

Internal repercussions notwithstanding, externally the company is a united front. It cannot cause mistakes by luck, accident, or happenstance, because the world includes luck, accidents, and happenstance, so any user-visible error is ipso facto a failure of management.


apparently this is (or was) a job in japan. companies would hire what amounts to an actor to get screamed at by the angry customer, and pretend to get fired on the spot. rinse, repeat whenever such appeasement is required.


I know one person who does this for real estate developers. He gets involved in contentious projects early on, goes to community meetings, offers testimony before the city council, etc. When construction gets going and people inevitably get pissed about some aspect of the project, he gets publicly fired to deflect the blame while the project moves on. Have seen it happen on three different projects in two cities now and, somehow, nobody catches on.


I don't know how to describe this in a single word or phrase appropriately, but I think it is a "genius problem" to exist. Not a genius solution. I feel that the problem itself is impressive and rich in layers of human nature, local culture etc - but once you have such a problem any average person could come up with a similar solution, because it is obvious.

It's still mind blowing and very amusing that this is a thing in our world!


Do you have a citation for that? I am curious; it's something I've never heard of and goes against my intuitions/experience regarding what traditionally managed Japanese companies would do. (Entirely possible it has happened! Hence the cite request.)


Imagine if the customer saw the same actor getting fired in different companies! Is the customer going to catch on? More likely, they will think "Yeah, no wonder there was a problem. This same incompetent dude wormed his way into this company too" :-)


There's a movie sketch in here somewhere. A guy has the worst day of his life, every single thing goes wrong, and at every single company the same person is "responsible" for the issue.


Awesome! The first movie ever made based on an anonymous comment on HN. Wait. So I can't get a cut in the profits then?


Well, well, well...scribbles note into Trello: "Today I got fired, it was all my fault, our poor customers suffered" Blog posts as a service.


Inspired by the Daniel Pennac's novels? https://fr.wikipedia.org/wiki/Saga_Malauss%C3%A8ne


The BBC staff have a term for the way the corporation almost does this: "deputy heads will roll"


I always thought that was more a cynical take on the fact that the top guy was protected, rather than underlings.


Yes, exactly. Hence almost.


Historically some cultures practiced mock firing as a way to appease an angry customer. This was back in the day when most business transactions occured face to face so the owner should demand the employee to pack their belonging and leave the premises in full view of the customer. Of course this is all for show but this kind of public humiliation seems to satisfy even the most difficult customers.


Even when the customer knew it was show, I see it as a way of saying, "yes, we acknowledge that we screwed up and make a public, highly visible note of it that will be recorded in the annals of peoples' gossip in this area".


On the contrary, I think it's quite reasonable.


Did you ever insist that Frans must be fired and refuse to accept the "we didn't mean it?"

Cause that sounds pretty great.


Well, by then he was fired... :)


and mercilessly beaten?



That's one deserved upvote :)


To what lengths did you keep that up? Did you just tell the client that informally, formally in a reunion or so, or did you actually put it in writing or even fake some "firing" paperwork?


This is true in some cases, but not when mitigations aren't practiced properly - it's not the fat fingered user who should be fired or retrained, but the designer or maintainer of the system that allowed it to become a serious issue.

Look at the recent GitLab incident - one guy messed up and nuked a server. Okay, that happens sometimes, go to backups. Uh oh, all the backups are broken. Minor momentary problem just turned into a major multi-day one.

That's a problem and one which could be preventable with training (or, arguably, firing and hiring). Maintaining your backups properly should be someone's duty, designing and testing systems to minimize impact of user error should be too.


Fair enough. I guess what I meant was specifically using training or punishment to combat "momentary lapse" issues.

If someone doesn't test their backups, you train them to test backups. If someone lies about testing the backups, maybe you fire them. But if someone trips and shatters the only backup disk, you don't yell at them - you create backups that an instant of clumsiness can't ruin.

I did overstate, training is perfectly reasonable, but I often see it cited exactly when it shouldn't be, as a solution to errors like typos or forgetfulness.


Training about testing backups is still a bad idea: Why make someone do a job that purely verification? Those jobs eventually stop getting done, and it's hard to keep people doing them.

Instead, you make a machine verify the backups simply by using the backups all the time. For example, at work I feed part of our data pipeline with backups: Those processes have no access to the live data. If the backups break, those processes would provide bad information to the users, and people would come complaining in a matter of minutes.

Just like when you have a set of backup servers, you don't leave them collecting dust, or tell someone to go look at them every once in a while: you just route 1% of the traffic through them. They are still extra capacity, you can still do all kinds of things to them without too much trouble, but you know they are always working.

Never, ever, force people to do things they don't gain anything from. Their discipline would fade, just like it fades when you force them to a project management tool they get no value from.


No. You don't make a daily task of testing backups. That would be wrong for precisely the reasons you cite. It's a waste of effort and time, and ignores what the point of testing them is for: ensuring that the procedure still works.

One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data. The rest of the year it's only worth keeping an automated process reporting whether or not the things are being made, and people keeping an eye on change management to be sure no changes are made to the known-to-be working process that can break it without the new process incurring an explicit vetting cycle. Gitlab wasn't apparently testing or engaging in monitoring what was supposed to be an automated process. That's where they got burned.

Process monitoring may be boring as hell, but it's seldom wasted effort, and will prevent massive, compounded headaches from bringing operations to a chaotic halt.


> One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data.

Nope. Nope. Nope.

You test every backup by automatically restoring from it in a sandbox and verifying its integrity and functionality in the restored state.

Backups are worthless unless verified for their intended use of recovering a functioning system.


You're imagining an automated test system. But Gitlabs problem was the automated system was not communicating failures properly.

And constant "this succeeded" messages don't scale well.


I agree. I have the same policy when setting up servers: don't have a "primary" and a "backup" server, make both servers production servers and have the code that uses them alternate between them, pick a random one, whatever. (I don't always get to implement this policy, of course.)


This makes some sense, but I don't think it negates testing backups? For duplicate live data, yeah, you can't just use both. But most businesses have at least some things backed up to cold storage, and that still needs to be popped in a tape deck (or whatever's relevant) and verified.


I don't buy the "If we just plan ENOUGH, disasters will never occur" argument. It's the universe is just too darn interesting for us to be able to plan enough to prevent it from being interesting.


It is all about hazard/risk modeling and mitigation. E.g.

Someone rm -rf / ing the server will happen eventually with near 100% certainty in any company and can be mitigated by tested, regular, multiply redundant backups.

Cosmic rays flipping bits will happen with near 100% probability at the scale someone like Amazon works at can by mitigated by redundant copies and filesystems with checksum style checks. Similar with hard drive failure.

Earthquakes will happen in some areas with near certainty over the time periods companies like Amazon presumably hope to be in business and could be mitigated by having multiple datacenters and well constructed buildings. Similar for 'normal' scale volcanoes.

Fires will happen but they can be mitigated (with appropriate buildings and redundancy).

Small meteorite stikes are unlikely but can be mitigated by redundancy.

Solar activity causing an electomagnetic storm - yeah one can shield one's datacenter in a Faraday cage but in this situation the whole world is probably in chaos and one's datacenter will be the least of one's concerns (unless shielding become standard in which case you'd better be doing it). Similar applies for nuclear war, super volcanoes, massive meteorite strikes or other global events at the interesting end of the scale.

But yeah there are going to be things that get missed. They key is having an organization that (1) learns from its mistakes and (2) learns from others' mistakes and continually keeps their risk modeling and mitigation measures up to date. And note that many of the hazards that are worth mitigating have the same mitigation i.e. redundancy (at different scales).


"[T]he universe is just too darn interesting for us to be able to plan enough to prevent it from being interesting." -- Beat

That's a great line. How should I attribute it?


If you're actually quoting me somewhere, either use "Dave Stagner", or "Some asshole on the internet". Same diff.


Giving people training in response to things like this always seemed a little strange to me - that particular person just got the most effective training the world has ever seen. If you look at it that way, you could say that everything Amazon spent responding to this was actually a training expense for this particular person and team. After you've already done that, it seems silly to make them sit through some online quiz or PowerPoint by a supposed guru and think you're accomplishing anything.


Yeah indeed. You know who the one person at Amazon is that I'd expect to never fat finger a sensitive command ever ever again? The guy who managed to fat finger S3 on Tuesday. Firing him over this mistake is worse than pointless, it offers absolution to every other developer and system that helped cause this event.


I'm guessing your comment was inspired by this: http://www.squawkpoint.com/2014/01/criticism/


Not directly, but maybe that was running around in the back of my mind while I responded to it.


I wouldn't merely call this a fat finger or typo - it's quite possible that the usage of the tool itself was so nefarious that mistakes would be impossible to avoid, given the complexity of its inputs.

Based on Amazon's decision to improve the tooling such that this category of error would be (hopefully) impossible to reproduce, I would lean more towards that being the case.


I think the value of making these mistakes is in learning from them and then making sure they can't happen again. Leaving this process in place and just making this guy run the command forever because he screwed it up once would be a much less effective solution than fixing the tooling so it's impossible to do this in the first place. Saying "this guy don't do it again" also offers absolution to everyone else on the team. In a healthy culture only we can fail.


That old chestnut. Is it true?


Is it not true for you? I know that I'm personally good at avoiding the same mistake. I'm also extraordinarily good at avoiding repeating catastrophic mistakes. I generally change my processes in the same way that Amazon is changing their processes to avoid this mistake.


I am not talking about what Amazon are doing, but the concept that the individual wont make the same mistake again, which is what the grandparent is getting that.

He won't make the same mistake because no one makes the same big mistake twice? I wouldn't bank on that alone.


Years ago I read a story about a fat-fingered ops person getting called into the CEO's office after an outage. "I thought you were calling me in to fire me." "I can't afford to fire you, today I spent a million dollars training you."


> You can't actually solve the problem of user error with training, and it's good to see Amazon not playing that game.

The problem of user error can be mitigated by an appropriate level of OCD.

But OCD can't be trained, you either have it or you don't.


Which is really the point of automation and configuration management. When a manager asks you, "How are you going to prevent this in the future?" You can say, "We added a check so n must be less than x% of the total number of cluster members," or "We added additional unit tests for the missing area of coverage" or "We added new integration tests that will pick up on this."

Tests and configuration scripts don't prevent all breakage. But when you have them, you can say, "We missed that, let's add it," or "That failed, but it's a false positive. Let's add this edge case to this test."

If you have no automation, tests or auditing systems around running deployments, you can't do any of this.


I agree testing and automation are good. I think they need to go beyond this to formal verification, for something on this scale and reliability. NASA doesn't make these sorts of mistakes.

By the way - this is not just Amazon's problem now. We know the internet has a single point of failure. So does a lot of IoT.

When will we experience the first Suicide DevOps?


NASA doesn't make these sorts of mistakes

https://www.wired.com/2010/11/1110mars-climate-observer-repo...


https://www.youtube.com/watch?v=6OalIW1yL-k

(Specifically https://www.youtube.com/watch?v=6OalIW1yL-k#t=3m but it's worth watching the whole clip (or even the whole movie) if you haven't seen it before. It's from Terry Gilliam's "Brazil".)


Almost twenty years ago, though.


Well, they've had plenty of opportunities to learn from their mistakes; Amazon hasn't had this long.


>We know the internet has a single point of failure.

It has? I have yet to see the day where I can neither reach my email provider nor Google nor Hackernews. My local provider might screw up occasionally, or some number of of websites go unreachable for whatever reason. But I fail to come up with anything short of cutting multiple see cables that causes more than 50% of servers to be unreachable to more than 50% of users.



Amazon do formally verify AWS (they use TLA+), which is probably why this failure is a human error. Of course, you could expand the formal analysis of the system to include all possible operator interactions, but you'll need to draw the line at some point. NASA certainly makes human errors that result in catastrophic failures. The Challenger disaster was also a result of human error to a large degree[1]; to quote Wikipedia: "The Rogers Commission found NASA's organizational culture and decision-making processes had been key contributing factors to the accident, with the agency violating its own safety rules."

[1]: https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disas...


I presume this is well entrenched in the Amazon culture.

Jeff Bezos once said: "Good intentions never work, you need good mechanisms to make anything happen"


That's exactly it. Amazon doesn't like sharing all that much, but I wish they'd publicly release that video.


This is the major basis of the CMM Levels [1]. At higher levels of maturity and necessity, systems and processes are designed to increasingly prevent errors from reaching a production environment.

Amazon is taking the right approach here. The fact that a system as complex and important as S3 can be taken down is a failure of the system, not the person who took it down accidentally.

1. https://en.wikipedia.org/wiki/Capability_Maturity_Model#Leve...


A lot of IT vendors I have worked with, they all were CMM/CMMi level 5. But the crappiness in their work development/process/deployment etc make me wonder if all their efforts go in attaining those certifications as oppose to doing something better.


As someone who worked for an IT vendor with certification and as someone who was part of the certification team at another place, I can assure you that you're right.

The certification is more for the organization/unit and the people working do not realize what they are for. Another thing that usually becomes a problem is the rigidity of the certification. Saying you need X, Y and Z documented is easy, but it doesn't work for projects that maybe don't have Y. So people make up documentation and process just to be compliant, this soon becomes a hinderance to the work. At this point people either abandon the process or follow it and the work suffers.


Thank you for adding this comment. I am glad there are more people out there that aren't afraid to be honest about some of the nonsense 'follow the process no matter what' stuff that I have experienced over the years.


CMM level 5 ==> You have a well-documented, repeatable, and still horrible process that declares all errors statistically uncommon by "augmenting" the root cause with random factors. Insta-certification.

(I lied about the "insta" part)


How laudable is this, really?

I've had the privilege of either working for myself, the company that acquired mine and let me run the dev, or at Google. From that perspective, and what I understand about ops, the rarity is not having the attitude mentioned in the parent.


Are you suggesting we take their behavior for granted? Positive behavior needs to be praised – it's part of how society influences its members.


No


This is good. And for the software engineers, great. I've heard from people doing the grunt work at amazon -- warehouse staff -- that Amazon incentivises employees to rat out each other for mishandling, late time etc, fostering intense competition


I spent time in the fulfillment centers, writing software for them. I definitely didn't see that sort of thing. There's no need- the software tracked everything they did. Low performer would be found and retained or 'promoted to customer' without the need for anyone to 'rat out'.

Plus, managing humans in a 'rat out' system would be incredibly inefficient. Now you need lots of employees just to listen to the ratting!


Yup. And for that mercy the engineer is going to be that much more careful, and loyal. I would be, that's for sure.


Agreed, especially regarding the culture but isn't this pretty much the same explanation they gave a few years ago when something similar happened?

I seem to recall an EC2 or S3 outage a few years ago that boiled down to an engineer pushing out a patch that broke an entire region when it was supposed to be a phased deployment.

I could be mis-remembering that but it's important that these lessons be applied across the whole company (at least AWS) so it would be a bigger mark against AWS if this is a result of similar tooling to what caused a previous outage.


Pretty sure that one was a Microsoft Azure outage.

(Source: am a self-identified post-mortems connoisseur. :)


Not a bad plan. If you don't make enough mistakes on your own, ya gotta learn from the mistakes of others as a preventative.


Do you by chance keep a public log of your postmortem collection :)?


I don't, but danluu does! https://github.com/danluu/post-mortems


Yeah an EC2 engineer switched over traffic to a backup network connection that had significantly less bandwidth, triggering cascading failures.


Yeh, makes sense to make changes to the system rather than do nothing and just blame someone. Errors happen, it's something you can't avoid.


that's just a public statement. how do you whether the individual was reprimanded


Because in my five years as an Amazon dev, that's exactly the attitude I witnessed. People are trying their best, so firing them won't help.


I believe Jeff once said something along the lines of "why would I fire an employee that made an honest mistake? I just spent a bunch of money teaching him a lesson"


lol what part of "these things should be done proactively and tested over and over in CI" does not make sense to management?


Putting the capability to take down S3 in to the hands of a single engineer seems a bit much.

Is mere extra training the right solution here?

Maybe they need something like the procedure that's used in missile silos:

Not allowing the shutdown system to function at all without the explicit authorization of least two people.


The linked article also says the tools they use were changed to limit the amount of resources that could be taken down at a single time, the speed they could be taken down at, and a hard floor was put on the number of instances that could be stopped.

That's a lot more than just extra training, and a lot better than a two-key system.


> Maybe they need something like the procedure that's used in missile silos...

Probably a bad example. The system was a pain in the ass, so they went and circumvented some of its restrictions.

http://gizmodo.com/for-20-years-the-nuclear-launch-code-at-u...

> Those in the U.S. that had been fitted with the devices, such as ones in the Minuteman Silos, were installed under the close scrutiny of Robert McNamara, JFK's Secretary of Defence. However, The Strategic Air Command greatly resented McNamara's presence and almost as soon as he left, the code to launch the missile's, all 50 of them, was set to 00000000.

> Oh, and in case you actually did forget the code, it was handily written down on a checklist handed out to the soldiers.


I think you have it backwards. The post does not say they will simply be training the problem away. They are putting safeguards into their tooling to prevent the case of a fat finger.


The article leaves little doubt that they didn't know such an event would be so hard to recover from. They knew it wouldn't be easy, but they were surprised by how bad it was.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: