Hacker News new | comments | show | ask | jobs | submit login
Summary of the Amazon S3 Service Disruption (amazon.com)
1246 points by oscarwao 206 days ago | hide | past | web | 516 comments | favorite



> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.


Look at the language used though. This is saying very loudly "Look, this isn't the engineer's fault here". It's one thing I miss about Amazon's culture- not blaming people when system's fail.

The follow-up doesn't bullshit with "extra training to make sure no one does this again", it says (effectively) "we're going to make this impossible to happen again, even if someone makes a mistake".


Any time I see "we're going to train everyone better" or "we're going to fire the guy who did it", all I can read is "this will happen again". You can't actually solve the problem of user error with training, and it's good to see Amazon not playing that game.


What bothered me about running TrueTech is that customers would sometimes demand repercussions against employees for making mistakes.

Enter Frans Plugge. Whenever a customer would get into that mode we'd fire Frans. This was easy, simply because he didn't exist in the first place (his name was pulled from a skit by two Dutch comedians, bonus points if you know who and which skit).

This usually then caused the customer to recant on how he/she never meant for anybody to get fired...

It was a funny solution and we got away with it for years, for one because it was pretty rare to get customers that mad to begin with and for another because Frans never wrote any blog posts about it ;)

But I was always waiting for that call from the labor board to ask why we fired someone for who there was no record of employment.


It is unreasonable for me to think that company owners should have the spine to say, "We take the decision to fire someone very seriously. We'll take your comments under consideration, but we retain sole discretion over such decisions."

It irks me that businesses fire people because of pressure from clients or social media. But having never been the boss, I may be missing something.


One reason to like a facet of Japanese management culture: if a customer wants someone to rake over the coals you offer management, not employees.

Internal repercussions notwithstanding, externally the company is a united front. It cannot cause mistakes by luck, accident, or happenstance, because the world includes luck, accidents, and happenstance, so any user-visible error is ipso facto a failure of management.


apparently this is (or was) a job in japan. companies would hire what amounts to an actor to get screamed at by the angry customer, and pretend to get fired on the spot. rinse, repeat whenever such appeasement is required.


I know one person who does this for real estate developers. He gets involved in contentious projects early on, goes to community meetings, offers testimony before the city council, etc. When construction gets going and people inevitably get pissed about some aspect of the project, he gets publicly fired to deflect the blame while the project moves on. Have seen it happen on three different projects in two cities now and, somehow, nobody catches on.


I don't know how to describe this in a single word or phrase appropriately, but I think it is a "genius problem" to exist. Not a genius solution. I feel that the problem itself is impressive and rich in layers of human nature, local culture etc - but once you have such a problem any average person could come up with a similar solution, because it is obvious.

It's still mind blowing and very amusing that this is a thing in our world!


Do you have a citation for that? I am curious; it's something I've never heard of and goes against my intuitions/experience regarding what traditionally managed Japanese companies would do. (Entirely possible it has happened! Hence the cite request.)


Imagine if the customer saw the same actor getting fired in different companies! Is the customer going to catch on? More likely, they will think "Yeah, no wonder there was a problem. This same incompetent dude wormed his way into this company too" :-)


There's a movie sketch in here somewhere. A guy has the worst day of his life, every single thing goes wrong, and at every single company the same person is "responsible" for the issue.


Awesome! The first movie ever made based on an anonymous comment on HN. Wait. So I can't get a cut in the profits then?


Well, well, well...scribbles note into Trello: "Today I got fired, it was all my fault, our poor customers suffered" Blog posts as a service.


Inspired by the Daniel Pennac's novels? https://fr.wikipedia.org/wiki/Saga_Malauss%C3%A8ne


The BBC staff have a term for the way the corporation almost does this: "deputy heads will roll"


I always thought that was more a cynical take on the fact that the top guy was protected, rather than underlings.


Yes, exactly. Hence almost.


Historically some cultures practiced mock firing as a way to appease an angry customer. This was back in the day when most business transactions occured face to face so the owner should demand the employee to pack their belonging and leave the premises in full view of the customer. Of course this is all for show but this kind of public humiliation seems to satisfy even the most difficult customers.


Even when the customer knew it was show, I see it as a way of saying, "yes, we acknowledge that we screwed up and make a public, highly visible note of it that will be recorded in the annals of peoples' gossip in this area".


On the contrary, I think it's quite reasonable.


Did you ever insist that Frans must be fired and refuse to accept the "we didn't mean it?"

Cause that sounds pretty great.


Well, by then he was fired... :)


and mercilessly beaten?



That's one deserved upvote :)


To what lengths did you keep that up? Did you just tell the client that informally, formally in a reunion or so, or did you actually put it in writing or even fake some "firing" paperwork?


This is true in some cases, but not when mitigations aren't practiced properly - it's not the fat fingered user who should be fired or retrained, but the designer or maintainer of the system that allowed it to become a serious issue.

Look at the recent GitLab incident - one guy messed up and nuked a server. Okay, that happens sometimes, go to backups. Uh oh, all the backups are broken. Minor momentary problem just turned into a major multi-day one.

That's a problem and one which could be preventable with training (or, arguably, firing and hiring). Maintaining your backups properly should be someone's duty, designing and testing systems to minimize impact of user error should be too.


Fair enough. I guess what I meant was specifically using training or punishment to combat "momentary lapse" issues.

If someone doesn't test their backups, you train them to test backups. If someone lies about testing the backups, maybe you fire them. But if someone trips and shatters the only backup disk, you don't yell at them - you create backups that an instant of clumsiness can't ruin.

I did overstate, training is perfectly reasonable, but I often see it cited exactly when it shouldn't be, as a solution to errors like typos or forgetfulness.


Training about testing backups is still a bad idea: Why make someone do a job that purely verification? Those jobs eventually stop getting done, and it's hard to keep people doing them.

Instead, you make a machine verify the backups simply by using the backups all the time. For example, at work I feed part of our data pipeline with backups: Those processes have no access to the live data. If the backups break, those processes would provide bad information to the users, and people would come complaining in a matter of minutes.

Just like when you have a set of backup servers, you don't leave them collecting dust, or tell someone to go look at them every once in a while: you just route 1% of the traffic through them. They are still extra capacity, you can still do all kinds of things to them without too much trouble, but you know they are always working.

Never, ever, force people to do things they don't gain anything from. Their discipline would fade, just like it fades when you force them to a project management tool they get no value from.


No. You don't make a daily task of testing backups. That would be wrong for precisely the reasons you cite. It's a waste of effort and time, and ignores what the point of testing them is for: ensuring that the procedure still works.

One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data. The rest of the year it's only worth keeping an automated process reporting whether or not the things are being made, and people keeping an eye on change management to be sure no changes are made to the known-to-be working process that can break it without the new process incurring an explicit vetting cycle. Gitlab wasn't apparently testing or engaging in monitoring what was supposed to be an automated process. That's where they got burned.

Process monitoring may be boring as hell, but it's seldom wasted effort, and will prevent massive, compounded headaches from bringing operations to a chaotic halt.


> One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data.

Nope. Nope. Nope.

You test every backup by automatically restoring from it in a sandbox and verifying its integrity and functionality in the restored state.

Backups are worthless unless verified for their intended use of recovering a functioning system.


You're imagining an automated test system. But Gitlabs problem was the automated system was not communicating failures properly.

And constant "this succeeded" messages don't scale well.


I agree. I have the same policy when setting up servers: don't have a "primary" and a "backup" server, make both servers production servers and have the code that uses them alternate between them, pick a random one, whatever. (I don't always get to implement this policy, of course.)


This makes some sense, but I don't think it negates testing backups? For duplicate live data, yeah, you can't just use both. But most businesses have at least some things backed up to cold storage, and that still needs to be popped in a tape deck (or whatever's relevant) and verified.


I don't buy the "If we just plan ENOUGH, disasters will never occur" argument. It's the universe is just too darn interesting for us to be able to plan enough to prevent it from being interesting.


It is all about hazard/risk modeling and mitigation. E.g.

Someone rm -rf / ing the server will happen eventually with near 100% certainty in any company and can be mitigated by tested, regular, multiply redundant backups.

Cosmic rays flipping bits will happen with near 100% probability at the scale someone like Amazon works at can by mitigated by redundant copies and filesystems with checksum style checks. Similar with hard drive failure.

Earthquakes will happen in some areas with near certainty over the time periods companies like Amazon presumably hope to be in business and could be mitigated by having multiple datacenters and well constructed buildings. Similar for 'normal' scale volcanoes.

Fires will happen but they can be mitigated (with appropriate buildings and redundancy).

Small meteorite stikes are unlikely but can be mitigated by redundancy.

Solar activity causing an electomagnetic storm - yeah one can shield one's datacenter in a Faraday cage but in this situation the whole world is probably in chaos and one's datacenter will be the least of one's concerns (unless shielding become standard in which case you'd better be doing it). Similar applies for nuclear war, super volcanoes, massive meteorite strikes or other global events at the interesting end of the scale.

But yeah there are going to be things that get missed. They key is having an organization that (1) learns from its mistakes and (2) learns from others' mistakes and continually keeps their risk modeling and mitigation measures up to date. And note that many of the hazards that are worth mitigating have the same mitigation i.e. redundancy (at different scales).


"[T]he universe is just too darn interesting for us to be able to plan enough to prevent it from being interesting." -- Beat

That's a great line. How should I attribute it?


If you're actually quoting me somewhere, either use "Dave Stagner", or "Some asshole on the internet". Same diff.


Giving people training in response to things like this always seemed a little strange to me - that particular person just got the most effective training the world has ever seen. If you look at it that way, you could say that everything Amazon spent responding to this was actually a training expense for this particular person and team. After you've already done that, it seems silly to make them sit through some online quiz or PowerPoint by a supposed guru and think you're accomplishing anything.


Yeah indeed. You know who the one person at Amazon is that I'd expect to never fat finger a sensitive command ever ever again? The guy who managed to fat finger S3 on Tuesday. Firing him over this mistake is worse than pointless, it offers absolution to every other developer and system that helped cause this event.


I'm guessing your comment was inspired by this: http://www.squawkpoint.com/2014/01/criticism/


Not directly, but maybe that was running around in the back of my mind while I responded to it.


I wouldn't merely call this a fat finger or typo - it's quite possible that the usage of the tool itself was so nefarious that mistakes would be impossible to avoid, given the complexity of its inputs.

Based on Amazon's decision to improve the tooling such that this category of error would be (hopefully) impossible to reproduce, I would lean more towards that being the case.


I think the value of making these mistakes is in learning from them and then making sure they can't happen again. Leaving this process in place and just making this guy run the command forever because he screwed it up once would be a much less effective solution than fixing the tooling so it's impossible to do this in the first place. Saying "this guy don't do it again" also offers absolution to everyone else on the team. In a healthy culture only we can fail.


That old chestnut. Is it true?


Is it not true for you? I know that I'm personally good at avoiding the same mistake. I'm also extraordinarily good at avoiding repeating catastrophic mistakes. I generally change my processes in the same way that Amazon is changing their processes to avoid this mistake.


I am not talking about what Amazon are doing, but the concept that the individual wont make the same mistake again, which is what the grandparent is getting that.

He won't make the same mistake because no one makes the same big mistake twice? I wouldn't bank on that alone.


Years ago I read a story about a fat-fingered ops person getting called into the CEO's office after an outage. "I thought you were calling me in to fire me." "I can't afford to fire you, today I spent a million dollars training you."


> You can't actually solve the problem of user error with training, and it's good to see Amazon not playing that game.

The problem of user error can be mitigated by an appropriate level of OCD.

But OCD can't be trained, you either have it or you don't.


Which is really the point of automation and configuration management. When a manager asks you, "How are you going to prevent this in the future?" You can say, "We added a check so n must be less than x% of the total number of cluster members," or "We added additional unit tests for the missing area of coverage" or "We added new integration tests that will pick up on this."

Tests and configuration scripts don't prevent all breakage. But when you have them, you can say, "We missed that, let's add it," or "That failed, but it's a false positive. Let's add this edge case to this test."

If you have no automation, tests or auditing systems around running deployments, you can't do any of this.


I agree testing and automation are good. I think they need to go beyond this to formal verification, for something on this scale and reliability. NASA doesn't make these sorts of mistakes.

By the way - this is not just Amazon's problem now. We know the internet has a single point of failure. So does a lot of IoT.

When will we experience the first Suicide DevOps?


NASA doesn't make these sorts of mistakes

https://www.wired.com/2010/11/1110mars-climate-observer-repo...


https://www.youtube.com/watch?v=6OalIW1yL-k

(Specifically https://www.youtube.com/watch?v=6OalIW1yL-k#t=3m but it's worth watching the whole clip (or even the whole movie) if you haven't seen it before. It's from Terry Gilliam's "Brazil".)


Almost twenty years ago, though.


Well, they've had plenty of opportunities to learn from their mistakes; Amazon hasn't had this long.


>We know the internet has a single point of failure.

It has? I have yet to see the day where I can neither reach my email provider nor Google nor Hackernews. My local provider might screw up occasionally, or some number of of websites go unreachable for whatever reason. But I fail to come up with anything short of cutting multiple see cables that causes more than 50% of servers to be unreachable to more than 50% of users.



Amazon do formally verify AWS (they use TLA+), which is probably why this failure is a human error. Of course, you could expand the formal analysis of the system to include all possible operator interactions, but you'll need to draw the line at some point. NASA certainly makes human errors that result in catastrophic failures. The Challenger disaster was also a result of human error to a large degree[1]; to quote Wikipedia: "The Rogers Commission found NASA's organizational culture and decision-making processes had been key contributing factors to the accident, with the agency violating its own safety rules."

[1]: https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disas...


I presume this is well entrenched in the Amazon culture.

Jeff Bezos once said: "Good intentions never work, you need good mechanisms to make anything happen"


That's exactly it. Amazon doesn't like sharing all that much, but I wish they'd publicly release that video.


This is the major basis of the CMM Levels [1]. At higher levels of maturity and necessity, systems and processes are designed to increasingly prevent errors from reaching a production environment.

Amazon is taking the right approach here. The fact that a system as complex and important as S3 can be taken down is a failure of the system, not the person who took it down accidentally.

1. https://en.wikipedia.org/wiki/Capability_Maturity_Model#Leve...


A lot of IT vendors I have worked with, they all were CMM/CMMi level 5. But the crappiness in their work development/process/deployment etc make me wonder if all their efforts go in attaining those certifications as oppose to doing something better.


As someone who worked for an IT vendor with certification and as someone who was part of the certification team at another place, I can assure you that you're right.

The certification is more for the organization/unit and the people working do not realize what they are for. Another thing that usually becomes a problem is the rigidity of the certification. Saying you need X, Y and Z documented is easy, but it doesn't work for projects that maybe don't have Y. So people make up documentation and process just to be compliant, this soon becomes a hinderance to the work. At this point people either abandon the process or follow it and the work suffers.


Thank you for adding this comment. I am glad there are more people out there that aren't afraid to be honest about some of the nonsense 'follow the process no matter what' stuff that I have experienced over the years.


CMM level 5 ==> You have a well-documented, repeatable, and still horrible process that declares all errors statistically uncommon by "augmenting" the root cause with random factors. Insta-certification.

(I lied about the "insta" part)


How laudable is this, really?

I've had the privilege of either working for myself, the company that acquired mine and let me run the dev, or at Google. From that perspective, and what I understand about ops, the rarity is not having the attitude mentioned in the parent.


Are you suggesting we take their behavior for granted? Positive behavior needs to be praised – it's part of how society influences its members.


No


This is good. And for the software engineers, great. I've heard from people doing the grunt work at amazon -- warehouse staff -- that Amazon incentivises employees to rat out each other for mishandling, late time etc, fostering intense competition


I spent time in the fulfillment centers, writing software for them. I definitely didn't see that sort of thing. There's no need- the software tracked everything they did. Low performer would be found and retained or 'promoted to customer' without the need for anyone to 'rat out'.

Plus, managing humans in a 'rat out' system would be incredibly inefficient. Now you need lots of employees just to listen to the ratting!


Yup. And for that mercy the engineer is going to be that much more careful, and loyal. I would be, that's for sure.


Agreed, especially regarding the culture but isn't this pretty much the same explanation they gave a few years ago when something similar happened?

I seem to recall an EC2 or S3 outage a few years ago that boiled down to an engineer pushing out a patch that broke an entire region when it was supposed to be a phased deployment.

I could be mis-remembering that but it's important that these lessons be applied across the whole company (at least AWS) so it would be a bigger mark against AWS if this is a result of similar tooling to what caused a previous outage.


Pretty sure that one was a Microsoft Azure outage.

(Source: am a self-identified post-mortems connoisseur. :)


Not a bad plan. If you don't make enough mistakes on your own, ya gotta learn from the mistakes of others as a preventative.


Do you by chance keep a public log of your postmortem collection :)?


I don't, but danluu does! https://github.com/danluu/post-mortems


Yeah an EC2 engineer switched over traffic to a backup network connection that had significantly less bandwidth, triggering cascading failures.


Yeh, makes sense to make changes to the system rather than do nothing and just blame someone. Errors happen, it's something you can't avoid.


that's just a public statement. how do you whether the individual was reprimanded


Because in my five years as an Amazon dev, that's exactly the attitude I witnessed. People are trying their best, so firing them won't help.


I believe Jeff once said something along the lines of "why would I fire an employee that made an honest mistake? I just spent a bunch of money teaching him a lesson"


lol what part of "these things should be done proactively and tested over and over in CI" does not make sense to management?


Putting the capability to take down S3 in to the hands of a single engineer seems a bit much.

Is mere extra training the right solution here?

Maybe they need something like the procedure that's used in missile silos:

Not allowing the shutdown system to function at all without the explicit authorization of least two people.


The linked article also says the tools they use were changed to limit the amount of resources that could be taken down at a single time, the speed they could be taken down at, and a hard floor was put on the number of instances that could be stopped.

That's a lot more than just extra training, and a lot better than a two-key system.


> Maybe they need something like the procedure that's used in missile silos...

Probably a bad example. The system was a pain in the ass, so they went and circumvented some of its restrictions.

http://gizmodo.com/for-20-years-the-nuclear-launch-code-at-u...

> Those in the U.S. that had been fitted with the devices, such as ones in the Minuteman Silos, were installed under the close scrutiny of Robert McNamara, JFK's Secretary of Defence. However, The Strategic Air Command greatly resented McNamara's presence and almost as soon as he left, the code to launch the missile's, all 50 of them, was set to 00000000.

> Oh, and in case you actually did forget the code, it was handily written down on a checklist handed out to the soldiers.


I think you have it backwards. The post does not say they will simply be training the problem away. They are putting safeguards into their tooling to prevent the case of a fat finger.


The article leaves little doubt that they didn't know such an event would be so hard to recover from. They knew it wouldn't be easy, but they were surprised by how bad it was.


To make error is human. To propagate error to all server in automatic way is #devops - DevOps Borat


I've long said something like "To err is human. To fuck up a million times in a second you need a computer."

I may have to upgrade that to take the mighty power of Cloud (TM) into account, though. Billions and trillions of fuck ups per second are now well within reach!


I can't wait until quantum computing lets us add a degree of simultaneity to fucking up. Fuck up in many ways... AT ONCE!


Quantum computing: giving humans the unprecedented ability to make every possible error at once


It will be fucked, not fucked, neither, and both... until we look. I feel bad for the poor bastard that has to look...


Internship in the future just got a whole lot bleaker.


shrodinger's buttocks


It will not be certain if you have fucked up or not until you actually go to check.


But checking affects the outcome! https://en.wikipedia.org/wiki/Heisenbug


> I've long said something like "To err is human. To fuck up a million times in a second you need a computer."

This quote (paraphrased) actually dates all the way back to 1969:

> To err is human; to really foul things up requires a computer.

-- http://quoteinvestigator.com/2010/12/07/foul-computer/


Yes, it is. I believe I added the concept of "fuckups per second", but my memory being what it is and the general creativity of the internet being what it is, I would not be surprised that it either wasn't original or I wasn't the first.


> "To err is human. To fuck up a million times in a second you need a computer."

If you made that up, I tip my hat off to you as payment for all my future uses of the phrase.


"A computer lets you make more mistakes faster than any other invention with the possible exception of handguns and tequila." -- Mitch Ratcliffe


I would go with: "To err is human; to cascade, DevOps."


In #devops is turtle all way down but at bottom is perl script - DevOps Borat


Automation doesn't just allow you to create/fix things faster. It also allows you to break things faster.


We may think that an automated system requires less understanding of it in order to operate it. But from the other point of view, you have to know what you are doing, consequences of even an small change are big.

This is one of the things that happens with windows, getting up a server is so easy, that people believe that they don't have to understand what's under the hood, and then, we get a lot of miss-configuration and operational issues.


It's one of the reasons that silly guarantees like "twelve 9s of reliability" are meaningless. There are humans here. "Accidental human mishap" is gonna happen sometimes, and when it happens it's probably gonna affect a lot of data. Heck, at around 7 or 8 nines you have to account for the possibility that your operations team will decide that all your data is a vicious pack of timberwolves and needs to be defeated.


Note that's durability not reliability. You might not be able to get at it with every request (I think 99.99% is the target) but it'll still be there if you try again later.


The point is that at eleven nines, you're entering the realm of very rare/unlikely events that will also affect durability.

In other words, there's a lack of humility about "unknown unknowns".


but amazon doesn't offer eleven 9s of availability. I don't think anybody serious does, so arguing how silly eleven 9s of availability is is kind of pointless. The SLA is only four 9s of availability.


Not even four 9's - they only trigger SLA credits when they dip below 3 9's.


Note: they say "S3 is DESIGNED for 11 9s of durability". It's PR-speak to say that they don't give you any guarantee, but in theory the system is designed in a magnificent way.


11 9s of durability is about the likelyhood of AWS loosing your data. It doesn't cover the likelyhood of you being able to access your data that's called availability.

For example on GCS (Google's S3)...A storage class specifies how many locations the data is made available. All storage classes share the same durability (chance of google loosing your data) of 99.999999999%, but have different availability (chance of being able to retrieve data).


> chance of google not loosing your data of 99.999999999%

git commit -m 'typo'


I think it's a little better than that, actually.

It says that their ideal-case failure rate is 11 nines; that's how much you should lose to known, lasting issues like machines failing and cutting over.

Amazon's actual SLA offers 2 nines and 3 nines as the credit thresholds. So they're stating the reliability of their known system, and the rest is for events like this.


Durability and uptime are not the same thing. Durability is about the chance of losing your data and has nothing to do with service disruptions. Their uptime SLA is much lower. Looking at [1], it looks like the SLA says 3 9s (discounts given for anything lower) of uptime.

[1] https://aws.amazon.com/s3/sla/


As I understand it, those guarantees don't mean that the service will actually stay up for the given number of 9s; it's that you'll be reimbursed monetarily if and when they go down.


I don't think it even means that; their policy says that the reimbursement only happens when your reliability dips all the way down to three 9s:

https://aws.amazon.com/s3/sla/


The SLA (as you linked) says three nines. The 12 nines quoted by others is durability, not uptime.


Kinda the same thing, though. I mean, from my perspective there's no substantive difference between me saying "this service will stay up 99.xx% of the time and me buying insurance to pay you for the 0.xx% of the time I might fail.

The alternative is that I use the insurance to pay my legal fees when you sue me for not meeting my uptime guarantees.


It's not the same thing. The Amazon service might only be costing you $100/mo, but if it goes down the cost to your business might be millions. They'll reimburse you the $100, not the millions.


Yeah as soon as I read this I felt bad for the employee. I remember writing an update statement without a where clause and having to restore the table from backup. But that was at a company not as advanced as Amazon. Fat fingering a key like that is just crazy (but comforting that even at Amazon it happens) and I'm sure they've fixed that from happening again.


FWIW: Setting "safe-updates=1" in ~/.my.cnf will require UPDATE and DELETE statements in the client to have a WHERE clause which references a key. It's not perfect protection, but it will save you from a lot of mistakes.


That's awesome,

My worst DELETE fail however was:

  DELETE * FROM table WHERE [long condition that resolves to true for all records]
Now i write SELECT or SELECT COUNT(*) over and over again until i see the data i expect and then change it to a DELETE/UPDATE.

It's not my personal habit but some folks I know turn off auto commit and BEGIN a transaction every time they enter an interactive SQL sessions. They then default to ROLLBACK at least once before COMMITing them.

That and having a user with read-only permissions or a read replica


Is there a collection of data safety tips like this somewhere? I never knew this existed. What else am I missing?


hmm, that's kinda cool. I'm in a MS shop and I don't know if SMSS has the same feature. My manager just looked at me and said "welp, go restore the table and be more careful next time." I was a new DBA at the time, still, kinda new.


USE BEGIN TRANS as mentioned above


I once brought down our entire production XenServer cluster group by issuing a "shutdown now" in the wrong SSH window. Needless to say it was a bad feeling watching Nagios go crazy and realizing what had just happened.


    root@baz # shutdown now
    W: molly-guard: SSH session detected!
    Please type in hostname of the machine to shutdown: foo
    Good thing I asked; I won't shutdown baz ...
Surprising to see such a simple protection neglected.


I don't know how well-known molly-guard is, but I've never heard of it before. Definitely enabling it on my servers next week.


Interesting. Up until now I've considered it well-known to the point of ubiquity :)


Oh crap! Yeah I bet you were pretty panicked. My update statement destroyed data that my team used all the time so I was worried I'd get fired. Luckily that wasn't the case.


Fat fingers are just nature's way of making sure you test your back-up & restore procedures periodically :-)


When I first started using Linux and wanted to do some housecleaning, I did "rm -r (asterix) in a folder. Cleaned up everything, no prob. Then went to some more folders, hit the up arrow on my keyboard fast to get to a command I had used before. Hit 'enter' before my brain realized I had landed on rm -r (asterix) and not the right command. Never used that command again.


Automation tends to make those kinds of errors worse rather than better. Perhaps more infrequent and of a different nature than before, but screwing up an automated action cascades much, much faster than a human initiated one. As a result, you have to watch things a good deal closer and build in more and tighter safe guards.

For instance: https://thenextweb.com/shareables/2014/05/16/emory-universit...

Note: Automation is great, you just can't be sloppy with it. EVER.

edit:fix minor typo


> established playbook

A playbook actually represents a lack of automation for a particular task.

The playbook itself should be automated, with automated tests that validate its correctness.


I've heard (and sometimes pushed) this rhetoric before, but something should be well understood before it's automated. Things that happen very rarely should be backed with a playbook + well exercised general monitoring and tools. This puts human discretion in front of the tools' use and makes sure ops is watching for any secondary effects. Ops grimorae can gather disparate one offs into common and tested tools but they don't do anything to consolidate the reason the tools might be needed.


To me that sounds like development and testing (i.e. figuring out what the steps are). Once you have that it should be automated fully.

Too often people will put up with the, "well, we only do this once a month so it's not worth automating". Literally, I script everything now, just in simple bash... if I type a command, I stick it into a script, and then run the script. Over time you go back and modify said script to be better, eventually this turns into more substantive application. At a certain point, around the time that you have more than one loop, are trying to do things based on different error scenarios, it's probably time to turn to rewriting it in another language.

The simplest thing this does for me, is guarantee that all the parameters needed are valid and present before continuing.


I've been doing it this way for years and it really, really works. Some places have reservations with it since its lack of formality is considered "risky" by some.

Though, an alternative to switching to another language is using xargs well. Writing bash with some immutably has been pretty invaluable for my workflows lately. For example

  seq 1 10 | xargs -P10 -I{} ssh $host-{} hostname


Its probably their name for an automated admin task. The post does bot imply that this was merely a checklist of things to do. Ansible calls their automatiin receipts playbook as well.


It's probably a page on the internal Wiki that the S3 team follows for that particular task. Most the actual steps are probably automated, but it sounds more like a checklist.

I used to follow runbooks/playbooks written on the internal wiki when I worked at Amazon.


I don't think it means "playbook" in the Ansible sense. The dictionary (i.e. Wikipedia) definition of "playbook" is "a document defining one or more business process workflows aimed at ensuring a consistent response to situations commonly encountered during the operation of the business", and that's how I know it.

At $work, certain types of frequently-occurring alerts have playbooks that document how the alert in question can be diagnosed and how known causes can be remedied. Something like "Look at Grafana dashboard X. If metric Y is doing this and that thing, the cause is Z. Log on to box 16 and systemctl restart the foo.service."


Hm, based on the description, I would be surprised if they could fat-finger that.


playbooks can take arguments: http://stackoverflow.com/questions/30662069/how-can-i-pass-v...

So, fat-fingering something is imminently possible.


To be fair, the real problem isn't that someone screwed up a playbook or command. The real problem is that a tiny mistake in a command can cause an entire service to be disrupted for hours. That's the problem that needs to be fixed.


"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future."


I wonder as to why this wasn't thought of while creating the system? Of course, I don't have experience at that scale, so just wondering.


It seems like sometimes this is just how iteratively automating things works, especially on an internal-facing tool.

You have some process that starts out being "deploy this app with this java code". You deploy once and while, so it's not a big deal. But then those changes get a bit more frequent and so you pull out the common bits and the process becomes "make this YAML change in git and redeploy the app".

That works until you find yourself deploying 5 times a day, so you turn it into a MySQL table, and the process becomes "write a ROLL plan that executes this UPDATE x=y WHERE t=u; command"

After a while you get super annoyed at some quirk of the commands and figure, "Ok, fine, I'll just add an endpoint and some logic that just does this for the command case."

Then you wanna go on vacation and the new guy messed up the API request last week, so you figure, "I'll just add a little JS interface with a little red warning if the request is messed up in this way or that before I go".

You get back from vacation and some original interested party (whoever has wanted all these changes deployed) watched the intern make the change and thinks they could just do it themselves if they had access to the interface. You're wary, but you make the changes together a few times and maybe even add a little "wait-for-approval" node in the state machine.

Life is good. You've basically de-looped yourself, aside from a quick sanity check and button press, instead of what was a ~2 hour code + build + PR + PR approved + deploy process.

Then that interested party goes to work for Uber and the rest of your team adds a few functionalities on top of the interface you built and it all goes pretty well, until you realize that now that this thing that used to be 20 YAML objects is now 50k database records, and a bunch of them don't even apply anymore. So you build a button to disable some group of them, but after getting it deployed you realize it's actually possible to issue a "disable all" request accidentally if you click a button in your janky JS front-end before the 50k records download and get parsed and displayed. Oops! This mistake that you and the original interested party would have never made (because you spent the last 2 years thinking about all this crap) is probably a single impatient anxious mouse-click away from happening. So you make a patch and deploy that.

Congrats! You found that particular failure mode and added some protections for it, and maybe added some other protections like rate-limiting the deletions or updates or whatever. That's cool, but is that every failure mode? I bet it isn't. What happens when someone else thinks you have too many endpoints and just drops to SQL for the update?

Basically, yeah, of course you think of this stuff while iterating on it. But you figure "only power users are on the ACL" or "my teammates will understand the data model before making changes, or ask me first" or "that's what ROLL plans are for" or "I'll show a warning in the UI" or whatever. Fundamentally, you're thinking about a way to do a thing, if you're even thinking about it at all.

So yeah, that's what I've spent the last year or two doing. :-)


Even I have been doing this, but not quite at this scale, it is mostly python scripts to automate something, but because of the low scale and that I am the sole owner + user, I am good to go :-D


To be fair, it looks like you agree with AWS on this point.


Or the root cause is a UI that allows mistakes like these.


Is it possible to build a UI that will not allow you to make a mistake?

If the computer knows exactly what actions would be a mistake, why can't it just do the correct actions (those that aren't a mistake) automatically?


You don't have to know what would be a mistake. E.g. if the tool is used most of the time to operate on a small set of servers, you have some extra confirmation or command-line option for removing a large set.

That's good UI design in tools with powerful destructive capabilities. You make the UI to do lots of things v.s. the few things you do routinely different enough that there's no mistaking them.


You can also have the program tell the user what's going to happen (if it can be computed beforehand), e.g. "This will affect 138 server(s)."


Yes, but be careful. UIs like that tend to accumulate "--yes" options, because you don't feel like being asked every time for 1 server. Then one day you screw up the wildcard and it's 1000 servers, but you used the --yes template.

Which is why I'm pointing out that to design UIs like these you should fall back on slightly different UIs depending on the severity of the operation.


This is a good pattern to use. The more pre-feedback I get, the less likely I am to make a horrible mistake. However one problem I often see with this pattern is the numbers are not formatted for humans to read. Suppose it prompts:

  "1382345166 agents will be affected. Proceed? (y/n)"
Was that ~100k or ~1M agents? I can't tell unless I count the number of digits, which itself is slow and error-prone. It's worse if I'm in the middle of some high-pressure operation, because this verification detour will break my concentration and maybe I'll forget some important detail.

Now if the number is formatted for a human to consume, I don't have to break flow and am much less likely to make an "order-of-magnitude error":

  "1,382,345,166 (1.4M) agents will be affected. Proceed? (y/n)"
I always attempt to build tooling & automation and use it during a project, rather than running lots of one-off commands. I find this usually saves me & my team a lot of time over the course of a project, and helps reduce the number of magical incantations I need to keep stored in my limited mental rolodex. I seem to have better outcomes than when I build automation as an afterthought.


This doesn't work. Users learn to ignore the message.


I think it depends on the quality of the feedback. Most tooling sucks, so the messages are very literal trace statements peppered through the code. , vs what the user-facing impact will be. When the thing is just spitting raw information at me, I'm probably going to train myself to ignore it. But if it can tell me what is going to happen, in terms that I care about, then I'll pay attention.

Imagine I just entered a command to remove too many servers that will cause an outage:

  "Finished removing servers" 
  (better than no message, I suppose)
vs

  "Finished removing 8 servers"
  (better, it's still too late to prevent my mistake 
    but at least I can figure out the scale of my mistake)
vs

  "8 servers will be removed. Press `y` to continue"
  (better, no indication of impact but if I'm paying
     attention I might catch the mistake)
vs

  "40% capacity (8 servers) will be removed. 
    Load will increase by 66% on the remaining 12 servers. 
    This is above the safety threshold of a 20% increase. 
    You can override by entering `live dangerously`."
  (preemptive safety check--imagine the text is also red so it stands out)


Obviously some UIs make some errors less likely. You don't have the "launch the nukes" button right next to the "make coffee" button, because humans are clumsy and don't pay attention.


Fat-finger implies you made your mistake once. A UI can't stop you from setting out to do the wrong thing, but it can make it astronomically unlikely to do a different action than the one you intended.

Simple example: I have a git hook which complains at me if I push to master. If I decide "screw you, I want to push to master", it can't assess my decision, but it easily fixes "oops, I thought I was on my branch".


A good UI should be able to help, especially in critical situations. I imagine Amazon will consider something like this:

> The dosage you ordered is an order of magnitude greater than the dosage most commonly ordered for this medicine. Continue? y/n


But that still allows you to make a mistake - by pressing y when that's the wrong thing to do.


There's a balance to be struck. I'd say number of hoops you have to jump through to do something should scale with the potential impact of an operation.

That said, the only way to completely prevent mistakes is to make the tool unable to do anything at all.

(Or to encode every possible meaning of the word "mistake" in your software. If you could do that, you would probably get a Nobel prize for it.)


In a program I wrote I make the user manually type "I AGREE" (case-sensitive) in a prompt before continuing, just to avoid situations where people just tap "y" a bunch of times.


Habituation is a powerful thing: a safety-critical program used in the 90s had a similar, hard-coded safety prompt (<10 uppercase ASCII characters). Within a few weeks, all elevated permission users had the combination committed to muscle memory and would bang it out without hesitation, just by reflex: "Warning: please confirm these potentially unsaf-" "IAGREE!"


It's indeed a real problem. Hell, I myself am habituated to logins and passwords for frequently used dialog boxes, and so just two days ago I tried to log in on my work's JIRA account using test credentials for an app we're developing...

For securing very dangerous commands, I'd recommend asking the user to retype a phrase composed of random words, or maybe a random 8-character hexadecimal number - something that's different every time, so can't be memorized.


I think that even if someone can't memorize the exact characters, they'll memorize the task of having to type over the characters. Better would be to never ask for confirmation except in the worst of worst cases.


That's what I meant in my original comment when I wrote that "number of hoops you have to jump through to do something should scale with the potential impact of an operation". Harmless operations - no confirmation. Something that could mess up your work - y-or-n-p confirmation. Something that could fuck up the whole infrastructure - you'd better get ready to retype a mix of "I DO UNDERSTAND WHAT I'M JUST ABOUT TO DO" and some random hashes.


Not sure if even that would work.

I've almost deleted my heroku production server even though you need to type (or copy paste....ahem...) the full server name (e.g. thawing-temple-23345).

I think the reason was that because in my mind I was 100% sure this was the right server, when the confirmation came up I didn't stopped to look if indeed this was the correct one so I mechanically started to type the name of the server and just a second before I clicked ok, I had this genius idea to double check.... Oh boy... My heart dropped to the floor when I realized what was I about to do.

You could say that indeed Heroku's system of avoiding errors worked correctly....

However the confirmation dialog wasn't what made me stop... Instead it was my past-self's experience screaming at me and remembering me that ONE time where I did fucked up a production server years ago (it cost the company a full day of customers' bids... Imagine the shame of calling all the winning bidders and asking them what price did they end up bidding to win....)

My point is, maybe no number of confirmation dialogs however complex they are, will stop mistakes if the operator is fixed on doing X. If you are working in a semi-autopilot mode because you obviously are very smart and careful (ahem..) you will just do whatever the dialog asks you to do without actually thinking what you are doing.

What, then, will make you stop and verify? My only guess is that experience is the only way. I.e. only when you seriously fuck up you learn that no matter how many safety systems or complex confirmation dialogs there are you still need to double and triple check each character you typed, lest you want to go through that bad experience again....


A well-designed confirmation doesn't give you the same prompt for deleting some random test server as it does for deleting a production server. That helps with the "autopilot mode" issue.


I agree that it should help reduce the amount of mistakes.

But I still believe auto-pilot mode is a real thing (and a danger!) .

My point is that I'm not sure if it's even possible to design one that actually cuts errors to 0.

And if that's indeed the case, even if it's close to 0, it's still non-zero, thus at the scale Amazon operates at, it's very probable that it will happen at least one time.

Maybe sometime in the future AI systems will help here?


I totally agree that it's a real issue, a danger, and that it's impossible to cut errors to zero.

I've also built complex systems that have been run in production for years with relatively few typo-related problems. The way I do it is with the design patterns like the one I just mentioned, which is also what TeMPOraL was talking about (and I guess you missed it.)

If you have the same kind of confirmation whenever you delete a thing, whether it's an important thing or not, you're designing a system which encourages bad auto-pilot habits.

You'll also note that Amazon's description of the way that they plan on changing their system is intended to fire extra confirmation only when it looks like the operator is about to make a massive mistake. That follows the design pattern I'm suggesting.


> My point is that I'm not sure if it's even possible to design one that actually cuts errors to 0.

Personally, I don't believe it is without making the tool impotent. But you can try and push down the error probability down to arbitrarily low value.


Still no help against "whoops, took down a different production instance than intended."


We're assuming that the software in question is even aware of the potential impact. It might not have that information.


This prevents fat-finger mistakes.

You could go further and try to prevent cat-on-the-keyboard mistakes, which is maybe what you're describing (solve this math equation to prove you are a human who is sufficiently not inebriated). Or even further and prevent malicious, trench-coat wearing, pointy-nosed trouble-makers.

The point is, yes, it is possible. That's what good design does.


It's not possible to be perfect, but you can certainly do better than taking down S3 because of a single command gone wrong.

One thing I have been doing for my own command line tools is making a preview feature for what a command will do and make the preview state be default. It's simple, but if the S3 engineer first saw a readout of the huge list of servers that were going to be taken offline instead of the small expected list we probably would not be talking about this. There's obviously a ton more you can do here (have the tool throw up "are you sure" messages for unusual inputs, etc).


If the computer knows exactly what actions would be a mistake - how? The difference between correct and incorrect (not to mention legal and illegal) is usually inferred from a much wider context than what is accessible to a script. Mind you, in this specific case, Amazon even implies that such a command could have been correct under other circumstances.

So, this means a) strong superhuman AI (good luck), b) deciding from an ambiguous input to one of possibly mistaken actions (good luck mapping all possible correct states), or c) drool-proof interface ("It looks you're trying to shut down S3, would you like some help with that?").

TL;DR: yes, but it's a cure worse than the disease.


> If the computer knows exactly what actions would be a mistake - how?

I don't know. I was suggesting it wasn't realistic to do that, and therefore it wasn't realistic to implement a UI that prevents you making mistakes.


That's what they claim they will do to ameliorate this. They will build limits into their tools.


Why weren't they there already?


Maybe they were, but they missed this one thing?


Hindsight is 20/20.


Because human error isn't foreseeable? Or a disgruntled employee?


Because there are limits to engineering resources even at Amazon.


It can validate inputs and not let you enter out of range data. You can know that an answer is wrong without knowing what the right answer is.


Possibly the values are all with range. It was just that this operation only worked on elements that were a subset. No amount of validation will catch that error.

You could feedback a clarification, but if that happens too often nobody will double check it after they have seen it over and over.


While you can't prevent user error without preventing user capability, you can (as others have observed) follow some common heuristics to avoid common failure modes.

A confirm step in something as sensitive as this operation is important. It won't stop all user error, but it gives a user about to accidentally turn off the lights on US-EAST-1 an opportunity to realize that's what their command will do.


No. But a good UI can help you see the mistake you're about to make.


> Is it possible to build a UI that will not allow you to make a mistake?

no, because developers still produce code with bugs.


Good UX is important, even for things like scripts. Unfortunately a lot of tech people take pride in working with hard-to-use and error-prone tools.


If you have UI that allows to undeploy 10 servers, it will also allow to undeploy 100 servers. Unless you specifically thought about possibility that there might be lower bound of number of servers, which they obviously didn't before that. It's easy to talk about it after the fact, but nobody is able to predict all such scenarios in advance - there are just too many ways to mess up to have special code for all of them in advance.


It's not really a UI issue.

The tool as a whole should incorporate a model of S3. Any action you take through the UI should first be applied to this model, and then the resulting impact analyzed. If the impact is "service goes down", then don't apply the action without raising red flags.

Where I work we use PCS for high availability, and it bugs the heck out of me that a fat-fingered command can bring down a service. PCS knows what the effect of any given command will be, but there's no way (that I know of) to do a "dry run" to see whether your services would remain up afterward.


Interesting.

In practice, it would likely be very hard to make a model of your infrastructure to test against, but I can imagine a tool that would run each query against a set of heuristics, and if any flags pop up, it would make you jump through some hoops to confirm. Such a tool should NEVER have an option to silently confirm, and the only way to adjust a heuristic if it becomes invalid should be formally getting someone from an appropriate department to change it and sign off on it.

By the way, this is how companies acquire red tape. It's like scar tissue.


For many systems, the rule is simply "X of Y servers must be up". Something like that isn't too hard to enforce.


They probably didn't know the service would go down. For that, you need to identify the minimal requirements for the service to be up upfront, and code that requirements into the UI upfront. Most tools don't do that. File managers don't check the file you delete is not necessary for any of the installed software packages to run. Shells don't check the file you overwriting doesn't contain vital config file. Firewall UIs don't check that this port you're closing isn't vital for some infrastructural service. It would be nice to have a benevolent omniscient God-like UI that would have foresight to check such things - but usually the way it works is that you know about these things after the first (if you're lucky) time it breaks.


Or it makes you re-enter the quantity of affected targets as a confirmation, similar to the way GitHub requires a second entry of a repo name for deletion.


Agreed -- a postmortem should cite that deployment goof as the immediate cause, with a contributory cause of "you can goof like this without getting a warning etc".


Computers are devices built to amplify human error.


Bicycles for the bumbling mind.


I don't understand how this is even possible in a company operating on that scale. Granted, I'm a lowly scientific programmer with no clue about running a cloud infrastructure, but I would have imagined that there would be at least a pretense of oversight for destructive commands run in such an environment. A scheme as simple as "any destructive command run on S3 subsystems is automatically run in a dry run form, and requires independent confirmation by 2-3 other engineers to actually come into effect" would have prevented this altogether. Given the overall prominence of S3, this incident seems to demonstrate a rather callous attitude on the part of the organization.


I thought the same thing before I went into the industry but now that I've been in it for a few years (including two at Amazon), it doesn't surprise me.

I suspect locking everyone down in the way you suggest would cost more in lost productivity (and costs for the infrastructure that would be required for greater auditing, etc.) than is lost in outages like this.


Number of lawyers must have drafted those lines and 5 people including Bezos must have approved those lines.

Those lines are not reflective of what Amazon is but what picture Amazon wants to paint now. They have clarified it is their error and not some hacking attempt. Secondly they have not vilified the engineer in question because already Amazon's culture is a bit of a ??? in public mind.

But they have got it right. Shit happens and this is not the first time it has happened or the last time it will happen. Also it will happen with Microsoft, Google and everyone else.

May be we will build even better technologies that will rely on two different cloud providers instead of 1.


It is always going to be like that. If you write software that has a rule "do not remove more than 5% of capacity at once" it will always work, yet if you tell a systems engineer please do not remove 5% of the capacity at once it fails with 0.0x% chance. The solution is that you move the execution of change into a system that spits out steps that is automatically executed by the system itself, entirely removing the human factor.


Every communication channel has its flaws. CLI is fast and that's why it is a favorite. It is also noisy. If you have to worry about a fat finger, you are using the wrong communication channel or could afford to be a bit more verbose within that channel. That's why rm has safety nets.


GUIs are really great. They're a recent development in the computing industry that help mitigate this sort of problem. You can even put prompts in that get you to confirm Yes/No to continue.

I think Borland do some RAD systems, and Microsoft have an IDE of sorts on the way too.

EDIT: Please note that this is humour.


20 years ago I read a postmortem of Tandem and their Non-Stop Unix. A core take-away for me was: "Computer hardware has gotten way more reliable than it was." combined with "The leading cause of outages has become operators making mistakes."


Somehow, it's reassuring to see how 45 years later, the rm -rf class of problem still persists.


Sounds like an opportunity for machine learning. Anyone want to write an AI BOFH?


Basically what Netflix's Chaos Monkey is: https://github.com/netflix/chaosmonkey


Heaven forbid an AI whose primary object is Chaos Monkey gain sentience. That might just be worse than paperclips.


Most of the BOFH I've known have seemed artificially intelligent already.


Someone forgot a --limit on their ansible playbook ? I did that too.


serious question - why did no one ever accidentally launch and nuke a city, with thousands of nuclear warheads able to do so on short notice? like, AWS presumably puts a lot more redundancy in, and yet with all that effort comes up this far short. Why? It has a huge amount of brainpower all set up so that this never ever happens. Whatever works for the military, can't they adopt those actual best practices?


When I think about questions like these, I recall the Anthropic Principle. Perhaps on lots of planets, intelligent life ceased at the beginning of the Atomic Age. Here we are seven decades (several generations!) in, and we're still alive! The numerator on the odds almost doesn't matter, when you never get to see the denominator. Now that we're finding all these planets, perhaps we ought to start looking for nuclear extinction events? They probably wouldn't leave lasting evidence, but if they're common enough they wouldn't need to...

Actually the accounts I've read seem to indicate that most missile operators simply decided they would never launch no matter what. God bless them, for that.


In 2016, the UK accidentally fired a Trident missile at the US mainland: http://www.dw.com/en/uk-government-covered-up-disastrous-fai...

This one did not carry a warhead. Others do...


I highly recommend reading https://www.amazon.com/Command-Control-Damascus-Accident-Ill...

Turns out the answer to your question is simply: luck.


Sounds like AWS needs Spinnaker for easy rollbacks! https://news.ycombinator.com/item?id=13776456


Why? People make mistakes.


feel sorry for that user, I'd want to hide in a corner


So they're going to build a complex system to correct possible user command line errors. That new system itself will introduce possible errors. Wouldn't an administrative GUI have been much simpler to implement overall?


Do you have an assumption that GUIs are safer than command lines? Or have fewer programming bugs in them? I don't think either is true.


No, they're introducing safeguards to an already established system. It's not even that complex, for that matter.


Wonder what happened to the poor slob who did that. He/She was unauthorized AND caused a pretty serious outage...

EDIT: derp, my bad, I read as "unauthorized" which was "authorized".


One of the positive things about Amazon's culture is that they heavily emphasize blaming broken processes, not blaming people. I doubt the person involved will have any negative consequences beyond embarrassment.


I would be horrified if I learned that Amazon or any other company of such size in any way castigates employees for such very human errs. The guilt (don't beat yourself up) he or she likely feels is bad enough.

Anyway, to me this firstly sounds like a "tool" or command that was too powerful with not enough safeguards. Who knows, the command might even be ambiguous.


Can you imagine the feeling? I once sent a debug email to our mailing list and felt terrible for days. Imagine bringing down the internet...


> At 9:37AM PST, an >>> authorized <<< S3 team member [...]

typos happen, if the system didn't stop them that's a design issue or accepted risk.


The article specifically said "authorized".


How do you figure? From the blog post:

" At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."


Haha it's funny that a simple misread of "authorized" as "unauthorized" got this comment downvoted to oblivion...


The post never once blames human error, but always specifies that the tools were the problem.


I think the article says he was authorized? Either way, at worst, he got fired.


I assume if they were unauthorized they would be fired, but since they were authorized shit happens.

"Fire you? I just spent $10 million training you!"


Exactly - this person is now the expert on the error, and they can now impart this valuable knowledge to the organization.


Did the parent comment change? It currently says "an authorized..."


This part is also interesting:

> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

These sorts of things make me understand why the Netflix "Chaos Gorilla" style of operating is so important. As they say in this post:

> We build our systems with the assumption that things will occasionally fail

Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.


> Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.

Exactly. It seems likely that Amazon tests the restart operation, but it would be hard to test it at full us-east-1 scale. Running a full S3 test cluster at that scale would likely be a prohibitive expense. Perhaps the "index subsystem" and "placement subsystem" are small enough for full-scale tests to be tractable, but certainly not cheap, and how often do you run it? Also, hindsight is 20/20, but before this incident it might have been hard to identify "full-scale restart of the index subsystem" as rising to the top of the list of things to test.

One approach is to try to extrapolate from smaller-scale tests. It would be interesting to know what kinds of disaster testing Amazon does do, and at what scale, and whether a careful reading could have predicted this outcome.


> Failure at every level has to be simulated pretty often to understand how to handle it

Keep in mind, S3 "fails" all the time. We regularly make millions of S3 requests at my work. Usually we get 1:240K failure rate (mostly GETs), returning 500 errors. However, if you're really hammering an S3 node in the hash ring (e.g. Spark job), we see failures in the 1/10K range, including SocketExceptions, where the routed IP is dead.

You need to always expect such services to die in your code, setting the proper timeouts, backoffs, retries, queues, and dead letter queues.


If you retry the get does it succeed usually?


Yes, retry has always worked (*us-west-2).

Sometimes it's a 404 for an object written 1 sec prior, other times it's an S3 node that died mid request. Retry gets you to a different node.


> Perhaps the "index subsystem" and "placement subsystem" are small enough for full-scale tests to be tractable, but certainly not cheap, and how often do you run it?

Rough guide:

CT = cost of 1 full scale test with necessary infrastructure and labor costs added up

CF = amount of money paid out in SLA claims + subjective estimate of business lost due to reputation damage etc

PF = estimate of probability of this event happening in a given year

if PF * CF > CT, then you run such a test at least once a year. Think of such an expense as an insurance premium.

What Netflix does with their simian army is amortize the cost of doing the test across millions of tests per year and the extra design complications arising from having to deal with failures that often.


This is precisely why cells (alluded to in the write-up) are beneficial. If the size of a cell is bounded and you scale by adding more cells, testing the breaking point of the largest cell becomes an easier problem. There is still usually a layer that spans across all cell boundaries, which is what then becomes hard to test at prod scale (so you make that as simple as possible)


Testing a full zone test is only possible when they have a new zone available, unused. I bet they do these test, and they now have a new scenario to test.


They also probably have one or more test regions where they could perform a test like this. But it's presumably not at nearly the same scale as us-east-1, the region affected by this incident. And to a considerable extent the problem was one of scale. The writeup makes the recovery sound fairly straightforward; but due to the sheer size of S3 in this region, it took hours for the system to come back up, which was apparently unexpected.

(Nit: this incident affected a region, not a zone. us-east-1 is a region, which is divided into zones us-east-1a, us-east-1b, etc. S3 operates on regions.)


I think you meant "Chaos Monkey" [1].

[1] https://github.com/Netflix/chaosmonkey


No, Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone

http://techblog.netflix.com/2011/07/netflix-simian-army.html


I recently saw a talk where they referred to Chaos Monkey (kills instances), Chaos Gorilla (kills many instances for a single service in a single region) and Chaos Kong (takes an entire region offline)


They have an entire simian army of chaos for the purposes of simulated destruction of their network.


http://techblog.netflix.com/2011/07/netflix-simian-army.html

> Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.


Chaos gorilla is a thing as well, simulates outage of an entire AZ. http://techblog.netflix.com/2011/07/netflix-simian-army.html


Yep. There's a transition period where you can't rely on redundancy any longer because there are so many components that it's basically inevitable that at any given time somewhere something will be in a degraded state. So you design for that case, the degraded normalcy case. You make something failing somewhere a non-emergency. It takes a lot of work to do but when you have things working in that way then you can guarantee that you're in that state by testing it routinely in production.


Totally agree. Would also point out that if you have systems up for many years, they like haven't been updated in the same... shouldn't people find that alarming?

More

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: