Summary of the Amazon S3 Service Disruption (amazon.com)
265 points by oscarwao 1 hour ago | 122 comments





> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.

To make error is human. To propagate error to all server in automatic way is #devops - DevOps Borat

I've long said something like "To err is human. To fuck up a million times in a second you need a computer."

I may have to upgrade that to take the mighty power of Cloud (TM) into account, though. Billions and trillions of fuck ups per second are now well within reach!

I can't wait until quantum computing lets us add a degree of simultaneity to fucking up. Fuck up in many ways... AT ONCE!

Automation doesn't just allow you to create/fix things faster. It also allows you to break things faster.

We may think that an automated system requires less understanding of it in order to operate it. But from the other point of view, you have to know what you are doing, consequences of even an small change are big.

This is one of the things that happens with windows, getting up a server is so easy, that people believe that they don't have to understand what's under the hood, and then, we get a lot of miss-configuration and operational issues.

Computers are devices built to amplify human error.

Automation tends to make those kinds of errors worse rather than better. Perhaps more infrequent and of a different nature than before, but screwing up an automated action cascades much, much faster than a human initiated one. As a result, you have to watch things a good deal closer and build in more and tighter safe guards.

For instance: https://thenextweb.com/shareables/2014/05/16/emory-universit...

Note: Automation is great, you just can't be sloppy with it. EVER.

edit:fix minor typo

Yeah as soon as I read this I felt bad for the employee. I remember writing an update statement without a where clause and having to restore the table from backup. But that was at a company not as advanced as Amazon. Fat fingering a key like that is just crazy and I'm sure they've fixed that from happening again.

Or the root cause is a UI that allows mistakes like these.

Good UX is important, even for things like scripts. Unfortunately a lot of tech people take pride in working with hard-to-use and error-prone tools.

Is it possible to build a UI that will not allow you to make a mistake?

If the computer knows exactly what actions would be a mistake, why can't it just do the correct actions (those that aren't a mistake) automatically?

You don't have to know what would be a mistake. E.g. if the tool is used most of the time to operate on a small set of servers, you have some extra confirmation or command-line option for removing a large set.

That's good UI design in tools with powerful destructive capabilities. You make the UI to do lots of things v.s. the few things you do routinely different enough that there's no mistaking them.

You can also have the program tell the user what's going to happen (if it can be computed beforehand), e.g. "This will affect 138 server(s)."


> Is it possible to build a UI that will not allow you to make a mistake?

no, because developers still produce code with bugs.

Obviously some UIs make some errors less likely. You don't have the "launch the nukes" button right next to the "make coffee" button, because humans are clumsy and don't pay attention.

It can validate inputs and not let you enter out of range data. You can know that an answer is wrong without knowing what the right answer is.

Possibly the values are all with range. It was just that this operation only worked on elements that were a subset. No amount of validation will catch that error.

You could feedback a clarification, but if that happens too often nobody will double check it after they have seen it over and over.

That's what they claim they will do to ameliorate this. They will build limits into their tools.

Sounds like an opportunity for machine learning. Anyone want to write an AI BOFH?

Basically what Netflix's Chaos Monkey is: https://github.com/netflix/chaosmonkey

Heaven forbid an AI whose primary object is Chaos Monkey gain sentience. That might just be worse than paperclips.

Most of the BOFH I've known have seemed artificially intelligent already.

Wonder what happened to the poor slob who did that. He/She was unauthorized AND caused a pretty serious outage...

EDIT: derp, my bad, I read as "unauthorized" which was "authorized".

One of the positive things about Amazon's culture is that they heavily emphasize blaming broken processes, not blaming people. I doubt the person involved will have any negative consequences beyond embarrassment.

I would be horrified if I learned that Amazon or any other company of such size in any way castigates employees for such very human errs. The guilt (don't beat yourself up) he or she likely feels is bad enough.

Anyway, to me this firstly sounds like a "tool" or command that was too powerful with not enough safeguards. Who knows, the command might even be ambiguous.

Can you imagine the feeling? I once sent a debug email to our mailing list and felt terrible for days. Imagine bringing down the internet...

> At 9:37AM PST, an >>> authorized <<< S3 team member [...]

typos happen, if the system didn't stop them that's a design issue or accepted risk.

The article specifically said "authorized".

How do you figure? From the blog post:

" At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."

Did the parent comment change? It currently says "an authorized..."

I think the article says he was authorized? Either way, at worst, he got fired.

I assume if they were unauthorized they would be fired, but since they were authorized shit happens.

"Fire you? I just spent $10 million training you!"


This part is also interesting:

> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

These sorts of things make me understand why the Netflix "Chaos Gorilla" style of operating is so important. As they say in this post:

> We build our systems with the assumption that things will occasionally fail

Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.

I think you meant "Chaos Monkey" [1].

[1] https://github.com/Netflix/chaosmonkey

No, Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone

http://techblog.netflix.com/2011/07/netflix-simian-army.html

I recently saw a talk where they referred to Chaos Monkey (kills instances), Chaos Gorilla (kills many instances for a single service in a single region) and Chaos Kong (takes an entire region offline)

They have an entire simian army of chaos for the purposes of simulated destruction of their network.

Chaos gorilla is a thing as well, simulates outage of an entire AZ. http://techblog.netflix.com/2011/07/netflix-simian-army.html

http://techblog.netflix.com/2011/07/netflix-simian-army.html

> Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

> From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3.

Ensuring that your status dashboard doesn't depend on the thing it's monitoring is probably the first thing you should think about when designing your status system. This doesn't fill me with confidence about how the rest of the system is designed, frankly...

Anyone who works on large complex systems will read this is and go "this was all preventable, but also completely understandable".

And apparently they had never tried rebooting some of the most important parts of that system. Just when you start to think that someone's really gotten it right you come to learn they're just fumbling around in the dark like everyone else.

My interpretation of this is that the indexing system was resilient to lost of a certain amount of capacity (probably around ⅓ + 1 host). As a guess, the indexing system probably used some form of consensus (e.g. paxos) which has had an active leader for years. Deployments stay within that capacity constraint, so while hosts have been restarted and replaced (data center migrations, hardware lease expiration, failures, upgrades, etc.), they may have not recently run into a situation where quorum wasn't available for a partition, especially at the scale of restarting the entire fleet.

Since restarting the entire fleet would incur downtime of all relevant S3 operations, it's unlikely that it was something ever intentionally done in production (and they may or may not have run that scenario in other environments).

reply


It sounds like the weakness in the process is that the tool they were using permitted destructive operations like that. The passage that stuck out to me: "in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."

At the organizational level, I guess it wasn't rated as all that likely that someone would try to remove capacity that would take a subsystem below its minimum. Building in a safeguard now makes sense as this new data point probably indicates that the likelihood of accidental deletion is higher than they had estimated.

I always wonder about unintended consequences of this sort of thing. Like someday there will be a worm about to rampage through their servers and someone says, "take them all offline now!" and the answer is, "we can't because of the throttle safeguard we put in place after incident XYZ, it will be about 17 hours..."

By safeguard I meant (and I think Amazon means too) an extra step that is required by the user before they can do the action so they don't do it by accident. Not something that prevents it entirely. Like how an MMO requires you before you delete a character to type the character's name in a box that pops up before you can delete it. That's far outside the realm of usual user interface, but that's so if you are just trying to edit a character it's impossible to accidentally hit that delete key. An analogous system for Amazon that would have prevented this outage: delete 10 nodes, ok. Delete 100 nodes, box pops up saying 'To delete this many nodes you must type the following in to a message box: "I want to take down a dangerously large amount of nodes."'

Which is why there needs to be an override switch, but it needs to be very very explicit that you are going past the safeguards. And only a limited number of people who can use that override.

then you need to "sudo take them all offline now!"

"We can't because we put a safeguard in for that last week!"

"Just pull the power!"

I'm remembering tools I've worked with where the cheeky dev required things like type the sentence "I know what I am doing and wish to proceed." in order to perform unsafe operations.

reply


reply


reply


From a software development perspective, it makes sense to reuse S3 and rely on it internally if you need object storage, but from an ops perspective, it means that S3 is now a single point of failure and that SES's reliability will always be capped by S3's reliability. From a customer perspective, the hard dependency between SES and S3 is not obvious and is disappointing.

The whole internet was talking about S3 when the AWS status dashboard did not show any outage, but very few people mentioned other services such as SES. Next time we encounter errors with SES, should we check for hints of S3 outage before everything else? Should we also check for EC2 outage?

Really pleased to see this, it's good to see an organisation that's being transparent (and maybe given us a little peek under the hood of how S3 is architected) and most importantly they seem quite humbled.

It would be easy for an arrogant organisation to fire or negatively impact the person that made the mistake, I hope Amazon don't fall into that trap and focus instead on learning from what happened, closing the book and move on.

TLDR; Someone on the team ran a command by mistake that took everything down. Good, detailed description. It happens. Out of all of Amazon's offerings, I still love S3 the most.

"It happens" is the only reasonable takeaway you can get from a postmortem like this. My worry is that people read it and go "I am aghast that such a command can be run!" without knowing that little commands like that are run numerous times a day without incident.

The only thing I read in there and go "hmmm" is that it took quite that long for the S3 service to recover, and that the status page wasn't hosted on someone that doesn't have an S3 dependency. That's just a plain "doh" moment :)

People need to realize when they go to the cloud it's not that 'it happens', it's that it will happen, and you have no ability to do anything about it. Fact of life and risk management.

Cleversafe is much cheaper, higher quality, more flexible, etc.

"I did."

That was CEO Robert Allen's response when the AT&T network collapsed [1] on January 15, 1990

He was asked who made the mistake.

I can't imagine any CEO now a days making a similar statement.

[1] http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collaps...

Oh, that interview question. “Tell me about something you broke in your last job"

It's my favorite. =D

" we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected"

reply


> Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

If I would have guessed anyone could prevent mistakes like this from propagating it would be AWS. It points to just how easy it is to make these errors. I am sure that the SRE who made this mistake is amazing and competent and just had one bad moment.

While I hope that AWS would be as understanding as Gitlab, I doubt the outcome is the same.

I keep being reminded of something I read recently that made me feel uneasy about google's cloud spanner [1]:

the most important one is that Spanner runs on Google’s private network. Unlike most wide-area networks, and especially the public internet, Google controls the entire network and thus can ensure redundancy of hardware and paths, and can also control upgrades and operations in general. Fibers will still be cut, and equipment will fail, but the overall system remains quite robust. It also took years of operational improvements to get to this point. For much of the last decade, Google has improved its redundancy, its fault containment and, above all, its processes for evolution. We found that the network contributed less than 10% of Spanner’s already rare outages.

But when it fails it's going to be epic!

[1] https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Sp...

Kinda reassuring to hear how everything is hacked together with shoestrings - even at Amazon.

reply


"People make mistakes all the time...the problem was that our systems that were designed to recognize and correct human error failed us." [1]

[1] http://articles.latimes.com/1999/oct/01/news/mn-17288

This reminds me of Asimov's characteristically tiny story "Fault-Intolerant" https://unotices.com/book.php?id=38686&page=15 (You can ignore the story at the top about Feghoot, the real story is below.)

tl;dr: Engineer fat-fingered a command and shut everything down. Booting it back up took a long time. Then the backlog was huge, so getting back to normal took even longer. We made the command safer, and are gonna make stuff boot faster. Finally, we couldn’t report any of this on the service status dashboard, because we’re idiots, and the dashboard runs on AWS.

This is a bit off topic. The use of the word "playbook" suggests to me that they use Ansible to help manage S3. I wonder if that is the case, or if it's just internal lingo that means "a script". Unless there is some other configuration management system that uses the word playbook that I'm not aware of.

"playbook" is a relatively common term for "documented step-by-step procedure for specific tasks". Effectively, a script with #!/bin/human at the top.

reply


reply


reply


reply


It's a document with step-by-step instructions for the operator to follow.

> Removing a significant portion of the capacity caused each of these systems to require a full restart.

I'd be interested to understand why a cold restart was needed in the first place. That seems like kind of a big deal. I can understand many reasons why it might be necessary, but that seems like one of the issues that's important to address.

I'm surprised how transparent this is, I can find Amazon often a bit opaque when dealing with issues.

I've found them to be very opaque in most contexts, but for major outages (which have been rare), they do have a history of solid postmortems.

They have no choice in this situation.

CEO's all over the world just realized that they can't only depend on S3, and they might have to double up on their infrastructure and have a parallel env. on Azure or Google as well.

I wouldn't want to be the person who wrote the wrong command! Sheesh.

Meh, that's a process problem, not a people problem. Playbooks that have you retype commands with complex options, with no confirmation, etc, are inviting that sort of thing.

They just spent millions of dollars training that person to never make this mistake again, should definitely keep them around.

The wording of the article implies Amazon is shifting the blame entirely on the individual who typo'd: they indemnify themselves with "an authorized S3 team member using an established playbook..." ("don't blame us, our process is perfect!")

There are process-fixes for this, such as requiring a two-person rule when at a production shell and modifying tooling to detect potentially unintentional commands (e.g. a SQL UPDATE without a WHERE) - but given what I know about Amazon's internal practices (i.e. the brutality) it wouldn't surprise me if they did terminate the unfortunate operator - not because they want to, but because AWS simply has too many large-scale customers who would demand immediate action like that.

At these scales it's the fault of the system, not the individual, so hopefully they don't come down hard on them.

Agreed. They also seemed to acknowledge that in the post, as they mentioned improving the tool to not allow such destructive options.

You mean like range checking the input parameters to the command? =)

    SHUT DOWN S3?  ARE YOU SURE? (y/N) :

More like:

I brought down our production system after a typo in a command once... the dev team took the blame for allowing an illegal parameter to bring down the system.

reply


reply


Every mistake was used as a learning opportunity to ensure that the same and similar mistakes can't be repeated.

reply


reply


On a more serious note, if you've never done something like this, you haven't had enough interesting projects.

I've had a decent career and I still managed to:

* re-deploy the current application version in all our data centers, instead of the new version, in a period when our deployment wasn't a 0-downtime one

* rename all the Jenkins jobs on the server to the same name, thus deleting hundreds of Jenkins jobs in one fell swoop

"Let him who is without sin cast the first stone" and all that :)

As I keep saying, there are people who screw up big time and people who are too scared to touch the system. I managed to gun down three productive clusters by deploying NTP by accident along with a tiny change. Kerpow, 12 minutes of downtime, full network outage due to DHCP and such. Great fun.

There's two things I've come to believe in my IT career regarding operations:

1. No organization anywhere is a paragon of excellence, and everyone can benefit from improvement. 2. Every organization is made up of humans just like you. With all that entails.

Some things which seem blatantly obvious after the fact are easily overlooked when the pressure to deliver is high and other issues are taking precedence.

I completely agree with your statement. In fact, when I do interviews, one of my favorite and most insightful questions to ask is, basically, "tell me about a time you screwed the pooch." If they don't have a story and they worked in ops, then it can suggest they didn't really do much. The really sharp ones I've interviewed have a good story or two (and can tell it in excruciating detail. =)

* At a prior company I once tried appending to the list of NFS exports, but dropped the "no-root-squash" option, and instantly denied write permissions to our entire VMware farm. You can imagine what then happened to all of the VMs for this mission critical customer. =P

Reminds me of the story about Bob Hoover.[1]

He took off in his piston engine plane, only to lose power during the climb and was forced to make a crash landing. It turned out the airplane was fueled with jet fuel instead of regular gasoline (the ground crewman mistakenly thought the plane was a turbo prop).

Instead of yelling at or firing the ground crewman, Hoover had this to say[2]:

    "There isn't a man alive who hasn't made a mistake.
    But I'm positive you'll never make this mistake again.
    That's why I want to make sure that you’re the only one
    to refuel my plane tomorrow. I won't let anyone else
    on the field touch it."
[1] https://www.aopa.org/news-and-media/all-news/2014/july/pilot...

[2] http://www.squawkpoint.com/2014/01/criticism/

Something that wasn't addressed -- there seems to be an architectural issue with ELB where ELBs with S3 access logs enabled had instances fail ELB health checks, presumably while the S3 API was returning 5XX. My load balancers in us-east-1 without access logs enabled were fine throughout this event. Has there been any word on this?

I think it comes down to how important your ELB logs are -- if they are important enough that you don't want to allow traffic without logs (i.e. if you're using them for some sort of auditing/compliance), then failing when it can't write the logs seems like the right choice.

reply


reply


We have geo-distributed systems. Load balancing and automatic failover. We agonize over edge cases that might cause issues. We build robust systems.

At the end of he day reliability -- a lot like security -- is most affected by the human factor.

Seems like something like Chaos monkey should have been able to predict and mitigate a issue like this. Im actually curious if anyone uses it at all- Curious if anyone in here at a large company (over 500 employees) has it deployed or not.

> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

This is the bit that'd worry me most; you'd think they'd be testing this.

A complete restart of the index subsystem would require downtime. Note: they are not saying those servers have never been restarted - it's highly likely they get restarted regularly. But, a complete restart of the index subsystem implies that you shut everything down first and restart it all at once, which is what was forced to happen two days ago.

Same as that db1 / db2 for GitLab, naming things is pretty important (E.g production / staging, production-us-east-1-db-560 etc.)

reply


reply


reply


reply


You shouldn't have to make a claim - this should be handled by them automatically.

This makes me want to write a program that would ask users to confirm commands if it thinks they are running a known playbook and deviating from it. Does anyone know if a tool like that exists?

Remember folks, automate your systems but never forget to add sanity checks.

I find this refreshingly candid; human, even, for AWS.

I'm just going to call this "PEBKAC at scale"

Sounds so chernobyl.

What's missing is addressing the problems with their status page system, and how we all had to use Hacker News and other sources to confirm that US East was borked.

No, this is addressed:

>We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

Which is fine until those regions go down. A status page, in my mind, should have a fallback on a completely different service provider.

reply


"From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions."

For the many of us who have built businesses dependent on S3, is anyone else surprised at a few assumptions embedded here?

* "authorized S3 team member" -- how did this team member acquire these elevated privs?

* Running playbooks is done by one member without a second set of eyes or approval?

* "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years"

The good news:

* "The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately."

The truly embarrassing that everyone has known about for years is the status page:

* "we were unable to update the individual services’ status on the AWS Service Health Dashboard "

When there is a wildly-popular Chrome plugin to fix your page ("Real AWS Status") you would think a company as responsive as AWS would have fixed this years ago.

) An authorized, not unauthorized. They sound almost identical, English language is... Yea :/

) If it's a playbook for something with minimal intended impact sure. The issue is that the tooling had larger capabilities than should be.

) Yes that seems like a major, major problem.

>Unauthorized S3 team member

reply


reply


