
Summary of the Amazon S3 Service Disruption - oscarwao
https://aws.amazon.com/message/41926/
======
ajross
> _At 9:37AM PST, an authorized S3 team member using an established playbook
> executed a command which was intended to remove a small number of servers
> for one of the S3 subsystems that is used by the S3 billing process.
> Unfortunately, one of the inputs to the command was entered incorrectly and
> a larger set of servers was removed than intended._

It remains amazing to me that even with all the layers of automation, the root
cause of most serious deployment problems remain some variant of a fat
fingered user.

~~~
mabbo
Look at the language used though. This is saying very loudly "Look, this isn't
the engineer's fault here". It's one thing I miss about Amazon's culture- not
blaming people when system's fail.

The follow-up doesn't bullshit with "extra training to make sure no one does
this again", it says (effectively) "we're going to make this impossible to
happen again, even if someone makes a mistake".

~~~
Bartweiss
Any time I see "we're going to train everyone better" or "we're going to fire
the guy who did it", all I can read is "this will happen again". You can't
actually solve the problem of user error with training, and it's good to see
Amazon not playing that game.

~~~
jacquesm
What bothered me about running TrueTech is that customers would sometimes
demand repercussions against employees for making mistakes.

Enter Frans Plugge. Whenever a customer would get into that mode we'd fire
Frans. This was easy, simply because he didn't exist in the first place (his
name was pulled from a skit by two Dutch comedians, bonus points if you know
who and which skit).

This usually then caused the customer to recant on how he/she never meant for
anybody to get fired...

It was a funny solution and we got away with it for years, for one because it
was pretty rare to get customers that mad to begin with and for another
because Frans never wrote any blog posts about it ;)

But I was always waiting for that call from the labor board to ask why we
fired someone for who there was no record of employment.

~~~
justin_oaks
It is unreasonable for me to think that company owners should have the spine
to say, "We take the decision to fire someone very seriously. We'll take your
comments under consideration, but we retain sole discretion over such
decisions."

It irks me that businesses fire people because of pressure from clients or
social media. But having never been the boss, I may be missing something.

~~~
patio11
One reason to like a facet of Japanese management culture: if a customer wants
someone to rake over the coals you offer management, not employees.

Internal repercussions notwithstanding, externally the company is a united
front. It cannot cause mistakes by luck, accident, or happenstance, because
the world includes luck, accidents, and happenstance, so any user-visible
error is ipso facto a failure of management.

~~~
mml
apparently this is (or was) a job in japan. companies would hire what amounts
to an actor to get screamed at by the angry customer, and pretend to get fired
on the spot. rinse, repeat whenever such appeasement is required.

~~~
marklyon
I know one person who does this for real estate developers. He gets involved
in contentious projects early on, goes to community meetings, offers testimony
before the city council, etc. When construction gets going and people
inevitably get pissed about some aspect of the project, he gets publicly fired
to deflect the blame while the project moves on. Have seen it happen on three
different projects in two cities now and, somehow, nobody catches on.

~~~
codeisawesome
I don't know how to describe this in a single word or phrase appropriately,
but I think it is a "genius problem" to exist. Not a genius solution. I feel
that the problem itself is impressive and rich in layers of human nature,
local culture etc - but once you have such a problem any average person could
come up with a similar solution, because it is obvious.

It's still mind blowing and very amusing that this is a thing in our world!

------
conorh
This part is also interesting:

 _> While this is an operation that we have relied on to maintain our systems
since the launch of S3, we have not completely restarted the index subsystem
or the placement subsystem in our larger regions for many years._

These sorts of things make me understand why the Netflix "Chaos Gorilla" style
of operating is so important. As they say in this post:

 _> We build our systems with the assumption that things will occasionally
fail_

Failure at every level has to be simulated pretty often to understand how to
handle it, and it is a really difficult problem to solve well.

~~~
urda
I think you meant "Chaos Monkey" [1].

[1]
[https://github.com/Netflix/chaosmonkey](https://github.com/Netflix/chaosmonkey)

~~~
tantalor
No, _Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an
entire Amazon availability zone_

[http://techblog.netflix.com/2011/07/netflix-simian-
army.html](http://techblog.netflix.com/2011/07/netflix-simian-army.html)

------
jph00
> _From the beginning of this event until 11:37AM PST, we were unable to
> update the individual services’ status on the AWS Service Health Dashboard
> (SHD) because of a dependency the SHD administration console has on Amazon
> S3._

Ensuring that your status dashboard doesn't depend on the thing it's
monitoring is probably the first thing you should think about when designing
your status system. This doesn't fill me with confidence about how the rest of
the system is designed, frankly...

~~~
kevinchen
I'm curious as to why their fix was to host the Service Health Dashboard on
more AWS regions. It seems like the responsible thing to do is to host it
entirely on a competitor's service. That way, it's very simple to know that
the status page will work no matter what happens to you.

~~~
pmoriarty
If they did host their status page on a competitor's service, then they'd be
reliant on that service, which might backfire if the competitor's service goes
down while Amazon's own systems stay up.

What they really need is failover capability, which can fire up the status
page on a competitor's service (or maybe on a completely separate disaster
recovery site site owned by Amazon) in case Amazon's own services go down.

I'm sure Amazon's architects and engineers are more than capable of designing
and implementing such a robust system and recognizing its importance. So it
puzzles me as to why it wasn't done.

~~~
angry-hacker
You can use several hosts, have two subdomains so if one is not red
responding, engineers and managers know there are two status pages. Heck, have
two different domains for them as well if there are dns issues. Amzstatus1.com
and 2. Not dependent on Amazon domain anymore either.

------
losteverything
"I did."

That was CEO Robert Allen's response when the AT&T network collapsed [1] on
January 15, 1990

He was asked who made the mistake.

I can't imagine any CEO now a days making a similar statement.

[1]
[http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collaps...](http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse.html)

~~~
mcherm
Fascinating! Do you know of any source that documents the claim that Robert
Allen made this statement?

~~~
losteverything
I'll try. It was a dark day.

We all watched the news and I recall him saying that. The specific quote I
don't remember but it was something like " you can consider that I did." I
think he was asked what will happen to the person that caused it and who is
that person.

Everyone knew right away this had to be human error. Right away. Switches
simply had too much redundancy.

It was big then and not sure if I can locate a video.

~~~
pcthrowaway
Not sure if there's a video, but I found an article with the exact quote:

> As far as our customers are concerned, I did it.

[http://www.upi.com/Archives/1990/01/16/ATT-pinpoints-
cause-o...](http://www.upi.com/Archives/1990/01/16/ATT-pinpoints-cause-of-
long-distance-line-crash/2432632466000/)

~~~
losteverything
Thanks. I got the gist at least.

I admired him and his answer at the time. The culture was quite professional
and blame really never existed.

------
savanaly
Not as interesting an explanation as I was hoping for. Someone accidentally
typed "delete 100 nodes" instead of "delete 10 nodes" or something.

It sounds like the weakness in the process is that the tool they were using
permitted destructive operations like that. The passage that stuck out to me:
"in this instance, the tool used allowed too much capacity to be removed too
quickly. We have modified this tool to remove capacity more slowly and added
safeguards to prevent capacity from being removed when it will take any
subsystem below its minimum required capacity level."

At the organizational level, I guess it wasn't rated as all that likely that
someone would try to remove capacity that would take a subsystem below its
minimum. Building in a safeguard now makes sense as this new data point
probably indicates that the likelihood of accidental deletion is higher than
they had estimated.

~~~
pinko
I always wonder about unintended consequences of this sort of thing. Like
someday there will be a worm about to rampage through their servers and
someone says, "take them all offline now!" and the answer is, "we can't
because of the throttle safeguard we put in place after incident XYZ, it will
be about 17 hours..."

~~~
savanaly
By safeguard I meant (and I think Amazon means too) an extra step that is
required by the user before they can do the action so they don't do it by
accident. Not something that prevents it entirely. Like how an MMO requires
you before you delete a character to type the character's name in a box that
pops up before you can delete it. That's far outside the realm of usual user
interface, but that's so if you are just trying to edit a character it's
impossible to accidentally hit that delete key. An analogous system for Amazon
that would have prevented this outage: delete 10 nodes, ok. Delete 100 nodes,
box pops up saying 'To delete this many nodes you must type the following in
to a message box: "I want to take down a dangerously large amount of nodes."'

~~~
hinkley
It occurs to me that having to type the English version of the numbers would
probably work in this scenario.

    
    
       s3-shutdown -c "one hundred fifty"
    

But something simpler like a --emergency flag or the more whimsical
--shutitdownshutitalldown

~~~
rickcnagy
I think the biggest problem with flags like --emergency is if they end up in
daily use, such as git --force. Then, they are both sudo-level AND used
without a lot of though.

~~~
hinkley
I was in fact thinking of the --force problem which is why I went another way.

------
westernmostcoy
Take a moment to look at the construction of this report.

There is no easily readable timeline. It is not discoverable from anywhere
outside of social media or directly searching for it. As far as I know,
customers were not emailed about this - I certainly wasn't.

You're an important business, AWS. Burying outage retrospectives and live
service health data is what I expect from a much smaller shop, not the leader
in cloud computing. We should all demand better.

~~~
perlgeek
Also notably missing is the "we will automatically refund all affected
customers" line that we'd expect from somebody who wants to provide excellent
service.

A graphical illustration of the service dependencies they were talking about
would have been nice as well.

~~~
CephalopodMD
I mean, it's in the SLA that they have to refund 10% for the billing period
IIRC.

~~~
el_benhameen
If you request it and provide evidence that they find compelling.

 _To receive a Service Credit, you must submit a claim by opening a case in
the AWS Support Center. To be eligible, the credit request must be received by
us by the end of the second billing cycle after which the incident occurred
and must include:

the words “SLA Credit Request” in the subject line; the dates and times of
each incident of non-zero Error Rates that you are claiming; and your request
logs that document the errors and corroborate your claimed outage (any
confidential or sensitive information in these logs should be removed or
replaced with asterisks). If the Monthly Uptime Percentage applicable to the
month of such request is confirmed by us and is less than the applicable
Service Commitment, then we will issue the Service Credit to you within one
billing cycle following the month in which your request is confirmed by us.
Your failure to provide the request and other information as required above
will disqualify you from receiving a Service Credit."_

~~~
dsl
> provide evidence that they find compelling

A link to their tweet about the status page not working because the building
was burning down around it seems compelling.

------
seanwilson
> At 9:37AM PST, an authorized S3 team member using an established playbook
> executed a command which was intended to remove a small number of servers
> for one of the S3 subsystems that is used by the S3 billing process.
> Unfortunately, one of the inputs to the command was entered incorrectly and
> a larger set of servers was removed than intended.

I find making errors on production when you think you're on staging are a big
one for similar errors. One of the best things I ever did on one job was to
change the deployment script so that when you deployed you would get a prompt
saying "Are you sure you want to deploy to production? Type 'production' to
confirm". This helped stop several "oh my god, no!" situations when you
repeated previous commands without thinking. For cases where you need to use
SSH as well (best avoided but not always practical), it helps to use different
colours, login banners and prompts for the terminals.

~~~
fixermark
"All teams have a test server. Some teams are fortunate enough to also have a
separate production server."

~~~
seanwilson
> "All teams have a test server. Some teams are fortunate enough to also have
> a separate production server."

That's what I meant about the SSH comment. Not every team has the automation
or infrastructure that allows you to avoid SSH.

------
dsr_
" we have not completely restarted the index subsystem or the placement
subsystem in our larger regions for many years. S3 has experienced massive
growth over the last several years and the process of restarting these
services and running the necessary safety checks to validate the integrity of
the metadata took longer than expected"

This is analogous to "we needed to fsck, and nobody realized how long that
would take".

~~~
Rapzid
I feel they are lucky it was back up so fast.

------
magd
Oh, that interview question. “Tell me about something you broke in your last
job"

~~~
rconti
Paged for "$HOSTNAME is down!"

Usual checks to access $HOSTNAME failed

Rushed to office at 6am before important process was about to run that needed
that host.

Plugged in keyboard+monitor, dead screen, nothing.

Physically power-cycled server.

Stood in front of monitor+keyboard. It occurred to me it was taking longer
than expected to show POST screen. About that time, I got a page saying
$ACTUALHOSTNAME is down.

Walk around to the back of the racks. The monitor cable had come detached from
the cable extender that I plugged into the server. I had never plugged the
monitor in at all, just the extension.

The server wasn't down in the first place, it just lost a virtual interface,
which I was paged for, and stupidly tested that virtual interface instead of
the REAL name/IP.

And then I raced to the office just so that _I_ could cause an outage.

------
orthecreedence
TLDR; Someone on the team ran a command by mistake that took everything down.
Good, detailed description. It happens. Out of all of Amazon's offerings, I
still love S3 the most.

~~~
cflewis
"It happens" is the only reasonable takeaway you can get from a postmortem
like this. My worry is that people read it and go "I am aghast that such a
command can be run!" without knowing that little commands like that are run
numerous times a day without incident.

The only thing I read in there and go "hmmm" is that it took quite that long
for the S3 service to recover, and that the status page wasn't hosted on
someone that doesn't have an S3 dependency. That's just a plain "doh" moment
:)

~~~
ProAm
People need to realize when they go to the cloud it's not that 'it happens',
it's that it will happen, and you have no ability to do anything about it.
Fact of life and risk management.

~~~
fixermark
... and it's a different risk from self-hosting, but self-hosting provides all
sorts of similar issues (such as when you do this to yourself, the cost is now
coming out of your pocket, not Amazon's, to employ software engineers to
harden your scripts against making the same mistake twice).

~~~
thr0waway1239
Not to mention, Amazon is catching the long tail of cloud failure like Google
is catching the long tail of search keywords. They can now say with a somewhat
straight face - "You know all those scripts you run to keep everything up? We
have figured out many, many more possible ways for them to fail than you
probably ever will, and we have added more layers of safeguards than you can
even imagine."

------
idlewords
"we have changed the SHD administration console to run across multiple AWS
regions."

Dear Amazon: please lease a $25/month dedicated server to host your status
page on.

~~~
TeMPOraL
If big cloud companies would host their status pages on each other's servers,
that would be... actually pretty cool.

------
mleonhard
AWS partitions its services into isolated regions. This is great for reducing
blast radius. Unfortunately, us-east-1 has many times more load than any other
region. This means that scaling problems hit us-east-1 before any other
region, and affect the largest slice of customers.

The lesson is that partitioning your service into isolated regions is not
enough. You need to partition your load evenly, too. I can think of several
ways to accomplish this:

1\. Adjust pricing to incentivize customers to move load away from overloaded
regions. Amazon has historically done the opposite of this by offering cheaper
prices in us-east-1.

2\. Calculate a good default region for each customer and show that in all
documentation, in the AWS console, and in code examples.

3\. Provide tools to help customers choose the right region for their service.
Example: [http://www.cloudping.info/](http://www.cloudping.info/) (shameless
plug).

4\. Split the large regions into isolated partitions and allocate customers
evenly across them. For example, split us-east-1 into 10 different isolated
partitions. Each customer is assigned to a particular partition when they
create their account. When they use services, they will use the instances of
the services from their assigned partition.

------
Dangeranger
So this is the second high profile outage in the last month caused by a simple
command line mistake.

> Unfortunately, one of the inputs to the command was entered incorrectly and
> a larger set of servers was removed than intended.

If I would have guessed anyone could prevent mistakes like this from
propagating it would be AWS. It points to just how easy it is to make these
errors. I am sure that the SRE who made this mistake is amazing and competent
and just had one bad moment.

While I hope that AWS would be as understanding as Gitlab, I doubt the outcome
is the same.

~~~
imsofuture
Amazon has the wherewithal to not freaking publicly name their actually human
employee, so I'd imagine their culture around outages is probably a lot more
healthy.

~~~
Dangeranger
Well to be fair, he named himself within his notes and did not object to the
public nature of the disclosure. I agree with your sentiment though that names
should not be included within postmortems in the general case.

------
rkuykendall-com
tl;dr: Engineer fat-fingered a command and shut everything down. Booting it
back up took a long time. Then the backlog was huge, so getting back to normal
took even longer. We made the command safer, and are gonna make stuff boot
faster. Finally, we couldn’t report any of this on the service status
dashboard, because we’re idiots, and the dashboard runs on AWS.

~~~
fixermark
Everything except the "We're idiots" part I'd agree with.

Self-hosting your diagnostic tools is an easy mistake to make, and I've seen
both startups and large, multi-decade-experienced companies make it.

~~~
rkuykendall-com
Of course. Meant it as more of an "I just spent 10 minutes searching for my
car keys while holding them, because I'm an idiot." No disrespect to the
engineers.

------
all_usernames
Overall, it's pretty amazing that the recovery was as fast as it was. Given
the throughput of S3 API calls you can imagine the kind of capacity that's
needed to do a full stop followed by a full start. Cold-starting a service
when it has heavy traffic immediately pouring into it can be a nightmare.

It'd be very interesting to know what kind of tech they use at AWS to throttle
or do circuit breaking to allow back-end services like the indexer to come up
in a manageable way.

------
hyperanthony
Something that wasn't addressed -- there seems to be an architectural issue
with ELB where ELBs with S3 access logs enabled had instances fail ELB health
checks, presumably while the S3 API was returning 5XX. My load balancers in
us-east-1 without access logs enabled were fine throughout this event. Has
there been any word on this?

~~~
Johnny555
I think it comes down to how important your ELB logs are -- if they are
important enough that you don't want to allow traffic without logs (i.e. if
you're using them for some sort of auditing/compliance), then failing when it
can't write the logs seems like the right choice.

~~~
hyperanthony
Thanks, that is a fair perspective. In our case we're using ELB logs as a
redundant trace and it isn't critical that our traffic stops if the access
logs fail. It would be nice if this behavior became a toggle in ELB settings,
but think we can set something up to disable access logs programatically if we
start seeing S3 issues.

~~~
ctrlrsf
Good luck with this. We tried to make changes yesterday to mitigate impact but
AWS console was also affected. Was hesitant to make API calls for the changes
since we werent sure they would complete successfully given all the services
we found actually depended on S3 internally.

------
djhworld
Really pleased to see this, it's good to see an organisation that's being
transparent (and maybe given us a little peek under the hood of how S3 is
architected) and most importantly they seem quite humbled.

It would be easy for an arrogant organisation to fire or negatively impact the
person that made the mistake, I hope Amazon don't fall into that trap and
focus instead on learning from what happened, closing the book and move on.

------
mmanfrin
There are quite a few comments here ignoring the clarity that hindsight is
giving them. Apparently the devops engineers commenting here have never fucked
up.

~~~
scott_karana
On the contrary: I feel like Amazon is taking some flak _because_ everyone
here has messed up before, and are surprised that engineers (seemingly lacking
failure experience) were able to do what they did.

I wouldn't task a junior sysadmin a server deletion, would you? Nor could I
ever consider someone _without_ a fuckup a senior ;)

------
certifiedloud
This is a bit off topic. The use of the word "playbook" suggests to me that
they use Ansible to help manage S3. I wonder if that is the case, or if it's
just internal lingo that means "a script". Unless there is some other
configuration management system that uses the word playbook that I'm not aware
of.

~~~
JoshTriplett
"playbook" is a relatively common term for "documented step-by-step procedure
for specific tasks". Effectively, a script with #!/bin/human at the top.

~~~
p4lindromica
Also known as a runbook

~~~
hehheh
Or if you want to get real old school a checklist.

------
erikbye
What does everyone use S3 for?

I'm genuinely curious. As my experiments with it have left me disappointed
with its performance, I'm just not sure what I could use it for. Store massive
amounts of data that is infrequently accessed? Well, unfortunately the upload
speed I got to the standard rating one was so abysmal it would take too much
time to move the data there; and then I suspect the inverse would be pretty
bad as well.

~~~
castis
1 scenario: if you run a website that has a lot of static content (multiple GB
of images, css, js, etc) and you dont want your http server to be responsible
for serving that content then you give it all to s3 and let them serve it for
you.

~~~
erikbye
What about the performance of serving it? Sounds like I would need to cache it
myself, anyway.

~~~
ckozlowski
Performance out of S3 is generally really good. However, if you're looking to
say, serve up a global website and your content is in a single S3 region, then
you can leverage CloudFront CDN to serve up those objects. CloudFront
integrates seamlessly with S3, and you don't pay transfer charges between
CloudFront and S3.

------
i336_
> (...) [W]e have not completely restarted the index subsystem or the
> placement subsystem in our larger regions for many years. S3 has experienced
> massive growth over the last several years and the process of restarting
> these services and running the necessary safety checks to validate the
> integrity of the metadata took longer than expected.

All those tweets saying "turn it off and back on again"?

"We accidentally turned it off, but it hasn't been turned it off for so long
it took us hours to figure out how to turn it back on."

Poorly-presented jokes aside, this is rather concerning. The indexer and
placement systems are SPOFs!! I mean, I'd _presume_ these subsystems had
ultra-low-latency hot failover, but this says _they never restarted_ , and I
wonder if AWS didn't simply invest a ton of magic pixie dust in making
Absolutely Totally Sure™ the subsystems physically, literally never crashed in
years. Impressive engineering but also very scary.

At least they've restarted it now.

And I'm guessing the current hires now know a _lot_ about the indexer and
placer, which won't do any harm to the sharding effort (I presume this'll be
being sharded quicksmart).

I wonder if all the approval guys just photocopied their signatures onto a run
of blank forms, heheh.

~~~
richardwhiuk
I don't think you don't understand the architecture of the system, if you are
describing the indexer as a SPOF.

The system is a collection of shards. If you replicate it to create a second
shard, then you'll just have a large a system, which is still a single point
of failure.

The index, by necessity, has to be able to answer the question 'this object
exists' or 'this object doesn't exit' \- so it needs to have consensus.

~~~
i336_
Hrm.

My speculative presumption was going off the sole datapoint of " _we have not
completely restarted the index subsystem or the placement subsystem in our
larger regions for many years_ ". I'm not quite sure how to interpret "
_restart_ " in this context, mostly due to lack of exposure or experience.

The report also says " _Unfortunately, one of the inputs to the command was
entered incorrectly and a larger set of servers was removed than intended. The
servers that were inadvertently removed supported two other S3 subsystems._ "
So you're right, it looks like multiple servers were supporting these systems,
which does make sense (especially considering the load they would have seen).
Okay.

I guess I didn't quite think through the load requirements and thought these
were single machines - which is certainly ludicrous thinking :) - and that's
where I got the SPOF reasoning from.

You're very right though, these consensus systems must be built as bottlenecks
in order to see everything.

And there aren't really any alternatives: "build extra indexers and placement
systems!" just gives you "but what if _all_ of them get taken offline?" and
"it can't leave the datacenter, it sees 100GB/s of throughput" (number taken
out of thin air).

Good points.

------
cnorthwood
I'm surprised how transparent this is, I can find Amazon often a bit opaque
when dealing with issues.

~~~
snewman
I've found them to be very opaque in most contexts, but for major outages
(which have been rare), they do have a history of solid postmortems.

~~~
DigitalBison
The public postmortem from the big DynamoDB outage in 2015(?) is a great
example I think:
[https://aws.amazon.com/message/5467D2/](https://aws.amazon.com/message/5467D2/)

------
nissimk
I keep being reminded of something I read recently that made me feel uneasy
about google's cloud spanner [1]:

 _the most important one is that Spanner runs on Google’s private network.
Unlike most wide-area networks, and especially the public internet, Google
controls the entire network and thus can ensure redundancy of hardware and
paths, and can also control upgrades and operations in general. Fibers will
still be cut, and equipment will fail, but the overall system remains quite
robust. It also took years of operational improvements to get to this point.
For much of the last decade, Google has improved its redundancy, its fault
containment and, above all, its processes for evolution. We found that the
network contributed less than 10% of Spanner’s already rare outages._

But when it fails it's going to be epic!

[1] [https://cloudplatform.googleblog.com/2017/02/inside-Cloud-
Sp...](https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Spanner-and-
the-CAP-Theorem.html)

~~~
fixermark
Google takes active steps to confirm that their fallback systems for
mitigating failures work.

[http://queue.acm.org/detail.cfm?id=2371516](http://queue.acm.org/detail.cfm?id=2371516)

------
St-Clock
I am unpleasantly surprised that they do not mention why services that should
be unrelated to S3 such as SES were impacted as well and what they are doing
to reduce such dependencies.

From a software development perspective, it makes sense to reuse S3 and rely
on it internally if you need object storage, but from an ops perspective, it
means that S3 is now a single point of failure and that SES's reliability will
always be capped by S3's reliability. From a customer perspective, the hard
dependency between SES and S3 is not obvious and is disappointing.

The whole internet was talking about S3 when the AWS status dashboard did not
show any outage, but very few people mentioned other services such as SES.
Next time we encounter errors with SES, should we check for hints of S3 outage
before everything else? Should we also check for EC2 outage?

~~~
CWuestefeld
_services that should be unrelated to S3 such as SES were impacted_

I don't think this is particularly surprising. I'd already pretty much assumed
that, e.g., a package of code for a Lambda function would be housed in an S3
bucket somewhere.

What's really surprising to me is how many of those buckets appear to live in
US-EAST-1, and aren't able to keep functioning in a catastrophe by failing
over to a different region.

~~~
kondro
You specify the run region for each of those services and all the other
components get restricted to those regions too.

We're in ap-southeast-2 (Sydney) and none of our services were impacted
yesterday.

------
tifa2up
Do you know if Amazon is giving any refunds/credits for the service outbreak?

~~~
officelineback
I don't think it's automatic. I just helped my former boss with his decision
to go for a refund (he asked me for help drafting a request, but I reminded
him that like 99.99% of their S3 storage is backups that are IA-Standard, so
it may not be worth it).

------
ct0
I wouldn't want to be the person who wrote the wrong command! Sheesh.

~~~
Johnny555
I brought down our production system after a typo in a command once... the dev
team took the blame for allowing an illegal parameter to bring down the
system.

~~~
alexvy86
Kudos to the dev team for that, I think most people wouldn't own that kind of
issue

~~~
Johnny555
It was a very well-run engineering department where taking blame was not a
career ending decision. I took full blame for the typo (at 3am trying to
resolve a customer issue), but the dev team accepted full responsibility for
letting it take down the system.

Every mistake was used as a learning opportunity to ensure that the same and
similar mistakes can't be repeated.

------
EdSharkey
It's curious they needed to "remove capacity" to cure a slow billing problem.

Is that code for a "did you try to reboot the system?" kind of
troubleshooting?

It sounds to me like the authorized engineer sent a command to reboot/reimage
a large swath of the S3 infrastructure.

------
sebringj
If Amazon were a guy, he'd be a standup guy. This is a very detailed and
responsible explanation. S3 has revolutionized my businesses and I love that
service to no end. These problems happen very rarely but I may have backups
just in case using nginx proxy approach at some point and because S3 is so
good, everyone seems to adopt their API so its just a matter of a switch
statement. Werner can sweat less. Props.

I would add, it would be awesome if there was a simulation environment, beyond
just a test environment that simulated servers outside requesting in, before a
command was allowed to run onto production, like a robot deciding this, then
could mitigate this, kind of like TDD on steriods if they don't have that
already.

------
Exuma
Imagine being THAT guy.......... in that exact moment...... after hitting
enter and realizing what he did. RIP

~~~
usernametbd
I can imagine being that guy in that exact moment. But I can't imagine being
that guy after the event. There will be a constant fear and doubt in my mind.
And a constant fear whether others trust me anymore. I couldn't quit because
that might make me look bad and I couldn't continue because that might make me
look bad.

~~~
hanspeter
This guy may actually need therapy to not suffer from some light degree of
trauma.

------
lasermike026
Ops and Engineering here.

My guts hurt just reading this.

With big failures is never just one thing. There are a series of mistakes, bad
choices, and ignorance that lead to a big system wide failures.

------
spullara
Twitter once had 2 hours of downtime because an operations engineer
accidentally asked a tool to restart all memcached servers instead of a
certain server. The tool was then changed to make sure that you couldn't
restart more than a few servers without additional confirmation. Sounds very
similar to this situation. Something to think about when you are building your
tools to be more error proof.

~~~
idlewords
There are very few things that Twitter hasn't had two hours of downtime
because of.

------
matt_wulfeck
> _Unfortunately, one of the inputs to the command was entered incorrectly and
> a larger set of servers was removed than intended._

We have geo-distributed systems. Load balancing and automatic failover. We
agonize over edge cases that might cause issues. We build robust systems.

At the end of he day reliability -- a lot like security -- is most affected by
the human factor.

------
dap
> Removing a significant portion of the capacity caused each of these systems
> to require a full restart.

I'd be interested to understand why a cold restart was needed in the first
place. That seems like kind of a big deal. I can understand many reasons why
it might be necessary, but that seems like one of the issues that's important
to address.

~~~
perlgeek
Possibly a consensus algorithm that refuses writes when it detects itself in a
minority, because it think it's in the smaller part of a split-brain scenario.

In this case, throwing away and then re-provisioning the split-off nodes is a
viable approach.

------
OhHeyItsE
I find this refreshingly candid; human, even, for AWS.

------
throwtotheway
I hope I never have to write a post-mortem that includes the phrase "blast
radius"

------
bandrami
"we have not completely restarted the index subsystem or the placement
subsystem in our larger regions for many years."

Yeah... nothing says "resilience" quite like that...

------
aestetix
It sounds like this can be mitigated by making sure everything is run in dry
run mode first, and for something mission critical, getting it double-checked
by someone before removing the dry run constraint.

It's good practice in general, and I'm kind of astonished it's not part of the
operational procedures in AWS, as this would have quickly been caught and
fixed before ever going out to production.

~~~
user15672
I'm not sure you've understood what the problem is. They were removing some
servers from a group. This isn't something that get's a dry run, not without
spinning up the _entire_ AWS infrastructure. It also wouldn't have helped a
jot, since the issue came about after an employee executing a playbook made a
typo.

There's no way this could have been mitigated with a dry run. They're
mitigating it in future by putting more aggressive safeguards in their
tooling, which is the correct way to mitigate this sort of issue.

------
carlsborg
"As a result, (personal experience and anecdotal evidence suggest that) for
complex continuously available systems, Operations Error tends to be the
weakest link in the uptime chain."

[https://zvzzt.wordpress.com/2012/08/16/a-note-on-
uptime/](https://zvzzt.wordpress.com/2012/08/16/a-note-on-uptime/)

------
dorianm
Same as that db1 / db2 for GitLab, naming things is pretty important (E.g
production / staging, production-us-east-1-db-560 etc.)

------
pfortuny
I guess it is time to define commands whose inputs have great distance in, say
the Damerau-Levenshtein metric.

For numerical inputs, one might use both the digits and the textual
expression. This would make them quite cumbersone but much less prone to
errors. Or devise some shorthand for them...

156 (on fi six). 35. (zer th fi). 170 (on se zer). 28 (two eig) evens have
three letters odds have two.

This is just my 2 cents.

------
atrudeau
Are customers going to receive any kind of future credit because of this?
Would be a nice band-aid after such a hard smack on the head.

~~~
oxguy3
You can make a claim to receive credit based on their SLA:
[https://aws.amazon.com/s3/sla/](https://aws.amazon.com/s3/sla/)

~~~
jdalgetty
You shouldn't have to make a claim - this should be handled by them
automatically.

~~~
atrudeau
Agreed

------
mulmen
If this boils down to an engineer incorrectly entering a command can we please
refer to this outage as "Fat Finger Tuesday"?

------
jasonhoyt
"People make mistakes all the time...the problem was that our systems that
were designed to recognize and correct human error failed us." [1]

[1]
[http://articles.latimes.com/1999/oct/01/news/mn-17288](http://articles.latimes.com/1999/oct/01/news/mn-17288)

~~~
sp332
This reminds me of Asimov's characteristically tiny story "Fault-Intolerant"
[https://unotices.com/book.php?id=38686&page=15](https://unotices.com/book.php?id=38686&page=15)
(You can ignore the story at the top about Feghoot, the real story is below.)

------
hemant19cse
Amazon's AWS Outage Was More of a Design(UX) Failure and Less of Human Error.
[https://www.linkedin.com/pulse/how-small-typo-caused-
massive...](https://www.linkedin.com/pulse/how-small-typo-caused-massive-
downtime-s3aws-hemant-kumar-singh)

------
bsaul
Wonder if every numbers for critical command lines shouldn't be spelled out as
well. If you think about how checks works, you're supposed to write the number
as well as the words for the number. -nbs two_hundreds instead of twenty is
much less likely to happen..

just like rm -rf / should really be rm -rf `root`

------
eplanit
So this week the poor soul at Amazon, along with the Price-Waterhouse guy, are
the poster children of Human Error.

------
rsynnott
> While this is an operation that we have relied on to maintain our systems
> since the launch of S3, we have not completely restarted the index subsystem
> or the placement subsystem in our larger regions for many years.

This is the bit that'd worry me most; you'd think they'd be testing this.

~~~
illumin8
A complete restart of the index subsystem would require downtime. Note: they
are not saying those servers have never been restarted - it's highly likely
they get restarted regularly. But, a complete restart of the index subsystem
implies that you shut everything down first and restart it all at once, which
is what was forced to happen two days ago.

~~~
kwisatzh
Why can't the index subsystem itself have a backup then? When the primary
subsystem is being restarted/rebuilt, the secondary takes over.

~~~
grogenaut
thats kind of like asking why git doesn't have a single backup... it's a
distributed system, there's not just one backup, there are lots of little
partial backups.

------
EternalData
This caused panic and chaos for a bit among my team, which I imagine was
replicated across the web.

Moments like these always remind me that a particularly clever or nefarious
set of individuals could shut down essential parts of the Internet with a few
surgical incisions.

------
DanBlake
Seems like something like Chaos monkey should have been able to predict and
mitigate a issue like this. Im actually curious if anyone uses it at all-
Curious if anyone in here at a large company (over 500 employees) has it
deployed or not.

------
matt_wulfeck
Remember folks, automate your systems but never forget to add sanity checks.

------
tuxninja
I think they should have led with insensitivity about it and maybe a white
lie. Such as... We took our main region us-east-1 down for X hours because we
wanted to remind people they need to design for failure of a region :-)

Shameless plugs (authored months ago):
[http://tuxlabs.com/?p=380](http://tuxlabs.com/?p=380) \- How To: Maximize
Availability Effeciently Using AWS Availability Zones ( note read it, its not
just about AZ's it is very clear to state multi-regions and better yet multi-
cloud segway...second article)
[http://tuxlabs.com/?p=430](http://tuxlabs.com/?p=430) \- AWS, Google Cloud,
Azure and the singularity of the future Internet

------
nsgoetz
This makes me want to write a program that would ask users to confirm commands
if it thinks they are running a known playbook and deviating from it. Does
anyone know if a tool like that exists?

~~~
mrep
Not sure, but my company's fleet wide root scripts confirm first the exact
command you want to run, then run on 1 host first and output the the full logs
for you to inspect/confirm, and then finally start the full fleet wide run
after you have confirmed the expected result of your output. They also output
the full logs of across the entire fleet once your fleet wide script is run.

------
prh8
For as much as people jumped all over Gitlab last month, this seems remarkably
similar in terms of preparedness for accidental and unanticipated failure.

~~~
user15672
Beyond a typing mistake, it's not really very similar. The Gitlab incident was
one avoidable problem after another, ending with a giant WTF when they found
out that no-one had even tested the backups were working.

This is a case of someone slipping on the keyboard, removing more capacity
than intended and the recovery process taking longer than expected. The
process actually seems to be working (to a given value of working), but the
amount of downtime was way above acceptable. They've already put more
safeguards into the tooling to prevent the situation from happening again.

S3 is also orders of magnitude more complex than Gitlabs infrastructure, so
while the amount of time the outage lasted for is not acceptable, it does show
that they at least have working processes for critical situations that allow
them to get back in service within a day, which is pretty impressive.

------
chirau
Deletions, shutdowns and replications should always either contain a SELECT to
show affected entities or a confirmation (Y/n) of the step.

------
pwthornton
This is the risk you run into by doing everything through the command line.
This would be really hard to do through a good GUI.

~~~
djhworld
I highly doubt this claim, humans make mistakes regardless of the control
method.

In this particular case the scripts didn't have adequate protections in place,
but that's the benefit of hindsight

~~~
pwthornton
Or the benefit of testing.

------
fulafel
I guess this means it's much better for your app to fail over between regions
than between availability zones.

------
asow92
Man, I'd really hate to be that guy.

------
tn13
Can Amazon take responsibility and offer say 10% discount to all the customers
who are spending >$X ?

~~~
kinkrtyavimoodh
It's already part of their SLA.

[https://aws.amazon.com/s3/sla/](https://aws.amazon.com/s3/sla/)

------
sumobob
Wow, I'm sure that person who mis entered the command will never, ever, ever,
do it again.

------
dootdootskeltal
I don't know if it's a C thing, but those code comments are art!

~~~
reitanqild
posting in wrong thread?

~~~
dootdootskeltal
oops yeah, I had so many HN tabs.

------
CodeWriter23
I'm just going to call this "PEBKAC at scale"

------
feisky
Great shock to the quick recovery of such a big system.

------
cagataygurturk
Sounds so chernobyl.

~~~
Piskvorrr
Because it was resolved in a few hours without any further fallout? (Pun
definitely intended)

------
davidf18
There should be some sort of GUI interface that does appropriate checks
instead of allowing someone to mistakenly type the correct information.

------
njharman
Did you read the fucking article?

That is EXACTLY what they are doing (among other things).

~~~
dang
Please stop posting like this so we don't have to ban you.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

We detached this comment from
[https://news.ycombinator.com/item?id=13776335](https://news.ycombinator.com/item?id=13776335)
and marked it off-topic.

------
skrowl
TLDR:

Never type 'EXEC DeleteStuff ALL'

When you actually mean 'EXEC DeleteStuff SOME'

------
thraway2016
Something doesn't pass the smell test. Over two hours to reboot the index
hosts?

~~~
canadaduane
I assume "reboot" in this instance means more than turning it off and on again
--it must return to a working state, with many volumes of data requiring log
processing to find the last (and best) "good state".

------
romanovcode
They could've just not post anything. People already forgot about this
disruption.

------
machbio
No on in HN is questioning this - "The Amazon Simple Storage Service (S3) team
was debugging an issue causing the S3 billing system to progress more slowly
than expected." \- they are debugging on Production System..

------
fr4egy8e9
What most AWS customers don't realize is that AWS is poorly automated. Their
reliability relies on exploiting the employees to manually operate the
systems. The technical bar at Amazon is incredibly low and they can't retain
any good engineers.

------
aorloff
What's missing is addressing the problems with their status page system, and
how we all had to use Hacker News and other sources to confirm that US East
was borked.

~~~
Scramblejams
No, this is addressed:

>We understand that the SHD provides important visibility to our customers
during operational events and we have changed the SHD administration console
to run across multiple AWS regions.

~~~
yellow_postit
Which is fine until those regions go down. A status page, in my mind, should
have a fallback on a completely different service provider.

~~~
aorloff
Exactly. What they should do is have the status system be independent of AWS
so it can report issues regardless of AWS service status.

~~~
Piskvorrr
Awsnap!
[http://dilbert.com/strip/2008-02-15](http://dilbert.com/strip/2008-02-15)

------
tuxninja
I think they should have led with insensitivity about it and maybe a white
lie. Such as... We took our main region us-east-1 down for X hours because we
wanted to remind people they need to design for failure of a region :-)

Shameless plugs (authored months ago):
[http://tuxlabs.com/?p=380](http://tuxlabs.com/?p=380) \- How To: Maximize
Availability Effeciently Using AWS Availability Zones ( note read it, its not
just about AZ's it is very clear to state multi-regions and better yet multi-
cloud segway...second article)

[http://tuxlabs.com/?p=430](http://tuxlabs.com/?p=430) \- AWS, Google Cloud,
Azure and the singularity of the future Internet

------
edutechnion
For the many of us who have built businesses dependent on S3, is anyone else
surprised at a few assumptions embedded here?

* "authorized S3 team member" \-- how did this team member acquire these elevated privs?

* Running playbooks is done by one member without a second set of eyes or approval?

* "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years"

The good news:

* "The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately."

The truly embarrassing that everyone has known about for years is the status
page:

* "we were unable to update the individual services’ status on the AWS Service Health Dashboard "

When there is a wildly-popular Chrome plugin to _fix_ your page ("Real AWS
Status") you would think a company as responsive as AWS would have fixed this
years ago.

~~~
icelancer
>Unauthorized S3 team member

This is not in the post.

