
Summary of the Amazon DynamoDB Service Disruption - spew
https://aws.amazon.com/message/5467D2/
======
felixgallo
"It was impacting a relatively small number of customers, but we should have
posted the green-i to the dashboard sooner than we did on Monday."

Amazon. No. Please listen. You should never post a green-i. Green-i means
nothing to anyone. It's a minimization of a problem. You should change the
indicator to show that there is a problem. If there is a problem, that is what
the indicator is for. Customers don't care about whatever weird internal
political pressure causes you to want to show as little information as
possible on that dashboard. Customers want to know that there is a problem,
and we will totally be able to figure out if it is small or large on our own.

~~~
deanCommie
What's with this all-or-nothing attitude? What is so wrong about different
severity levels?

Green-i: Some small percentage of customers are affected, but most of you have
nothing to worry about. So if you are a customer having problems and see this
notice, you are re-assured that you are not crazy. And if you are NOT seeing
problems, you probably won't. Amazon has not always been the best at posting
this information quickly enough, hence a commitment to be better at messaging
for those customers

Yellow: A significant amount of customers is affected. Potentially serious
problems, and if you haven't seen them yet, don't be surprised if you will.
Start monitoring your service health and preparing for a failover to another
region.

Red: FUBAR. Happens rarely, but when it does you'll probably know even before
your application alarms - you'll notice when half the internet shuts down.

~~~
felixgallo
different severity levels are fine. That's what the indicator is for.
Decorating the indicator, normally completely incorrectly, with a tiny _extra_
indicator, is incomprehensible. That's why you've never seen a tiny red light
in the upper corner of your green traffic light. Your 'check engine' light is
not green with a tiny extra indicator in the corner. The metaphors go on and
on.

Customers are totally not interested in the fact that some other random
customer may not be experiencing problems and that amazon has a wide variety
of services, some of which may not be affected in certain geographies.
Customers go to that panel with one express goal: to discover if the problems
they are seeing with their servers could be related to them. It is a triage
check. The question is not, is Amazon super great and are the availability
engineers the bestest on average over time? The question is, what the fuck is
going on?

Amazon has suffered catastrophic unavailability, and the green-i has appeared
belatedly, an hour or more into the problem. Because as revealed in this post,
it is manually put up by engineers. Manually!

Manually!

~~~
deanCommie
The question of timing of the green-I is valid, but different from your
original complaint.

In your hypothetical scenario, the customer that is going to the panel to
"find out what the fuck is going on" is not going to miss the indicator i, and
will read the description of the message.

~~~
toomuchtodo
I don't care what that "i" indicator is; if you're showing green while you're
impacted, your status board is useless and I'm going to Twitter and HN to see
if you're broken (which is what I already have to do for AWS).

~~~
mgkimsal
When we've seen problems, some of my clients go there and see green and say
"Amazon shows they're fine - it must be your code or something you did or
didn't do - fix it". Well... it's green but it's not fine. And there's little
I can do to demonstrate to someone that it's not code but infrastructure.

------
arturhoo
_We did not have detailed enough monitoring for this dimension (membership
size), and didn’t have enough capacity allocated to the metadata service to
handle these much heavier requests._

As much as I admire and rely on AWS' scale to build architectures and fault
tolerant applications, it can't be ignored that the marketing towards going
"full cloud" doesn't take into account how hard it is to build resilient
architectures in the cloud.

I see those disruptions events as stop signs: when the cloud itself fails to
scale, I rethink a few decisions we all make when surfing those trends.

[http://yourdatafitsinram.com/](http://yourdatafitsinram.com/) also comes to
mind.

~~~
chillydawg
How is it any harder to do in the cloud than on a rack in a warehouse? At
least you don't have to muck about with cables and phoning power companies up.

~~~
nickpsecurity
You call up IBM. You ask for a mainframe solution for two sites. You get
experts to set it up for you with your application and such. You don't worry
about downtime again for at least 30 years.

You call up Bull, Fujitsu, or Unisys for the same thing.

You call up HP. You ask for a NonStop solution. You get same thing for at
least 20 years.

You call up VMS Software. You ask for an OpenVMS cluster. You get same thing
for at least 17 years.

Well-designed OS's, software, and hardware did cloud-style stuff for a long
time before cloud existed without the downtime. Cloud certainly brought price
down and flexibility up. Yet, these clouds haven't matched 70-80's technology
in uptime yet despite all the brains and money thrown at them. That's a fact.

So, shouldn't be used for anything mission critical where downtime costs lots
of money.

~~~
inopinatus
This is absolute cobblers. I worked in an IT team that had a pair of IBM
mainframes that were fed and watered at crushing expense and for which even
the tiniest software change required a colossal waterfall project.

One day, they failed. One went offline - for reasons never revealed, at least
to me - and the secondary didn't come up. Radio silence, kaput. But an airline
that housed mainframes in the same DCs had their booking system fail at
exactly the same times (with national headlines to match).

The myth of mainframe uptime is exactly that. La-la-land for hardware &
services salesmen.

~~~
nickpsecurity
Appreciate the counterpoint. Could in fact be a myth or legend. Lots of money
to justify spreading disinformation, too. Maybe an anonymous survey by a
reputable organization is in order that tries to break down what issues people
have and don't have along with specific metrics. Then compile that into a big
picture.

Meanwhile, the companies I've worked at all had mainframes without trouble
from them that people said. Problems were virtually always the app developers
or the pain of doing 21st century stuff with 60's-80's architecture or legacy
code.

------
yandie
As this post is on, Dynamo is experiencing error again:

6:33 AM PDT We are investigating increased error rates for API requests in the
US-EAST-1 Region.

Source: [http://status.aws.amazon.com/](http://status.aws.amazon.com/)

~~~
colinbartlett
Some services are reporting problems now. Starting with the same set as last
time such as Heroku and lots of services that depend on it:

[https://statusgator.io/events](https://statusgator.io/events)

~~~
toomuchtodo
DynamoDB, Cloudwatch, EC2, Scaling, and Elastic Beanstalk are all impacted
currently.

------
ghshephard
A pretty good post mortem, but one worrisome final comment:

 _" For us, availability is the most important feature of DynamoDB"_

I would think that durability should be the most important feature of
DynamoDB; better to have intermittent outage, or reduced capacity, but not
have data loss - but perhaps there is something about DynamoDB which suggests
availability is rated more highly than durability?

~~~
EwanToo
I suspect they're simply trying to reassure customers that this size of outage
won't happen again, not make some kind of deep mission statement..

~~~
jMyles
If this is true (and I think you're right), then pronouncements about "the
most important feature" are probably not timely.

------
deskamess
> but we should have posted the green-i to the dashboard sooner than we did

Use full orange or even half orange circle or something else if you want to
convey 'subset of customers may have issues'. If possible move 'having issue'
items to the top - this way people do not have to scroll down to see whats
broken.

~~~
deanCommie
Not enough granularity with a yellow icon. I feel like if only <1% of
customers are affected, a yellow circle isn't warranted - that's for the 5-20%
impact case

~~~
takeda
Unless you're one of the <1% of customers.

The whole point of status page is to to help determine if the issue you're
experiencing is on your side or the SaaS, and not to show as much green as
possible.

The way currently AWS status page works it simply fails to provide any
functionality and might as well be shut down.

The colors on it start to change when you already see tons of articles about
AWS being down.

------
userbinator
_Unavailable servers continued to retry requests for membership data,
maintaining high load on the metadata service._

It sounds like the retries exacerbated the situation. I wonder if they use
exponential backoff?

~~~
BillinghamJ
It's amazing how frequently this seems to happen. A lot of the huge downtime
events I've read about have occurred because of failures in the systems used
to recover from failures.

It seems that perhaps more time/effort needs to be spent testing the systems
(and the use of those systems) which are critical for handling failure. While
building them, it's easy to dismiss their scaling requirements on the basis
that they shouldn't ever be under significant load.

~~~
antirez
This is a classic, even because those software parts usually are the least
tested in a variety of conditions, even when they are relatively well tested.
It's very hard to simulate real failures.

------
beambot
Looks like a fair number of key aws systems rely on DynamoDB -- and further,
the same system used by customers. I wonder: do they have any inclination to
decouple these dependencies to prevent correlated outages?

~~~
Jgrubb
I don't think they rely on Dynamo, they rely on an internal metadata service
that Dynamo just happened to overwhelm with too many large requests.

~~~
beambot
From the 2nd sentence:

> ... subsequent impact to other AWS services that depend on DynamoDB ...

And the metadata service is part of DynamoDB:

> The membership of a set of table/partitions within a server is managed by
> DynamoDB’s internal metadata service.

~~~
ckozlowski
I just asked the question in a separate post before seeing yours, but I
questioned this as well. Would it have been better to have a separate instance
of DynamoDB independent from the one customers are writing to? I understand
this is added cost/overhead, but it would have resulted in isolating the
failure to the customer side only, I'd think.

------
ckozlowski
I noticed that in the "Impact on other services" bit, that CloudWatch and
Console were affected, becasue they were dependent on DyanmoDB. Now, I don't
pretend to know DyanmoDB to well, but it seems to me that having your
monitoring application dependent on one of the things you would be monitoring
is a strange circular dependency.

Would it have been wiser for Amazon to implement a completely separate
instance of DynamoDB for service offerings that depended on it? Or is this
just simply cost ineffective? Help me understand, thanks. =)

~~~
_alex_
CloudWatch is a monitoring service that AWS offers publicly. It is different
from the monitoring service that AWS services use internally.

source: was on an AWS service team for several years

~~~
berkay
It's still a problem even if it's not the internal monitoring system.
Customers use CloudWatch to detect problems with DynamoDB and get notified.
This dependency means customers may not get notified if CloudWatch does not
work as expected due to a DynamoDB problem.

~~~
sharpy
There are different forces at play.

On one hand, one can design a service with minimum dependencies to survive
other services' outages, but that means duplication of effort and increased
cost, as well as slower delivery of features. Or one can focus on adding
value.

It seems though, from the update, that CloudWatch is going to have some sort
of caching for most recent data.

------
bboreham
> We did not have detailed enough monitoring for this dimension (membership
> size)

It is often the case that you aren't monitoring the metric that would have
told you the world was about to end. However,

> their processing time was near the time limit for retrieval

This is a specific response time with a known, fatal, limit. Not seeing that
trend seems unfortunate.

------
astral303
"We did not realize soon enough that this low overall error rate was giving
some customers disproportionately high error rates."

Critical lesson for running multitenant SaaS services at scale: you need to
monitor not just overall error rates, but individual tenant/customer error
rates.

------
jgalt212
Does anyone keep stats on service outages for AWS, Axure, Google, et al?

~~~
colinbartlett
I currently track a lot of this through my project StatusGator which alerts
you on outages: [https://statusgator.io](https://statusgator.io).

Most of the historical data isn't shown publicly but I could probably do
something with that.

~~~
jgalt212
The historical data would be quite useful for companies when deciding which
cloud providers to contract with.

------
djb_hackernews
Hmm, we are seeing very high request latencies with DynamoDB in multiple
accounts and can't even load the DynamoDB console...Anyone else?

~~~
lostcolony
I can load the console, but it's very, very slow. And creating a new table
takes forever, then spews error text over the interface in a way that makes it
clear they never intended for an error to be displayed this way.

