
AWS US East is experiencing high error rates on several services - oliverfriedmann
The services CloudWatch, SES, SNS, SQS, SWS, AutoScale, Cloud Formation, Directory Service, Key Mgmt and Lambda experience very high error rates for about 3 hours now.<p>Dynamo DB is throttling API access and seems to be having issues with the management of meta data.
======
tcas
We were in the middle of a large infrastructure change starting at 4:30am this
morning, including taking our application offline. I'm very thankful that we
did dry runs along with timing how long certain operations like RDS restores
should take and planned for abort steps in case something goes wrong.

We noticed that RDS and ElastiCache backup and restores were taking much
longer than expected, and once the first set of errors about Dynamo DB came in
we decided to abort and try it again at a different date. An hour later we got
notifications that RDS was having issues as well. I'm disappointed that it
takes so long to update the AWS status page when things aren't working
properly.

~~~
confiq
similar story here... They status.aws page has serious delays

------
taf2
The main issue appears to be DynamoDB

Here's a copy from the status page.

3:00 AM PDT We are investigating increased error rates for API requests in the
US-EAST-1 Region.

3:26 AM PDT We are continuing to see increased error rates for all API calls
in DynamoDB in US-East-1. We are actively working on resolving the issue.

4:05 AM PDT We have identified the source of the issue. We are working on the
recovery.

4:41 AM PDT We continue to work towards recovery of the issue causing
increased error rates for the DynamoDB APIs in the US-EAST-1 Region.

4:52 AM PDT We want to give you more information about what is happening. The
root cause began with a portion of our metadata service within DynamoDB. This
is an internal sub-service which manages table and partition information. Our
recovery efforts are now focused on restoring metadata operations. We will be
throttling APIs as we work on recovery.

5:22 AM PDT We can confirm that we have now throttled APIs as we continue to
work on recovery.

5:42 AM PDT We are seeing increasing stability in the metadata service and
continue to work towards a point where we can begin removing throttles.

~~~
virtuallynathan
6:19 AM PDT The metadata service is now stable and we are actively working on
removing throttles.

7:12 AM PDT We continue to work on removing throttles and restoring API
availability but are proceeding cautiously.

7:22 AM PDT We are continuing to remove throttles and enable traffic
progressively.

7:40 AM PDT We continue to remove throttles and are starting to see recovery.

7:50 AM PDT We continue to see recovery of read and write operations and
continue to work on restoring all other operations.

8:16 AM PDT We are seeing significant recovery of read and write operations
and continue to work on restoring all other operations.

~~~
justinholmes
9:12 AM PDT Between 2:13 AM and 8:15 AM PDT we experienced increased error
rates for API requests in the US-EAST-1 Region. The issue has been resolved
and the service is operating normally.

------
colinbartlett
This is manifesting itself as downtime for a lot of companies, including
Heroku: [https://status.heroku.com](https://status.heroku.com)

If you want alerts on this sort of thing, my side project StatusGator
[https://statusgator.io](https://statusgator.io) will alert you when services
post downtime on their status pages. My dashboard blew up this morning with a
ton of red and yellow as soon as Amazon started flaking.

Edit: I suppose it's time to invest in a multi-region setup. Since StatusGator
is hosted on Heroku in the US-East region, it is in theory affected by this
problem though so far is still up.

~~~
fidz
From Heroku Status Page:

> Our service provider is still working towards resolution of this issue. We
> will update when we have news, or in 1 hour.

I wonder why they don't tell that AWS is their service provider. Is it wrong
to make the information less obscure?

~~~
nzadrozny
> I wonder why they don't tell that AWS is their service provider.

It's because Heroku's choice of vendors shouldn't matter to their customers.
They see it as an implementation detail, and their responsibility to manage.

So I don't think that's an obfuscation. The people I know at Heroku all have
an attitude of, "The buck stops here."

~~~
cracell
That's just stupid. Heroku rarely gives a proper technical explanation of
their outages and under reports the length and severity of them drastically.

I assume this is to maintain their SLA. We really need independent third
parties to record uptime for SLAs instead of trusting hosts to do it
themselves.

This outage may be the last straw with Heroku for me. They've also previously
stated years ago that they would end their dependance on AWS East and yet
today shows that obviously hasn't happened.

------
nasalgoat
This seems like another reason to not rely on Amazon-specific services, other
than the obvious vendor lock-in.

At least in the event of an instance outage you could conceivably migrate off
Amazon to another VPS provider. No one using DynamoDB has an alternative.

~~~
acdha
I don't disagree that you would want to have a rough idea of what migrating
off of DynamoDB would require but wouldn't the easier step be using redundancy
across regions first? Most of the sites which have suffered downtime due to
AWS outages have been operating in only a single region (or even AZ!) and
adding the extra level isolation is usually going to be a lot easier than
dealing with multiple vendors or having to maintain more of your
infrastructure directly.

~~~
zurn
There are many high profile companies down (airbnb, IMDb, tinder, ...) so
apparently this is not so straightforward.

~~~
peterjancelis
I don't see airbnb or IMDB being down. If there's some advice on what small
time apps (like mine) can do to get thing up again sooner, please let me know.

------
brianpetro_
Could be why my address is now "incorrect"
[https://twitter.com/search?f=tweets&vertical=default&q=incor...](https://twitter.com/search?f=tweets&vertical=default&q=incorrect%20address%20Amazon&src=typd)

~~~
colinbartlett
Well at least it's nice to know Amazon uses Amazon. Could still be unrelated,
but awfully coincidental.

------
xenoclast
As an AWS customer you need to be aware that the service health of all AWS
services and not just the ones you use directly are important.

You say you don't use SQS or SNS? When they go down, you might not be able to
get Logs or even login to the web Console.

Same goes for things like AutoScaling, OpsWorks, etc.

~~~
gfosco
That's the beauty of micro-service architectures. You don't have a single
monolithic point of failure, you have dozens of smaller ones.

------
zkhalique
This is why 99.99999999% uptime is a fallacy

It is not really measuring the time you're going to be up. That interpretation
is based on faulty assumptions. It's like the statement "the sun will burn out
before one bit is flipped" is wrong. It is quite likely that by that time, all
the bits will be gone.

[https://signalvnoise.com/posts/3067-lets-get-honest-about-
up...](https://signalvnoise.com/posts/3067-lets-get-honest-about-uptime)

~~~
rdtsc
Yeah it is bullshit. They just had a failure so now they can claim, oh it is
still 9 9s it is just that it is over 400 billion years averaged, not like you
assumed, 10 billions. So legally still cool though...

------
mdnormy
I hate the fact that most people(including me apparently) still assume AWS is
not in their "downtime" equation. Just spend the last 30min troubleshooting
SMTP auth problem.

Not funny when its Sunday.

~~~
interesting_att
You're not alone buddy. Been wasting hours of my life looking at this stuff
too :)

------
sauere
Tinder is down due to this, now my life is pointless.

------
rsynnott
Not often you see a red status symbol on Amazon's status page (yellow is
normally considered more than enough to indicate that the product is totally
broken). Don't think I've ever seen _ten_ of them before.

------
samstave
The last time there were API outages in AWS, our autoscaling logic could not
determine the number of running instances, so it felt it had too few. It kept
launching instances, and due to the API outage we couldn't manually kill the
instances either...

So we wound up with over 1,000 of these machines running which then due to our
fan out of their DB they needed to load into memory from other machines, our
whole environment crashed until we could kill off the erroneously launched
instances.

This meant an effective full reboot of our entire platform...

It's was not a fun weekend.

------
drendorx39
Amazon CTO: We designed DynamoDB to operate with at least 99.999% availability
:D

~~~
ryanfitz
I've been using DynamoDB since it was released. In over 3 and 1/2 years of
use, this is the first time I've experienced DynamoDB being down.

~~~
awscat
If down for longer than 18 minutes then they missed "5 9s" availability (.3
hours / 3.5 years). Not that it is supposed to work that way.

~~~
yeukhon
Software bug caused downtime vs infrastructure / hardware availability uptime
to me are a different guarantee. I am pretty sure someone did something
recently to DynamoDB.

~~~
toomuchtodo
Infrastructure guy here doing this for 14 years. Downtime is downtime. You get
a pass if its "scheduled maintenance" you've notified your customers about to
allow them to be prepared, but if you silently perform maintenance and it goes
to shit, you've just counted against your metrics.

~~~
yeukhon
Nope. I still disagree. No service can guarantee 99.999999% unless you
discount software upgrade. You just cannot. If you think those nines include
software upgrades, you are probably over optimistic.

~~~
toomuchtodo
> No service can guarantee 99.999999%

Don't advertise it if you can't offer it then.

> If you think those nines include software upgrades, you are probably over
> optimistic.

If you advertise a product with a specific SLA, and you can't meet that SLA,
you're a liar. Don't try to blame the victim because of inaccurate/untruthful
marketing or engineering.

~~~
antod
SLAs are just contractual thresholds for getting some specified redress if not
met. They are not promises.

Not meeting a SLA is not lying.

~~~
toomuchtodo
I used SLA to communicate an advertised/marketed level of service. In this
case, I agree, that SLA is the wrong term as there is no contractual
agreement.

------
driverdan
Why is us-east-1 so terrible? All of the downtime this year has been Virginia.

~~~
toomuchtodo
It's the primary AWS region. You spin up your resources by default there
unless you explicitly select another region in the console.

~~~
raverbashing
It's probably a good idea to pick other regions, especially the ones closest
to yourself

However, us-east is usually the cheapest one as well

~~~
scott_karana
Is it cheaper than the downtime?

------
crb
"4:52 AM PDT We want to give you more information about what is happening. The
root cause began with a portion of our metadata service within DynamoDB. This
is an internal sub-service which manages table and partition information. Our
recovery efforts are now focused on restoring metadata operations. We will be
throttling APIs as we work on recovery."

([http://status.aws.amazon.com/](http://status.aws.amazon.com/))

------
archimedespi
Reddit is down right now with a 503 - they're on AWS.

~~~
zenonu
That's the reason I'm reading here right now vs. time-wasting on Reddit at the
moment.

------
dbarlett
S3 and VPC themselves appear to be fine, as noted on the dashboard, but the S3
VPC endpoints in EC2 are not ("we are also experiencing increased error rates
accessing VPC endpoints for S3"). I was able to restore my sites by removing
the endpoints from the routing tables.

------
airza
well, time to find out who has failure tolerance built in to the architecture
8^)

~~~
drendorx39
failure tolerance is an alien technology for amazon...

~~~
divideby0
that's completely untrue. there are many ways to do fault-tolerance in AWS.
it's expensive, but it's possible. netflix even goes as far as simulating the
failure of entire aws regions in their simian army testing suite:

[http://techblog.netflix.com/2011/07/netflix-simian-
army.html](http://techblog.netflix.com/2011/07/netflix-simian-army.html)

That's why Netflix stays up when us-east or us-west are down.

~~~
beagledude
Netflix is down.

~~~
veverkap
Works fine for me

------
fapjacks
You know this is interesting. There were no symptoms for us at all that
something was wrong with Amazon itself, and their status page was not updated
in a timely fashion. I spent a few hours (in the middle of the night working
with my laptop in bed next to my wife) trying to figure out what in the hell
was wrong, only to find out through the grapevine that it was Amazon. This is
_extremely_ frustrating when providers are having problems and _actively
working on a solution_ yet their status page still has glowing recommendations
of their service.

------
JOnAgain
Le sigh. This is impacting AirBnB and I need to check in somewhere in LA later
today. Good thing all the details are in the AirBnB messaging history with the
host. Time for them to just go back to email.

~~~
neals
I'm somewhat in the same situation. I want to check-in a movie I just watched
and give it appropriate rating, but IMDB is down. I guess we'll just have
wait, right?

~~~
clebio
Any knowledge or evidence that IMDB runs on AWS, and that the two are thus
correlated?

~~~
nhumrich
Well, Amazon owns IMDB, so it's probably a reasonable assumption.

------
aaawow
Amazon Echo doesn't work from 4am PST

------
lxfontes
If you can't get in the console, use awscli. It is responding fine!

------
janson0
SQS is the specific service giving me a ton of trouble right now. Hope they
resolve this quickly. Had rayguns about sqs all night heh.

So are they saying they are throttling SQS because of the DynamoDB issue?

~~~
oliverfriedmann
I'm not sure. I think many of the other services mentioned probably rely
internally on SQS, so resolving the SQS issues might resolve most of the other
issues as well.

Not completely sure though whether DynamoDB would benefit from relying
internally on SQS.

~~~
janson0
Yeah good point. It's something I forget sometimes that AWS uses AWS... and
that even if I don't rely on a particular service specifically, a service I
rely on may, in fact, rely on that service.

Hopefully there is a relatively fast recovery on this.

Can anyone even log into their aws console right now?

~~~
grhmc
I can log in.

I wonder if SQS uses DynamoDB, not the other way around.

------
crypt1d
Audible seems affected by this as well. I've made some purchases with my
credits but the books are still not showing up in my library...and the
checking out process is very slow.

~~~
edanm
FYI happened to me as well, but is resolved now.

------
neoecos
The AWS KMS is not working. Critital payment applicaction down =S.

~~~
rational-future
Why would you run a "critical payment application" in US East? This datacenter
has 10x the downtime of West or Ireland.

------
Tinyyy
Its pretty interesting to see how much our internet relies on cloud services
like AWS, and all that are brought down with issues like this.

------
geertj
Address verification on amazon.com doesn't work for me at the moment, blocking
me from making any orders.

Not sure if this is related.

------
kureikain
Whoever uses autoscaling and, especially lifecycle notification with SQS will
be in trouble now(I'am).

The morning is going to be started. Traffic will be ramped up, and not sure if
new sevrers will be launched because CloudWatch is failed. Polling SQS to find
lifecycle notification message fail too.

------
leesalminen
Not related to the AWS outage, but Rackspace CDN customers are in for a world
of hurt today as well.

[https://status.rackspace.com/index/viewincidents?group=28](https://status.rackspace.com/index/viewincidents?group=28)

------
divideby0
Sign-ins to AWS console also appear to be timing out:

[https://www.evernote.com/l/ABkKLgp3RjRDe5uV4pMlyVg1uzkW41DG4...](https://www.evernote.com/l/ABkKLgp3RjRDe5uV4pMlyVg1uzkW41DG4SEB/image.png)

------
chetanahuja
The aws services stack is deep and deeply intertwined. I've always viewed
depending on such stacks in production with skepticism and I'd recommend
everybody else does that too.

This might come across as tooting our horn a bit. But it's more about sounding
a warning to other startups providing SaaS service built on public cloud. My
own misgivings about relying on a cloud provider specific stack (both for the
reasons of visibility/debuggability as well as for vendor lock-in) meant that
PacketZoom services were not affected by this failure at all because we only
use them as one of the many providers of raw machines. We use our own
techniques to load-balance/failover among multiple cloud providers too (so
even if the raw compute/network went away, our service would take a perf hit
but not be completely down).

~~~
not_kurt_godel
Or you could just run in multiple regions. Using multiple cloud providers
limits your ability to take advantage of provider-specific features - why
waste time writing your own load balancer when you could use ELB + multiple
regions?

~~~
chetanahuja
_" Or you could just run in multiple regions."_

Not when the original goal of the very service is to have presence in all
geographical regions. If aws us-east is hit, I want the users to transparently
failover to a server on east coast (perhaps one hosted by google or softlayer)
rather than be directed all the way to us-west or eu.

And as for ELB, one doesn't use ELB for a custom protocol that load-
balances/fails-over itself from the client :-)

------
williamcotton
Free Rugby World Cup!

[http://universalsports.com/](http://universalsports.com/)

"RWC2015ppv.com has been affected by an internet outage. Watch here. Not all
mobile devices are compatible"

------
Hughlon
Videos and Alexa ia also down

~~~
eatonphil
This appears to have completely blown away all Alexa data. Even searching for
google.com returns nothing.

------
dankohn1
I noticed this because I was unable to checkout on Amazon Prime Now just now.

~~~
pgrote
I noticed it when I couldn't stream something. The player begged it off as
Silverlight issue even when the flash option is chosen. lol

------
frequent
nothing like having a short-movie done in 48hrs using only web services and
then WeVideo goes stale just before I download... 2hrs before submission
deadline :(

------
vreauobere
other down sites: medium.com, getpocket.com, idonethis.com

------
kiallmacinnes
Great! Almost every takeaway in Dublin has moved to Zuppler for their online
ordering... Zuppler is hosted out of AWS.

------
sidcool
Wow, very interesting to see how much of the infrastructure directly or
indirectly depends on AWS.

------
tnolet
docker, wercker, travis.ci are also affected. Can't login or stuff is really
sluggish.

------
SoulMan
Nothing should be effected in non- USEast regions as per the status page

------
rocky1138
Better to have this happen on a Sunday than a Monday.

------
qaqy
clouds are so great

------
drendorx39
DynamoDB is literally a garbage. That's why Amazon does not provide any SLA
for the service... Even cheap Azure Storage provides cross-region failover.

~~~
Ixiaus
Garbage you say? Dynamodb blazed a trail for many open source, eventually
consistent kv databases. Certainly not garbage.

~~~
drendorx39
The only thing DynamoDB can do good is simplicity. Except for this, even
MongoDB has tons more features than DynamoDB and the new version resolved
performance problems existed in the previous versions.

~~~
icefall
Apples and oranges. People who call tools like this garbage fail to evaluate
trade-offs at their required complexity space. Distributed systems have
extreme trade-offs. More features => more bugs.

------
crablar
?

------
heapcity
What is happening to stock price? oh; its sunday; forgot we can't trade.

~~~
bdcravens
I don't recall any outages having a material effect on Amazon's stock price.

~~~
varelse
Amazon can apparently go up or down 50% depending on whether a butterfly
sneezes in Australia (it's wildly danced between 284 and 580 in the past 12
months alone).

In the background of such high volatility, it would be hard to pinpoint such a
material effect from such a small disruption (in the big picture of course,
I'm betting there are some pretty angry customers today due to the loss of a
few sigma of reliability from this outage alone).

Now if a study were published indicating customers were switching providers
over incidents this, then I think you'd have some material evidence. But is
anyone else better yet? Azure was out for 12 hours last year apparently...

[http://www.datacenterknowledge.com/archives/2015/01/23/cloud...](http://www.datacenterknowledge.com/archives/2015/01/23/cloud-
reliability-aws-had-fewer-errors-than-azure-google-cloud-in-2014/)

