
AWS EC2/RDS Outage in us-east-1 - jacobwg
https://status.aws.amazon.com/?date=2019-08-31
======
matt2000
Just wanted to add a quick note before we get the usual deluge of "you should
be running in multiple AZs and regions" posts: These outages are relatively
rare and your best decision might just be to accept the tiny amount of
downtime and keep your app simple and inexpensive to run.

I of course don't know the tradeoffs involved in running your system, but I
know for a lot of my situations the simplicity of single AZ with a
straightforward failover option is usually the right tradeoff.

~~~
bob1029
For us this is exactly the correct approach. We could have spent millions of
dollars and thousands of man hours hardening things to be resilient to single
region outages. But for what? We aren't GE or Google. If our conference line
goes down for 2-3 hours per year because we don't have apocalypse-proof
infrastructure, literally nothing bad happens to our business. In this exact
outage we are discussing, all of my coworkers are at home having breakfast
with their families and doing various weekend activities. No one but me will
know there was even a problem until I log into the AWS console and find the
alerts. Worst case, I have to reboot or restore a few affected instances on
Tuesday morning.

It seems like a lot of businesses are chasing this ideal of perfect and end up
much worse off than if they had just stuck whatever application on a single
server in a semi-reliable part of the world.

~~~
hhw
2-3 hours per year is a lot of downtime. Most competent bare metal providers
see maybe one major outage of less than an hour every 3-5 years. Nothing other
than a facility wide power outage, if the load somehow gets dropped because
the generators don't start right away as they should, or a misbehaving (only
partially failing) core network infrastructure device should result in major
outages when all the proper redundancies are in place.

Specific providers aside, there's more complexity involved in a large cloud
provider's infrastructure and much more that can go wrong as a result. Having
a code update, or some orchestration issue from your infrastructure provider
be potential points of major outages are huge and unnecessary risks. You don't
need that much scale, just utilizing enough resources to fill up a few whole
physical machines for a few hundred dollars a month. Add some globally
distributed BGP Anycast DNS and database replication and you have enough
redundancy to withstand most of the worst major infrastructure failures.

I would understand if AWS was super simple and convenient, but these days the
learning curve seems far greater than setting up the above described bare
metal solution. While being almost an order of magnitude more expensive for
the equivalent amount of resources.

How did we end up here? Does brand recognition just trump all technical and
economic factors, or what am I missing?

Disclaimer: I run a bare metal hosting provider

~~~
tilolebo
3 hours of downtime per year equals to 99,96% uptime.

In what world is that a lot of downtime?

~~~
rosser
In a world where you have SLAs with your customers, in which you commit to
something better?

~~~
tilolebo
Damn, these ships must really be run tightly.

In every company I have worked for, the amount of outages caused by bugs and
other post deployment issues was already above that number.

------
btown
[https://status.heroku.com/incidents/1892](https://status.heroku.com/incidents/1892)
\- it appears Heroku is being particularly affected. We've had multiple sites
on multiple accounts go down in the past few minutes.

EDIT T16:31Z: It appears Heroku has failed over their dashboard, but dynos are
still failing to come online. We had assumed that they had multi-region
failovers for their customers. Incredibly disappointing.

~~~
bkovacev
We cannot even restart/turn off the dynos or get into the dashboard to turn
off and kill our background tasks for some of our clients.

~~~
brianwawok
Is it best practice to run the dashboard and the cloud service in the same
region of the same cloud provider?

~~~
bkovacev
I believe it's not.

As a PaaS I would think that they would run a high availability cluster on at
least 2 multiple regions so that they would have a mechanism in place for
events like these. I know it's expensive, but if you charge 250 for 2.5GB of
RAM I believe you would have enough money to cover it. I also think as you
hinted that they should separate services across different regions..

------
bombtrack
Looks to have been caused by a loss of utility power and subsequent backup
generator failure at one datacenter.

> 10:47 AM PDT We want to give you more information on progress at this point,
> and what we know about the event. At 4:33 AM PDT one of 10 datacenters in
> one of the 6 Availability Zones in the US-EAST-1 Region saw a failure of
> utility power. Backup generators came online immediately, but for reasons we
> are still investigating, began quickly failing at around 6:00 AM PDT. This
> resulted in 7.5% of all instances in that Availability Zone failing by 6:10
> AM PDT. Over the last few hours we have recovered most instances but still
> have 1.5% of the instances in that Availability Zone remaining to be
> recovered. Similar impact existed to EBS and we continue to recover volumes
> within EBS. New instance launches in this zone continue to work without
> issue.

[https://status.aws.amazon.com/rss/ec2-us-
east-1.rss](https://status.aws.amazon.com/rss/ec2-us-east-1.rss)

------
bdcravens
I've noticed both Twitter and Reddit were having issues this morning, so this
makes sense.

~~~
mrunseen
[https://reddit.statuspage.io/](https://reddit.statuspage.io/)

------
scott113341
I got paged 50 minutes before AWS updated their status page. We are running on
AWS's managed Kubernetes offering (EKS), and about one third of our nodes were
running in the affected availability zone. We were then able to move all of or
traffic out of that AZ, which solved our issues. The main symptom was HTTP
requests made by our backend to 3rd party APIs failing, but only on requests
originating from that AZ.

------
groundlogic
Reddit has been quite dysfunctional for me the past hour or so.

~~~
JoshGlazebrook
Same. I thought it was my WiFi at first.

------
sdrothrock
Amazon JUST had an ec2/RDS failure in one AZ in Tokyo last week; the cause was
a bug in their HVAC that led to overheating. I wonder if this is similar or
just coincidental.

[https://aws.amazon.com/jp/message/56489/](https://aws.amazon.com/jp/message/56489/)

~~~
jacques_chester
US-East-1 was the first, IIRC. Lots of early adopters have grown significantly
but have various hardcoded assumptions about running there.

------
xyst
The Spinnaker project is looking more appealing with every outage. Outage
detected in X provider in Y region? Deploy infrastructure to Z provider in Y
region.

~~~
bjterry
This outage is only affecting a single availability zone, so taking on the
complexity of multiple cloud providers would not be necessary to be resilient
against it. AWS best practices would already have covered you.

~~~
herostratus101
Where does it say it's a single AZ?

~~~
judge2020
Every "more" drawer/dropdown

> We are investigating connectivity issues affecting some single-AZ RDS
> instances in a single Availability Zone in the US-EAST-1 Region.

------
somehowadev
I’m surprised by how much of the “internet” seem to be affected by a single AZ
going down.

~~~
neuromantik8086
We wouldn't have this problem if people just used application-layer protocols
and federated services like the early internet.

~~~
sjwright
Wait, why wouldn’t we have these problems? Back in the 1980s, if a university
campus connection goes down, you can’t telnet in or read your university POP2
email remotely. It’s down.

The only difference between then and now is that we’re online (seemingly) at
every waking minute expecting a hundred different services to be functional at
any given moment.

~~~
neuromantik8086
Modern services such as reddit and Twitter effectively usurp the role that
Usenet/NNTP and similar distributed protocols used to fulfill, but without the
advantage of decentralization / lack of large single points of failure that
such protocols embraced. That's what I was getting at, and maybe I'm full of
shit.

In the 80s if a university campus internet connection went down, only that
university was affected. Now, when a single AWS availability zone goes down, a
much wider swath of users is impacted. Such consolidation / centralization
shows a disregard for the spirit of the early internet and design
considerations that went into it.

Again, maybe I'm full of shit. Lots of people here seem to think so.

------
nemothekid
us-east-1 continues to have continually worse uptime than other regions (for,
likely, good reason too, it continues to be the default region).

I've avoided that region and I can't remember the last time I had downtime
caused by Amazon.

~~~
bdcravens
Also, it is one of the regions that gets new features first, which makes me
wonder if it contributes to lower stability.

~~~
Matthias247
This is not true. The region where new software is deployed first is different
team by team (or service by service).

~~~
bdcravens
Perhaps, but I can't recall a product launch that wasn't available in us-
east-1 from day 1.

~~~
jbourne
I believe us-east-1 is one of the regions included in the minimal set of
regions for a new AWS service to be considered 'available'. If I recall, eu-
west-1 is another such region.

------
JacobJans
Leaseweb Virginia is having a major outage as well. Maybe it is related?

[https://www.leasewebstatus.com/incidents/updated-
connectivit...](https://www.leasewebstatus.com/incidents/updated-connectivity-
issues-in-part-of-our-network/ci25t2jr)

------
ihaveajob
Copy that. Happy Labor Day weekend everyone.

~~~
ihaveajob
It's been 2 hours and they still don't have a red flag on
[https://status.aws.amazon.com/](https://status.aws.amazon.com/)

~~~
mathieuh
Cognito went down completely a couple of months ago (started returning rate
limited to every request) and despite our contacting AWS to see if there was
anything going on (and their confirming that there was) they never updated the
status page. The way we got updates was by calling our AWS contact.

------
colinbartlett
This seems to affect a broad swath of the internet, perhaps because the us-
east-1 region is so popular? My side project StatusGator shows approximately
15% of the status pages we monitor (including our own) with a warn or down
notice right now, a sizable spike over the baseline.

------
riffic
>We are investigating connectivity issues affecting some instances in a single
Availability Zone in the US-EAST-1 Region.

Well there’s your problem, people. Use multiple AZs.

~~~
doiwin
Easier said then done if that would mean synchronizing database and filesystem
that is heavily written to.

~~~
holykin
Depends on what you use. RDS can span AZs and failover in events like this.

------
crb002
Curious. Lambda not effected. EC2 being physically tied to a box does
introduce extra risk I hadn't thought of.

~~~
scarface74
Lambda would be just as affected if you were running inside of a VPC [1] and
you ignored the multiple warnings about setting up your lambda to run in only
one AZ.

[1] technically your lambda never runs “inside your VPC” but it’s a
colloquialism that everyone understand.

------
jgalt212
This is pretty good common sense post on not having your failure moods
correlate with your client's failure modes.

[https://trackjs.com/blog/separate-
monitoring/](https://trackjs.com/blog/separate-monitoring/)

I don't work for any of the entities mentioned.

------
abathur
Had an app doing fine until about 12 minutes ago, when Heroku tried to move it
to a new server. Alas.

~~~
abathur
Setting dynos to 0 and back got us up, though.

------
whalesalad
For folks here, my RDS instances in us-east-1f are doing okay (knock on wood!)
Not sure which AZ is suffering most.

My client's Heroku instances are online, thankfully.

Can anyone here speak to their experience with the Ohio region? I'm
considering leaning on that more and more.

~~~
deusex_
Your us-east-1f is not the same one as on other accounts. The letter is
randomly assigned to the AZ to spread load.

~~~
whalesalad
Interesting, never knew that. I guess that is why the announcements never
explicitly pointed to a single AZ by name.

~~~
zifnab06
Essentially everyone picks "a" because it's the first az. There's some
internal mapping to "your az a is actually datacenter q". You can _kind of_
figure out which AZs match across accounts if you've got enough accounts you
can send traffic between.

I've been told that the "a" AZ you get was the least populated at object
creation time (ie the first time you make an object that lives in an az), but
I don't know how valid that is.

~~~
whalesalad
Aside from spinning up an EC2 node in each AZ and doing ping or tracing tests
I wonder if there is a quick-n-dirty way to map AZ’s between different AWS
accounts. I’ve never had to approach that scenario (cross account, low latency
requirements) but in the future I’ll keep this in mind.

~~~
cthalupa
Go to your Subnets tab in the EC2 console. You'll see the actual AZ numbers
there, vs. the 'random' lettering.

------
doiwin
Is there no way at all to reach Amazon EC2 instances in us-east-1 or is just
the default route to the internet broken?

Is there any way for the owners of the instances to reach them?

------
shamalinga
Is this why Reddit and Duolingo weren't working properly? I've had issues
since 9pm Sydney time so about 4 hours or so.

------
karmakaze
I remember reading about how not all AWS regions are similarly operated and
that one was a snowflake. Is it US-East-1?

~~~
Dunedan
Yes. us-east-1 is the first AWS region Amazon made publicly available. It's
also historically used by a lot of customers as "default region" where they
launch all workloads where they don't have special needs of launching them
somewhere else.

That has lead to us-east-1 being the largest AWS region by far, also
compromised of the largest number of availability zones (6) of all AWS
regions.

~~~
karmakaze
Ok so it is this one. I was talking more about how the region itself has
features, exceptions/quirks that are different than other AWS regions.
Basically a quirks-mode region with differences that may, or may not impact
you at some point in time. Or you do have special needs and US-East-1 is the
only region that has the special non-standard ability you want to use.

------
nrxr
Has anyone else noticed that there seems to never be outages in us-east-2 and
somehow everyone keeps putting instances in -1?

Why?

~~~
bob1029
For us, it's mostly a matter of historical convention. Our entire stack
currently lives in -1 (we've had instances there for ~5 years now), and to
move to a different region under these pretenses is a bit of a pain in the ass
for us considering how transient the impact of these things has typically been
to our business.

If we move anywhere, its going to be completely out of AWS and into on-prem or
some bare metal provider. Hopping regions hoping to win at some reliability
metric game is not a good way to run a business IMO.

------
odiroot
Funnily enough Heroku in Europe also seems to be malfunctioning. Cannot deploy
my app for at least an hour now.

------
bjornsteffanson
I'm in Australia and Reddit/Twitter ground to a standstill - request timeout
after request timeout. I presumed it was an outage somewhere but was surprised
to learn it was with AWS us-east-1? I would have thought surely that my
connection would have referenced a different region based on my location.

~~~
dekhn
usually, DB servers will live in a small number of locations with good
connectivity between clusters and the frontends (which terminate the user's
TCP connection) live much closer (likely Sydney). Good design means that there
are few roundtrips between the FEs and the backend but they are not
unavoidable.

Designing truly resilient and available applications with DB servers that
replicate across continents is hard.

~~~
juliusmusseau
Is true master-master replication across continents even possible?

I guess partitioning can help, but then isn't it just turning the DB servers
into pizzas of master-slave where the Hawaiian slice is master only in Hawaii,
and slave everywhere else?

~~~
yjftsjthsd-h
Yeah, you're gonna hit CAP _hard_ at that distance.

------
patrickaljord
That must be why reddit and twitter are failing on me.

~~~
ryanSrich
This leads me to believe it’s more than a single AZ failure, despite what AWS
is reporting. Not having multi-AZ, auto failover or replication doesn’t seem
like a thing Reddit or Twitter would skip out on.

------
holykin
It looks like it was localized to zone D.

~~~
tdurden
Zone designations are account specific; zone D for you is not zone D for me

~~~
riffic
The affected AZ appears to be use1-az6. You can map "your" AZ name (us-
east-1c, us-east-1d, etc.) to the actual, canonical name of the AZ in the
'Subnets' tab on the VPC console.

~~~
y0y
What makes you say use1-az6 is the culprit? I only ask because none of our
workloads in az6 have experienced any issues. ....yet. We run critical
workloads across 3 AZs thankfully, but still.

------
beardedman
Aha. Experienced some NPM lag too.

------
fibers
is that why xda developers doesnt work

------
smitty1e
My little instance died and I had to bring it back from the image.

Glad to know that it wasn't anything personal over any Hacker News gags I've
done.

------
rvz
Well, this outage says something about the companies that religiously depend
on it.

If your entire service just went down as soon as this happened,
Congratulations! You didn't deploy in multiple regions or think about a
failsafe/fallback option that redirects from your affected service or
instance.

~~~
meddlepal
Very few companies or systems need near-perfect uptime. Multi-region cloud
engineering, especially once data is involved, is incredibly expensive. If you
do need the kind of resiliency you usually engineer it for just a very
specific component rather than the entire system.

An outage like this happens how often?

Edit: Looks like this is affecting a single AZ... so bit different situation,
but I would agree if you're not capable of surviving a single AZ outage in
2019 then your engineering team should be replaced.

~~~
sgustard
> your engineering team should be replaced

My engineers are all React and CSS web developers. They don't know anything
about multi tenant data resiliency. But they can make a real pretty "system
down" page.

