
AWS Service Interuptions - codingninja
AWS Is currently having issues with a bunch of our instances and the console reports the following error
&quot;An error occurred fetching instance data: The service is unavailable. Please try again shortly.
&quot;<p>Anyone else experiencing a problem?
======
romanr
It will take a direct hit with nuclear weapon on the datacenter for Amazon to
change icon to red on service status page.

~~~
mrmondo
Yeah we monitor lots of Amazon & Microsoft 'cloud' services, we observe much,
much higher downtime / number of outages than they ever report in a order of
50 to 1 or more. What do you expect though, both companies are known for lying
through teeth to convince the IT community (or more likely the IT managers)
that their services are reliable for everyone and amazing uptime and that
they're not only a good option but the only option.

~~~
mbesto
> _What do you expect though, both companies are known for lying through teeth
> to convince the IT community (or more likely the IT managers) that their
> services are reliable for everyone and amazing uptime_

Their uptime is _much_ higher on average than any IT team I've ever been
involved in.

~~~
mrmondo
Oh wow really? That's really bad - you must have worked with some really poor
ops teams in the past. Last year we measured less than 97% uptime on AWS
Sydney, and a shocking 96% uptime for Office 365 exchange online. Most of the
problems when we investigated them out of interest were due to either internet
network routing issues within their networks (or first ISP hop), or they just
had hosts outright fail. The 'cloud' is just outsourced hardware with a
provided toolset (APIs etc...), Amazon itself claims that you must have your
hosts across various zones to get decent uptime - that's like saying "oh yes -
the Toyota Carolla is really reliable, it works 99.99% of the time... As long
as you buy a second one for when it's not available".

Our internal uptime is 99.985 in production, we are fast moving and roll out
changes every day, we run mainline kernels and all of our 350 odd servers and
800~ containers are running on completely vendor independent, open source
software.

I'm not saying it's easy, but the middle man is there to help you if you can't
find or afford up front good operational engineers, or to take your money
because their advertising has made you believe that they are always the best
decision.

We perform an in-detail yearly cross-cost comparison between AWS and our
operated datacentre, the cost to run and maintain the same uptime, processing
power (and yes we take into account spinning down instances at night etc...),
bandwidth between zones, backups and customers and it really hasn't improve at
all over the past 3 years. This year the review came back that our yearly
expenditure on operational expenses would increase from approximately $500,000
(including human resources) to well over $3,000,000 a year. (Not kidding), the
margin of error was approximated at between 10-20%.

~~~
mbesto
> _Oh wow really? That 's really bad - you must have worked with some really
> poor ops teams in the past._

You sound genuinely very smart and knowledgeable in this area. But the other
90% of the workers in this sector are not.

> _Amazon itself claims that you must have your hosts across various zones to
> get decent uptime - that 's like saying "oh yes - the Toyota Carolla is
> really reliable, it works 99.99% of the time... As long as you buy a second
> one for when it's not available"._

Wait, you don't have a second data center for your mission critical systems in
case your primary fails?

> _We perform an in-detail yearly cross-cost comparison between AWS and our
> operated datacentre...bandwidth between zones, backups and customers and it
> really hasn 't improve at all over the past 3 years_

I totally agree. If you have the right resources, a good data center partner
and well defined process, then "the cloud" isn't for you. For the other 90% of
the people out there that simply don't have the know-how, knowledge, or
resources to find talented IT operational excellence, then AWS totally makes
sense.

~~~
mrmondo
Yes we have two datacentres and we do have a few VPS mostly for triangulation
of monitoring, but honestly, in four years - we haven't had to failover once,
although we practise it with our applications almost every single day.

Thank you for the kind words there, I think one major thing for us is that
we've hired a small number of just the right people, each with quite different
backgrounds and we work VERY closely with our developers. Every bit of
configuration is kept in GIT and we CI / CD whatever we can.

~~~
mbesto
> _Yes we have two datacentres_

That's all that Multi-AZ is mate ;)

------
lobe
Not sure if relevant to this issue, but Sydney is currently being hit with one
of the biggest storms I can remember in the past few years. Probably not crazy
enough to take down a DC, but might be a contributing factor in this outage.

------
origami777
I realize that some systems may need to have all of their servers located
close together in a single AZ. But barring that, if this took you offline, you
should really consider spreading your instances across AZs. It's so easy
there's no excuse not to do it.

Another thing to look into is EC2 Auto Recovery [1]. I don't know if this
would've kicked in with today's event, but it's worth setting up as an extra
safety net.

[1] [https://aws.amazon.com/blogs/aws/new-auto-recovery-for-
amazo...](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)

edit: I'm basing this off the status page which indicated that only one AZ was
impacted.

~~~
PebblesHD
Sadly our use case (private data etc.) prevents us from leaving the local
availability zone, meaning when it went down today we were left totally
unavailable. The recovery itself is ongoing but our applications are resilient
enough to detect the restored connections and automatically add themselves
back into the cluster.

~~~
NeutronBoy
Availability zones are different from regions. You can be in multiple AZ's
within the Sydney region still.

------
sidcool
From AWS status page for Asia Pacific:

10:47 PM PDT We are investigating increased connectivity issues for EC2
instances in the AP-SOUTHEAST-2 Region.

11:08 PM PDT We continue to investigate connectivity issues for some instances
in a single Availability Zone and increased API error rates for the EC2 APIs
in the AP-SOUTHEAST-2 Region.

11:49 PM PDT We can confirm that instances have experienced a power event
within a single Availability Zone in the AP-SOUTHEAST-2 Region. Error rates
for the EC2 APIs have improved and launches of new EC2 instances are
succeeding within the other Availability Zones in the Region.

Jun 5, 12:31 AM PDT We have restored power to the affected Availability Zone
and are working to restore connectivity to the affected instances.

~~~
tvmalsv
It took them an hour to figure out that their connectivity issues were caused
by losing power to an entire Availability Zone? Maybe they should add an alert
for "AZ has no power" or put it on a dashboard...

I'm joking of course, but that's what ran through my mind while reading that
timeline.

~~~
AdamJacobMuller
Wasn't quite that simple. I lost connectivity to instances that did not reboot
so I'm guessing it took out some network elements.

------
karmacondon
I recently switched to Google Compute Engine. It's cheaper and so far more
reliable than AWS. Might be another option for some people here.

~~~
sidcool
I am trying to convince people at my work to move to GCP from AWS, but AWS
truly has become the Microsoft of Cloud computing. Many people have no idea
there are other providers like GCP, Azure, DigitalOcean etc.

~~~
jdc0589
Azure might be great in a year or so, but it makes me uneasy as is. Some of
the services are great, but a lot of them are pretty fragmented. I've had so
many instances where our billing/usage data has just "disappeared" for a few
days, undocumented changes have been made to the formats of
reports/exports/APIs, and official documentation is plain wrong that I just
can't recommend Azure to anyone. Not to mention they have the most expensive
infrastructure costs out of the major players (even with an EA and decent
monetary commitment); their premium for Windows licensing is the lowest by far
though (not surprising), so it does end up being a cheaper option for super
windows heavy shops though.

~~~
sidcool
Although I am quite optimistic about Azure, GCP seems like the best bet at the
moment. I think of factors like Reliability, performance, availability, cost
and longevity.

------
WDCDev
Our Sydney EC2 DB instance is stuck spinning in the "stopping" state so we are
basically offline right now. The team is working on getting a new DB instance
set up, but I read that our payment provide, Westpac, is also having issues.
So even if we do get back online, users might not be able to purchase.

What a mess.

------
MasterNayru
For all intents and purposes, we're completely offline at the moment. It's
clearly some serious issue because the icon for EC2 in Sydney on AWS' status
page is yellow, rather than the usual green tick with the small 'i'.

------
25thhour
Still down 5 hours later. ELB won't register instances. Ugh

~~~
inopinatus
The ELB control plane woke up for us about 90 minutes ago; back to flying on
all engines again now.

~~~
25thhour
Multiple ELB's came alive around that timeframe but our primary ELB has
remained unable to re-register instances. Creating a new ELB as a test and
trying to register new instances from the effected ASG has also failed.

------
PebblesHD
Another confirmation here, all services in our Sydney AZ are down. AWS Support
last mentioned a power failure or similar in AZ1, but some of ours are coming
back online now.

------
DenisM
I see lots of people using ELB for load balancing. Anyone tried using DNS on
top of ELB to spread the load? That might just save you from the extended
downtime.

~~~
dsmithatx
Generally the ELB should have instances in different availability zones which
are data centers miles apart. If your ELB went down creating a new one should
be simple if you can access the region. The problem with high availability and
spreading load is how to deal with your database and recovery.

------
aaratn
Works for me - Sydney Region

~~~
ParadisoShlee
I cannot tell if this is an amazing joke or not.

------
rmdoss
Yes, AWS EC2 (Sydney) is completely offline from what we see. We have almost
10 servers there unaccessible for over an hour.

------
mhealy
Yes, it seems that zone A is completely down. However, load balancers seem to
be affected as well.

~~~
nnx
Your zone A might be another customer's zone B. AWS maps availability zones
per account.

See [http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-
reg...](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-
availability-zones.html#concepts-regions-availability-zones)

------
schappim
Still having issues with accessing apps hosted on Elastic Beanstalk on AP-
SOUTHEAST-2 Region. Restarting app servers / rebuilding the environment
doesn't make a difference.

------
theathea
Curious if you are all based in Australia, or if the Sydney outage is
effecting other regions?

------
vfulco
Anyone having problems with BJ servers? My site is not running,
[https://www.weisisheng.cn](https://www.weisisheng.cn), I can not SSH into
machine nor access login page to AWS dashboard.

------
mysteriousmango
We appear to be back online, however all machines have rebooted.

------
25thhour
ap-southeast-2 EC2 appears to be completely offline for us

