
Amazon EC2 Issues in Frankfurt AZ - mrmattyboy
https://status.aws.amazon.com/#Europe
======
intsunny
It is baffling that the Europe tab for AWS' Status page posts timestamps in
PST. What the fuck is PST?

We have the ability to do everything and anything in Javascript EXCEPT for
fucking UTC or localized time zones.

Is PST the new UTC? Is there an RFC for this? Did I miss the memo?

Un-fucking-real.

~~~
sm4rk0
That's due to the general ignorance about the world outside the US.

~~~
lxgr
Outside the Bay Area, really.

~~~
sargun
AWS is based in Seattle.

~~~
lxgr
Sorry, Bay Area plus Seattle ;)

------
carlsborg
This is a non-issue.

"We are experiencing elevated API error rates and network connectivity errors
in a single Availability Zone."

Key fact "A SINGLE AZ". Availability Zone's are isolated from each other with
redundant power supplies and internet connectivity and most often physically
different datacenter locations. Well architected applications are designed to
allow for a single AZ to become unavailable. This is precisely why the cloud
is useful: you can bring up new capacity in the other Availability Zones
behind the same load balancer with zero effort - the autoscaling does that for
you automatically.

~~~
gjtempleton
Except it's not currently handling this nicely, we've got an ASG behind an
ELB, spread across multiple AZs including the affected one, and the ASG
doesn't see the new instances as it scales up as ever coming into service -
they just stick in "Not yet in service"

~~~
carlsborg
I spun up a new t2.small instance in eu-central-1 and it took less than 90
seconds, which is more than it usually takes, but not that bad.

[https://pastebin.com/WC1hkh0c](https://pastebin.com/WC1hkh0c) (see ts on
uptime command, ignore timezone differences)

------
Roritharr
We were affected with one of our marketing pages that we migrated to ECS in a
non-HA configuration, our main applications are setup in Multi-AZ HA Configs
and weren't affected by this.

Not a big deal to remediate as all the other AZs are working.

------
nrki
Looks like this took down TransferWise (debit card/forex/payments processor):
[https://status.transferwise.com](https://status.transferwise.com)

------
fyfy18
I wonder if this is related to TransferWise issues, their app and even card
payments have been down all morning.

[https://twitter.com/TransferWise/status/1194168200210124800?...](https://twitter.com/TransferWise/status/1194168200210124800?s=20)

------
mangatmodi
This year has been really bad for the cloud vendors in terms of stability.

~~~
StreamBright
This is a single AZ. Your application should tolerate single AZ outages. That
is the first rule of building reliable, highly available services on AWS.

    
    
       12:08 AM PST We are investigating increased network connectivity errors to instances in a single Availability Zone in the EU-CENTRAL-1 Region.

~~~
mvanbaak
True, but autoscaling at the moment is unable to scale in/out because of this
issue. If your autoscaling group includes the affected AZ, you are out of luck
because those instances are being terminated since their health checks fail.
But because of the failures, autoscaling is unable to complete this, and
unable to launch new instances in the other AZ's as it is stuck on the
terminating part.

~~~
StreamBright
That is an interesting detail. Would you consider having 3 separate
autoscaling groups (one per AZ) or this is not feasible for some reason? One
of the interesting aspects of running services on AWS was to remove the AZ
from "rotation" while the outage lasts. Meaning, having a DNS change and
exclude the public endpoint from taking any traffic. If you have 3 separate
groups doing 1/3 of the load and having independent DNS entries, autoscaling
groups, etc. then moving traffic from one AZ to another is probably easier.
Not sure about the details of your setup though, you might have reasons not to
do this.

~~~
mvanbaak
A setup with an ASG per AZ makes it a lot harder to do real autoscaling based
on load/mem/connections. If it was purely for running a fixed amount of
instances equally spread acros AZ's this would probably work, but not in our
setup where we have unpredictable traffic and load patterns.

[edit] I know it can be done with combined metrics etc, but it would make it a
lot more complicated ;-)

~~~
StreamBright
It is most certainly more complicated. We were ok to be put up with that
additional complication because service reliability (especially tolerating
single AZ outages with ease) was higher on the requirements than avoiding
complication. :)

