
4x Hour EC2 Outage for the Entire AWS Sydney Region - LilBytes
https:&#x2F;&#x2F;phd.aws.amazon.com&#x2F;phd&#x2F;home#&#x2F;dashboard&#x2F;open-issues?eventID=arn:aws:health:ap-southeast-2::event&#x2F;EC2&#x2F;AWS_EC2_OPERATIONAL_ISSUE&#x2F;AWS_EC2_OPERATIONAL_ISSUE_RFMRO_1579740075&amp;eventTab=details&amp;layout=vertical<p>Not yet resolved, currently impacting:<p>RDS
WorkSpaces
ELB
Appstream2
Lambda
EastiCache
EC2
======
gelatocar
Latest update from AWS:

[08:49 PM PST] We wanted to provide you with more details on the issue causing
increased API error rates and latencies in the AP-SOUTHEAST-2 Region. A data
store used by a subsystem responsible for the configuration of Virtual Private
Cloud (VPC) networks is currently offline and the engineering team are working
to restore it. While the investigation into the issue was started immediately,
it took us longer to understand the full extent of the issue and determine a
path to recovery. We determined that the data store needed to be restored to a
point before the issue began. In order to do this restore, we needed to
disable writes. Error rates and latencies for the networking-related APIs will
continue until the restore has been completed and writes re-enabled. We are
working through the recovery process now. With issues like this, it is always
difficult to provide an accurate ETA, but we expect to complete the restore
process within the next 2 hours and begin to allow API requests to proceed
once again. We will continue to keep you updated if that ETA changes.
Connectivity to existing instances is not impacted. Also, launch requests that
refer to regional objects like subnets that already exist will succeed at this
stage, as they do not depend on the affected subsystem. If you know the subnet
ID, you can use that to launch instances within the region. We apologize for
the impact and continue to work towards full resolution.

~~~
LilBytes
Thank you for this. Our TAM confirmed the same a few hours ago.

Is it me, or are AWS outage messaging always really invasive?

Error rates and API latency? I mean, 100% is technically a 'rate' between 1
and a 100.

------
wryun
The frustrating thing is this is also sending out other services (ECS Fargate,
CodeBuild) that presumably can't allocate capacity, and they don't admit it on
the status page (hey, it's still a 200!).

My favourite error of today:

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '135', 'content-type':
'application/x-amz-json-1.1', 'date': 'Thu, 23 Jan 2020 04:16:02 GMT',
'x-amzn-requestid': '...'}, 'HTTPStatusCode': 200, 'RequestId': '...',
'RetryAttempts': 0}, 'failures': [{'reason': 'Capacity is unavailable at this
time. Please try ' 'again later or in a different availability zone'}],
'tasks': []}

------
gelatocar
This is impacting us as our application relies heavily on Lambdas connecting
to RDS via VPC. Has been down on and off since about 12pm AEST today. Doesn't
seem like there's any easy ways for us to resolve or work around without just
waiting for AWS to fix.

------
quixquaxqux
Better link? [https://status.aws.amazon.com/](https://status.aws.amazon.com/)

~~~
LilBytes
Has zero detail. I thought the URL I provided has more albeit, you need to log
in.

------
borplk
Obligatory reminder that if this kind of "X hour downtime per year" is such a
big problem for your business you can go multi-AZ and multi-region.

If you choose not to, that's a valid and often the right choice but then don't
panic and scream when something fails. It's a trade-off decision that you have
to make. You can't have your cake and eat it too.

Either you care enough to design, build, and test that kind of fault tolerance
or you don't. If you do you get its benefits during these times. If you don't
then just own it and put up with the downtimes a few times per year.

