Ask HN: Were you able to mitigate the impact of the AWS us-east-1 incident? How? - melor
======
deftnerd
I run my infrastructure on three different providers and use GeoIP assigned
AnyCast DNS servers from another provider.

Asia/Australia is run on Digital Ocean, Europe is on OVH, and the Americas is
on AWS.

When someone requests the IP address of my site's front-end domain or static
asset CDN domain, my nameserver determines their geographic location and
returns the IP address of the closest resources to them.

I run health checks so when S3 went down, which I use to host my static assets
for the Americas, my nameservers quit giving out the IP addresses for the
Americas systems and started giving out IP addresses for the Europe systems.

When health checks started being successful again, everything restored itself.

Due to low DNS TTL values, users in the Americas were only impacted for a few
minutes and that's if the IP was cached by their system.

~~~
mydpy
Which AWS services do you use?

~~~
deftnerd
S3 for static file hosting, ec2 for caddy web server front ends (2+ depending
on traffic needs), ec2 for a MySQL master (replicated to the Digital Ocean and
OVH MySQL master VPS's), and Elastic Load Balancer

------
melor
We host a number of our customers' database systems on us-east-1.

What worked well for us ([https://aiven.io](https://aiven.io)):

\- Architecturally relying only to a few cloud provider services (only need
VMs, disk, object storage)

\- Upfront investment on being able to move services from one region to
another without downtime

\- Pre-existing tooling for easily (manually) reconfiguring backup
destinations on the fly

\- Not running everything on just AWS

What did not work so well:

\- Backups should automatically reroute to a secondary backup site on N
consecutive failures

\- Alert spam, need more aggregation

\- New failure mode: extremely slow EBS access, some affected VMs were kinda
working, but very slowly: need to create a separate alert trigger for this

