Unless a bug was in all regions it's why it's good to consider multiple cloud se...

tracker1 · on April 12, 2016

And where do you put your system that directs traffic to one cloud and/or another... and what happens when that goes down?

DanielDent · on April 12, 2016

You get an AS number, and announce your own IP space. DNS failover only sort-of works.

Or your subscribe to a "GSLB" service where they do this for you for a significant fee. Or you use a "man-in-the-middle as a service" system like Cloudflare, who do this at an extremely reasonable and/or free cost.

Of course, you still have to deal with the risk of route leaks, BGP route flapping/dampening, and other things which can take your IP addresses offline despite the fact you are multihoming with different carriers in different locations.

So perhaps you setup IP addresses on different ASNs and use both DNS & IP based failover.

But then you find a bug somewhere in your software stack which makes all of this redundancy completely ineffective. So you just take your ball, go home and cry.

tracker1 · on April 12, 2016

Kind of the point... adding more possibilities for failure, at increased complexity and expense isn't always worth it... and I'd say usually isn't.

jon-wood · on April 12, 2016

You put it in all your clouds, with low TTL DNS entries pointing at all those instances (or the closest one geographically maybe). Then if you're really paranoid you use redundant DNS providers as well.

packetslave · on April 12, 2016

And then you discover that there are a LOT of craptastic DNS resolvers, middle boxes, AND ISP DNS servers out there that happily ignore or rewrite TTLs. With a high-volume web service, you can have a 1 minute TTL, change your A records, and still see a lovely long tail of traffic hitting the old IP for HOURS.

tracker1 · on April 12, 2016

The point was that adding another point for potential failure still won't reduce the chance of failure... it's just something else that can and will break.

In any case, failures happen, and most systems are better off being as simple as possible and accepting the unforeseen failures than trying to add complexity to overcome them.