
S3 and High Availability - mnutt
https://blog.movableink.com/s3-and-high-availability/
======
curun1r
This post seems to perpetuate a common misconception that AWS users make when
it comes to regions. Namely, that the purpose of regions is fault tolerance.

But the guidance we've been given by Amazon is that this is the purpose of
availability zones, not regions. Regions are more appropriate for fighting the
speed of light (i.e. locating your site closer to your users). As an
illustration of this, Amazon told us that the US version of amazon.com runs in
a single region.

Incidentally, the other interesting take away from that meeting was to avoid
using autoscaling to respond to failures. This is because provisioning
instances can fail when there's heavy demand and that's frequently the case
when Amazon is experiencing outages in other regions and availability zones.
Instead, we've been urged to provision 150% of what we need (50% in each AZ)
so that if any one AZ goes down, we can still handle all our traffic. Where
autoscaling works well is in responding to spikes in our own need rather than
situations where many Amazon customers will have need.

Sorry for the digression, but I found that consultation interesting and it's
clear that others have the same misconceptions that I had before learning the
thought process behind the building blocks that AWS gives us.

~~~
pyre
But Amazon has had regional failure before. How do availability zones within a
region help, if Amazon has a catastrophic failure that is local to a region?

~~~
namecast
At that point, you'll need to have multiple regions in play, and some sort of
mechanism to direct traffic to regions when one becomes unreachable (for most
people on AWS, this will be Route53).

Devil's advocate, though: if you're concerned about what to do in the event of
an AWS regional failure, given how rare an event that is, then you've probably
outgrown AWS.

(For most small-to-medium sized startups, I'd advocate setting up
statuspage.io and keeping your users informed if you're single homed to an AWS
region and that region experiences a catastrophic failure. The math on "how
much money you'll lose from the outage" vs. "how much you need to spend
implementing proper DR, better than what AWS has in place to keep a region up
and running" isn't even close, assuming, say, 1 8 hour regional outage every
~2 years.)

~~~
jewel
A B2B startup of any size will have a very hard time if those 8 hours are
during business hours. Depending on the nature of the service, of course.

Where I work now (a small web-based software-as-a-service) an 8 hour outage
would be catastrophic, and could easily kill the business. The switching cost
for our niche is small, so one bad day could cost us 20% of our clients. We're
not running with a 20% profit yet, so at best it'd mean an immediate layoff or
across-the-board temporary paycut.

Luckily, because we're small we can run on a single LAMP server. We're working
on making it so that we can migrate that to any region in EC2 with a single
command, as well as making sure we can switch to a different dedicated hosting
provider with minimal time.

------
mnutt
By the way, turning on S3 bucket versioning is safe in that your objects will
get served exactly the same over HTTP. However, with many of the AWS SDKs you
will start receiving an S3::VersionedObject rather than an S3::Object. From
there you can get the S3::Object but the VersionedObject does not have all of
the object's properties.

------
helper
I really hope we get a postmortem for the outages on Monday. S3 has
historically been one of the most reliable AWS offerings so it will be
interesting to hear what happened.

~~~
badmadrad
Me too. We have noticed a general increase in error rates to S3 over the last
month. I wouldn't be surprised if they were battling some ongoing issue that
reached a tipping point.

~~~
ak217
It seems more likely that this fell out of recent advertised updates that
offer read-after-write consistency in US STANDARD.

------
anh79
I'm thinking of putting S3 behind a Cloudflare set up, and use "Always online"
feature of Cloudflare.

Sound goods? (Woh, as long as Cloudflare doesn't have any SSL issue :D)

~~~
gphil
I think this would be a pretty good approach, especially when combined with
the author's strategy.

~~~
Gigablah
I imagine you'd have to implement some sort of cache priming as well?

------
jedberg
FYI you can target the datacenter you want for S3's "standard" region and
force it to always use Virginia by targeting s3-external-1.amazonaws.com

------
mmaedler
As a bloody AWS newbe I wanted to clarify on one thing: You're talking about a
Source S3 Bucket and a Destination Bucket. So in case the Source Bucket fails
you also do your Writes against the Destination Bucket and they will get
replicated once the former Source Bucket comes back online (two way sync)? Or
am I mistaken something here? Thanks for clarification!

~~~
mnutt
It actually looks like it may be possible to cross-sync back and forth between
two buckets, but I haven't tried it. In our case we're ok with going into
read-only mode for a bit.

------
lexicalscope
It's amusing that they relied on a write followed by an immediate read to see
if the updates were immediately consistent since S3 is only eventually
consistent even if you're only using one region (with exception of certain
utilities like the Import/Export tool) unless I'm missing something?

~~~
revertts
There's read after write consistency for new objects, eventual consistency for
overwriting existing objects.

Historically this applied to all regions except US Standard, but now that too
supports it if you go through the VA instead of global endpoint.

~~~
lexicalscope
That makes sense - in this article it looked like they were using a
modification to test the consistency - which should always be eventually
consistent though. Maybe I'm misunderstanding though? Regardless, interesting.

~~~
mnutt
Yeah, I was operating under the assumption that it was eventually consistent
and just found it curious that it converged way faster than I expected. (until
I read the explanation)

~~~
lexicalscope
That makes sense - thanks for clarifying.

------
ilkkao
Is there rough estimate how much more (%) you need to pay if you duplicate all
data but almost never access the copy.

~~~
mnutt
It depends on how much your data gets accessed. Your data storage costs
(~$0.03/GB) double, but your transfer out costs (~$0.05-$0.09/GB) stay the
same. The replication cost is often pretty negligible compared to regular
traffic.

------
aftbit
Wait, doesn't putting this all behind Cloudfront make that a SPOF for your
system?

