
AWS Server Issues Take Down Instagram, Vine, Airbnb And IFTTT - WestCoastJustin
http://techcrunch.com/2013/08/25/instagram-vine-and-ifttt-went-dark-thanks-to-amazon-web-services-issues/
======
thezilch
How many times do we have to hear about EBS issues and companies like Reddit
reporting EBS being a disaster before we stop seeing outages from it? Either,
Amazon is looking into it, publicly, or people have wised up to the facts and
stop using EBS-backed services. Those include Elastic Load Balancer (ELB),
Relational Database Service (RDS), Elastic Beanstalk, and others. It's just
part of the AWS fabric; you "need" to be in multi-region, multi-AZ, and
architect to be on ephemeral disks that can disappear at at any instant.

~~~
mattlong
I agree with you 100%. For those of us needing to run MySQL, what would be the
best ways to minimize the dependency on EBS and maximize HA?

~~~
nknighthb
I haven't implemented this on EC2 myself, but Galera Cluster is a godsend for
MariaDB/MySQL HA. Multi-AZ is probably viable, but multi-region may hurt.

A multi-AZ cluster with ordinary binary log replication to another region
(with preparations to launch a new cluster in that region based on that slave
on short notice) might be a good solution.

~~~
falcolas
You would not want to implement PXC across regions - the round trip time would
be too hard in individual transactions, and the ISTs would be murder to your
bandwidth. Between AZs should be better, but remember with PXC that your
transaction time is limited by your network round trip to the slowest node.

Regular asynchronous replication is your best bet, with a regular run of pt-
table-checksum scheduled to ensure your data is consistent.

~~~
nknighthb
For those confused (as I was at first), falcolas is referencing Percona XtraDB
Cluster, a particular distribution of MySQL + XtraDB + Galera Cluster.

> _Regular asynchronous replication is your best bet_

That doesn't get you HA in any practical form. I've done the heartbeat thing,
it created more outages by itself than would have occurred with no HA solution
at all, usually failed to kick in during a real failure, and we could never
fix all the split-brain scenarios. Galera is absolute magic by comparison.

Hence my suggestion that a cluster be deployed within one datacenter, with
asynchronous replication to a standby.

~~~
falcolas
Galera is magic, I'll agree, but there are just too many shortcomings for me
to recommend it for most people.

Heartbeat is very problematic, I agree, but it's hardly the only solution out
there (and far from the best solution). That said, I have learned that it's
remarkably configurable, so many of the problems you encountered could
probably be addressed, if you're willing to learn about Pacemaker and really
dig into its configuration.

~~~
nknighthb
I wish you'd elaborate on those shortcomings, because individual transaction
latency is the only real one I've found, and for most workloads, it's not
nearly enough to overcome the HA and CPU scaling benefits.

------
ericmsimons
Interesting that Instagram is still running on Amazon - I wonder when they'll
be hosted off of FB's in house servers.

------
abalone
Amazon claims[1] the problem was isolated to a single AZ. Yet it took down
Instagram, Vine, Airbnb and more.

Simple question: Are none of these major apps properly architected to failover
to other AZs, or is Amazon lying?

[1] [http://status.aws.amazon.com](http://status.aws.amazon.com)

~~~
mblakele
Well, I spent the afternoon performance-testing EBS-backed instances in us-
west-2. Performance was no worse than usual. So I credit the idea that the
problems were limited to us-east-1, and the companies mentioned should do more
work on cross-AZ resiliency.

~~~
xxpor
us-east-1 is a region, which has multiple Availability Zones.

------
retr0h
This is where things like Chaos Monkey really come in handy. If you want to
run EBS or ELB, I suggest randomly breaking things, until your architecture is
resilient.

This is one reason I like Linode. They don't offer EBS, so you don't design it
into your system. Also, they are just a really great company to do business
with.

------
skidoo
The paranoid part of me sees an untimely coincidence between this and the
recent outages at Google and Amazon.

~~~
AsymetricCom
Are you saying the Amazon outages may be related to the Amazon outages?

~~~
skidoo
These involve subsidiary hosted services. I was referring to the online
storefront itself. Sorry for not making the clarification! :)

------
zura
Interesting, when you're a quite successful company - why should you be
critically dependent on such 3rd party services?

I mean, why not have an in-house hosting with the appropriate staff? Is this
so much trouble, even for rich companies?

Another side is having a several in-house servers in different locations.

~~~
bowlofpetunias
Because when you start designing the architecture of your in-house hosting to
be as flexible and reliable as possible, you end up with something very
similar to AWS, only with a fraction of the resources and experience to manage
it.

It's not like you suddenly stop having technical issues because it's your own
staff and hardware.

~~~
zura
When it comes to your company, you are aware of more details and plans, so I
believe you can get off with the less flexible but more tailored system.

You won't suddenly stop having technical issues, but at least you'll have a
much more control.

------
benologist
Thanks AOL for letting us know that AWS issues affect lots of sites!

