
What to do when your Amazon EBS fails - brettbowman
https://www.lucidchart.com/techblog/2013/03/08/lesson-learned-amazon-went-down/
======
old-gregg
Can I ask: why do people chose to waste their very precious time writing
software to deal with EBS failures (and other AWS deficiencies) instead of
just not using AWS?

The world is full of wonderful alternatives like SoftLayer, for example (no
affiliation). We've been running a very high traffic system there with
individual box uptimes of 700 days before we had to reboot them during
scheduled OS upgrade/maintenance.

The "AWS scaling stories" I hear at dev. meetups make absolutely no sense.
People, AWS original sales pitch was to take care of scaling _for you_ , not
the other way around.

The developers obsession with AWS amazes me. It's like paying for a BMW and
being proud of carrying around a self-made toolbox in case of an engine
failure in the middle of nowhere, and everyone is proud to "know someone
inside Amazon". Since when establishing a personal relationship with a car
mechanic is seen as "cool"? Why not just pick a quality brand?

~~~
JOnAgain
>> why do people chose to waste their very precious time writing software to
deal with EBS failures (and other AWS deficiencies) instead of just not using
AWS?

1) Proven Scale. Netflix. They're bigger than I will be until such time my
company can afford to buy it's own data centers. At that point, I can re-
evaluate. 2) Operational overhead. Startup, running hundreds of hosts,
thousands of requests per second, 5 developers, 0 sys-engineers. 3) Lock in.
We use Cloudwatch, Dynamo, RDS, Beanstalk, S3, EMR and more. We also use
Heroku. No one else out there has the full suite of offerings. Not using AWS
would require us to stitch together offerings from several vendors, some
technology wouldn't be feasible having to hop across the internet (do you want
the lag to get to your database from one data center to another?) 4) Billing.
We get 1 Bill which we can predict. AWS takes credit cards, will do Net 30
billing, and are generally pretty flexible. 5) Customer Service. Don't get me
wrong, support at AWS is ... not it's strong suit. We also don't spend enough
to be a "big fish". However, if we used 10 different providers to get our
infrastructure together, we'd me a minnow to each one. By consolidating, we're
a small-medium sized customer which is enough to get into beta's, info on
their roadmap ahead of time, and some personal service. 6) Excuses for
downtime. So many high profile companies are on AWS that if we have an outage
directly attributable to AWS, the odds are a lot of the rest of the internet
is down, too. Seriously. It sounds stupid, but it's a huge reason. 2 years ago
we were raising money. Right in the middle of our final week (when all our
investors were visiting our site and making sure they really wanted to sign on
the dotted line), we had an outage that lasted several days. This was the AWS
outage that took out Heroku for a few days, and ours was one of the last
databases to get restored. If we were using almost anyone else, investors
might have been like "oh, this team doesn't really know what they're doing,
they're not really that technically competent, I'm not going to invest". But
since Netflix, Foursquare, and dozens of other companies were all offline at
the same time, the investors were just like "they're using AWS, if Netflix
can't keep online with hundreds of engineers, then I'm not going to expect
that this little startup is going to do better". 7) Could I really do it
better? Is anyone else doing better? AWS is the leader, so we all see the
warts ... but at least they keep their certs updated ( _ahem_ microsoft)

~~~
luser001
Serious question: You have hundreds of hosts, but you don't think the savings
of colocating your own hardware won't pay for an engineer to take care of the
machines?

~~~
JOnAgain
It would probably be possible to do it cheaper, but there is tremendous cost
advantage at scale in this business. Then there's the question of focus. Do I
want to have to hire the ops folks? their managers? create new payscales and
review standards? Am I going to be good at attracting or judging talent in
this field that is only tangentially related to software that I do know? Will
the best people in that field want to work for me?

------
lelandbatey
This is a total side question but I noticed this line:

> The process was stuck in the kernel waiting for an IOCTL, so killing the
> process did nothing – even kill -9. It was going to stay that way until EBS
> was back to normal.

I did not know that can happen! Can anyone provide a further
explanation/material for why this is the case?

~~~
ajross
Processes blocked in system calls can be unkillable. In the Linux kernel, what
typically happens is that the process is blocked on a wait queue (what
userspace usually call a "condition variable" or "monitor") in one of the
wait_event() family of functions. Because the kernel doesn't implement any
kind of exception handling, there is no way to automatically break out of
this; you have to wait for some other task to call wake_up().

It's possible to wait in a "interruptible" way, and many subsystems do. But
that's hard, because it means that the resources you would otherwise be
responsible for clean up need to be detected and cleaned up by someone else
(or alternatively that someone else needs to pick up and finish what you
started, or you need to be able to roll back a partially completed action,
etc...). Userspace usually ignoes this problem (because the task got killed,
right?), but in the kernel a leak is a huge bug.

So the default is that waiting in the kernel means that the kernel can't allow
the process to be terminated (or otherwise be delivered a signal that would
invalidate the stack). And the side effect is that userspace sees an
unkillable process in the "D" state.

------
agwa
I use the iptables reject trick all the time. It's particularly useful when
you have a bunch of servers in a DNS round robin. When one of them adds the
reject rule, clients start failing over to the next server in the DNS round
robin essentially immediately.

------
contingencies
How about using a proven high availability solution such as Corosync/Pacemaker
instead of rewriting a limited implementation from scratch?
<http://corosync.org/>

~~~
druiid
Up-voting you because, well, everyone always forgets about Corosync. About the
only criticism one could have had about it was that when it only supported
multicast you couldn't use it with EC2 and other providers that (for good
reason) blocked multicast. It now can do UDPU so it works on EC2.

The other great thing about corosync+pacemaker is that it's super easy to
write your own HA scripts. Need mongodb? Then write your own script(not sure
why you'd use corosync for this, but you could...).

------
jperras
Maybe I'm missing something, but why not create a software RAID 1+0 array of
disks (easily doable with EBS volumes and mdadm) instead of relying on a
heartbeat-style solution as described in this post?

