

20+ hour outage due to EC2/EBS on BitBucket - zx
http://www.bitbucket.org/

======
gfodor
Unfortunately this doesn't sound like an EBS issue but a systems architecture
flaw. I apologize in advance if my analysis is wrong here, but I think it's
important to understand.

We use EBS extensively on our infrastructure. Occasionally an EBS volume will
fail. We ran into an issue where a volume had a spike in IO load as I think
was the case here.

EBS volumes are not magic, they are just chunks of physical disks. Disks fail.
You should have the system architected so that you can handle such failures.

Here are a few things you can do (we don't do all of these, but enough to
assure we won't lose data or have downtime in the case of a failure.)

\- Mount several EBS volumes, use raid.

\- If its a database, set up a failover node on a separate EBS volume.

\- Take regular snapshot backups.

\- Take regular full backups to S3.

It's also very important to have everything highly automated. If an EBS volume
fails for us, its one command to switch to the failover node. If that doesn't
work, its another command to spin up a new machine off the last available
snapshot with a few hours of data loss (worst case scenario.) Everything is
highly monitored with nagios + ganglia so we know when bad stuff happens.

The two or three times we've had issues with EBS we were able to either switch
to a failover node or take a snapshot and mount a new volume from there. I
haven't set up RAID on EC2, but I'd imagine this also a very good route to
protect your data.

Remember, the cloud isn't magic. The only advantage you get with the cloud is
rapid provisioning and unlimited capacity if you need it. You still have to
build a shared nothing, reliable architecture within the framework the cloud
gives you. We've found EC2 and EBS to work out very well, but of course there
were growing pains as you learn very quickly where your single points of
failure are! I get the sense that the overall reliability of resources such as
instances or volumes, on the whole, is definitely lower than what you'd expect
in a standard hosting provider, whatever the reason may be.

Edit: Of course, you could also load your data on a distributed data store
like Cassandra as well that handles some of this failover and replication
magic automatically, too.

~~~
jespern
You're absolutely right that the cloud is not magic, but you do get some
guarantees with EBS. From their website:

"Each storage volume is automatically replicated within the same Availability
Zone. This prevents data loss due to failure of any single hardware
component."

We don't keep the database on the same EBS, and we have segmented database
traffic out to several EBS volumes (for WAL, etc.) That's not the issue.

We take regular snapshot backups. We didn't lose any data. We have everything,
we just can't get to it.

Regardless of what might make sense in this situation, it's not working for
us. We've moved both our instances _and_ the volumes to different availability
zones, to no avail.

I just received a call from AWS engineering, assuring us that we are currently
their top priority, and a team of engineers are working to fix the problem.
They're seeing the issue on their end, and fortunately for them, it seems
rather isolated to our instance.

Could we have taken precautions to prevent this problem? Maybe. We hadn't,
cause we didn't anticipate a problem as exotic as this one. The only way to
keep persistent data on EC2 is using EBS, and right now, it doesn't work for
us, at all. This is not a common problem that could've been solved with
backups or snapshots, or whatever.

~~~
keefe
>The only way to keep persistent data on EC2 is using EBS, and right now, it
doesn't work for us, at all.

S3 should work too? Unless it was a global EBS failure, you should be able to
restore from any backup to a new set of instances and stores, why doesn't that
work?

~~~
jespern
...As data you can access as a filesystem. S3 is great, but pretending it's a
filesystem is going to get you awful performance.

As I said, our data is not lost, we have snapshots and backups, it's sitting
right there on the mount, we're just not getting any sort of acceptable
throughput. New instances does not fix the problem.

Ironically, we were looking into having S3 as the backend for our data, for
scalability/redundancy purposes, but this pretty much puts a stop to that.

~~~
keefe
Oh, I wasn't suggesting pretending it's a file system - I had been thinking of
a place to dump the data for backups, thinking fresh instances + fresh EBS
would solve the problem. I think you answered this already in the other post -
that you booted a new instance and a new EBS with some backup and the problem
remained?? This seems like such a horrendous failure on AWS' part, unless it
has something to do with how you are accessing the EBS (too many connections
or something). I could understand if a given EBS fails, but if you can restore
the data from an independent backup and spin back up with new instances and
new EBS this indicates a very concerning systemic problem in EBS!

------
jespern
I'm here to answer questions if there are any (I run Bitbucket.)

~~~
jespern
It was fixed around 4am (GMT+2) last night, with the assistance of Amazon. I'm
just going to summarize what happened here:

We were attacked. Massive UDP DDOS. The flood of traffic prevented us from
accessing our EBS store with any acceptable speeds, which is what caused
everyone to think the problem was between our EC2 and the EBS. Of course this
also explains why booting up a new instance and EBS didn't help anything.

Also, it's happening again now, and we're working with Amazon to remedy it
once more.

~~~
tlrobinson
Is there anything Amazon could have done to prevent this (or at least made
diagnosing it easier), or is it a problem with your particular application?

~~~
jespern
We're talking UDP flood here, saturating our bandwidth. It never reached our
servers, it just ate all the bandwidth on our connection. I guess what Amazon
could have done is be quicker in spotting the DDOS and take measures to
prevent it.

~~~
spudlyo
So you never saw any evidence of this DDOS yourself? I'm somewhat skeptical of
this explanation. It seems to me with shared infrastructure it'd be difficult
to saturate just one customer's connection. It also doesn't make sense to me
that this could be done without the traffic ever reaching your server. You
used the phrases "our bandwidth" and "our connection" do things really work
this way on the AWS cloud?

Anyway, I'm really sorry you guys had to go through all of this, and I hope
whatever it is that caused it is fixed.

------
shizcakes
Ugh. I am working on the next gen architecture for our site, and I wanted to
focus on hosting it in a cloud - but all these outages give me no confidence
that cloud-based hosting is really all that ready for primetime yet.

~~~
neurotech1
IHMO AWS Windows instances may not be ready for prime time. This issue seems
not to affect Linux based instances.

There are other cloud providers like the YC favorite, SliceHost - I have no
direct experience with them but may soon try them out.

~~~
jrockway
Slicehost is not really a "cloud provider". You buy a VPS and use it
"forever".

~~~
jbellis
Well, it's both. Rackspace Cloud Servers (basically, Slicehost post-
acquisition) has an EC2-like api to spin up and down servers on demand, billed
by the hour.

------
durana
Everything fails. Design systems that minimize the impact failures have on
your customers. Moving to another provider isn't going to fix the problem.
Data like this should be stored in more than one place.

~~~
wmf
In reality, replicating across two clouds is difficult and expensive. It's
quite possible that Bitbucket wouldn't exist at all if they had used such an
architecture.

~~~
joevandyk
I'm looking at cassandra as a possible solution for things like this.

~~~
jbellis
Seems like a popular theme right now -- one of the github guys is working on a
cassandra git backend at <http://github.com/schacon/agitmemnon>, and paul
querna of ASF infrastructure is looking at doing the same for svn.

------
rogerthat
Despite this, Bitbucket is great. Private repository with a free account
option - you don't get that on GitHub.

~~~
jrockway
OTOH, since Github has some of my money, they have a bit more of an obligation
(and incentive) to keep their servers up. Free services come and go as the
owner pleases. (Hello, ma.gnol.ia.)

~~~
tomjen2
Not really - Bitbucket have gotten quite some of my money.

And yes I am considering chancing that, but mostly because I can't find a good
mercurial client for windows.

------
tve
From what I can piece together, it seems the real problem isn't EBS, it's that
the security groups are implemented at the host level (the machine on which
your instance runs). This means that the UDP flood reached your host where it
got dropped due to the security group rules, but it still had a performance
impact, on EBS in your case, just because of the sheer volume of packets. The
trouble was that nobody could see these packets and diagnose the problem
correctly. If you had temporarily allowed-all into your security group and
done a tcpdump you'd have gone "whoa!" and headed into the correct direction
to fix the problem. Interesting...

------
zaph0d
Dear HN readers: What according to you could be one way of architecting the
storage so as to avoid similar AWS EBS outages in the future?

~~~
moe
Money. You buy a second set of instances in a different availability zone and
failover to it in case of problems. You buy a second datacenter at a different
ISP, keep it in sync and failover to it when your primary fails.

Eventually you architect your application to distribute load over multiple
facilities and to become resilient against component failure.

Until then: You do nothing, grab a beer, relax and wait as the amazon guys
sweat their asses off to fix it. You pat yourself on the back because it is
not _your_ ass in the trenches right now.

On top of that you have the perfect excuse for the followup blog-post.

------
Kayem
An unfortunate event, however, re-iterates to system admins why the cloud
should only be used as a low tier of storage -- for now.

~~~
moe
What you are saying makes no sense.

Everything fails occassionally. Amazon probably has a team of highly
specialized engineers on the task right now, working under the pressure of a
few dozen disgruntled customers and under the eyes of worldwide press.

Could _your_ company respond with an equal intensity if this was your own
hardware? Will _your_ SAN supplier whip his staff on-site as fast as they will
for amazon?

~~~
Kayem
"Everything fails occassionally. Amazon probably ..."

I wish I could use that line to explain to my company why our business has
come to a complete halt.

Granted, your business is completely web-based and in the cloud, which is why
I specifically made a mention to system admins why the cloud is not reliable
enough to be a high-level tier of storage yet. Why doesn't that make sense? I
wasn't trying to offend you or your decisions.

Also, yes, my company and my SAN supplier would have staff on-site. But we
have control over our own hardware, so there's really no comparison.

~~~
moe
_I wish I could use that line to explain to my company why our business has
come to a complete halt._

It's a completely valid and reasonable business decision.

For most companies the risk of amazon downtime is simply not a deciding factor
when held against what it would cost to maintain an own datacenter with
remotely similar properties.

 _a high-level tier of storage yet. Why doesn't that make sense?_

I guess I didn't get what you mean by "high-level tier storage"? Most
companies have at most two tiers: Live and Snapshot-Backup. If you're a bank
or fortune XXX with truly multi-tiered storage then yes, your inhouse staff
_might_ be able to do it better. But it will probably cost quite a bit more
than ec2 and the business case for that is imho rather the exception than the
rule.

 _But we have control over our own hardware, so there's really no comparison._

Well, I think you overestimate your capabilities there (unless you _are_ a
fortune 500). Amazon doesn't face downtimes over disk or server failures - and
neither would you. The real question is who can debug and resolve complicated
failure modes faster (you know, nasty stuff, heisenbugs).

Not meaning to offend you either but my money would be on amazon. That's why I
questioned your broad statement of "not a high-level tier storage". How much
higher level than backed by a 50.000-servers operation can it get?

~~~
Kayem
I completely agree with your points. Again, I was speaking to people who
manage their own SANs, and may be looking to use the Cloud as an additional
tier of storage, with the same reliability as a local array. Reliable in the
sense that they would never have to worry about Internet latency, network
nodes going down, or anything else that they have absolutely no knowledge or
control over and could potentially affect the performance/operation of an
application, ultimately disabling me from meeting business requirements.

If there are no business requirements to meet, I have no arguments, the Cloud
is where I'm at!

------
datums
Did you guys try mounting a new empty volume on a new instance ? To see if it
was all EBS or just that EBS. Have you thought of not using EBS and using s3
as the data store ?

