
Bitbucket: On our extended downtime with Amazon - brodie
http://blog.bitbucket.org/2009/10/04/on-our-extended-downtime-amazon-and-whats-coming/
======
shughes
The only thing that kind of bothered me was that Bitbucket focused too much on
blaming Amazon. Here's a sports example. Let's say a team loses because the
left tackle messes up and the quarterback gets sacked. After the game, during
a press conference, does the quarterback say that they lost, but it's the left
tackle's fault? No. Even if the quarterback believes he himself performed
great, he'll still take part of the blame. That's because it's the
professional thing to do, and people respect him more for it.

You could say that Amazon and Bitbucket aren't on the same team, so it's
different. But Bitbucket is paying Amazon to perform better just like a
football team pays a player to make the team perform better. In other words,
based on the concept of a team recruiting players for better team performance,
Bitbucket recruits Amazon for better company performance, so they're basically
on the same team.

That said, Bitbucket should focus less on singling out Amazon, and focus more
on trying to figure out what they could have done better. It's the
professional thing to do, and I'll respect them more for it.

~~~
ubernostrum
I certainly can understand why they did, though; for a _very_ long time the
only thing they knew was "Amazon's EBS seems to have failed". Once it was
known that they were being DoS'd they quickly got that information out.

------
ErrantX
Downtime like this sucks - I've been there beating my head against a brick
wall trying to get someone else to admit it really is something they need to
look at :( I'm impressed with the effort they put in in the end.

BitBucket has had quite a lot of down time this year - and I also had a few
problems with payments a short time back.

 _but_

It doesn't piss me off like it would with certain other providers.

The status update pages are amusingly maintained when they go down, updates
are prompt and contacting support got a sensibly fast (it was a Sunday I
emailed :D). At no point have I thought about moving on. So they do deserve a
lot of kudos for that.

------
moe
Good to hear you're back on track.

16h resolution time is actually not too bad when you look at your monthly bill
- but the initial auto-denial (typical for large service providers of all
kinds) is ofcourse annoying. Perhaps you get faster action next time, now that
you're paying for the premium plan.

It is indeed strange, though, that it took them so long to discover what must
have been a huge traffic spike on your instance. Their administration panel
needs work if it doesn't point out such anomalies at a glance. Also your own
monitoring needs work if you didn't notice the spike of inbound traffic.

Anyways, good luck for the future and if you decide to switch or extend I'd be
curious to read about it.

------
bcl
Why weren't they running any monitoring on their instances? A simple graph of
incoming traffic should have revealed the massive influx of packets and helped
track things down more quickly.

Just because a system is 'in the cloud' is no reason not to setup normal
monitoring on the system.

I wrote <http://www.brianlane.com/software/systemhealth/> to try and help with
this situation, it is simple to install, doesn't require anything other then
rrdtool and a web server to serve up the graphs.

~~~
rbranson
They may have been, but I imagine it's difficult to discern the difference
between legitimate traffic and a DoS attack if you've got NAS volumes mounted
over the same interface. You'd need to do a more thorough network analysis.

~~~
drusenko
They claimed in many places that their "pipe" was "maxed out". If the attack
was large enough to use up all of their available bandwidth, that should be
_easily_ noticeable in any basic bandwidth graphs.

------
rbranson
This is surprising to me. It seems like Amazon should have dedicated ethernet
(or at least a separate VLAN + higher QoS) for EBS.

~~~
tc
Wait! What about Network Neutrality? Using QoS would _de_ prioritize other
traffic. In particular it would _unfairly_ advantage EBS over other cloud
storage solutions you could be trying to access from EC2.

In all seriousness, I agree that Amazon should probably be using QoS here so
outside traffic can't affect your EC2 to EBS link. I just want to point out
that it is difficult to be simultaneously in favor of Amazon doing this while
being against ISPs prioritizing voice traffic, which provides similar
resilience for voice against outside influences [1].

[1] I'm not suggesting that the parent poster suffers this potential cognitive
dissidence.

~~~
tsuraan
QOS and net neutrality are completely unrelated. What net neutrality is trying
to prevent isn't prioritization by service, it's prioritization by server.
Comcast wants to be able to hold Google's (or ycombinator's, or twitter's)
traffic hostage, degrading their bandwidth unless they pay Comcast a fee.
They'd also like to be able to ruin skype's voice services and netflix's video
services so that their captive audiences can only get phone and video services
through Comcast. That's what net neutrality is fighting, not perfectly
reasonable QOS.

~~~
tc
That may be the intention [2], but it isn't the effect.

Prioritizing EBS as was suggested above _is_ prioritizing by _server_. But
let's say that there was some reasonable way to priority-queue EBS-like
traffic regardless of the provider (there really isn't [1]). The DDOS attacker
would then certainly use this priority channel, which would magnify the effect
of their attack ten fold.

[1] Let me explain this in greater detail:

The best way to do QoS is to have end-devices tag their own IP traffic with
the appropriate DSCP bits. After all, only the end devices can be really
certain about the class of some packet. In your network, you then establish
policies for how the various classes of packets should be prioritized and
limited.

This works great inside a private network (like AWS, or an ISP). For one
thing, you can establish clear standards on how different types of packets
should be marked, and you can order them appropriately. You can trust the
classifications to the same degree you trust your own systems and
(potentially) your customers (depending on checks you do at your service
edge). Also, your network gear can prioritize efficiently because only one bit
field needs to be checked to determine priority.

Once you step outside of a private network, this all breaks down. First off,
most carriers strip DSCP bits at the network edge, so these don't propagate
end to end on the internet. Even if you receive a DSCP tag from the outside
though, why should you trust it, and how could you trust that it is tagged
according to your policies? If Youtube sets all their traffic _network
control_ priority, and you trust it, your network would grind to a halt under
load. More problematically, though, how can you meaningfully (and generically)
differentiate Youtube packets from DDOS packets?

I don't expect you to have answers to these questions; they are hard problems
that the best internet engineers in the world haven't solved yet. But this is
the background on why an enforced government Net Neutrality mandate is going
to throw all QoS policies into question.

[2] And trust me, I've very sympathetic to that intention. On net, Net
Neutrality would be a win for my business. But it is still bad policy. What
makes people so sure we should give politicians like GWB or Pelosi final say
about network engineering [3]...

[3] Once the interest groups realize that centralized national network
engineering policy is up for grabs in Washington every 2 years, have you
thought about all the unpleasant directions this could go?

~~~
tsuraan
To be honest, I'm not actually sure what QOS even has to do with the initial
post in this thread; wouldn't a dedicated private ethernet network between the
ECC machines and the EBS machines (assuming it's possible within Amazon's
network structure) have worked, assuming that their switches are configured to
only allow communication between ECC and EBS (forbid ECC - ECC traffic)? Your
commentary about how this would deprioritize other traffic wouldn't then be
quite right; it would be a dedicated network, with no cost to any external
sites. It would obviously give EBS an "unfair" advantage over anybody who
wants to provide an ECC block storage solution to compete with EBS, but
honestly, is that a concern? Nobody's demanding that all internet hosts (end-
user included) have the same bandwidth to all other hosts. It's just
artificial crippling of services that are bad, not some things having
advantages over others.

QoS on the general web doesn't seem likely, as you point out. Prioritization
based on which services can afford to pay off all the nation's ISPs seems like
a really bad idea though. As a service provider, I don't want to have to pay
kickbacks to every ISP in the country (world?) to ensure that no unfortunate
accidents befall my traffic. Much more personally, though, as a human, I'm
sick of being sold as a product. I don't watch commercial TV because I'm not a
product to be sold to advertisers. Similarly, I'm not a product for Comcast to
sell to google; that's how they see me, but it's not how I see myself. And, I
get pissed off when people see me that way :)

~~~
tc
_wouldn't a dedicated private ethernet network_

With appropriate capacity, a VLAN/QoS configuration is functionally identical
to having physically separated networks. On an abstract level, if you object
to provider prioritization, you should probably object to them running
separate physical networks for their own services.

As to your point about feeling screwed by the (mostly theoretical) possibility
of intentional service degradation, I sympathize. I'll point out though that
1) I'm quite certain this isn't terribly prevalent in the US, 2) in most cases
where people think this is happening it is the result of simple under-capacity
or generally bad network design or management, and 3) the correct answer to
this general issue is to enable provider competition, not to impose ham-handed
mandates.

------
zaidf
DDOSes are a sucker the first time round.

We went through the symptoms these guys did for about a week(!) before
figuring out and putting hardware DDOS protection in place with our dedicated
provider.

The lesson was well worth it though. Next time we have these symptoms, we
won't be reinstalling the OS, hardware etc.

~~~
mtw
what did you get for hardware protection? Do you have your own data center?

~~~
zaidf
I believe we have Cisco Guard DDoS Mitigation Appliance. It was provided to
our by our data center for additional $59/month. We certainly don't have our
own data center.

------
gojomo
I wonder what someone has against BitBucket... but I also understand that
BitBucket is unlikely to share all of whatever they suspect.

In such cases, the target of a DDoS often only has hunches and pseudonymous
threats from which to reason about attacker motivations. Airing such
speculative info risks (1) unfairly implicating innocents who may have been
framed; (2) encouraging the guilty with more attention for their grievances or
destructive skills.

~~~
pjhyett
We (GitHub) were DDoS'd the other week when members of an open source project
hosted on the site got upset at the other members and wanted to make the
entire ecosystem around the project suffer. Not saying that's what happened to
BitBucket, but it could be a similar reason.

Talking with Sourceforge, it sounds stuff like this happens fairly regularly.
It's one of the disadvantages of running a site that has members capable of
doing this stuff.

------
durana
Yikes, always remember to check interface utilization when troubleshooting
performance issues with anything network related!

~~~
tghw
<http://news.ycombinator.com/item?id=859941>

~~~
durana
Yeah, I saw that too. Too bad it seems like it was ~15 hours until Amazon
checked an interface and saw the high utilization.

------
anApple
After all, amazon was right to tell them that everything was in order. As it
was.

He should fire his sysadmin for not checking the in/out network traffic...

~~~
spudlyo
Let's not be too hasty to play armchair sysadmin. Someone who claimed to be
involved said the traffic never reached their servers.

<http://news.ycombinator.com/item?id=859941>

~~~
mingdingo
Is that possible? How could they be the only ones affected then?

~~~
gojomo
Anything's possible. What if there was enough traffic targeted at just
BitBucket that one of the 'last hops' to BitBucket's machines, which may just
be a virtual hop in Amazon's own infrastructure, was the only one saturated? I
suppose it's even possible that the affected machines could only see high-
packet loss (and EBS sluggishness), not the arriving packets themselves.

~~~
jespern
Correct. Our machines don't allow UDP, even. Either the physical machine our
VM runs on, or the segment was flooded, which means we couldn't talk outside
it.

~~~
jacquesm
Switch statistics should be able to rule that one out for you.

~~~
gojomo
Does Amazon make such network-equipment statistics available?

~~~
jacquesm
I do not know, but any competent hosting facility will have those stats on
call, it's what you base your billing on, so you'd better have them.

For the sites I operate this is my 'general health' indicator, bandwidth says
a lot more than my alarms, if there is a problem it usually shows up in the
bandwidth graphs before the alarms trigger (unless it is a power failure, but
those are extremely rare).

Our providers make them available to us, and this has been the case with any
provider that we've had to date (the planet, vxs, leaseweb and a couple of
smaller ones), I'd imagine amazon has them too.

According to the Amazon FAQ you have to use 'cloudwatch' to get at this data:

"An Amazon VPC router enables Amazon EC2 instances within subnets to
communicate with Amazon EC2 instances in other subnets within the same VPC.
They also enable subnets and VPN gateways to communicate with each other. You
can create and delete subnets attached to your router. Network usage data is
not available from the router; however, you can obtain network usage
statistics from your instances using Amazon CloudWatch."

You may have to do some arithmetic to see if a link got overloaded, one
telltale on the bandwidth graphs is 'flat caps', where in spite of the
machines inbound limit still not being reached you see a fairly flat top on
the in or outbound bandwidth graph on several machines at the same time (if
they're on the same segment, which on amazons infrastructure could be quite
hard to figure out).

