Hacker News new | comments | show | ask | jobs | submit login
Bitbucket: On our extended downtime with Amazon (bitbucket.org)
86 points by brodie on Oct 4, 2009 | hide | past | web | favorite | 36 comments

The only thing that kind of bothered me was that Bitbucket focused too much on blaming Amazon. Here's a sports example. Let's say a team loses because the left tackle messes up and the quarterback gets sacked. After the game, during a press conference, does the quarterback say that they lost, but it's the left tackle's fault? No. Even if the quarterback believes he himself performed great, he'll still take part of the blame. That's because it's the professional thing to do, and people respect him more for it.

You could say that Amazon and Bitbucket aren't on the same team, so it's different. But Bitbucket is paying Amazon to perform better just like a football team pays a player to make the team perform better. In other words, based on the concept of a team recruiting players for better team performance, Bitbucket recruits Amazon for better company performance, so they're basically on the same team.

That said, Bitbucket should focus less on singling out Amazon, and focus more on trying to figure out what they could have done better. It's the professional thing to do, and I'll respect them more for it.

I certainly can understand why they did, though; for a very long time the only thing they knew was "Amazon's EBS seems to have failed". Once it was known that they were being DoS'd they quickly got that information out.

Downtime like this sucks - I've been there beating my head against a brick wall trying to get someone else to admit it really is something they need to look at :( I'm impressed with the effort they put in in the end.

BitBucket has had quite a lot of down time this year - and I also had a few problems with payments a short time back.


It doesn't piss me off like it would with certain other providers.

The status update pages are amusingly maintained when they go down, updates are prompt and contacting support got a sensibly fast (it was a Sunday I emailed :D). At no point have I thought about moving on. So they do deserve a lot of kudos for that.

Good to hear you're back on track.

16h resolution time is actually not too bad when you look at your monthly bill - but the initial auto-denial (typical for large service providers of all kinds) is ofcourse annoying. Perhaps you get faster action next time, now that you're paying for the premium plan.

It is indeed strange, though, that it took them so long to discover what must have been a huge traffic spike on your instance. Their administration panel needs work if it doesn't point out such anomalies at a glance. Also your own monitoring needs work if you didn't notice the spike of inbound traffic.

Anyways, good luck for the future and if you decide to switch or extend I'd be curious to read about it.

Why weren't they running any monitoring on their instances? A simple graph of incoming traffic should have revealed the massive influx of packets and helped track things down more quickly.

Just because a system is 'in the cloud' is no reason not to setup normal monitoring on the system.

I wrote http://www.brianlane.com/software/systemhealth/ to try and help with this situation, it is simple to install, doesn't require anything other then rrdtool and a web server to serve up the graphs.

Jesper addressed this in the IRC channel. Basically, the traffic was not actually reaching the box, it was just saturating the available bandwidth to the EBS store. Moreover, since the attack was against EBS and not EC2, I'm not sure that you could actually get any monitoring in place on the volume.

I'm kind of surprised your first assumption is that they have no monitoring in place. I like BitBucket, so I'm biased, but I can't really see anything about this instance that would make assume it was caused by some sort of incompetence on their part.

Sorry, I'm not familiar with how EBS connects to the EC2 instances so I assumed the incoming traffic was hitting their EC2 instance.

Are you are saying that EBS is accessible from outside and that the attackers were able to discover its IP and attack it directly? This doesn't seem like a very safe way to configure things.

I also wonder why Amazon didn't notice this with any of their monitoring of either traffic or system performance and latency. Assuming they have such monitoring in place of course.

They may have been, but I imagine it's difficult to discern the difference between legitimate traffic and a DoS attack if you've got NAS volumes mounted over the same interface. You'd need to do a more thorough network analysis.

They claimed in many places that their "pipe" was "maxed out". If the attack was large enough to use up all of their available bandwidth, that should be easily noticeable in any basic bandwidth graphs.

This is surprising to me. It seems like Amazon should have dedicated ethernet (or at least a separate VLAN + higher QoS) for EBS.

Wait! What about Network Neutrality? Using QoS would deprioritize other traffic. In particular it would unfairly advantage EBS over other cloud storage solutions you could be trying to access from EC2.

In all seriousness, I agree that Amazon should probably be using QoS here so outside traffic can't affect your EC2 to EBS link. I just want to point out that it is difficult to be simultaneously in favor of Amazon doing this while being against ISPs prioritizing voice traffic, which provides similar resilience for voice against outside influences [1].

[1] I'm not suggesting that the parent poster suffers this potential cognitive dissidence.

QOS and net neutrality are completely unrelated. What net neutrality is trying to prevent isn't prioritization by service, it's prioritization by server. Comcast wants to be able to hold Google's (or ycombinator's, or twitter's) traffic hostage, degrading their bandwidth unless they pay Comcast a fee. They'd also like to be able to ruin skype's voice services and netflix's video services so that their captive audiences can only get phone and video services through Comcast. That's what net neutrality is fighting, not perfectly reasonable QOS.

That may be the intention [2], but it isn't the effect.

Prioritizing EBS as was suggested above is prioritizing by server. But let's say that there was some reasonable way to priority-queue EBS-like traffic regardless of the provider (there really isn't [1]). The DDOS attacker would then certainly use this priority channel, which would magnify the effect of their attack ten fold.

[1] Let me explain this in greater detail:

The best way to do QoS is to have end-devices tag their own IP traffic with the appropriate DSCP bits. After all, only the end devices can be really certain about the class of some packet. In your network, you then establish policies for how the various classes of packets should be prioritized and limited.

This works great inside a private network (like AWS, or an ISP). For one thing, you can establish clear standards on how different types of packets should be marked, and you can order them appropriately. You can trust the classifications to the same degree you trust your own systems and (potentially) your customers (depending on checks you do at your service edge). Also, your network gear can prioritize efficiently because only one bit field needs to be checked to determine priority.

Once you step outside of a private network, this all breaks down. First off, most carriers strip DSCP bits at the network edge, so these don't propagate end to end on the internet. Even if you receive a DSCP tag from the outside though, why should you trust it, and how could you trust that it is tagged according to your policies? If Youtube sets all their traffic network control priority, and you trust it, your network would grind to a halt under load. More problematically, though, how can you meaningfully (and generically) differentiate Youtube packets from DDOS packets?

I don't expect you to have answers to these questions; they are hard problems that the best internet engineers in the world haven't solved yet. But this is the background on why an enforced government Net Neutrality mandate is going to throw all QoS policies into question.

[2] And trust me, I've very sympathetic to that intention. On net, Net Neutrality would be a win for my business. But it is still bad policy. What makes people so sure we should give politicians like GWB or Pelosi final say about network engineering [3]...

[3] Once the interest groups realize that centralized national network engineering policy is up for grabs in Washington every 2 years, have you thought about all the unpleasant directions this could go?

To be honest, I'm not actually sure what QOS even has to do with the initial post in this thread; wouldn't a dedicated private ethernet network between the ECC machines and the EBS machines (assuming it's possible within Amazon's network structure) have worked, assuming that their switches are configured to only allow communication between ECC and EBS (forbid ECC - ECC traffic)? Your commentary about how this would deprioritize other traffic wouldn't then be quite right; it would be a dedicated network, with no cost to any external sites. It would obviously give EBS an "unfair" advantage over anybody who wants to provide an ECC block storage solution to compete with EBS, but honestly, is that a concern? Nobody's demanding that all internet hosts (end-user included) have the same bandwidth to all other hosts. It's just artificial crippling of services that are bad, not some things having advantages over others.

QoS on the general web doesn't seem likely, as you point out. Prioritization based on which services can afford to pay off all the nation's ISPs seems like a really bad idea though. As a service provider, I don't want to have to pay kickbacks to every ISP in the country (world?) to ensure that no unfortunate accidents befall my traffic. Much more personally, though, as a human, I'm sick of being sold as a product. I don't watch commercial TV because I'm not a product to be sold to advertisers. Similarly, I'm not a product for Comcast to sell to google; that's how they see me, but it's not how I see myself. And, I get pissed off when people see me that way :)

wouldn't a dedicated private ethernet network

With appropriate capacity, a VLAN/QoS configuration is functionally identical to having physically separated networks. On an abstract level, if you object to provider prioritization, you should probably object to them running separate physical networks for their own services.

As to your point about feeling screwed by the (mostly theoretical) possibility of intentional service degradation, I sympathize. I'll point out though that 1) I'm quite certain this isn't terribly prevalent in the US, 2) in most cases where people think this is happening it is the result of simple under-capacity or generally bad network design or management, and 3) the correct answer to this general issue is to enable provider competition, not to impose ham-handed mandates.

I can't reply to tc's reply to me (I think that "feature" should probably be disabled, btw; it tends to lead to broken threads like this one, and inhibits discussion), so I'm replying to myself, like a monster.

Service provider competition is definitely the answer. Where I live, we finally have a competitor to Comcast (Qwest), and t-mobile is also gearing up for a bit of wireless service. I've heard that some municipalities assert ownership of their power grids, phone lines, pretty much everything that's built on public land, and they force open competition for providing services over those infrastructure pieces. That would definitely be preferable to net neutrality, but it seems that wresting ownership of infrastructure from the ISPs is even more offensive than the creation of neutrality provisions. At least new communities can learn from existing ones and have sane starting points for competition, I guess.

[2] and [3], I think may have been written after I posted, or maybe I'm just blind. Either way...

I'm definitely uncertain about what bad politicians would do with control over the net's structure. I do know exactly what the CEOs of major ISPs will do though: they will maximize short-term profit. I've seen a lot of short-term profit maximization in my (rather short) life, and I haven't seen any of it turn out well. We used to have companies that had long-term goals, fundamental research divisions, the ability to see past next quarter's profits. I'm not sure where those companies are now, but if they're gone, it seems that the government is the only entity left that can have a long-term goal. Of course, long-term for our (US) government tends to be the election cycle, which is 2, 4, or 6 years. Not good, but perhaps not as bad as it could be. It's also easier to change the government than it is to change a huge company, although neither of those things is an easy thing to do.

I'm not sure that net neutrality is quite as sweeping as you see it either. If congress is getting into peering agreements, protocol design, things like that, then we're definitely all doomed. If they're just slapping down anti-competitive measures like charging companies for access to ISP subscribers, then I can definitely get behind that. It's a matter of degrees, I guess.

the CEOs of major ISPs will...maximize short-term profit

I think you may be unfairly maligning a bunch of folks who do actually care about their customers, but in any case, the correct answer is to enable competition, not to add regulation.

As for how sweeping it is, knowledgeable people in the industry feel like this is going to impair their ability to make reasonable network design choices, and in some cases force them to lay physically separate networks to do an identical job. Once it is established that Washington has that level of control (effectively making a public resource out of private networks), it seems reasonable judging from historical precedence that they will expand their jurisdiction (the special interests will smell blood).

(On a meta-note, I wrote 2&3 before seeing your response. We were likely writing in parallel.)

DDOSes are a sucker the first time round.

We went through the symptoms these guys did for about a week(!) before figuring out and putting hardware DDOS protection in place with our dedicated provider.

The lesson was well worth it though. Next time we have these symptoms, we won't be reinstalling the OS, hardware etc.

what did you get for hardware protection? Do you have your own data center?

I believe we have Cisco Guard DDoS Mitigation Appliance. It was provided to our by our data center for additional $59/month. We certainly don't have our own data center.

I wonder what someone has against BitBucket... but I also understand that BitBucket is unlikely to share all of whatever they suspect.

In such cases, the target of a DDoS often only has hunches and pseudonymous threats from which to reason about attacker motivations. Airing such speculative info risks (1) unfairly implicating innocents who may have been framed; (2) encouraging the guilty with more attention for their grievances or destructive skills.

We (GitHub) were DDoS'd the other week when members of an open source project hosted on the site got upset at the other members and wanted to make the entire ecosystem around the project suffer. Not saying that's what happened to BitBucket, but it could be a similar reason.

Talking with Sourceforge, it sounds stuff like this happens fairly regularly. It's one of the disadvantages of running a site that has members capable of doing this stuff.

Going on nothing other than my overactive imagination, I speculate that somebody hosted exploit code on BitBucket that somebody else didn't want made public. There seems to be a growing trend of black hat security types who are very much against disclosure of exploits and holes.

Yikes, always remember to check interface utilization when troubleshooting performance issues with anything network related!

Yeah, I saw that too. Too bad it seems like it was ~15 hours until Amazon checked an interface and saw the high utilization.

After all, amazon was right to tell them that everything was in order. As it was.

He should fire his sysadmin for not checking the in/out network traffic...

Let's not be too hasty to play armchair sysadmin. Someone who claimed to be involved said the traffic never reached their servers.


They said they moved their servers to 3 availabilty zones and still had the same problems. What's the probably of having the same attack in 3 zones on a subnet you are randomly assigned to?

Besides, firing would definitely be a little harsh. Everybody deserves a 2nd chance.

Is that possible? How could they be the only ones affected then?

Anything's possible. What if there was enough traffic targeted at just BitBucket that one of the 'last hops' to BitBucket's machines, which may just be a virtual hop in Amazon's own infrastructure, was the only one saturated? I suppose it's even possible that the affected machines could only see high-packet loss (and EBS sluggishness), not the arriving packets themselves.

Correct. Our machines don't allow UDP, even. Either the physical machine our VM runs on, or the segment was flooded, which means we couldn't talk outside it.

Switch statistics should be able to rule that one out for you.

Does Amazon make such network-equipment statistics available?

I do not know, but any competent hosting facility will have those stats on call, it's what you base your billing on, so you'd better have them.

For the sites I operate this is my 'general health' indicator, bandwidth says a lot more than my alarms, if there is a problem it usually shows up in the bandwidth graphs before the alarms trigger (unless it is a power failure, but those are extremely rare).

Our providers make them available to us, and this has been the case with any provider that we've had to date (the planet, vxs, leaseweb and a couple of smaller ones), I'd imagine amazon has them too.

According to the Amazon FAQ you have to use 'cloudwatch' to get at this data:

"An Amazon VPC router enables Amazon EC2 instances within subnets to communicate with Amazon EC2 instances in other subnets within the same VPC. They also enable subnets and VPN gateways to communicate with each other. You can create and delete subnets attached to your router. Network usage data is not available from the router; however, you can obtain network usage statistics from your instances using Amazon CloudWatch."

You may have to do some arithmetic to see if a link got overloaded, one telltale on the bandwidth graphs is 'flat caps', where in spite of the machines inbound limit still not being reached you see a fairly flat top on the in or outbound bandwidth graph on several machines at the same time (if they're on the same segment, which on amazons infrastructure could be quite hard to figure out).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact