All HTTP-based services unresponsive

david-giesberg · on March 2, 2018

David from the Atlassian SRE team here. AWS Direct Connect is experiencing an outage in their US East Region: https://status.aws.amazon.com, which is causing connectivity issues for most Atlassian products and services. We're working hard to get everything back up and running. Please check http://status.atlassian.com for the latest updates. We're posting regularly and will continue to provide updates there.

insomniacity · on March 2, 2018

Why does Bitbucket depend on AWS DC? Why wouldn't I just connect to it over the internet? Or is part of it in your own DC?

hkchad · on March 2, 2018

Yea, this makes no sense to me. I have a pretty heavy workload in AWS (us-east-1), don't use DC AT ALL and nothing is down for me today (except Atlassian Jira/Confluence Cloud), we self host BB. Why their 'cloud' based application relies on DC is very odd.

gtsteve · on March 2, 2018

I don't know but my guess would be anything that isn't core storage - we know they run their own SAN on their own hardware because that was the cause of another outage a month or two ago.

At a guess:

- Bitbucket Pipelines

- Webhook workers

- Front-end web servers

- SSH push/pull workers

Basically anything that's elastic to demand. Presumably the cost of AWS storage makes it not worth it for the Bitbucket team.

gtsteve · on March 2, 2018

I'm not really a networking guy, so perhaps this is an obvious question, but why don't you have a failover configuration to send traffic over VPN or the public Internet? I would expect the latency to increase but otherwise still work.

Is it a cost concern, is DC reliable enough that it's just an accepted risk, or is there some other reason?

irenna · on March 3, 2018

Hello, I'm Irena from the Networking Engineering team at Atlassian. I have been directly involved with this incident and wanted to provide some answers to the questions. We’ve built our architectures based on the AWS Direct Connect service because it’s the most reliable and scalable solution based on our customer and network needs. The AWS Direct Connect service we use in the US East Region has multiple redundant links (4x 10Gbps) optimized for data throughput requirements and availability, and to our knowledge the AWS Direct Connect transit facilities have power backups that would help contribute to its reliability. But, as we saw from today’s event, something still failed.

I should note that we have both publicly and privately reachable resources in AWS. The publicly reachable resources have fail-overs built in for situations like these (it happens automatically), but the private reachable resources with our architecture depend solely on AWS Direct Connect. For example, our Bitbucket failure today was due to the fact that we rely on AWS Direct Connect to link between the Bitbucket Cloud components that we host in our data centers and others that we host on AWS. Bitbucket could continue connecting to services in our own data centers and the public Internet/AWS, but could not talk to the privately reachable resources in the Atlassian infrastructure hosted on AWS.

We understand the importance and the impact for our customers, and dedicated several teams to this issue as soon as it was reported. AWS has resolved the issue, but we will look into ways to help prevent and better mitigate these types of issues in the future as part of our incident review and improvement processes.

itake · on March 2, 2018

its amazon having problems.

https://status.aws.amazon.com/

colinbartlett · on March 2, 2018

There are actually quite a lot of services down across the web now. Maybe they are unrelated... but it could all be related to AWS. https://statusgator.com/services/amazon-web-services

My side project, StatusGator, monitors something like 250 status pages and there's quite a spike in warn or down notices at the moment that I can see.

edaemon · on March 2, 2018

StatusGator is neat, thanks for linking that. Do you have graphs anywhere to track the number of outages/problems over time? It would be nice to see if there's been an uptick in problems generally, across certain services, etc.

colinbartlett · on March 2, 2018

No, but that's a great idea! I have 3 years of data now from hundreds of services including severity of reported outage and text about why it went down. So I could show graphs over time for sure.

kkirsche · on March 2, 2018

Northern VA is having a ton of power outages due to high winds. NOVEC electric company is reporting power issues for almost 10% of customers so far

cimmanom · on March 2, 2018

You would think Amazon would have sufficient backup power to make that a non-issue, as would any significant third party data centers involved in routing data. However, a downed data line might make sense as a root cause.

thegeomaster · on March 2, 2018

And it's only one AZ, so it sounds like Atlassian services aren't spread out over AWS properly. The recent S3 incident really highlighted the importance of this.

bklyn11201 · on March 2, 2018

Amazon's description of S3: "Data is automatically distributed across a minimum of three physical facilities that are geographically separated within an AWS Region".

What are people doing wrong with how they use S3? Until AWS provides a cross-region S3 that is master-master or self-healing, the suggestion that people are using AWS improperly seems incorrect.

acdha · on March 2, 2018

That’s currently showing an issue with Direct Connect in northern Virginia. That seems like a bit of a stretch and it certainly wouldn’t say anything good about their DR planning if one region can take the whole thing down.

shoover · on March 2, 2018

Ah. I wonder if that's why Capital One's login is down. I'm a little surprised their app isn't more resilient, but this is the second multi-hour outage I've noticed in the past couple months.

filchermcurr · on March 2, 2018

GitHub DDoS, Sourceforge DDoS, BitBucket 'routing issues'... somebody hates version control.

rambojazz · on March 2, 2018

The target could be the companies, not VCSes per se. As long as they don't ddos notabug I'm fine with it :)

thegeomaster · on March 2, 2018

Another ongoing discussion: https://news.ycombinator.com/item?id=16501731

Hosted JIRA is down too (at least for me).

Interestingly, I can seem to be able to find it only via search, it doesn't show up on the frontpage at all.

arkad · on March 2, 2018

XaaS has many benefits, but uptime is not one of them anymore. I self-host my repos, had a few downtimes but thanks to this DDoS my local services have better uptime. ( Disclaimer: I know it's not apple to apple comparison as scale is massively different)

convolvatron · on March 2, 2018

distributed source code management theoretically doing this in a robust and replicated manner quite a bit easier. if you ignore partitions, it seems pretty straightforward to make a git push-all, and a recovery process for stale nodes coming back.

simlevesque · on March 2, 2018

This is the last straw for me. I'm gonna stop using them to host my code. They have been down way too many times in the last year. It's been six failures from them in the last two weeks alone. I'm gonna self host Gitea to fix my issues. I cannot believe that they fail so hard. Why does a failure mean that I cannot read AND write from BitBucket ? Why are those two things even related ?

udia · on March 2, 2018

Relying solely on one platform for your organization's code repository needs is a bad thing in general. I have a workflow where I host my own Gogs instance on a raspberry pi, as well as on Github. Maybe this would be useful to you?

I've configured my push such that it will deploy to both Github and on my Gogs meaning that I always have an up to date repository in two places.

https://github.com/gogits/gogs (if you're interested in setting up your own)

https://stackoverflow.com/questions/14290113/git-pushing-cod... (push to two remotes)

You don't have to use just BitBucket, or rely entirely on your self hosted git service.

foxylion · on March 2, 2018

I think hosting the repository is one thing.

But integrating those two repositories with all the automated workflows is quite a headache. Jenkins won't automatically switch over to another git backend if one fails. Other automated tools like code review are mostly relying on a centralized repostiory. I don't see any simple solution to solve these issues. Bitbucket, GitHub, GitLab don't have any easy fallback solutions when you relying core workflows on their services.

Self hosting is maybe one solution (we do this with bitbucket), but this requires major administrative effort to keep it running reliably (always available, no data losses on hardware/software failures).

udia · on March 2, 2018

I agree. Having CI with multiple git repositories is still a painful thing. (I continue to rely primarily on GitHub here. My backup repos don't have CI enabled.)

I think if your primary repo that's hooked into CI goes down, you're still SOL. Having your own repo just enables you to continue local development among your team.

I don't see any great solution, other than making your app distributed to begin with and doing your build/deploy manually.

As a complete aside, I've fantasized about deploying to all cloud vendors (Azure/GCE/AWS/Heroku/DigitalOcean/misc.) each with their own specific build/deploy and having persistent state shared with something like CockroachDB. Having some load balancer managing state between all instances. Taking advantage of the free/basic tiers provided by all of the vendors.

simlevesque · on March 2, 2018

That's what I wanted to do. Thank you for the guide !

chrisan · on March 2, 2018

> Why does a failure mean that I cannot read AND write from BitBucket ? Why are those two things even related ?

https://status.aws.amazon.com/

If there are network issues then it wont matter if you want to read or write to bitbucket

013a · on March 2, 2018

There are ways to engineer on AWS to basically never go down barring massive, systemic, cross-region failure.

Granted, there's no economical way you could self-host a Gittea instance to avoid this.

parliament32 · on March 2, 2018

Two replicated instances across two VPS providers (say, Linode and DigitalOcean), on different continents, $5/mo each. And you'll have spare cycles to run anything from email to a monitoring solution to a build server. How is this not economical?

013a · on March 2, 2018

How do you access these VPS's? Load balancer? You'll need a globally distributed load balancer as well. DO LBs only have availability tolerance cross-AZ, not cross-Region, and are $20/mo. You might need a GCP Global Load Balancer.

How are your git repos stored? Attached block storage? Seriously? Those aren't highly available at all, and it wouldn't even work in your proposed setup because each instance will have its own block storage.

No, that won't do. You'll have to implement some form of cross-region replication, possibly blocked by S3 or something. Good luck. Its not easy.

ComputerGuru · on March 2, 2018

It’s laughably economical to do so, actually. Just not at GitHub/Bitbucket’s scale.

013a · on March 2, 2018

Github costs $7/mo. Is it more economical than that? Can you seriously say that you can achieve higher cross-region redundancy, for both your instances and storage, than Github can, at a price less than $7/mo?

The only way I can think of accomplishing it is to make use of GCP's free f1.micro instance, so you can spin up two of those for pretty cheap in different regions ($3.88/mo). Have DNS hosted somewhere that can resolve to each of your two instances at random ($12/year? Lets's say $1/mo). Then you have instance storage; good luck finding globally redundant block storage for practically unlimited repositories, with backups, for $2.12/mo. Let's just leave network egress charges out of it, since those would be marginal.

And let's go ahead and say I value my free time at a conservative $30/hr. It takes me an hour to set this thing up and maintain it every year; horribly conservative. That's an additional $2.50/mo.

Maybe you can do it. It isn't laughably economical.

ComputerGuru · on March 3, 2018

Economical does not mean “the cheapest option” or even “still cheaper than GitHub”. I have no argument with you.

BoorishBears · on March 2, 2018

I’ve been having issues with several services today:

- Jira - Ring Central - Github

I suspected AWS but don’t see anything on the status page.

Isn’t part of the point of a DVCS to not be overly held up by an inability to access your server?

I don’t know about your situation but I know that a company like Github can probably do better at SRE than I can with a self-hosted server

simlevesque · on March 2, 2018

> Github can probably do better at SRE than I can with a self-hosted server

GitHub maybe but for BitBucket, I'm not sure at all that it's true.

Siecje · on March 2, 2018

You can self host JIRA.

BoorishBears · on March 2, 2018

I’m aware, and I’d wager most self-hosted Jira instances are behind on updates and stability.

One former workplace had to bring in multiple Jira consultants due to poor performance and stability

mtgx · on March 2, 2018

It's likely a DDoS attack against Akamai (again). Github also saw a record-breaking 1.3 Tbps attack recently:

https://www.wired.com/story/github-ddos-memcached

simlevesque · on March 2, 2018

That's something that is bound to happen more often. I cannot let that affect me. I cannot let their problems become mine.

zbentley · on March 2, 2018

You use the internet. Their problems are already yours.

The only situation in which self-hosting or ditching BitBucket/similarly large providers will help protect you from the fallout of historically-large, catastrophic attacks is if you do all of your development on the same local network as your hosted server (and don't rely on any internet services to access it).

And even if you diligently self-host every part of your own services (not as easy as just plop a gitlab/gitea install on a host you own and start it up), you have to deal with the fallout from internet-breaking DDoS attacks and other malicious activity if you want to use the internet to run or use your code: from congestion caused by compromised devices in your network "neighborhood" (same/similar ISPs or last-mile providers) to DNS outages to BGP hacks, we have seen time and time again that, if not exactly centralized, the systems that comprise the usable internet are certainly highly interdependent. Large-scale attacks of many kinds compromise them.

Instead of tantrums, it might behoove users to understand what kind of SLAs they can promise in order to operate their self-hosted services in such an interdependent environment. Some examples:

Do you need local power? A local ISP to be up? More than one? If you have more than one internet link, how do you pair the connections--if it's via BGP, what happens if the central authority on that has issues? If local power is down, does your local ISP's connection stay up? How long does it stay up (is there a node/amp somewhere on the line that cut to battery)? Do you need to access internet services by hostname? If so, do you do local DNS caching? If so, how stale can it get in the event of a loss of external DNS? Most importantly: how much (it's a nonzero number unless you're developing for yourself, by yourself, on your LAN) dependence on external services are you comfortable with, and how much time are you willing to spend eliminating the long tail of such dependencies?

gtsteve · on March 2, 2018

Well if you can't read then you can't see what you're writing to right?

But I agree, it's not ideal that Bitbucket has been experiencing issues. I am also looking at alternatives, most likely using AWS CodeCommit and Upsource. It's been at the back of my mind to move away for a while.

bg4 · on March 2, 2018

DDoS?

simlevesque · on March 2, 2018

My self hosted repos won't get DDoS. BitBucket is a large target and that's a problem for me.

edwinksl · on March 2, 2018

Looks like access over SSH is still working. Not the worst.

maemilius · on March 2, 2018

I'm currently only able to read over SSH. Write operations seem to hang forever.

gtsteve · on March 2, 2018

Keep trying, it eventually goes through.

edwinksl · on March 2, 2018

Hmm, I just pushed some commits over SSH. Seems to work fine on my end so far.

drwu · on March 2, 2018

Pushing over SSH failed since hours

krallja · on March 2, 2018

Stop using AWS US East!

_xnmw · on March 2, 2018

Curious, why? Is us-east-1 known to be problematic? What about us-east-2 (Ohio)?

insomniacity · on March 2, 2018

Because it was the first AWS region, it is known to have the oldest hardware, the most dirty hacks, and the most outages.

tedmiston · on March 2, 2018

us-east-1 is the oldest region with the oldest hardware which is probably why it has more issues than others

bklyn11201 · on March 2, 2018

AWS has an easy nudge available to get people to begin using us-east-2 over us-east-1: introduce new features, new EC2 instance types, etc in us-east-2 first.

krallja · on March 3, 2018

I think that’s the problem with us-east-1: it’s the guinea pig region, so of course it’s going to have the most problems. If you don’t need cutting edge features, you shouldn’t be there.

murph-almighty · on March 2, 2018

Why? It's literally an AZ out of many that AWS provides.

krallja · on March 3, 2018

us-east-1 feels like the region where Amazon rolls things out first. This has two primary effects:

* sometimes new things break in unexpected ways

* sometimes things get changed for the Rev. B and those revisions don’t get done in Virginia because they’ve already completed the Rev. A rollout.

Also, there’s a secondary effect: because it’s the “default” region, it has a LOT more tenants, which means it probably has scaling and HA problems that none of the other regions do.

smaili · on March 2, 2018

Appears to be due to an upstream dependency:

> Some component services are currently unreachable due to an upstream incident on a cloud provider. We're attempting to route as much traffic as possible away from the affected components, and are working with our vendor now.