David from the Atlassian SRE team here. AWS Direct Connect is experiencing an outage in their US East Region: https://status.aws.amazon.com, which is causing connectivity issues for most Atlassian products and services. We're working hard to get everything back up and running. Please check http://status.atlassian.com for the latest updates. We're posting regularly and will continue to provide updates there.
Yea, this makes no sense to me. I have a pretty heavy workload in AWS (us-east-1), don't use DC AT ALL and nothing is down for me today (except Atlassian Jira/Confluence Cloud), we self host BB. Why their 'cloud' based application relies on DC is very odd.
I don't know but my guess would be anything that isn't core storage - we know they run their own SAN on their own hardware because that was the cause of another outage a month or two ago.
At a guess:
- Bitbucket Pipelines
- Webhook workers
- Front-end web servers
- SSH push/pull workers
Basically anything that's elastic to demand. Presumably the cost of AWS storage makes it not worth it for the Bitbucket team.
I'm not really a networking guy, so perhaps this is an obvious question, but why don't you have a failover configuration to send traffic over VPN or the public Internet? I would expect the latency to increase but otherwise still work.
Is it a cost concern, is DC reliable enough that it's just an accepted risk, or is there some other reason?
Hello, I'm Irena from the Networking Engineering team at Atlassian. I have been directly involved with this incident and wanted to provide some answers to the questions.
We’ve built our architectures based on the AWS Direct Connect service because it’s the most reliable and scalable solution based on our customer and network needs. The AWS Direct Connect service we use in the US East Region has multiple redundant links (4x 10Gbps) optimized for data throughput requirements and availability, and to our knowledge the AWS Direct Connect transit facilities have power backups that would help contribute to its reliability. But, as we saw from today’s event, something still failed.
I should note that we have both publicly and privately reachable resources in AWS. The publicly reachable resources have fail-overs built in for situations like these (it happens automatically), but the private reachable resources with our architecture depend solely on AWS Direct Connect. For example, our Bitbucket failure today was due to the fact that we rely on AWS Direct Connect to link between the Bitbucket Cloud components that we host in our data centers and others that we host on AWS. Bitbucket could continue connecting to services in our own data centers and the public Internet/AWS, but could not talk to the privately reachable resources in the Atlassian infrastructure hosted on AWS.
We understand the importance and the impact for our customers, and dedicated several teams to this issue as soon as it was reported. AWS has resolved the issue, but we will look into ways to help prevent and better mitigate these types of issues in the future as part of our incident review and improvement processes.
StatusGator is neat, thanks for linking that. Do you have graphs anywhere to track the number of outages/problems over time? It would be nice to see if there's been an uptick in problems generally, across certain services, etc.
No, but that's a great idea! I have 3 years of data now from hundreds of services including severity of reported outage and text about why it went down. So I could show graphs over time for sure.
You would think Amazon would have sufficient backup power to make that a non-issue, as would any significant third party data centers involved in routing data. However, a downed data line might make sense as a root cause.
And it's only one AZ, so it sounds like Atlassian services aren't spread out over AWS properly. The recent S3 incident really highlighted the importance of this.
Amazon's description of S3: "Data is automatically distributed across a minimum of three physical facilities that are geographically separated within an AWS Region".
What are people doing wrong with how they use S3? Until AWS provides a cross-region S3 that is master-master or self-healing, the suggestion that people are using AWS improperly seems incorrect.
That’s currently showing an issue with Direct Connect in northern Virginia. That seems like a bit of a stretch and it certainly wouldn’t say anything good about their DR planning if one region can take the whole thing down.
Ah. I wonder if that's why Capital One's login is down. I'm a little surprised their app isn't more resilient, but this is the second multi-hour outage I've noticed in the past couple months.
XaaS has many benefits, but uptime is not one of them anymore. I self-host my repos, had a few downtimes but thanks to this DDoS my local services have better uptime. ( Disclaimer: I know it's not apple to apple comparison as scale is massively different)
distributed source code management theoretically doing this in a robust and replicated manner quite a bit easier. if you ignore partitions, it seems pretty straightforward to make a git push-all, and a recovery process for stale nodes coming back.
This is the last straw for me. I'm gonna stop using them to host my code. They have been down way too many times in the last year. It's been six failures from them in the last two weeks alone. I'm gonna self host Gitea to fix my issues. I cannot believe that they fail so hard. Why does a failure mean that I cannot read AND write from BitBucket ? Why are those two things even related ?
Relying solely on one platform for your organization's code repository needs is a bad thing in general. I have a workflow where I host my own Gogs instance on a raspberry pi, as well as on Github. Maybe this would be useful to you?
I've configured my push such that it will deploy to both Github and on my Gogs meaning that I always have an up to date repository in two places.
But integrating those two repositories with all the automated workflows is quite a headache. Jenkins won't automatically switch over to another git backend if one fails. Other automated tools like code review are mostly relying on a centralized repostiory. I don't see any simple solution to solve these issues. Bitbucket, GitHub, GitLab don't have any easy fallback solutions when you relying core workflows on their services.
Self hosting is maybe one solution (we do this with bitbucket), but this requires major administrative effort to keep it running reliably (always available, no data losses on hardware/software failures).
I agree. Having CI with multiple git repositories is still a painful thing. (I continue to rely primarily on GitHub here. My backup repos don't have CI enabled.)
I think if your primary repo that's hooked into CI goes down, you're still SOL. Having your own repo just enables you to continue local development among your team.
I don't see any great solution, other than making your app distributed to begin with and doing your build/deploy manually.
As a complete aside, I've fantasized about deploying to all cloud vendors (Azure/GCE/AWS/Heroku/DigitalOcean/misc.) each with their own specific build/deploy and having persistent state shared with something like CockroachDB. Having some load balancer managing state between all instances. Taking advantage of the free/basic tiers provided by all of the vendors.
Two replicated instances across two VPS providers (say, Linode and DigitalOcean), on different continents, $5/mo each. And you'll have spare cycles to run anything from email to a monitoring solution to a build server. How is this not economical?
How do you access these VPS's? Load balancer? You'll need a globally distributed load balancer as well. DO LBs only have availability tolerance cross-AZ, not cross-Region, and are $20/mo. You might need a GCP Global Load Balancer.
How are your git repos stored? Attached block storage? Seriously? Those aren't highly available at all, and it wouldn't even work in your proposed setup because each instance will have its own block storage.
No, that won't do. You'll have to implement some form of cross-region replication, possibly blocked by S3 or something. Good luck. Its not easy.
Github costs $7/mo. Is it more economical than that? Can you seriously say that you can achieve higher cross-region redundancy, for both your instances and storage, than Github can, at a price less than $7/mo?
The only way I can think of accomplishing it is to make use of GCP's free f1.micro instance, so you can spin up two of those for pretty cheap in different regions ($3.88/mo). Have DNS hosted somewhere that can resolve to each of your two instances at random ($12/year? Lets's say $1/mo). Then you have instance storage; good luck finding globally redundant block storage for practically unlimited repositories, with backups, for $2.12/mo. Let's just leave network egress charges out of it, since those would be marginal.
And let's go ahead and say I value my free time at a conservative $30/hr. It takes me an hour to set this thing up and maintain it every year; horribly conservative. That's an additional $2.50/mo.
Maybe you can do it. It isn't laughably economical.
You use the internet. Their problems are already yours.
The only situation in which self-hosting or ditching BitBucket/similarly large providers will help protect you from the fallout of historically-large, catastrophic attacks is if you do all of your development on the same local network as your hosted server (and don't rely on any internet services to access it).
And even if you diligently self-host every part of your own services (not as easy as just plop a gitlab/gitea install on a host you own and start it up), you have to deal with the fallout from internet-breaking DDoS attacks and other malicious activity if you want to use the internet to run or use your code: from congestion caused by compromised devices in your network "neighborhood" (same/similar ISPs or last-mile providers) to DNS outages to BGP hacks, we have seen time and time again that, if not exactly centralized, the systems that comprise the usable internet are certainly highly interdependent. Large-scale attacks of many kinds compromise them.
Instead of tantrums, it might behoove users to understand what kind of SLAs they can promise in order to operate their self-hosted services in such an interdependent environment. Some examples:
Do you need local power? A local ISP to be up? More than one? If you have more than one internet link, how do you pair the connections--if it's via BGP, what happens if the central authority on that has issues? If local power is down, does your local ISP's connection stay up? How long does it stay up (is there a node/amp somewhere on the line that cut to battery)? Do you need to access internet services by hostname? If so, do you do local DNS caching? If so, how stale can it get in the event of a loss of external DNS? Most importantly: how much (it's a nonzero number unless you're developing for yourself, by yourself, on your LAN) dependence on external services are you comfortable with, and how much time are you willing to spend eliminating the long tail of such dependencies?
Well if you can't read then you can't see what you're writing to right?
But I agree, it's not ideal that Bitbucket has been experiencing issues. I am also looking at alternatives, most likely using AWS CodeCommit and Upsource. It's been at the back of my mind to move away for a while.
AWS has an easy nudge available to get people to begin using us-east-2 over us-east-1: introduce new features, new EC2 instance types, etc in us-east-2 first.
I think that’s the problem with us-east-1: it’s the guinea pig region, so of course it’s going to have the most problems. If you don’t need cutting edge features, you shouldn’t be there.
us-east-1 feels like the region where Amazon rolls things out first. This has two primary effects:
* sometimes new things break in unexpected ways
* sometimes things get changed for the Rev. B and those revisions don’t get done in Virginia because they’ve already completed the Rev. A rollout.
Also, there’s a secondary effect: because it’s the “default” region, it has a LOT more tenants, which means it probably has scaling and HA problems that none of the other regions do.
> Some component services are currently unreachable due to an upstream incident on a cloud provider. We're attempting to route as much traffic as possible away from the affected components, and are working with our vendor now.