Incident Report for Confluence

throwatl · on April 11, 2022

Throwaway account, ex-Atlassian employee.

This does not suprise me at all. I came by acquisition years ago, and was wondering when something would like this happen. They've deleted internal Slack, internal wiki before. Nobody cares about stability or scalability at Atlassian, their incident process and monitoring is a joke. More than half of the incidents are customer detected.

Most of engineering practices at Atlassian focus on only the happy path, almost no one considers what can go wrong. Every system is so interconnected, and there are more SPOF than the employees. So there are million ways this could've happened anyways.

Even a minor feature requires 6 months of planning with 3 timezones, while architects and middle-management in Sydney overriding most of the decisions and there is constant reorganization. There is a general sentiment of "We've built JIRA, our stock increased >10x, so everything we say and we will do is true" and all of the other ideas are invalid to them. Also, all other products other than Jira is second-class and not important.

bin_bash · on April 11, 2022

It’s just like JIRA the product. If you gave a team a year to come up with a feature list of the “perfect” project management tool it would probably look like JIRA. It ticks all the feature boxes you’d want.

Actually using JIRA on the other hand…

zelphirkalt · on April 13, 2022

It is annoying to use, because every action takes seconds to complete. Nothing in Jira ever feels snappy. Everything feels delayed. Things are hard to find in the jungle of clickable things in the UI. Things seem to be put in places, where I would not intuitively look for them. That is only the UI. There are also bugs lurking in the depth, which will probably never be fixed. Recently I've seen a coworker edit a story description and suddenly the browser showed an API JSON response, with the whole Jira page gone and only that browser native JSON display. Recently I've tried to sort task on my to-do tasks list on the sprint board, but could not arrange them how I wanted, because one of the tasks was a subtask of a task, which was not assigned to me and thus not visible ion my "only my tasks" filter. You have to make all tasks visible, then search for the supertask and then arrange that to be higher in the list, then navigate back to only your own tasks and the subtask will have moved up. Things that take a second with a reasonable UI take 10 or more seconds in Jira. It is a catastrophe of UI design.

gregdoesit · on April 11, 2022

The title of this is post is incorrect. It's not just Confluence that is down, but JIRA and all Atlassian Cloud services for impacted customers.

Some customers are also told they should expect to be down for up to another two weeks, making the outage a total of 3 weeks or so [1]. 400 companies are impacted by this outage and at least one of them is a YC company, Bitrise, who are still down, Atlassian is not telling them anything specific, and data loss is a possibility for them [2].

The outage is strange to last this long as it contradicts Atlassian's own admission of how they plan for resillience and their statements of how quickly they can restore data for customers. They specifically say how they utilize multiple DCs, have zero data loss scenarios and their recovery time objective (RTO) is 6 hours for their Tier 1 products (like JIRA). They are failing this objective big time, which questions what is happening behind the scenes. [3]

The outage is especially ironic given that Atlassaian have discontinued their self-hosted Server product and are forcing all customers to move to the Atlassian Cloud - this very cloud that is down for many customers. They stopped selling Server licenses in Feb 2021, disallowed changing tiers in Server products this February, and are stopping support for Server products in Feb 2024.

This long of an outage for software which companies depend on is mind-boggling. The only similarly long outage I can recall is the 2011 Playstation Network Outage which lasted for 23 days.

Good luck with the restoration, which sounds like it's manual work. Hopefully we get a public postmortem from Atlasssian: they are not the company known to publish these, though.

[1] https://twitter.com/kjartanmuller/status/1513462616030683138...

[2] https://twitter.com/gabornadai/status/1513481270411636738?s=...

[3] https://www.atlassian.com/trust/security/data-management

taspeotis · on April 11, 2022

You should include a link to their recovery time objective for their SaaS offerings: 6 hours.

https://www.atlassian.com/trust/security/data-management

techdragon · on April 11, 2022

They don’t even back up that often, why bother having a six hour target for your daily backups, I mean really… I’d rather you loose less of my data thanks.

barrkel · on April 11, 2022

There's no tier with an RPO of 6 hours on that page, unless it has changed. It's 1h, 1h, 8h, 24h for RPOs.

throwatl2 · on April 11, 2022

The RTO for products is listed at 6 hours. RPO is the maximum data loss. RTO is how long the recovery process will take. This is saying they should be able to recover the products in less than 6 hours, to a point in time less than one hour before the incident.

taspeotis · on April 11, 2022

You are correct, that was my mistake.

gregdoesit · on April 11, 2022

Thanks, updated the post. It's a good callout. They do have an RTO and they clearly struggle to meet it right now.

Beltiras · on April 11, 2022

I know it sounds pedantic but this bit me in the ass once. You are talking about a retrospective, not a post-mortem. I asked for a post-mortem because of a mistake with a product I was responsible for. The CEO set a meeting where he was asking pointed questions about if I still had faith in the product and finally asked straight out if I thought it would fail. Reason being he comes from pharma where post-mortem is an autopsy. It's not performed unless you have a corpse to dissect. I explained that in software engineer we have post-mortems about why coffee ran low. He was not amused.

jeltz · on April 11, 2022

No, he is talking about a post-mortem. The same word can mean different things in different contexts.

jacquesm · on April 11, 2022

I sincerely wonder how bad a cloud outage would have to be for people to wake up to the fact that any cloud solution can turn into 'data gone up into thin air' at the drop of a hat. Even the big ones have had their moments. I'm always painfully aware that these services are only as reliable as the people that run them and the hardware they run on and there is absolutely nothing that will stop a cascade of mistakes or failures to make you feel sorry about not having backups outside of the cloud. And that's before you get into acts of malice.

I've seen supposedly very secure and extremely well run orgs lose some or even all of their data to - retrospectively - ridiculously small chances and yet, they happened.

lnxg33k1 · on April 11, 2022

Well but it's not that you can lose data in the cloud _ONLY_, you can lose data from everywhere, I think the main difference is who has to work through weekend to restore it, and I am quite happy if that's not me, in my case it would be very unlikely as I would rather quit than work 24/7 to fix a fuckup I didn't cause.. but still

timthorn · on April 11, 2022

I think there's a difference in customer attitude, too. Cloud providers are often seen as a silver bullet to resolve reliability issues, but in reality you still need to maintain a BCP.

jeltz · on April 11, 2022

Yeah, I think this is really the issue here. I do not mind using SaaS at all but I dislike how people treat it like a silver bullet. You should still have a disaster recovery plan and if you are big enough back up the most important things you have in the Cloud.

nixass · on April 11, 2022

Literally this :) https://www.datacenterdynamics.com/en/opinions/learning-ovhc...

frankfrankfrank · on April 11, 2022

This reminds me of the Lessig comment about the “iPatriot act” legislation waiting in the wings for some surely totally unavoidable and unforeseeable “cyber event”, to rationalize total and complete control and surveillance over the whole internet and all communications; in a similar way in which “digital currencies” will facilitate total control and surveillance over all human actions, facilitated by the total control of the internet and all communications.

We all know they are working on breaking encryption and anonymity that comes from it, so if you don’t think they would at least allow some major event to happen just to roll out the iPatriot Act, then I would like to talk to you about the once in a lifetime opportunity to buy a bridge in a highly desirable location.

salawat · on April 11, 2022

>Sell you a bridge scam...

Which would be folly under the surveillance schemes you warn of!

Never threaten with what some may legitimately consider a good time. You'll be astonished who comes forward.

I don't consider the surveillance a good time, for the record.

jacquesm · on April 11, 2022

I don't know why my comment would remind you of Lessig's comment, can you please explain the link?

spicybright · on April 11, 2022

Honestly, with how investment you need in configuring jira, it's going to take a heck of a lot more than a week long outage to make the cost of a move worth it.

danuker · on April 11, 2022

This is the sunk cost fallacy. Just because you spent a long time on configuring it, it doesn't mean it's worth staying.

You should only consider the current situation, and plan for the future.

throwaway787544 · on April 11, 2022

Remember, kids: most of the software you have ever used is no longer functional. That browser you used 10 years ago doesn't run today. That server software you used 10 years ago is EOL today. Services come and go. Code dies.

Do yourself a favor and plan the sunsetting of every piece of software and service that you use for your business. At the moment you start using it, plan how you will eventually move away from it. It always happens, and planning ahead leads to much fewer headaches.

This applies to life in general, too; have an exit strategy.

buscoquadnary · on April 11, 2022

That's funny. Grep, awk, sed, sort, uniq, emacs even. Most of these apps work and run and operate despite being old than me.

Maybe the problem isn't that code eventually decays maybe the problem is the vision of what something is supposed to do is lost and then it has no place later on.

bin_bash · on April 11, 2022

That's because they're not part of a living, breathing, evolving ecosystem. They're great for what they are, but services on the internet need to be constantly maintained, upgraded, and monitored.

It's not that our values as engineers have shifted, it's that old software was built on static platforms that rarely changed. New software has to constantly adapt to changing conditions in terms of security, stability, and performance.

nonameiguess · on April 11, 2022

This is actually kind of hilarious. In addition to the Unix examples here, this reminds me of my wife finding a 30 year-old Nintendo system complete with some old cartridges and an old controller. They were in a box since she was in college. Plugged it in and it still worked.

Cloud-based services inevitably die, sure, but this is not true of software or machines in general. There's an entire YouTube genre of people finding and restoring cars that were built 30-90 years ago and abandoned for decades and getting them to run. We just don't build things to last any more, but a lot of that is by design. Subscriptions and planned obsolescence lead to more revenue.

Atlassian could have built software that doesn't rot. Step 1 is don't get rid of the individual server licenses so people can self-host. Step 2 is probably don't use Java because eventually users will have to upgrade to a JVM that isn't compatible to avoid security problems. 64-bit ELF will probably work fine for another 60 years. Step 3 is open source it so users can build it themselves since you'll eventually hit binary incompatibility problems otherwise. Do all of that and there is no reason Jira couldn't be as durable as sed and grep.

ElectricalUnion · on April 11, 2022

> Step 2 is probably don't use Java because eventually users will have to upgrade to a JVM that isn't compatible to avoid security problems. 64-bit ELF will probably work fine for another 60 years.

[Citation needed]

I trust the managed language more in this regard. I don't know specifically the situation on the BSD side, but on Linux userland, once it uses anything that is not the Kernel or glibc, it's anyone's guess if the ELF will even run if not on the exact same packages on the same machine that compiled it.

mrits · on April 11, 2022

There are very few of us that have to deal with the headaches of our own choices. Even if you are staying at the same company for a decade there is a good chance you moved on to a different team.

bloopernova · on April 11, 2022

This is a great point.

I was working with a DevOps team a couple of years ago and they had written some projects to deploy internal web apps to AWS. It came time to remove one of the apps, but the removal date kept slipping. After much hand wringing and meetings, it was discovered that they had no mechanism to uninstall one of their apps. They simply hadn't considered that an app might not be used anymore.

Nor did they ever consider needing multiple instances of an app, so all their stuff was using hard coded names.

It wasn't a fun project.

nix23 · on April 11, 2022

>Remember, kids: most of the software you have ever used is no longer functional. That browser you used 10 years ago doesn't run today. That server software you used 10 years ago is EOL today. Services come and go. Code dies.

My Multics and MVS Tk4- runs pretty well, my DosBox and Nintendo Emulator too and hell have you checked FreeDOS?

>Services come and go

Yes

>Code dies

Don't have to.

rob_c · on April 11, 2022

speak for industry, in academia there is code still running from 20+ years ago which has been used for publications etc, not a lot but they exist and should be applauded.

daenney · on April 11, 2022

35% of affected customers after a 6 day and counting outage. The rest is still in limbo and there’s no real timeline for when it’ll be resolved for those remaining. Atlassian is barely communicating.

“Successful” is doing an awful lot of work in the title.

donatj · on April 11, 2022

I can’t even imagine what is going on from a technical level. I don’t think we’re Atlassian scale, but we’ve got a couple million users.

If you blew up our entire AWS account save our backups we could be nominally back online within an hour or two and fully within 6.

I am genuinely puzzled what could be happening with their infrastructure to cause such an outage. Are they self hosted on bare metal and there was a fire or something?

tyingq · on April 11, 2022

It sounds to me like they lost some sort of master reference table that maps customers to services, instances, data, and so on. Which would mean a lot of manual examination to recreate the map. And would give you that situation where maybe the first thirds of restores is somewhat "fast" because it's the bigger customers with more data/logs/activity, and so easier to identify. But now they've hit the long tail.

They may also be gated by limitations in-between customers and the backups. Like backups going through a transit VPC, for example. You don't see the bandwidth limitation when you're sending incrementals all day, but try and do full, parallel, multiple customer restores through it, and ... (and perhaps they are afraid of making too many changes on top of the current chaos to fix something like that)

ollien · on April 11, 2022

> It sounds to me like they lost some sort of master reference table that maps customers to services, instances, data, and so on.

What makes you say this?

tyingq · on April 11, 2022

Just guessing what would drive a situation where some customers were restored quickly, others stuck for days, and matches somewhat with the limited statements they do make about the issue. Like this one: "The rebuild stage is particularly complex due to several steps that are required to validate sites and verify data."

Also matches with it being a cross-product outage.

But just a guess.

wlonkly · on April 11, 2022

And not just a cross-product outage, but a cross-product outage that includes acquisitions. I doubt there is that much in common between the JIRA/Confluence tenancy model and, say, Statuspage.

technion · on April 11, 2022

All I can say here is that my experience with ransomware response is that _every_ company has some document stating they can recover from scratch with backups in x hours, and in no case have I ever seen those hours be accurate, or even close by an order of magnitude.

contravariant · on April 11, 2022

Rebuilding the whole thing might end up being a lot easier than what Atlassians seems to be doing, which is restoring each user individually from a backup (?).

Though, going by their speed, this seems to be a weirdly manual process, not sure what is going on there.

donmcronald · on April 11, 2022

> I can’t even imagine what is going on from a technical level.

I always wonder if most SaaS providers are fixated on building massive systems when simple shards would work for most customers. If you compare something like Gitea and GitHub you can probably see the reason. With Gitea your "shard" is siloed away from everyone else because there's no federation (yet?). GitHub is more like a huge social network where everyone can interact.

I think the vision for a lot of SaaS is to facilitate B2B collaboration by acting as a proprietary social network rather than via federation between relatively independent shards. The reason is simple IMO; vendor lock-in.

media-trivial · on April 12, 2022

Former Gitea (Gogs) contributor here. I see your point and you're not really wrong but it's not that easy either:

- GitHub has way more features. More features means more scattered data (e.g. different micro-services databases, file storage).

- You cannot use Gitea in an organization with hundreds or thousands of engineers. It provides nether the features nor the scalability you need. I like Gitea but I'm happy that my company uses GitHub.

- Individuals and open-source use GitHub because it's free and has all the features (including stuff like free CI/CD which is expensive). This is all being paid for by companies like mine who buy expensive enterprise licenses for thousands of engineers.

- It's not uncommon to shard by customer/account, even in the cloud and GitHub is actually trying to do this right now for their primary database. Doesn't solve the problem with different storages for different micro-services though.

toyg · on April 11, 2022

Chances are their code is older than yours, hence less automated, and they've since lost some of the people who would know how to do some of the most difficult manual things on failure.

lamontcg · on April 11, 2022

I would guess outsourcing to contractors is involved in this mess somewhere.

notreallyserio · on April 11, 2022

It's also one of those companies built by acquisitions.

chrisseaton · on April 11, 2022

> If you blew up our entire AWS account save our backups we could be nominally back online within an hour or two and fully within 6.

Do you regularly exercise this, or is it just a plan?

donatj · on April 11, 2022

We've brought entirely separate instances up a number of times, from scratch. We sell straight to governments and some of them have very strict hosting requirements.

Our Chinese installation for instance runs on servers controlled by the Chinese government. I believe they run an AWS compatibility layer from what I've heard, I haven't personally worked on this.

We've got a decent amount of experience bringing the whole thing up from scratch, and we build with that in mind. Most of the time in my estimate is for importing the larger detail tables, which we do regularly for our staging environment.

codeptualize · on April 11, 2022

I feel bad for the people who are dealing with this. Quite stressful to manage a disaster like this, especially when it lasts this long.

I hope they get a bonus and some time off once this is resolved.

gjvc · on April 11, 2022

save your sympathy for the people who have to use their shitty products

spicybright · on April 11, 2022

There's humans behind every keyboard.

gjvc · on April 11, 2022

some of whom are psychopaths and to whom tools like jira are a wet dream for micromanaging people, so your statement is only partially true!

codeptualize · on April 11, 2022

haha that as well, plenty of sympathy for everyone, there are no winners here

dorianmariefr · on April 11, 2022

maybe competition? time to move to better services

bkq · on April 11, 2022

According to their Twitter the root cause of this was a maintenance script that was ran https://twitter.com/Atlassian/status/1511870509973090304. Must have been one hell of a maintenance script to knock out this many services, and cause potential data loss.

gess933 · on April 11, 2022

Change is usually the fault. Maintenance script is just one form of change. All the big boys have had this Cloudflare, Fastly, AWS, GCP, always a change

msh · on April 11, 2022

Change is not the cause of a fault.

It's almost the same as saying life is the cause of death in a human.

Change is life, software that don't change is dead software.

sinuhe69 · on April 11, 2022

Agree. Bad (or no) change management is the cause, not the change itself.

andrewclunn · on April 11, 2022

Or stable software.

greyface- · on April 11, 2022

It's easier to charge a monthly fee for software that's constantly changing than it is for stable software.

msh · on April 11, 2022

Which will shortly outside very few categories become obsolete software.

eps · on April 11, 2022

Must've been a script they got by email from someone.

smackeyacky · on April 11, 2022

Would be funnier if it was code hastily cranked out due to an overdue Jira ticket that some middle manager sprang on an unsuspecting code monkey an hour before the weekly standup.

Jira is emotional abuse software.

naoqj · on April 11, 2022

>While running a maintenance script, a small number of sites were disabled unintentionally

"sites were disabled", making it sound like they flipped a bit in the database, when they have clearly destroyed a lot of data.

tyingq · on April 11, 2022

A guess, but it sounds like they might have screwed up some sort of master reference table that maps customers to services and data stores. So it's possible they have all the data, but lost the map.

glintik · on April 11, 2022

I guess “disabled” is a lie at this case.

Intrepidatious · on April 11, 2022

Luckily (and I count my blessings often for this) our company was not affected by this outage. I can't even fathom the impact it would have on our productivity, communication and general ability to move forward with projects. It definitely does make you reconsider cloud providers for such an integral part of your business.

dx034 · on April 11, 2022

Especially as I've never heard from a self hosted Jira instance being down for a week. You just restore the last backup and there you go. Quite simple if it's only one system you have to restore, not millions.

To me that shows again that self hosting often beats cloud services. Yes, the vendor knows more about their own product, but delivering services at scale is so much more complex than a small on-prem instance that on prem usually shows higher resiliance. I had the same experience with self hosted Gitlab and Mattermost (as a Slack alternative).

sschueller · on April 11, 2022

What are "good" self-hosted alternatives to Jira? We are being forced to either go cloud or move to another product. Cloud is not option for us out of legal reasons and looking at this incident it is also not a good idea in general.

willis936 · on April 11, 2022

Jira used to have a self-hosted option that Atlassian is in the process of sunsetting. Stories like this highlight why it's a colossal mistake. If I can connect to my work's VPN then it is extremely likely I will also be able to connect to their Jira site. Moving to cloud adds risk without adding benefit.

https://www.atlassian.com/migration/assess/journey-to-cloud

originalvichy · on April 11, 2022

Correction: it still has a self-hosted version, they just got rid of the cheapest tier which didn’t support clustering. They still sell ”data center” licenses which enable clustering but don’t force you to use it as you can run it on a single node. With this change they now only have one flavour of self hosted Jira/Confluence/Bitbucket and then the cloud. Only the classic server license is sunsetting.

bombcar · on April 11, 2022

The datacenter licenses start at eye watering and go up from there.

originalvichy · on April 11, 2022

Not denying that at all. There is however some residual misunderstanding left from the time they announced the change towards server licensing. A lot of people who don't deal with administration of Atlassian products thought they were getting rid of the entire self-hosted suite.

bombcar · on April 11, 2022

The server instances are licensed forever also, you just stop receiving security updates.

Whether that's a risk people are willing to take is another question.

rst · on April 11, 2022

... for prior licensees. IIRC, they stopped selling new licenses last year.

nonameiguess · on April 11, 2022

Data Center licenses start at 500 users, so they did effectively kill self-hosting for small and mid-sized organizations that don't need 500 seats on their Jira license.

glintik · on April 11, 2022

It’s not a mistake. Mistake is how they run cloud infra.

willis936 · on April 11, 2022

Cloud adds complexity and dependency. What does it buy? If your work depends on the internet or servicing the internet then it's reasonable, but otherwise development tools see no benefit to being hosted offsite.

Since Jira is entirely development tools, it sees no benefit to being hosted offsite.

glintik · on April 11, 2022

Cloud Jira saves a time and resources for many of developers. Also after removing self hosted version they can introduce true edge computing things and move Jira development to another level in terms of performance and scalability. And after this outage - reliability, I guess.

jakey_bakey · on April 11, 2022

Since moving to post-its on the wall I haven't looked back.

(We're a 1-dev startup so YMMV)

diordiderot · on April 11, 2022

A list of atlassian product alternatives here

https://bye-bye-server.com/

jacquesm · on April 11, 2022

Gitlab?

willis936 · on April 11, 2022

Gitlab is a double edged sword. It has bugs, some of them are near untenable (don't you dare try to diff or MR commits with binaries larger than 10 MB). It also has a much smaller feature set than full-blown Jira. I'm not in a scrum setting right now, but if I was I would not want to have to use Gitlab as the one-stop shop.

XorNot · on April 11, 2022

Is anyone even using the full Jira featureset? People keep telling me that X doesn't have as many features as Jira, and I'm just genuinely trying to think of what those could possibly be to justify it? Like what does Jira do that Gitlab does not do that's absolutely vital?

tyingq · on April 11, 2022

Probably the biggest one vs Gitlab is cross-team functionality. Jira is very configurable that way, allowing you to share issues, epics, etc, in a very granular way. Gitlab was designed to tie things to repos directly. It does have some cross-linking, but it's clunky.

That's also the basis for the cross-org reporting that makes the MBA types happy. Yes, you can debate the usefulness, but that's who is signing the checks.

JBiserkov · on April 11, 2022

See this from 2002: https://www.joelonsoftware.com/2002/04/07/20020407/

dx034 · on April 11, 2022

Using all features isn't uncommon for large companies with very heterogen team structures. And there aren't many alternatives that allow companies to tailor the software as well as Jira.

mgbmtl · on April 11, 2022

Jira is pretty weird too, where is never clear whether it's an intended behavior or bug.

As a long time user of both, I'm much happier with Gitlab, but manager-types still struggle a bit, because too much technical clutter in their UI (they just want issues/boards, not CI/CD).

I got used to less features, and kind of see it as a benefit. We tend to push a product to its limits, then throw it out the window when it becomes to complex.

ljm · on April 11, 2022

My experience of gitlab hasn’t been completely positive. It tries to do too many things but the problem with trying to cover so much ground is that eventually you stop working with the product and start working around it. Not to mention how cluttered the UI is because of it.

It still has many positives as an alternative to GitHub that also happens to be open source, of course.

jeltz · on April 11, 2022

Jira is really buggy too, just a different ser of bugs so the bugginess is mostly a matter of which bugs annoy you the least.

pmlnr · on April 11, 2022

Fossil could be one of them.

https://fossil-scm.org/home/doc/trunk/www/index.wiki

Before the JIRA days there was https://www.redmine.org/ as well.

zozbot234 · on April 11, 2022

Shouldn't Phabricator be a viable alternative for on-prem use? AIUI, it's used in at least one FAANG-level firm, and in lots of FLOSS projects.

mardifoufs · on April 11, 2022

I thought Phabricator was not maintained anymore? Is there an active fork?

zentr1c · on April 11, 2022

https://www.jetbrains.com/youtrack/

sschueller · on April 11, 2022

Jetbrains is moving into the cloud. Look at their new products and what they are doing with their IDEs.

techdragon · on April 11, 2022

I imagine if atlassian don’t bring back the self hosted options, they are going to hemorrhage customers in Jetbrains direction after this disaster

mschuster91 · on April 11, 2022

You can still get Jira for on-premises - the Data Center edition is still being sold and AFAIK there are no plans on discontinuing it.

The downside is that DC is about twice as expensive.

sschueller · on April 11, 2022

From what I understand you need to license 500+ users. We are in the 50-100 range.

doubled112 · on April 11, 2022

Yeah, 500+ user minimum on the datacenter license at 42K USD per year. I believe the old licenses were forever, even if support wasn't.

If you're using any paid add ons, you'll probably need to pay more for them as well.

It turns it really expensive if you've only got 50 people. I can't see many businesses that size being very excited.

slifin · on April 11, 2022

Got a feeling this is going to be a net positive for the world I don't know any developers who are excited about Jira but I know plenty of managers who are

wonderwonder · on April 11, 2022

Wonder what the issue is where recovery is currently limited to a subset of those affected. Are there multiple issues affecting different customers or is restoration completely manual for some reason and the 2/3 still affected are due to resource (people) constraints?

Also side note, I feel bad for the people working those 24/7 shifts. Burnout starts to kick in and more mistakes can happen.

stefan_ · on April 11, 2022

That's one way to say 70% are still unable to access their data, going on a week now.

mellosouls · on April 11, 2022

Atlassian successfully restores just over 1 out of 3 affected customers data

That's not the title.

Quoting from the last update on the page:

no reported data loss

taspeotis · on April 11, 2022

35% is just over 1 in 3…

> we have rebuilt functionality for over 35% of the users who are impacted by the service outage, with no reported data loss

The page title is bereft of more information:

> Multiple sites showing down/under maintenance

csomar · on April 11, 2022

It's not clear if the "no reported data loss" concerns the 1 of 3 or 3 of 3?

gumby · on April 11, 2022

The situation is appalling but the headline is does not represent the post. The post says that so far they have brought 35% of affected customers back online. The HN title implies that 2/3 of customer data is gone.

I can’t really imagine how you would design a system in this day and age that could fail in such a fashion. But anything that discourages the use of Confluence is ok by me.

djabatt · on April 13, 2022

I was a CPO of the largest Jira marketplace company for Atlassian. It amazing how much ARR you can drive and it is equally scary how it is to run a SaaS business within the walled garden. Atlassian's DNA is not cloud based software and it will take them time to get there. In the meantime there are so many better solutions for dev groups.

bluedino · on April 11, 2022

First Kronos, now Atlassian...imagine one of the big cloud ERP providers going down for a few weeks. Businesses would be halted across the planet.

DocTomoe · on April 11, 2022

Can we please change the title to reflect that this is an ongoing recovery? The title makes it sound like 2/3 of affected users certain face data loss after a finished recovery process.

case0x · on April 11, 2022

As someone who uses Jira and BitBucket at work I wish they were more reliable. Switching to something like Azure DevOps is sadly not an option for us so we’re stuck there.

taspeotis · on April 11, 2022

Azure DevOps is a dead end since Microsoft bought GitHub. You can look at ADO’s release notes. The quantity of work released has slowed right down.

I mean the output of one three-week sprint (that’s their cadence) was “we are turning some stuff off.”

https://docs.microsoft.com/en-us/azure/devops/release-notes/...

9wzYQbTYsAIc · on April 11, 2022

Here’s their output from a different, recent sprint: https://docs.microsoft.com/en-us/azure/devops/integrate/conc...

Looks like most of their sprints are adding things or fixing things, not removing things.

taspeotis · on April 12, 2022

Yes mate they did a thing, guess I was wrong. Compare their four most recent sprints:

https://docs.microsoft.com/en-us/azure/devops/release-notes/...

To three years ago:

https://docs.microsoft.com/en-us/azure/devops/release-notes/...

case0x · on April 11, 2022

Nice to know. ADO was floating around the office as some large customers of us use it.

To be honest at this point I don’t see us ever switch away from BitBucket as our environment (non tech employees use SourceTree, bitbucket pipelines etc) is centered around it.

zelphirkalt · on April 11, 2022

Another reminder to not trust them with any data you have not stored elsewhere. Next time they might delete your confluence wiki content or whatever else.

musiccog · on April 11, 2022

Another reason to run things locally 'that just work for you' Especially when you are a smaller company/team.

ZuLuuuuuu · on April 11, 2022

Our experience is different. We are using Jira cloud for more than 5 years and I don't even remember the last time an outage has affected our work.

Meanwhile we still have some self-hosted services (like SVN) which has problems relatively frequently which require manual attention from engineer's time. I probably spent more than a week maintaining our self hosted services last year, compared to 0 minutes for the cloud services we use. Not even including that our self-hosted services are probably not very secure since we sometimes forget to do updates on our servers for a long time.

zozbot234 · on April 11, 2022

Why is SVN still in use? You can use a tool like reposurgeon to convert your full repo history to git.

baud147258 · on April 11, 2022

I can't talk for the parent, but in our company, we've made the change hg -> git (using self-hosted bitbucket) and it was quite a costly endeavor time-wise (between testing, configuring and installing BB, converting/moving all the repos, making the changes to our in-house build system, changing all our code signing system (because it wasn't git-compatible) and supporting all the devs who've never used git (or haven't in a long time and forgot everything)), so I can understand not wanting to change what's working.

ZuLuuuuuu · on April 11, 2022

We use git for all software projects, SVN for all hardware projects. I am not certain about the reasons why hardware department still uses SVN. Probably because every hardware engineer is familiar with SVN and it works for them.

amadvance · on April 11, 2022

Not the parent, but we use it to store binaries. Code is in git.

sschueller · on April 11, 2022

The irony is they don't let you host your own system anymore unless you run 500+ accounts. We don't want to and can't switch to a hosted solution so we are still trying to figure out what we are going to do when our licenses expire.

musiccog · on April 11, 2022

Isn't that because you weren't hosting local 10+ users in the first place?

musiccog · on April 11, 2022

Anyway, this HN story is about Atlassian screwing up cloud hosting Let's not get away from the story

glintik · on April 11, 2022

Is Atlassin’s CTO still working in a company? If yes, what his daily tasks?

YATA1 · on April 11, 2022

Virtue signaling about the topic de jure (BLM, Women at Atlassian, etc.) and how they've sunk hundreds of engineering hours into changing all their repos from "master" to "main" thanks to the heroic work by their Diversity Inclusion and Equity in Engineering team.

Bet they wished they spent those hours elsewhere now.

throwmeariver1 · on April 11, 2022

I guess he is trying to make jokes to lift spirits while sitting behind some senior dev who has to take a deep dive into a legacy codebase no one touched in 5 years.

gjvc · on April 11, 2022

https://www.google.com/finance/quote/TEAM:NASDAQ?window=5D

ShinTakuya · on April 11, 2022

The articles I read seem to imply this is less about the outage and more about the current growth numbers being unable to meet the medium term targets that have been painted, and the fact that the P/E ratio is very high.

glenngillen · on April 11, 2022

I don’t know the actual numbers involved, but people are already jumping to the conclusion that it means 70% of customers are impacted.

“Affected” is doing a lot heavy lifting in this headline. The linked page states that a small number of customers are affects. So it’s 70% of a small number. Where “small number” is defined by Atlassian so we don’t really know the impact here. As best as I can tell my own usage has not been impacted at all.

rjmunro · on April 11, 2022

Our Confluence is working fine and as far as I know has been working fine all week - no one has mentioned it in slack or anything, which I would expect.

I think Atlassian are probably not helping themselves by calling the number of customers "small" rather than saying the actual number or specifying a percentage of total customers.

throwmeariver1 · on April 11, 2022

Every HN discussion about this was rather tame and everyone respected the scope of the outtage. The only one jumping to conclusions at the moment is you.

glenngillen · on April 11, 2022

I replied precisely because the first comment I saw on this thread was “That's one way to say 70% are still unable to access their data, going on a week now.”

I also don’t presume everyone to have read every thread on here every day.

throwmeariver1 · on April 11, 2022

The thread and the post are about a partial outtake and 70% of the affected users are not able to access their data. The comment was a snarky jab at the marketing speech. How you would read into it that the commenter meant that 70% of all Atlassian users are affected is not comprehensible.

ClickUpCowboy · on April 11, 2022

Hey y'all! Seems like there's a lot of frustration with Jira and Atlassian on this page, and understandably so.

This is extremely forward of me, and I hope you'll forgive my professional proactive nature here, but I'm a member of the ClickUp sales team and I'd be happy to learn about your teams and see if we can better serve your needs.

If you're interested, or you think your organization could benefit from at least TRYING to see if there's another solution out there, feel free to schedule a meeting here:

https://a.clickup.com/c/coreyelder#/select-time

Humbly,

ClickUp Cowboy