1. They can confirm that they have backups of our data (about a thousand stories, substantial confluence, opsgenie history, and three service desks).
2. Will our integrations, configuration, and customizations also be recovered, or will we need to rebuild those once our data is recovered?
I have received no response, and no human is even willing to acknowledge those questions. The service desk staff ignore them as if I never asked. Repeatedly.
Also, I've been asking around, and haven't been able to find a single story from somebody that can confirm that they were down, who has had their data recovered.
I was down, my instance is fully restored right now.
1. They do, every 25 hours, via snapshot. I have spoked to their team since the incident and that same thing is in this article.
2. Yes, they recover all of it. Some things have had issues, external mailboxes attached to service management projects, some attachment rendering slowness. Filters needing to be overlayed into our instance again, but otherwise it is running fine again.
Not sure what to tell you other than they are fixing life saving companies first, then the rest. That is what they have told us.
Thank you for responding here. Even a single successful restoration, anecdotal as it is, makes me feel a lot better about the situation. I was honestly wondering if Atlassian was just pushing the date out to soften the eventual backlash when they had to announce data loss.
FWIW Jira and Confluence and used in life sciences firms, though they are rarely more useful than Google Docs. Self hosting in this context was the norm but that is changing.
There is a thing that I dont understand, from their blog/report [1]
If the script was used in "permanently delete" mode, which is intended for compliance... how do you restore?
Is it the only explanation... if the deletion is non-compliant?
> Second, the script we used provided both the "mark for deletion" capability used in normal day-to-day operations (where recoverability is desirable), and the "permanently delete" capability that is required to permanently remove data when required for compliance reasons.
Update: My instance was recovered over this weekend (Apr 16). We've verified that integrations (webhooks, jira integrations) are working, and no data was lost that we can see (nobody works overnight, so our last change was at the end of the working day, which matched our recollection).
I once worked a company that had a data loss issue. There was nothing else we could do, we had exhausted every option we had over almost 40 hours. At the end of the second day, it was decided to restore from backup.
We had done this before, as a test. It took about 12 hours to restore the data and another 12 hours to import the data and get back up and running.
One small thing was different this time, and it had huge consequences. As a cost-saving measure, an engineer had changed the location of our backups to the cold-storage tier offered by our cloud provider. All backups, not just 'old' ones.
This added 2 additional days to our recovery time, for a total of five days. Interestingly enough, even though we offered a full month's refund to all of our customers, not even half of them took us up on it.
Hi, I'm Mike and I work in Engineering at Atlassian. Here's our approach to backup and data management: https://www.atlassian.com/trust/security/data-management - we certainly have the backups and have a restore process that we keep to. However, this incident stressed our ability to do this at scale, which has led to the very long times to restore.
Hey Mike; Not dumping on you personally, but the RTO claims to be 6 hours. I can understand that being a target, but we're at 32X that RTO target, with a communicated target date of another 12 or so days IIRC. That's literally two orders of magnitude longer than the RTO. I don't think any rational person would take that document seriously at this point.
I'll also ask (since nobody else has answered, I may as well ask you as well):
1. Are the customers actually being restored from backups (and additionally, by a standard process)?
2. Will the recovery also include our integrations, API keys, configuration and customization?
Hi Ranteki, you're right that the RTO for this incident is far longer than any of the ones listed on the doc I linked above. That's because our RPO/RTO targets are set at the service level and not at the level of a "customer". This is part of the problem and demonstrates a gap both in what the doc is meant to express and a gap in our automation. Both will be reviewed in the PIR.
Also, the answer to (1) and (2) is yes.
A friend in Atlassian engineering said the numbers on the trust site are closer to wishful thinking than actual capabilities, and that there has been an engineering wide disaster recovery project running because things were in such bad shape. The recovery part hasn't even started. If Atlassian could actually restore full products in under six hours, they should have been able to restore a second copy of the products exclusively for the impacted customers.
Nah. The RTO/RPO assumes that only one customer that has a failure big enough to require a restore.
When the entire service is hosed, that's a totally different set of circumstances, and you have to look at what the RTO/RPO are for basically restoring the entire service for all customers. And since the have more than a thousand customers, it totally makes sense that it would take orders of magnitude longer to restore the entire service.
I think this document and incident is a decent example of common DR planning failure patterns.
It is explained here that Atlassian runs regular DR planning meetings with the engineers spending time planing out potential scenarios, as well as quarterly tests of backups and tracking findings from them.
So, with those two things happening, I the imagine recovery time objectives of <6 hours was taking a typical "we deleted data from a bad script run affecting a lot of customers" scenario into account with the metrics from the quarterly backup tests.
That doesn't even come close to the recovery time we are currently seeing now however. We're coming up on 2 orders of magnitude more than that.
The above doc seems pretty far our of line with what is currently happening.
It's 400 tenants scattered across all their servers. So they are most likely having to build out servers to pull the data then put it in place. 10x the problem that just restoring a single server would be.
This is why I love GCP Cloud Storage. The "colder" tiers are cheaper, and reads simply cost a lot more from there, but they don't slow them down and take days to restore. You pay with dollars, not time for restoring those GCS backups. e.g. Coldline [1] simply has reduced availability in exchange for being cheaper (99.9-99.95% availability, so 43min/mo, way less than "two days").
Not every business can afford to go one month without income. What's the best thing for customers? Have the business go bankrupt and irremediably lose access to the service?
Fastmail gave 1 month free service to about 2/3 of our customers after a major disk failure that led to about a week of downtime for them as we recovered from backups in... 2005ish I think. Long time ago - it was a pretty major hit and the wave in income is still visible all these years later as a lean month where there's no renewals from that batch! Definitely the right thing to do though.
400 clients, but how much of their revenue? Were they all small clients? Not to mention longer tail effects of people moving away to competitors even if they werent directly affected?
Good faith would be to lose all of that money to people who are already your customers.
Business-wise would be to stay in their good graces and keep those customers by offering the refund, but you don't lose any money to those who either don't care or won't move to a competitor.
25 years ago the clutch in my beater truck was slipping. I was 16 years old, making $50 a week and had very little in savings. I took that truck to a shop within walking distance of my job.
2 hours later I walked back to see what they found. I figured it would be several hundred dollars for a new clutch, and I'd have to borrow money or something to get it done. I talked to the owner who told be it was an adjustment on the cable. Just needed to be scootched up a bit and it was probably good for another 30k miles.
When I asked him how much I owed, he laughed at me and said, "For that? Not worth writing it up. No charge. You want me to show you how to do it yourself next time?"
The shop could very easily have charged me 1 hour of labor at their standard rate, maybe $75 or so. Plus a diagnostic or test drive fee. Whatever. He could have told me, "$123.98" and I would have paid it. I wouldn't even have been mad. But I sure as hell wouldn't have remembered the experience so clearly. Nor would I have told a dozen people over the years to take their cars there. And I definitely would not have driven 20 miles out of my way to return to that shop in the future years.
Being cynical about this stuff will hurt your brand. It's not obvious. It doesn't show up on the earnings report as a line item. This is service segmentation that seems like a no-brainer to a clueless MBA, but actually matters in the long run. How people view your brand is immensely important.
Not forcing customers you already screwed over to then spend more time chasing a refund is not only the right thing to do, it's also good business.
Your anecdote is nice, and sure it can be good advertising to give stuff away for free, But it doesn't really apply here.
If you were charged $123.98 and you said, "hey, I told you where the problem was, why am I being charged a diagnostics and driving fee?" and they corrected it by telling you the whole thing is on the house, is that not good business sense?
Even by your own admission, you would have gladly paid that $123.98 with no issues and you wouldn't have been mad about it. So from a business perspective, if they can provide a service, get paid for it, and the customer has no qualms or issues with the transaction whatsoever, in what way is that hurting the brand or being cynical? I think that's a much more business-wise action to take than to give away your services.
> If you were charged $123.98 and you said, "hey, I told you where the problem was, why am I being charged a diagnostics and driving fee?" and they corrected it by telling you the whole thing is on the house, is that not good business sense?
No. I'll be happy that I saved on the money, but I won't trust them in the future. They're now "the place that tries to get away with things" in my mental Rolodex. Better to stick with the fee and know their value. (I didn't tell them where the problem was. All I knew was that the clutch wasn't grabbing anymore. I assumed it needed a whole new clutch.)
> Even by your own admission, you would have gladly paid that $123.98 with no issues and you wouldn't have been mad about it. So from a business perspective, if they can provide a service, get paid for it, and the customer has no qualms or issues with the transaction whatsoever, in what way is that hurting the brand or being cynical?
It would have been a fine decision, sure. But in that case that would likely have been the only business I did with them. Not out of spite or anger, but because I'd have no reason to pick them for future business. I would instead ask friends for recommendations, or pick some place closer to my future residences.
But what actually happened was that I was the one steering people to them. I also went out of my way to return to them for brake jobs, simple oil changes, etc. I was a loyal customer, and probably spent or caused others to spend over $5,000 there.
He had absolutely know way of knowing that would result. But if you just treat people right, the way you'd want them to treat you, you build a reputation. It pays back.
I know this story comes off a bit pollyanna. I get it. For a cynical and non-altruistic explanation: when it takes a technician literally 5 minutes to twist an adjustment nut and verify that was all there was to it, stop and think about the bigger opportunity before you robotically mark '1.00' in the "LBR HRS" field on an invoice. Especially if you're operating in a field that's notorious for rip offs.
> I think that's a much more business-wise action to take than to give away your services.
I'm not saying businesses should give away major services. But they should avoid the temptation to nickel-and-dime as well. That's on the other end of the optimization curve. Not good business.
> He had absolutely know way of knowing that would result.
I think he absolutely knew that building trust is key to solid, long-term, repeat business - not only from the direct customer whose trust he has earned but also the zero-effort initial positive trust-balance he will have with his future/potential customers, even before he has done anything for them, just via word-of-mouth referrals. Such a simple concept but it just doesn't compute for some people.
> But if you just treat people right, the way you'd want them to treat you, you build a reputation. It pays back.
Reducing the impact analysis within a long running relationship to a single transaction is too narrow. People observe how other people are treated and draw their conclusions even if not impacted. People may tolerate some abuse but it moves them closer to leaving next time. Money lost in the outage may provide for a budget creation to look for an alternative.
A lot of people making those decisions don’t care about a refund because it’s other people’s money anyway. In my experience only small companies care about that.
Focussing on communicating open and honestly allows them to explain the crap they’re going through because of your mistakes to their bosses, so in fact you can help them save their asses, and they’ll save your ass in return. This is much more important and valuable than a refund.
So you should ALWAYS communicate open and honestly, and offer the refund as an option for clients who do not have a boss to account to.
I've seen cases where it was actually _more_ work for a business to process a refund. That money has to go all the way back through accounting/financing, be re-added to budgets for the appropriate groups, etc. It's not something done all the time so it takes extra time for those working on it. It's not like a Visa credit card getting a refund for a wrong coffee order.
Did I ever tell you guys about the time we accidentally nuked all the mailboxes for all the million-plus users on The Global Network Navigator (GNN) site? And how the restore process failed for us?
This hasn't been written up at The Register yet, so I don't have a single URL I can share with you.
I remember finding out one of the senior managers from my company ended up as head of software at Atlassian. It was at that point I was convinced Atlassian has no idea what the hell they're doing. I think this demonstrates the point nicely.
I interviewed for them about 8 years ago. The people interviewing me from the recruiter on the phone to the engineers were some of the most incompetent and unprofessional people I have dealt with. They ended up poaching and hiring some equally incompetent engineers from my wife's startup. The outage and what every one is saying in this thread is no surprise to me.
It's funny you say that, because he actually did end up coming back to work for the company, but it was a while ago now. He doesn't even list Atlassian on his Linkedin, he just pretends he never left hte company
We use on-premises setups for almost everything (we generally avoid cloud solutions to have full control of our data), sometimes (approximately once a month) it goes down for a few minutes which already feels like a torture because all our processes depend on it, I can't imagine having no access to it for several weeks, all our work would stop to a halt... The office of the guy who administers on-premise servers is literally next door, all it takes is to make a visit to him and everything works again after 5 minutes. Reading horror stories like this (Slack being down, Atlassian being down, no one knows what is happening and when it will end etc.), I wonder why many companies choose cloud solutions for critical business processes. Is it pricing? Ease of use? I can understand why very small companies would choose it, but I don't understand why a medium/large business would choose anything but an on-premises setup.
> but I don't understand why a medium/large business would choose anything but an on-premises setup.
Atlassian is in the process of killing the on-premise small/medium business option, already announced an EOL date.
Move to the cloud, buy a 500+ user solution for a much higher price or migrate away are my choices. Of course I use the local database and have local services JIRA/Confluence talk to so it's not really an option to move to the cloud.
I assume lack of competent on-site staff 24/7, having someone else to blame as well as lower costs are why people choose the cloud over on-premise though.
I am biased but I can tell you what works best for mid-large companies: having a solution provider. Basically a partner that hosts and maintains the instance and has enough Atlassian certified people to help you with any question so that you will never have to hire people to just maintain the beasts or tell you about features, tricks or plugins that could solve problem X.
Experienced people hosting and tuning Atlassian products has a greater success rate than someone doing it alone for a large company. Almost every time I’ve migrated an old Atlassian installation under our wing it’s given me shock how users have been made to suffer the loading times and perfs that come from underprovisioning (db or actual machine) and messy configuration. I’m not blaming the former admins but it just happens. Usually end users are happy after we clean the mess up and everything feels snappy.
Disclosure: I’ve worked in this kind of expert role.
It seems like if you are going to pay for a bunch of SaaS seats AND a team of technicians/engineers for make it work, you might as well just do the latter and roll your own solutions...
A lot of these SaaS are just glorified Rails apps with a patina of professional "security" and "reliability", and loads of extra junk that your co will never use.
Trust me, if someone could clone Jira and its functionality they would have done so already. Truth is that if you build one product for 20 years you have a giant lead in features. If all it took was having a Kanban board then Jira would have died years ago.
isn't it one of those "No one ever got fired for buying XXXXXX" type of situation?
Maybe i'm wrong, but the impression I had of Jira is that just like using sharepoint for file storage, the C-level people want it because they were told that's what big enterprise are using. And if it doesn't fit the need of the company and everyone hates it, they just blame the employees or lack of training.
It’s just so flexible that tracking projects and working together is easier with it. Decades of feature requests have made it good for people who want everything made for them or people who want to customise.
I can't see the difference between a "solution provider" that hosts your Jira and just getting Atlassian to do it. What's stopping the solution provider from accidentally running a script that deletes some customer's files and struggling to do a partial backup restore?
Because you can get the best parts of self-hosted and managed services. And on that backup question: self-hosted Atlassian is vastly easier to protect against disasters. The problem these Atlassian guys had arose from multi-tenant architecture. Usually managed service providers will host your stack on individual databases and VMs, and backing up the software is just a matter of taking pg_dumps and rsyncing certain directories (pretty old school) or just taking disk level snapshots.
Many medium-large corporations have their own cloud environments that their IT Ops control. Solution providers can host Atlassian stacks on their own cloud environment where they are not affected by data privacy concerns (it's in their already green-lit cloud providers data center) so they can host it behind a firewall with only VPN access allowed. They can also do all the magic you can usually do with web software like put a frontend proxy in front of it, or use more flexible/legacy authentication methods. Not to mention that for example you could have a Jira Cloud that you would need to integrate with a SCM program. Jira data could be "OK" to live in the cloud but code would be a big no-no. These problems can be solved by having them all live behind the firewall.
A competent managed solution provider also has consultants that can train or instruct on usage. It costs but it is simpler and faster than having to go through the forums or send a support ticket for every small issue to Atlassian itself.
Correct. There are probably not a lot of MSPs that have so many customers that they need to share that much data, and their customers probably use MSPs for the strict purpose that they don't want to share things with other companies.
We migrated from Slack to self-hosted Mattermost so we avoid being down. (And I guess money.)
Mattermost is so much worse that the slowness and general issues are not worth it. And in the end it is more down than Slack ever was, because it has performance issues.
I am not sure if it is Mattermost fault or our fault; but my friend from other corporation has similar experience with it. But maybe in general just don't know how to host MM, I donno
Cloud solutions can work well. I've used GitHub, Azure Devops, and BitBucket (another wonderful atlassian product /s) and BitBucket frequently craps out, multiple times a week. We need to rerun builds in TeamCity because BitBucket stops talking to it.
What do you do if your on prem setup lost data? There is an implicit assumption here that on prem is more reliable than cloud. Less downtime, less chances of data loss etc. Obviously it depends on which cloud product we're talking about but I don't think a blanket "my on prem goes down less and when it does go down I can get it back up sooner" is true.
There's IIRC 3 or 4 people in their department, they administer the whole building (wifi, security cams, LDAP, etc.), not only the on-premises servers. From what I gathered, our internal systems usually go down due to lack of disk space or some bug in the software which requires merely a reboot, it's not rocket science. Another thing is that our IT department (for internal systems) and the SRE department (for client-facing systems) have 24/7 on-call duty so it's unlikely that no one will respond.
The same thing that the cloud company would do. If there are other people there who share that guy's responsibilities, have them do it. If there aren't, you should have an on-call.
Cloud just outsources that problem to another business. Sure, they have better reasons to actually cover those positions and make sure they have on-calls and backup and a disaster plan, but just because you pay extra money for it doesn't actually make it work better if the company underlying it sucks.
There are always technical and economical pluses and minuses to any approach - but never underestimate the politics and character flaws that see dubious decisions pushed through by senior management regardless of those rational arguments pondered by the lower orders.
In our experience, this strongly depends on the services involved, as well as the scale.
For example, for our own service: If you have a hundred or two hundred licenses, you can drop our system on a linux box and usually you have to throw a yum update and one or two service restarts at it every few months and it just works. I honestly wouldn't be surprised if many of our small on-prem solutions have better uptime than the SaaS clusters, or be capped in uptime by some externality, rendering the system downtime irrelevant. If their VMWare cluster is down, our system is down, but no one cares.
This also mirrors a lot of our internal systems. At a small scale, you can just dump chef, jenkins, sonar, nexus, whatever on a linux box and forget about it.
However, this changes with high license counts. We have singular customers in our SaaS offering that are more than 50 - 100x bigger than the small on prem systems. At that point, our SaaS offering is better than anything the customer could to on-prem. I'm confident to say this about all of our customers, except maybe 2.
If anything, a smaller company with smaller footprint and fewer total requirements is going to be more likely to manage a vertical slice of some SAAS product.
The reason things like github go down so often is because they are public/shared resources.
>The reason things like github go down so often is because they are public/shared resources.
Very much this. Managing shared resources at scale is pretty hard. We have a bunch of internal sites made by interns as part of their internships, and, funny enough, those sites have much greater uptime and appear more stable than our own multi-tenant SaaS solution made by seasoned devs.
I've heard this argument many times before, but is there research into this? I.e. where they would compare uptime of cloud vs. on-premises across a wide range of companies.
I mean, you're going to get biased results, no? Only companies who are confident in self-hosting will self-host it. You won't have any real data about companies who are not confident in self-hosting maintaining their on-premises version of the software.
In this case it's survivor bias in that "We did these things and we didn't fail, ergo these things must be great."
Whenever you see a talk like this, always assume that it's BS. It might not be used by any real customers, or might still be in development. There might be a bunch of fires happening all the time due to things the talk doesn't mention. And it might be shuttered the next month if it's too expensive, complicated, obscure, or hard to support. These talks should only be considered aspirational sources of ideas, but never taken as a gold-standard battle-tested model, until they tell you how it fails. Only after you know how a system fails and how to respond to it can it be said to be reliable.
Focusing on the practices of successful companies makes you overlook the millions of other companies with the same practices, yet going bankrupt.
It is only through understanding what can fail that you can figure out causation.
And since Atlassian failed here, the talk might expose some of the failure's causes, or at least cast doubt over the usefulness of the practices presented.
All I can say as an Attlassian Server products user is that the moment they say it was Cloud or nothing, I choose nothing.
I much rather running Gittea on a raspberry pi that I CONTROL than having to have the impotence of doing nothing for more than a week. + having work at cloud companies and having been requested to "collect customer data" to hand it over to the government I would NEVER move critical pieces to anyone else's infa...
(Note: I am not supporting crime, but I rather to have privacy and criminals than living on an authoritarian regime where a dictator who knows everything abot everyone keeps "peace".... Yes I am looking at you China!)
If mistakes will be made, at least I wont pay others to do them for me....
AFAIK the Datacenter pricing starts at 500 users and goes up from there. So a small org could end up paying 5-10x what they were before on the Server license.
We're a team with <20 Jira users. For us it effectively is Cloud or nothing, and we weren't very keen on going Cloud even before this current clusterf...
The on-prem offering of Atlassian was discontinued. Existing contracts are being honored but as of March 2022, that's the end of the line for it. Maybe it will be revived now.
Selectively restoring data only for certain rows is super hard. But the communications by Atlassian has been the worst I have ever seen in the industry.
I actually got an email from our Atlassian contact just the other day encouraging us to switch to their cloud service. Crazy that no one thought to pause those. (I assume it must have been scheduled.)
This article on HN is the only time I've even heard that Atlassian was having a problem. I suspect that 99% of the tech "community" has absolutely no idea this is happening.
We use Jira, but it's self-hosted for my team. Maybe other teams that have transitioned to the cloud version are aware that there's a problem, but I haven't heard about it.
If the database schema for Jira on the cloud is anything like the Datacenter version, I'm not surprised they're having a hard time restoring data. I once tried to figure out how to find duplicate / redundant project schemas by querying the database (the required APIs are cloud-only) and could not even find which tables stored half the data, never mind how they referred to each other.
As this continues I suspect that this might be one of the few times where a lack of transparency / good communication really ... might not be better or worse because the situation is so bad that transparency would be horrible just the same.
Granted that's how all lies start / what sometimes people assume and they're wrong but ... maybe this is that time?
Maybe it is in fact so bad that honesty would be a push or worse?
Hi, this is Mike from Atlassian Engineering. You are right the communications from us have not lived up to our standard. We will focus on this specifically once we restore service and get the post incident review out there. More details here: https://www.atlassian.com/engineering/april-2022-outage-upda...
Well why are they writing a blog and posting the link on HN? We’re not directly your customers. Did you apologise individually to the customers you ignored? You don’t have to apologise to anyone here.
> Selectively restoring data only for certain rows is super hard.
What's the right way to structure your data here that would make restoring more straightforward here? Is this backup/restore scenario niche or they should have designed for it?
in theory, shard your customer databases 1:1, job done. alas, in practice, many SaaS compromise this two ways:
a) overwhelmed by creeping featuritis, each customer's data has relationships to global tables, and
b) they backup their entire database cluster in one snapshot
and there maybe other gotchas for restoration, like relying on denormalized views and caches that have to be rebuilt. they may also have erroneously assumed that data protection's main value driver is whole-of-system disaster recovery, which can lead to pathologies such as "we don't have a single-customer restoration tool".
> not all ORM frameworks handle this case well, if at all
typically this is probably for internal reporting/metrics. But yeah, a custom script with direct SQL is in order. Personally my opinion is avoid ORM at all costs. Never seen a benefit that wasn't trivially done in SQL, and the downsides are incredibly painful.
The big downside of sharding out, per customer, is that's a lot of databases to migrate on upgrades. Or rollback if shit hits the fan.
The upside? You can have customers on different versions of your app if you really wanted to do such a thing.
In any case, proper tooling goes a long way to making it the difference between wonderfully manageable and torturous nightmare. Think idempotent backup scripts that are capable of failing at any time and resuming where they died, etc.
All of your points (minus maybe the first one) should be "easily" solved/implemented in a company the size of Atlassian, and maybe there are newer costumers sharded like this already. IMO what happened in this case is basically tech debt that is now being paid with loooot of interests.
Would it be fair to estimate that the majority of SaaS companies aren't sharding like this then? Seems like a lot of downsides that impact everything often except for backups, which you'd restore rarely.
ISTM the fairly obvious approach would be to bring up a complete copy of the affected database(s) and move the affected tenants to that "copy", while eventually deleting non-affected tenants. Can't imagine they don't have the ability to move tenants to different shards, they got to need that to deal with quickly growing customers etc.
The other difficulty is if you don't restore the entire state in a single transaction. Imagine you have partial data restored in Table A but haven't updated Table B correspondingly. Now some other program that consumes Table A and Table B and doesn't have error handling will crash (or worse, mutate state in other weird ways).
- Can you restore data for a single customer, and if so, what is the RTO for that operation?
A smaller SaaS could be excused for only thinking about full database restores. When you're a scrappy upstart, thinking about hypotheticals is less important than survival.
But for any decent size multi-tenanted SaaS, it's imperative that you have the ability to selectively restore individual customers.
The usual approach is to do a full database restore into a separate instance, then run your pre-prepared "restore customer" scripts to extract a single customer's data from there and pump it across your prod instance. In Oracle for example you might use database links to give your restore code access to prod and also the restore instance at the same time.
This is a great opportunity to be an Atlassian competitor. I'm sure more than one business added Atlassian cloud services as a business risk in the meantime, even if they weren't affected.
gitlab seems to be eating their lunch already. Its bit bucket plus Jenkins plus the best bits of jira and it's starting price is free... I like confluence a lot, but separate design/documentation tools with their files pushed up to the repo are good enough. Latex, doxygen, or PowerPoint, whatever you want.
I still don't get why they didn't separate clients on a database level. Sure, put many clients on one database server to save resources. But why not use different databases? They cost nothing and provide perfect separation. It also drastically lowers the attack surface as you can set all permissions via database software. And if they had done that, this would've never been a multi-day outage.
If Jira was a product used by individuals I'd get it. Maybe a database is overkill for a sole developer. But pretty much all users of Jira are companies with tens or hundreds of users on average. I don't see how separating on a db level is overkill in that situation.
Using separate databases or schemas per tenant comes with the following problems
* Managing schema migrations across every DB
* You cant query across the DB, want to know some cross tenant thing for ops? That's now a lot harder
* Connection pooling and resource usage can be harder to manage
Most systems I've worked on use a single DB with a `tenant_id` col on every relevant table, it's easy to have your query builder slap in the auth'd tenant I'd. This approach does come with issues like saving and restoring an individual tenants data
> why not use different databases? They cost nothing and provide perfect separation.
I understand the sentiment, but This is a pretty simplistic take that I very much doubt will hold true for meaningful traffic. Many databases have licensing considerations that arent amenable. Beyond that you get in to density and resource problems as simple as IO, processes, threads etc. But most of all theres the time and effort burden in supporting migrations, schema updates, etc.
Yes layered logical separation is a really good idea. Its also really expensive once you start dealing with organic growth and a meaningful number of discrete customers.
Disclaimer: Principal at AWS who was helped build and run services with both multi tenant and single tenant architectures.
Don't you usually license based on server resources? Or do you know really have to pay per database/schema? At least on-prem licenses tend to be based on resource usage, not on the number of databases or schemas. I'm not talking about different db processes, just databases/schemas within a database.
And for migrations and schema updates I'd see this as a huge advantage. Migrating customers one by one is much easier than everyone at once. You also never have the issue that operations at one customer could cause a global lock affecting other customers.
Of course resource sharing isn't easy in this scenario, but you'd never want to connect data between customers anyway so I don't see the issue with that.
But maybe it works harder in a cloud environment where more is abstracted away.
Ah, when you said "database" I assumed you meant a dedicated single tenant instance of an RDMBS (or similar), and not necessarily something like dedicated tables. I will admit to being a decade out of touch with the vagaries of "processor", server, and client access licensing. In my relevant past I've only worried about (RDS/EMR/Redshift/etc) instances and tables.
Very fair call out on having more granular, discrete, instances for things like DML/schema updates and expensive queries. I love fault isolation and have had many sad days oncall when we exceeded the capabilities of The Database.
I wouldnt say it's harder because it's more abstract. I think the general motivation is to desperately avoid anything that scales cost/effort with the number of users. Even if it's sublinear a team can really drown under the cost of scaling up a service. And that's a serious consideration when a baseline expectation is to go from 0 to 10,000 or 50,000 active customers in just a few years. The care and feeding of (for example) 10 multi tenant partitions is just simpler than having to monitor & operate 10,000 independent databases with wildly divergent usage profiles. I will grant this hyper growth is not a common scenario for the industry, or if it is then its "one of them good problems."
I'd also say I have worked on a project that did have independent data tables for each customer instance. And we spent a meaningful amount of time abstracting away table creation/migration/etc, a common DAL that abstracted away the multitude of tables, common monitoring, etc. It has made some things around data migration & management easier but I honestly don't know if it's more efficient than multi tenant clusters in the long term. But the only way the economics and operational effort has worked is by going "all in" on using "serverless" technologies that efficiently scale to zero and have no carrying cost when idle
Is it standard for a RFP to have a long list of questions like this? I've never been involved in an RFP from either side.
Is it standard to (in addition or instead) to have something more general/forward-looking like: how do you watch other providers' postmortems and apply the lessons to your own system?
> - Can you restore data for a single customer, and if so, what is the RTO for that operation?
If I were to aim something at this specifically, it'd be: can you restore data for N customers or N% of customers, and if so, what is the RTO for that operation?
I mentioned in another comment that Gmail had a similar outage in which they had to restore from tape. https://news.ycombinator.com/item?id=31017160 They had a tool for restoring a single account but not for restoring N accounts in bulk, which would be significantly more efficient than doing the one-account process N times. (E.g., in the case of tape backups, imagine the difference between pulling data from the tape library sequentially for each user vs all N at once, particularly when one tape may hold data for many of these customers.)
Yes, pages of them. Multiple pages of security questions, ciphers used, how data is stored, when is it encrypted, etc. I filled out a 20 pager once. As the company got better and more mature, we had a bunch of canned answers to make it easier and faster....
Entire (excellent) start-ups exist to fill the role of 'RFP library' so that you don't have the whole sales team rewriting the same answers 100 times a year. Loopio saved me hours in the last role I was in that had them - even if you do have to edit some of the responses from colleagues you're not sure passed 9th grade English.
Any other startups you can recommend? I'm filling in my first RFP in a decade and answering what they mean to ask with the questions rather than answering questions literally is not something that comes easily to me.
Plus coming up with an answer to the vague question on "describe your project methodology" (I build what you want, it works - nope, they expect half a page). Or the 3 questions on project management systems and communication software choices that to my reading should have the same answer.
It's very common to have several pages of questions like this with particular customers, and though they may often come in to sales and marketing people, they'll contain highly technical or operational questions relating to a variety of things such as security algorithms, programming languages (type safety etc) all the way to disaster recovery.
Regarding bulk restore, a big customer doesn't care if you can restore all of your customers' data, they care if you can restore _their_ data, and fast, hence the question of "can you restore data for a single customer?".
> Regarding bulk restore, a big customer doesn't care if you can restore all of your customers' data, they care if you can restore _their_ data, and fast, hence the question of "can you restore data for a single customer?"
This outage should convince them to care. The problem isn't that Atlassian can't restore a single customer—there are people reporting that they've been restored. [1] It's that Atlassian can't restore 400 customers efficiently. So unless the RfP also has a question "will I be first on the list?" and the answer is yes, single customer restore is the wrong scenario.
[1] https://news.ycombinator.com/item?id=31023163 says "I was down, my instance is fully restored right now. ... Not sure what to tell you other than they are fixing life saving companies first, then the rest. That is what they have told us."
I work on a couple of reasonably small products but generally government customers have a long list of questions (mostly but not entirely about security).
I did. It was an first-principles architectural decision. A client could request any point-in-time within the contracted period, and it could be either a restoration or a fully operational, parallel instance of the account.
It was initially a cover-my-own-ass design, but it turned out to be an extremely popular feature that was never even used for disaster recovery. Instead, it was used for audit support, trial scenarios, projections, and all kinds of other stuff.
I wouldn’t expect them to advertise such a thing, but the question is “can they recover from their own mistakes” not “can they recover from mine.” I don’t care if this is with an “account-level restore” or whatever; it shouldn’t be my concern.
I’ve seen customer and resource level restores deprioritized more than once and the only hypothetical given serious thought is avoiding helping customers who accidentally deleted something because of the support burden/cost. No one seems to have much concern for what happens when they’re the ones that screwed up.
I know plenty of places (small/med startups) with "undelete" and "restore" account/data options built into their admin panels. Engineers shouldn't be flipping bits by hand, under duress.
I really wonder what these Altassian restore tools look like it if takes "hundreds of engineers across the company" to restore 400 accounts. Are backups siloed across many teams?
>Which SaaS platforms provide account-level restores?
We restore deleted accounts on request sometimes. There was a client, for example, who forgot to renew the subscription and did nothing for 30 days, so their account was automatically deleted. We restored it from backups. It helps that every tenant has their own isolated database, so it's mostly a matter of restoring that one single DB. Some microservices store data without DB-level sharding, so we have a script which is able to make a partial dump for a specific account.
There's also a popular option to restore deleted data - nothing is ever hard-deleted (it's marked deleted but stays in the DB) and we have a script which can restore individual records (and related records). There's maybe 5 such requests per month.
We don't offer rolling everything back to a specific point in time, though. Technically it's possible by undoing the event queue but it's untested.
We also have a script to migrate customers from cloud to on-premises and back.
I actually did this once with Dropbox, though it wasn't a feature they actually published. I clobbered my Dropbox directory accidentally, but I was able to find a script someone wrote to roll it back to a previous point in time and it worked quite well. After that I also took my own snapshots just in case.
My engineering team runs off Github Issues > Projects > Project Boards but it's been hard to get anyone outside of product managers and devs themselves to be fully immersed in it. Sales and management throw their hands up in the air and say 'Github Issues is too hard'. I'm a PM so I've fought and won this battle to stay on Github Issues because I could quantify that a 5-10% loss in dev productivity in moving to a 'management friendly' tool would equal so many thousands of dollars of developer hours per month.
The fact it's been so long and they still haven't revealed and explained the root cause of the outage is going to make it hard to regain trust on their buggy, slow tools. The bright side of the incident is that competitors that somewhat care about users have a unique opportunity to stand out.
> Faulty script. Second, the script we used provided both the "mark for deletion" capability used in normal day-to-day operations (where recoverability is desirable), and the "permanently delete" capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.
Ouch. I hope no one person got the blame. This is a systemic failure. Regardless, my regards to the engineers involved.
I don't want to assume too much, since the details are sparse. But I know for a fact that few of my current coworkers know a thing about writing tooling code. It's becoming a bit of a lost art.
Here's the way such a script should be done. You have a dry-run flag. Or, better yet, make the script dry-run only. What this script does is it checks the database, gathers actions, and then sends those actions to stdout. You dump this to a file. These commands are executable. They can be SQL, or additional shell scripts (e.g. "delete-recoverable <customer-id>" vs. "delete-permanent <customer-id>").
The idea is you now have something to verify. You can scan it for errors. You can even put it up on Github for review by stakeholders. You double/triple check the output and then you execute it.
Tooling that enhances visibility by breaking down changes into verifiable commands is incredibly powerful. Making these tools idempotent is also an art form, and important.
That’s how I did one of my more impactful deduplication/deletion scripts. It had to reach across environments to do its work. But there was no way to send any flags to it to do stuff. The environment names were hard coded, so like dev-uw2 reaching out to stg-ue1. It would output a dry run result by default. And you could look and see what was going to get deleted and from what environment.
Because the names were hard coded, I had to get changes approved in GitHub. Then the script would run on Jenkins.
That script was also only for that purpose and nothing else. It made a mess because I needed a ton of functionality around creation and querying, too. I just copied the script to folders and modified them as needed but a better solution would’ve been to make a python module. I just liked the code itself being highly specific to what the script was doing to help reduce mistakes. If I’m running a script to delete repos, I need to go to the delete-repos directory.
If coding is theatrical then ops is operatic. You have to telegraph stuff so over the top that the people in the cheap seats know what’s going on.
I think what we’ve lost in the post-XP world is that just because you build something incrementally doesn’t mean it’s designed incrementally (read: myopically).
My idiot coworkers are “fixing” redundancy issues by adding caching, which recreates the same problem they’re (un?)knowingly trying to avoid, which is having to iterate over things twice to accomplish anything. They’ve just moved the conditional branches to the cache and added more.
Most of the time, and especially on a concurrent system, you are better off building a plan of action first and then executing it second. You can dedupe while assembling the plan (dynamic programming) and you don’t have to worry about weird eviction issues dropping you into a logic problem like an infinite loop.
More importantly, you can build the plan and then explain the plan. You can explain the plan without running it. You can abort the plan in the middle when you realize you’ve clicked the wrong button. And you can clean up on abort because the plan is not twelve levels deep in a recursive call, where trying to clean up will have bugs you don’t see in a Dev sandbox.
Deleting 500 users…
Versus
Permanently deleting 500 users…
Maybe with a nice 10 second pause (what’s an extra ten seconds for a task that takes five minutes?)
I will then write a script calls your script with the PRNG of my choice: PRNG1 always returns "trigger 2", and PRNG2 always returns "trigger 1". This detail will be documented in Confluence.
Considering American police can't even seem to get it right when they have two distinct firearms, and are trained to holster them on specific sides so they know what they are grabbing - and still manage to f*ck it up....this might be an improvement.
This speaks to a lack of operational excellence - when you develop a platform like JIRA, Confluence, etc, the operational tools required to manage the systems are just as important as the features themselves. If all you do is pump out features, you're a feature factory and will suffer these kinds of issues. There's no reasonable explanation for needing a script to do what was described when the necessary tooling to generalize such an operation should have been in existence.
Right? The way this reads it seems like one person set a flag incorrectly, something I'm sure we've all done numerous times. And there were no checks down the line to catch it.
Hi, this is Mike from Atlassian Engineering. You are right that the checks need to improve to reduce human error, but that's only half of it. I don't see this as human error though. It's a system error. We will be doing some work to make these kind of hard deletes impossible in our system.
> Communication gap. First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.
So what they are saying is that they are not testing scripts at some staging server before running them in production. It's wild that they've managed to scale their products so much before something like this happened.
I hope they've learnt their lesson and they set up some QA process for that stuff.
it seems that it worked as intended, thus they have a QA process. The problem was in the wrong IDs provided and I doubt that at their scale they have a staging environment that duplicates the customer data.
> I doubt that at their scale they have a staging environment that duplicates the customer data.
If there is no feasible way of replicating their production environment somewhere else, then there should be some sanity checks in place. Something like "if an abnormally high amount of customer sites go down during the script's execution, kill the script". This is a 20/20 hindsight approach though and if Atlassian engineers can't solve I doubt a random HN user like me can.
Would it be bad practice to append values to a GUID type of ID that would help a human recognize them? For instance, in this specific case they wanted app IDs as APP-XXXXX-XXXX-blahblah and CLOUD-XXXXX-blahblah.
I'm not looking to help their specific problems, but this is more from a general question I've thought of doing but never have done just because I'm sure I'd get laughed at for blazing my own trail
While we don't do exactly that, when pulling out lists of ID's like that for someone else, internal or external, we strive to include a description column as well.
This might be customer id and customer name, article number and article description, invoice id and invoice number etc.
Then it is usually very clear to the recipient what they've been handed.
Also, for internal autoinc-type id's, we mostly use sequence generators with non-overlapping "series". That is, we'll start first one at 1 million, second at 2 million or similar. Not perfect but can be useful.
This is recommended in my experience, but you do have some potential issues when a UUID gets reused or repurposed.
WHENEVER a human is involved in the chain, UUIDs can be suspicious because there's no easy way to verify what it is, whereas a human has a good chance of realizing that $1,342.34 is probably not a valid date.
What's a good Jira replacement? Redmine? Phabricator? OpenProject? Just leaving the jira server alone and hoping there's no new and exciting zero-days? One thing is clear, these guys are a bunch of cowboys who can't be trusted with any amount of data.
Linear has offered free services to users impacted by Atlassian's outage through the end of the year. I took a look at it (we aren't impacted), and notice it can import tickets from Jira, and also has a "Jira Link" where you can use Linear as a kind of front-end to Jira if you aren't ready to go all in on Jira.
When we chose Jira, one of the points that was made was: If we decide to leave Jira, there will almost certainly be an importer from Jira to the new system. Which does seem to be true. We came to Jira from Fogbugz, and I spent the better part of a month writing tools to import our tickets and wikis. Jira had a Fogbugz importer, but it was horribly broken.
Looking at Linear, there is no such escape hatch, or indeed, searching the docs I saw no "export" or "backup" capability at all.
Thanks for the reply. Linear looks pretty slick, I'll probably give it a try with the Jira Link, and get some experience with it without having to do a whole conversion plus get buyin from the rest of the team. We weren't impacted by the Atlassian outage, but Linear does seem to have a pretty compelling feature-set.
Linear is phenomenal. Probably built for a different audience than Jira (it's like Superhuman for tickets), but if you want something that works well and is opinionated I highly highly recommend it.
I've used Request Tracker for years. It's not pretty, it's written in Perl, but I can fairly easily make it do all the ticket tracking flows I care about and it just runs and runs and runs. My scale is admittedly small, but I put tens of thousands of tickets per year through my instance, and i basically never have to touch it unless I'm setting up a new queue or different flow for something.
Wow, I’ve never seen anyone mention RT here. I used it for years when I was working IT for my university while in undergrad. It worked pretty well. It didn’t have a lot of features but it allowed clients/customers to respond to tickets via email which was pretty cool at the time (late 00s). It also ran pretty fast on the terrible servers we had it on.
We still run it today; they had a major release last year, I think. Its key feature is that it remains email-first. Customers never interface with the website, for them it's all just like they're emailing a human, with some extra tooling and tracking on top.
I've used Linear and Shortcut (formerly Clubhouse). I was a huge Shortcut proponent, but there were a couple of concepts that weren't fully fleshed out.
Linear has none of these issues. I've been super impressed with it.
We switched from JIRA to Shortcut https://shortcut.com/ (formerly Clubhouse), and I'd highly recommend them. It's much better than JIRA ever was, both from a UX perspective and an implementation/performance perspective.
For pure engineering teams it’s either Gitlab or Azure Devops. Those are the most common competitors I hear about. If you have non-engineers the choice gets trickier.
Wow would you look at that, a complete Atlassian puff piece got published in the WSJ just hours ago.
How peculiar that the biggest active outage in the history of this company is not mentioned once in this "article".
I'm left to assume that PR teams can plant whatever they see fit in the WSJ at a moment's notice. I guess that's what passes for journalism these days.
> Whatever else changes for CIOs, troubleshooting may never be far behind. Ms. Rao’s remarks came as Atlassian races to restore cloud-based software applications to roughly 400 companies, after a service outage last week caused by a routine maintenance glitch.
> As of Wednesday, she said, services were back online for just under half of the companies hit by the outage, which may take up to two weeks to fully repair.
SLI: Some metric you use to measure a thing (e.g. uptime, latency, etc.)
SLO: Some objective you try to hit, as measured by the SLI (e.g. "99.99% of requests are processed within 3 seconds)
SLA: A promise to a customer that they will meet some SLO, and consequences if they don't. If there aren't consequences for not meeting the SLO, then measuring and tracking the metrics is a pointless exercise.
The SLA is "real" to the extent Atlassian is adhering to any listed consequences.
Most SLAs say "if we miss this, you get time for free" which means that these companies will hopefully get a refund ... for the time they can't use the service.
Cars warranties are also aspirational/virtue signaling, to a point.
If the maintenance costs exceed the margins on the cars you lose money. Do that on too many product lines for too often and you’re looking at bankruptcy. But some makers clearly are more risk averse than others, so a 6 year warranty from maker X does not translate to a 7 year warranty from maker Y.
But Atlassian's (published*) SLA offers a credit of at most 50% of the month.. not really the same as a manufacturer warranty on a car, which the costs of servicing could easily exceed the price paid for the car.
* - their larger customers will have negotiated SLAs.
edit: to be clear, I expect Atlassian will offer concessions beyond their SLA obligations. I'm only responding to the comparison.
And these consequences usually just amount to getting some percentage of your service fees back. I'm sure the affected customers will get their entire monthly Atlassian Cloud fees back. Since this is so severe maybe Atlassian will even give them credits for some # of free months.
But there's no way the amount they'll get from Atlassian is going to come close to what they're losing in productivity by not having access to Jira & Confluence. At my company, getting an entire free year of Jira wouldn't be worth Jira being inaccessible for a week.
Does that indicate it would be preferable to pay more for a more reliable solution, if such a thing were to exist? Although, it definitely would be hard to quantify 'more reliable' there.
A typical SLA precludes that by specifying the remedy for noncompliance with the performance measure. Only if they fail to apply the remedy is there a material breach. For a month-to-month SLA, this limits liability to one month's subscription, as agreed in black-and-white.
Customers that demand service level agreements often fail to recognise that they cut both ways.
Tommy: Here's the way I see it, Ted. Guy puts a fancy guarantee on a box 'cause he wants you to fell all warm and toasty inside.
Ted Nelson: Yeah, makes a man feel good.
Ted Nelson: But why do they put a guarantee on the box?
Tommy: Because they know all they sold ya was a guaranteed piece of shit. That's all it is, isn't it? Hey, if you want me to take a dump in a box and mark it guaranteed, I will.
The typical SLA has no teeth because even if the customer gets their money back, the real harm to the customer may be orders of magnitude greater than what they paid for the service. Some services are contractual or tightly embedded and you know you're not gonna lose the customer if your service goes down frequently. If the service provider doesn't lose money or face, they aren't motivated to prevent the downtime.
One alternative I thought of is the Charity SLA. The service provider pledges to give $5,000 to charity for every minute of downtime. Now everyone within the company knows "if we're down, we're losing thousands of dollars a minute!" and thus will be motivated to ensure the services stay up. But even if the services go down, the company's making tax-free donations, which isn't really bad for anybody. The company could even have a specific downtime goal every year, to make sure their monitoring/alerting/runbooks actually work, and to ensure they donate every year.
Hi, this is Mike from Atlassian Engineering. For the customers impacted by this incident covered by an SLA, we will adhere to our contractual terms. However, given the long duration of this outage, we are planning to go above and beyond for our impacted customers. We are currently focused on restoring service, but after that will be discussing how we can make it right for each impacted customer.
Lawyers are involved, so I'd assume some text about "excluding acts of god, sabotage,etc" to weasel their way out of things. They might even be able to get away with "acts of incompetence" how ever a lawyer might phrase that to allow their client to weasel.
SLA credits are a thing that actually happen in the industry. I wouldn't automatically assume that they will be able to weasel out of it.
They are typically limited to the amount that you actually paid, though, so basically they don't charge you for the time when you couldn't use the product. You usually won't get more than that.
That's a good way to get executive approval to replace a system. Google or Apple can get away with this kind of behavior, I doubt Atlassian can.
This outage alone has spurred conversations in slack about how terrible JIRA is and why we should replace it. If this kind of shit was pulled, I can guarantee we'd be on shortcut, linear, or something else in short order.
> Google or Apple can get away with this kind of behavior, I doubt Atlassian can
Atlassian absolutely can in enterprise settings. In my company (a large cloud company), if JIRA goes down, large swathes of the business will also stall, including code deployment (deployments are tracked through change management JIRA tickets). We also use the DC version of Atlassian products, so presumably we aren't be at the mercy of Atlassian cloud engineers.
In some industries, three nines isn't exactly stellar. Every service I've worked on recently has demanded five nines of uptime and tons of reporting on latency and even seconds-long outages.
I've been on-call during a total infrastructure outage whose root cause was a service my team owned [1]. Our CEO was aware of it. Customers and business partners were aware of it. Other CEOs were aware of it. The media, you name it.
Some outages can be "business ending" or "business damaging". That's why we made a practice and process of performing regular disaster recovery exercises, had exceptionally well documented runbooks, had monitoring attached to everything, and engineered for resilience.
Though I'm not familiar with how Atlassian runs, I think this is an "engineering culture" thing or can be mitigated with a proper approach.
[1] The company has only had a few of these in total, and no member of our team was culpable for the complicated failure.
I think of SLAs as how do we design this thing. Ask for a system without an SLA and I will give you a system that is well designed and almost never goes down. As soon as you ask for an SLA, I will give you an over engineered system that costs more, takes longer to implement and is slower to iterate but it will almost never go down either.
Per the article, if you experience < 95% uptime in any 30 day window you qualify for a 50% discount. On a month or your next year or ... ? it doesn't say.
we need a better default way to communicate SLOs than "number of 9s", which are more human. how the status quo has stayed this way can only be attributed to intentional dark patterns, imho.
… honestly, even the "number of 9s" concept is a struggle for some companies. I've seen a number of SLAs that fail to correctly state a unit: it's %/<unit of time>, and I see the "unit of time" get dropped every now and then, and the resulting thing is meaningless absurdity.
I was just thinking that there's a hysteresis function here: the service is worth much more to your team after you've wired your whole process into it than before you joined.
Offering you a free month or whatever doesn't acknowledge all the person-hours lost.
There are certainly circumstances where you might have grounds to sue for damages if an SLA is breached. I'm not sure how often this happens but the losses from something like Jira being down could be quite a lot more than anybody pays for it. It's quite likely that defenses against exactly this are written into the contracts you agree to signing up for the service though.
I've yet to work at an office that paid sufficient attention to regular backup & restore validation, to scalable design, or proper unit testing, or to basic security updates. Upper management is repeatedly incentivized to produce vaporware, not reliable service.
Suits think a crummy Flash quiz on PII is enough to stop leaks. The automotive industry couldn't stop airbags from acting as claymores. It's even harder to get good code approved in tech.
"Shit happens" is a universal when it comes to computing. SLAs describe what is a normal background level of shit happening vs. what demands immediate attention and action from the team.
The lesson learned is that outsourcing at the level of containers or machines and raw compute in the cloud is one thing. It's a pretty fungible open market.
But outsourcing one's whole engineering environment to a SaaS on a cloud is just freakin lunacy. Not only do you have things like this outage, but what about simple things like features and versions of the apps changing all the time with no ability to control that. What if they remove or change a feature you use?
And expensive vendor-locked-in closed tools have no place in a modern software workflow anyway, on-prem let alone SaaS. Look at the rug-pull for the on-prem Atlasian Server product.
>"The outage is its 9th day, having started on Monday, 4th of April."
>"It took until Day 9 for executives at the company to acknowledge the outage."
Just to put this in perspective. These executives would have left on a Friday afternoon to start their weekends without bothering to publicly address an ongoing outage that was by then 5 days old.
This is mind boggling. Like did some C-level exec say something like "Let's just park this whole outage communication discussion until Monday, have a good weekend everyone."?
>Most of them said they won’t leave the Atlassian stack, as long as they don’t lose data. This is because moving is complex and they don’t see a move would mitigate a risk of a cloud provider going down.
I still don't understand the strangehold JIRA has on some clients. I can't quickly think of another SaaS product that could be down for almost 2 weeks and not have most customers leave.
A lot of companies have integrations to atlassian suite which might not be easy to shift from.
Secondly, there are a lot of individual competitors to Jira, Confluence and Bitbucket but which competitor can offer all three under a single invoice? May be Microsoft, can't think of anyone else.
Also for such an extended downtime the customers are entitled to a discount or a credit note which a lot of CXOs consider in their decision making.
We are in a similar place with Slack. We moved from HipChat to Slack and that was painful enough. Then the company noticed we get Teams for "free" and they tried to push us over to it. But folks have so much automation (because "ChatOps" is that new new) that is pushing things into Slack the company eventually gave up.
It’s been a self-hosted products for over a decade in the form of Visual SourceSafe and then TFS (wonky TFVC not withstanding; Git support was added a while ago as well), now living on as Azure DevOps Server.
visual studio online is what it was called internally, the marketing may have changed. It's okay, and is what was/probably still is used at MS internally to develop windows.
>I still don't understand the strangehold JIRA has on some clients.
- Integrations with things like the source code repos, incident management systems, confluence or other wikis, Slack, etc. Moving away from Jira creates a bunch of dead links.
- Internal dependence on complex workflows and state transition rules that are implemented in Jira.
- Various very customized reports that leaders depend on to make decisions, despite the often dubious value and/or accuracy.
When we migrated away from JIRA, we scripted it such that the JIRA issue numbers were recorded in the newly migrated issues exactly because of things like this.
Having migrated bug systems for very large, very old code bases before, it's pretty easy to make the URls and links like this still go to the right place.
This is actually the least difficult thing, i would say ;)
If they don't lose data, two weeks of downtime every few years might be cheaper than the cost of switching. Plus, it's not like you know the thing you switch to will be any better, if it's another SaaS.
Let's say we have an announced release schedule on may 1st.
With the tools down, there is no way to meet that date. For a 4 billion dollar company, this can make a huge difference in revenue. For a public company, the stock will definitely drop when it's announced the revenue goals were missed because the tools were down.
For companies of size, the cost of tools being down for 3 weeks can easily be in the multi-millions of dollars.
Again, part of the trouble is it's hard to gain enough certainty that the thing you switch to—self-hosted, or another service—won't be at least as bad. You can look at their past record, but then, when's the last time Atlassian had this happen? (or maybe they've been having similar issues every year or two and I've just not noticed, in which case, yeah, it's probably a safe bet that switching to almost anything else would be an improvement)
Atlassian sells to execs and gives kickbacks. You don't want to burn the company that gave you money and that you pushed through although you knew they sucked.
Even if they don't, I imagine they will have conversations internally to see what's feasible. It's just really difficult for an organization to move away from a product that everyone has learnt how to use. The company I work for is struggling to move away from something as simple as a collaborative editor, when I feel like I find no difference between the two products.
A few years ago we didn't renew our subscription on time because we got the email over Christmas break, and iirc they deleted all of our data in less than two weeks. They were eventually able to manually restore it from backups, but they restored it incorrectly so there was a bunch of stuff broken. This whole thing isn't even remotely surprising to me.
You can sleep soundly: it seems like they back _everything_ up:
> Second, the script we used provided both the "mark for deletion" capability
... (where recoverability is desirable), and
the "permanently delete" capability that is required to permanently remove
data when required for compliance reasons. The script was executed with the
wrong execution mode and the wrong list of IDs. The result was that sites for
approximately 400 customers were improperly deleted.
> To recover from this incident, our global engineering team has implemented a
> methodical process for restoring our impacted customers.
Anyone else find it disturbing that they are able to restore data that they deleted permanently for "compliance" reasons? If this is true, how were they ever compliant? I guess data is only permanently deleted when the engineering team is following their typical, non-methodical process...
No, I don't think that's disturbing. That's the point of backups - even when something is permanently and completely erased in the production database, it's still in the backup. Eventually it will get rotated out as the backups expire.
Going back and purging things from the backups as part of the delete process would be overdoing it to a ridiculous degree.
I think that depends on what you mean by compliance. Some regulations require you to irreversibly destroy data when they prescribe the destruction of that data.
That can mean as much as "you have to encrypt everything with a separate key, so that you can destroy the key for the given (say, personally identifiable) dataset making its retrieval irrecoverable"
I'm not saying that's the particular compliance reason they had here, or that the analysis you're giving is wrong, either. There is an interpretation where either of these ideas could be the correct one.
"permanently delete" strongly suggests to me that it was the "medical and financial data" kind of compliance. If data can be restored, it's not permanently deleted. But this was a statement from the CEO, so words can have arbitrary meaning :)
"permanently delete" does not mean the same thing as "immediately delete". deleting from the live database is the first step of a permanent deletion, as long as the data exists somewhere the deletion process is still in-progress.
there's a whole lot of people in here who are way too quick to assume that just because one part of a permanent deletion process was inadvertently triggered and then caught while they still had backups, their whole permanent deletion process is a lie.
You seem to be right-ish, while the gdpr in certain circumstances allows you to keep backups of data that should have been deleted it seems like they are trying to discourage it in the future.
> ...It is, however, important to note that where data put beyond use is
still held it might need to be provided in response to a court order.
Therefore data controllers should work towards technical solutions
to prevent deletion problems recurring in the future.
A better way to do this sort of thing is not an actual "delete", but a "cryptographic delete". The data should be encrypted, and you just delete the key. The data is then unrecoverable everywhere, including backups. Of course you probably don't want to just nuke the key, but disable it for some period of time, and then nuke it.
i don't see how that really changes anything - your keys should be backed up just as much as, if not more than your data. and any process for deleting the encryption keys should allow for restoring from backups for some period of time just the same as your process for deleting data should allow restoring from backups for some period of time. either way, permanently rendering data as unrecoverable takes time.
As an example, if you are using Amazon's KMS for key management and you destroy a key it gives you 7 days to undo before permanently destroying the key. Or you can disable they key and destroy it later as your retention policy permits. Surely they have some kind of key backup, but KMS users have no access to those backups.
Delitio> If you are only allowed to store data for x month that's it.
Exactly. I'm not aware of any laws saying "you must delete this data immediately". More like "within X days or months". The permanently delete thing presumably skips some cooling-off period in the online database but not the backup, which seems perfectly appropriate, provided your backup retention is compliant.
Google has a nice page describing out their deletion process. [1] It doesn't go into product-specific technical details/steps (like marked as deleted within the product, row deleted from Bigtable/Spanner, major compaction guaranteed to happen, backups guaranteed to be deleted or unusable) but it says this:
Google> We then begin a process designed to safely and completely delete the data from our storage systems. Safe deletion is important to protect our users and customers from accidental data loss. Complete deletion of data from our servers is equally important for users’ peace of mind. This process generally takes around 2 months from the time of deletion. This often includes up to a month-long recovery period in case the data was removed unintentionally.
This is a best practice.
Delitio> It's your job to use technics which allow you to do this like using encryption on your backup and deleting the keys for it, for example.
If they'd thrown away the encryption key immediately, this would have been much worse. Instead of "we're down for 2 weeks?!?" (already quite bad) it'd be "our data is gone forever?!?". You never want to delete anything too quickly for exactly this reason.
It's generally recognized that deleting data from a backup would violate the integrity of the backup, so allowances are made. Usually you have to make sure the data is deleted as part of the restore process. For example, from CCPA:
> If a business stores any personal information on archived or backup systems, it may delay compliance with the consumer's request to delete, with respect to data stored on the archived or backup system, until the archived or backup system relating to that data is restored to an active system or next accessed or used for a sale, disclosure, or commercial purpose.
Generally user data deletion happens in multiple phases for large companies that care about both compliance and user experience.
For example, if you delete an email or document on Google it moves to the "Trash" folder for 30 days.
When you manually empty the trash or the time window expires, most likely the next step would be a soft deletion for a few days where the data is still on hard drives but hidden from the application. Soft deletion is mainly protection against coding errors, since soft deletion is easy to undo if you've caused an incident but hard deletion (removing the data from disk) is not.
Then most likely a garbage collection process comes by a few days later and hard deletes the data from disk, leaving it only on tape backups
Finally, maybe a month or two later it disappears from the tape backups as they get rotated or otherwise disposed of
This addresses the needs of:
- Giving a good user experience (user "oops I made a mistake" undelete)
- Protecting against incidents due to coding errors (software engineer "oops I made a mistake" undelete)
- Making sure data disappears from both disk and backups within a certain time window, like maybe 30 or 60 days (comply with regulation and user expectations of data being cleared)
I asked the same question yesterday, and the responses were food for thought.
If you make backups, you are, almost by definition, unable to perform a full 'Compliance Delete' before the oldest backup in the set has expired.
Compliance-based deletion, if it is offered as a service, is almost always something time-based, like "we guarantee the data will be deleted 7 years from now". And then that deliberate deletion step is baked into the backup process.
So, i.m.o. at best they misrepresented the nature of the compliance deletion process. It never did what it was designed to do.
> Anyone else find it disturbing that they are able to restore data that they deleted permanently for "compliance" reasons?
An overarching theme with these things is “legitimate business need” and “no indefinitely retained customer data. Having backups, system event logs, etc are all legitimate business needs. Based on the data type that business need may be days or years with things like financial and legal requirements.
Youre conflating permanent, immediate, and irrevocable. These are usually handled in different aspects. Think of accounts having multiple states like active, suspended, closed, terminated, purged. Some examples;
suspended: credentials/authnz immediately disabled, all data online, charges continue to accrue, can be restored in minutes.
Closed: credentials disabled, data online, processing stopped, charges stopped, may take manual intervention (hours) to return to active.
Terminated: creds & account irrevocably unavailable, online data deleted, offline data (backups) remains available.
Purged: all online and offline customer data irrevocably unavailable. This generally happens after a defined retention period for things like logs, backups, etc.
You can apply similar concepts to individual resources more granularly than the account.
Disclaimer: principal at AWS but the above is my own opinion/observation and does not represent my employer.
Nope. I exported our data after they restored the backup and then we cancelled less than a month later. Like I obviously understand suspending our logins, but why would you ever delete someone's data when it's literally only 160 KB of text? The whole thing made zero sense.
After I met my now-fiancée on OkCupid, I deactivated my profile, turned off notifications and forgot about it for a while. A while later, I thought it be nice to revisit the first messages we sent to each other, only to find that... OkCupid had deleted both of our accounts. They didn't give me any advance warning, either, because I turned off notifications, remember? :^)
I'm still kinda salty about it. I understand why big services can't retain data indefinitely, but like... it's just a few KB of text, and that text happens to have a lot of sentimental value. Besides, OkCupid knows that I deactivated my account because I am a success story -- why not hold onto those profiles a bit longer? Or better yet, how about emailing an archive of those messages immediately when you click the "I'm leaving because I'm in a happy relationship now" button? /rant
I kind of agree - a company I used to work at used free Slack for years, then HipChat (until Atlassian killed it - with good reason), then converted to paid Slack, and all of our chat history was still there - even the old stuff that gets hidden as part of the free plan.
With GDPR, privacy regulations and data breach regulations sweeping the globe, holding onto unnecessary data is a huge liability. Getting rid of data you no longer have clear consent to store, or which you're unlikely to have a clear business need to continue storing, is a sign of a good company these days.
Not if the customer doesn't ask for it. As long as the user has a profile, was aware that PII is stored and doesn't request deletion, GDPR won't ever force you to delete information. Otherwise GMail would have to start deleting old emails as well.
That's true but it was stored there with your explicit consent. The GDPR is first and foremost concerned with data that is stored about you without your consent or with data that continues to be stored about you after your explicit request for deletion. Or incorrect data that you have requested to be removed. See the wikipedia page on the GDPR or a bunch of articles that I wrote about this subject.
If they had obtained the data without you supplying it freely then that would have been an entirely different matter, especially if it was used in ways that you did not consent to. But since that does not appear to be the case here the GDPR applies like it does to all data that is directly related to a data subject but continuing to store it on behalf of the user(s) that supplied it is not a problem.
Note that the user here is disappointed that their data which they consented to be kept is no longer there. This is a pretty clear indication that as far as they are concerned their expectation was the even with the GDPR up and running that such data would continue to be preserved as it is in almost every service that existed prior to may 2018.
It is precisely this kind of panicky thinking around the whole subject of the GDPR that gives these irrational responses, companies that suddenly no longer dare to mail you but you have to log in to their portal, which is secured by your email address and more of these totally weird constructs.
If they wanted to delete this data the better way would have been to positively contact the user (so that you know that they have received your message) to ask if their data should be deleted or not. That's good stewardship, just tossing it isn't.
I don't think people write code saying "if accountSize < 160kB { skipDelete() }" - THAT would make zero sense. So, the size is not relevant here. The process was likely to delete data after some event occurred, or lack of event occurred.
Such a decision is just as likely to have come from the legal/compliance team as an engineer. Data you no longer have clear consent or a legitimate business need to store is a liability, and if you operate in Europe, potentially illegal to continue storing.
It’s amazing how much stupid shit we do to keep the legal guys happy while their bosses are busy engaging in <checks news headlines%> tax evasion, graft, bribery, fraud, embezzlement, illegal dumping, sexual harassment, sexual assault, statutory rape, solicitation to commit murder, and my personal favorite and I’m sure yours too: human trafficking.
But sure, we can break all of our users to avoid the possibility of you having to write some legal briefs and us paying a small fine for keeping data 7 days instead of three.
Seems like that could be addressed with some fine print in the initial agreements. "In the event that you stop paying us, we may keep your data for up to N days unless directed otherwise by you"--or similar.
They have recently killed off on premise offerings, it's cloud only now. And this makes it harder to trust both the security and integrity of your data.
The fact that a single bad script could delete 400 of their customers should be absolute proof they do not have the processes in place to be a steward of your data in the cloud. On-prem or bust.
On-premise just means that your overworked IT person is going to spend 5% of their time keeping your service maintained, at no point gaining any more than baseline familiarity with the product.
On-premise isn’t a magic pill guaranteeing 100% uptime and 0 data loss.
While on-premise may be a good choice in many cases, it’s not like running on-premise business tools has no risk associated with that choice.
Remember that the goal of a company is to sell the most product possible (output) with the lowest cost possible (input).
Any Joe off the street starting their own business can pay Atlassian $0/month for up to a 10 users. On-prem doesn’t compete with that.
On Prem means you have control over spending. I calculated that if we've moved to the cloud, we would pay YEARLY as much as we spent on Atlassian licenses in last 5 years. That easily pays for the maintenance overhead on our devops team.
Gmail had a vaguely similar outage years ago. [1] tl;dr:
1. Different root cause. There was a bug in a refactoring of gmail's storage layer (iirc a missing asterisk caused a pointer to an important bool to be set to null, rather than setting the bool to false), which slipped through code review, automated testing, and early test servers dedicated to the team, so it got rolled out to some fraction of real users. Online data was lost/corrupted for 0.02% of users (a huge amount of email).
2. There were tape backups, but the tooling wasn't ready for a restore at scale. It was all hands on deck to get those accounts back to an acceptable state, and it took four days to get back to basically normal (iirc no lost mail, although some got bounced).
3. During the outage, some users could log in and see something frightening: an empty/incomplete mailbox, and no banner or anything telling them "we're fixing it".
4. Google communicated more openly, sooner, [2] which I think helped with customer trust. Wow, Atlassian really didn't say anything publicly for nine days?!?
Aside from the obvious "have backups and try hard to not need them", a big lesson is that you have to be prepared to do a mass restore, and you have to have good communication: not only traditional support and PR communication but also within the UI itself.
Even though you are no longer there...I had a friend who recently had her gmail inbox mysteriously emptied, all emails seemingly permanently deleted. She paid for Google One to be able to talk to support, and they said that the data is gone. Do you know if there's a way to recover this data? She is quite heartbroken at all the attachments that she will never get to see again.
The sad truth is that with 99.8% of customers unaffected, it was probably thought to be a minor issue. If those customers didn't have Gergely's ear we probably wouldn't have heard about it.
Hi, this is Mike from Atlassian Engineering. Not a minor issue. Once we knew the extent and severity of the incident, we had hundreds of engineers engaged and working to restore service.
I should have clarified, that I was talking about leadership's external communication on the incident, like in the article. Nobody doubted you were working around the clock, or with lots of people involved.
even in the middle of the crisis they will play with ambiguity around the words "user" and "customer" to obfuscate the real situation (presumably they don't consider users 'customers', which goes right to the heart of enterprise software in general)
i hate deleting things. prefer flags that hide things instead (like a boolean deleted flag in an rdbms table).
prevents data integrity issues in relational databases, makes debugging easier and prevents disasters.
ideally also include a timestamp, both for bookkeeping and safe tools that only remove things that have been soft deleted for some time and are safe to delete without compromising integrity of anything that is not deleted (this is especially important in relational data models)
Better still: a field that registers at what date a record was supposedly marked as deleted. Because otherwise you still can't bulk recover from an error.
yep. but at least in the rdbms case, and probably in all cases, a flag (and an index on it) tends to be essential for query performance since the state of the flag will appear in most, if not all queries.
that's okay though, queries that reference the timestamp can be slow since they're housekeeping.
The GDPR and various things have made companies more skittish in doing things this way, because they get scared.
Perhaps an effective measure would be to create a key that encrypts a customer's data, and give them a copy of the key, and let them know that after a certain point your copy of the key will be deleted, and if they want a restore past that point they'll need to provide the key.
You may as well just delete it, then. I guarantee a high percentage of users won't save that key and be able to find it later. GH (edit: or similarly nerdy sites) might (might!) be able to get away with that, but as soon as part of your process is "give the user a cryptographic key" you've just guaranteed yourself a support nightmare, with normal users. It's why the only cryptographic person-to-person communication systems that've been broadly successful haven't involved keeping track of anything, and don't have a setup process more complex than "point camera at QR code".
Yeah, you end up in the case where you "officially" cannot recover after X, but then you make sure that "accidentally" you might be able to recover by keeping copies around somewhere ... until someone realizes and you get sued.
that's an interesting question, i've given a little thought to this multi tenant saas stuff...
not sure if the right way forward is some sort of innovation in operating system and software design where people write and run apps that feel like single tenant apps attached to dedicated per tenant datastores where os and framework magic handle per tenant encryption and segmentation (tenant id as an os level concept)
or... if it makes more sense to encrypt at the record level with keys that only the customers hold using (assuming it's up to the task) homomorphic encryption for things like searches and other backend functions.
either way, for now, soft deleting and following up with an automatic daily hard delete of things soft deleted more than x days ago is a totally reasonable approach.
ops scripts should require typing "yes i know what i'm doing" if someone attempts to hard delete things that have not yet been soft deleted.
Yeah, soft delete is the way to go in 99.99% of the cases, with a system setup to eventually hard delete on some schedule (preferably don't hard delete until X number of backups have caught the soft deleted data safely, for example).
Hi, this is Mike from Atlassian Engineering. Strongly agree with this. I'd say that if you can afford it, don't do the hard deletes on a schedule though. You never know when there's a system out there referring to soft deleted data that fails once the data is hard deleted. Hard deletes should feel frightening because they are frightening.
i disagree for one reason. you really don't want the tooling or the process to rot. running it automatically normalizes the scary. otherwise you have bespoke tools in indeterminate states being run by people who are learning how to run them again. that's when i believe things get dangerous.
if it forces additional fail safes or backups to be able to do so safely, then that's probably a good thing to have anyway, no?
> The GDPR and various things have made companies more skittish in doing things this way, because they get scared.
They may be scared. But are they scared enough to reload every single backup they have, purge the desired records, and resave each and every single backup they have? And not also worry they will corrupt/break the backups in the process.
GDPR compliance is a mess of contradictions and unreasonable asks which all seem to amount to "depends on who you ask."
If this had hit us.. we would just switch to excel or something for a week/month?
But maybe we are a very light user of JIRA. Nothing in there can't be replaced. It's "nice" to be able to go look up a 3 year old bug and which client reported it, but not really crucial for day to day ops.
He didn't say it was sufficient; he said they could do it for a short while. I consider myself in the same situation: we depend on Jira, but for a week or so it's not a big deal to use a bunch of Post-It notes.
I don't see this as a valid comparison. There is information loss. This has happened to my team which had about 50 people and it was very chaotic. It took us several days to just create the state our features were in.
Today it would even be more troublesome as we have a lot of integration rules dependent upon the workflow. I'd probably just recommend everyone uses a few weeks for self improvement and only address critical production issues.
On prem is worse in some cases. If you don't have access to the source code or ability to modify it, you're still at the mercy of someone else and now there's likely additional hoops to jump through.
If your Oracle DB or Cisco router has a software big, you can always restore/rebuild but that doesn't guarantee you won't hit it again and in both cases you're still at the mercy of the company producing it.
Even if you're on OSS are you able to fix a data corruption bug yourself?
You get more control over maintenance windows and backups, but it doesn't automatically guarantee better uptime.
Unless their revenue takes a long term hit over the outage, no reason for the stock market to care. There isn't news of people actually planning to stop using Atlassian products over this. The only direct consequence is going to be the one time payment of SLA credits. So I guess the part I find surprising is how little impact this looks like it will have on people using their products more so than I am that the stock market doesn't care much about this.
It was a long time ago individual stocks represented anything grounded in reality. People talk about "fundamentals" and so on, but that's not what the price is based on. I don't think anyone know why the prices move as they do anymore, as there are so many algorithms involved today, both manual and automatic ones.
Yeah, I place a Put option order yesterday. By end of day I was up over 50% and now down to 50% of what I original purchased the Put at because it went up 5% today.
in some ways it should be bolstered by this because most of the customers appear helpless, just passively waiting for their data to be restored. It shows that Atlassian have them at their mercy and it must therefore suggest that there is a lot of lattitude to extort more out of them through abusive licensing terms / price increases in the future.
The outage did not impact the stock, most major tech stocks have taken a large hit in the past week and a half (until today).
This even is not even showing on any financial news site. I'm still hoping it does and the stock goes down because I place an option order yesterday betting that it goes down by next Friday. Seems like it won't now but the risk was worth taking in my book.
In a side note that someone else already made: it is interesting to see that many companies that uses JIRA also uses Slack but the noise/complaint/mentions comparing when Slack is down is way different. I barely saw people complaning.
I dunno about everyone else, but I'm generally frustrated and feel blocked when Slack is down, and I celebrate Jira being down because I've never had a pleasant experience using it. Jira is bureaucracy that gets in the way of me getting things done, and Slack is a critical communication path.
It's it though? You can hop onto any of a constellation of other IM platforms, FOSS and not fairly quickly for an instant comms channel, even if you're missing the history. Having all your issue tickets missing is something you can't really deal with unless you have a very recent dump, and even then you can't just fire up Bugzilla and get something working without a lot of migration and administrative effort.
You can do without JIRA for a week or two as long as managers understand and you all have a good concept of what work needed doing anyway. Then it starts getting dicey unless someone becomes a human JIRA to connect temporary manual bug tracking systems with everyone involved.
We have all sorts of slack channels set up to coordinate activity, so that internal customers can talk to engineers easily, or engineers can engage with each other. If slack goes down, we'd have to work all that out. For many days, it would be a huge drag on the process, slowing down interactions.
Other IM platforms wouldn't solve that just by existing. Sure, in principle one could set up such channels elsewhere, but that takes time, and the communication about it takes considerably more time.
Sounds like having a fallback pre-defined would be prudent if it's that important and you don't feel you could collectively extemporise something. "If Slack goes down, the plan is to use WhatsApp/Teams/Jeff's Matrix homeserver in his garage until service comes back. A list of group channels will be emailed if that happens."
Then if it does go down, you don't have to waste the first day arguing about the plan.
I recommend doing disaster recovery steps for your personal data as well, such as Gmail. At one point recently I was creating filters to delete bulk messages and - when the filter got created, it somehow missed the from:@xyz.com domain part and I ended up deleting => delete forever all emails. I noticed the issue right away but it was enough to wipe 2-3 months worth of emails (all of them, even Sent ones).
hahaha, i left atlassian when they deleted everyone's mercurial repos with no way to export either. fuck them. they've proven time and time again that they dont care about their customers
> Atlassian is a tech company, built by engineers, building products for tech professionals.
I am curious if anyone can provide any more insight on this simplification.
I've worked at companies like this. Originally a core of motivated creative individuals make a cool product. As the business grows rapidly, Pournelle's (Iron) Law (of Bureaucracy) takes over. For a variety of reasons, the very capable creators depart and are replaced by less motivated/aware individuals who are glad to have a job and easily compelled to do things to the product that probably should not be done.
My guess is that while Atlassian may have originally been one of those cool founder places, it has probably morphed into the more incompetent version that comes with scale all too often. But I don't know. Thus my question if anyone can speak to the true current tech capabilities of this company.
It kinda reads like their user's data is not separated very cleanly; I've never worked at a SaaS before, but reading this, especially given the size of some customers, I'd want each customer to have their own independent instance, with its own backup pipeline. I was thinking of "just" giving them their own database, but there's been plenty of instances where authentication got botched allowing one user to see another user's data; this should be impossible if things are running on their own instances.
Note that I'm pretty naïve and armchair on this subject, I'll see myself out.
Something to consider is that Jira can require a great deal of configuration to tailor it to your needs. If you already have a DevOps team of some capacity (not everyone does) then it may only be a small incremental increase to run thinks on prem. I did it myself: I'm ver much not a DevOps person, mostly unfamiliar with optimizing JVM parameters for apps like this, but it still only took me about 5 hours to get things running stable, and then another 2 hours or so a few weeks later to tweak things like heap size to help things go a bit faster (though it was still somewhat slow)
To be complete open though I don't know how much DevOps overhead is involved in maintenance or feature updates. I hated the app and used it for less than a year so I didn't have much exposure. I guess my point though is simply that you may not need to use their SaaS option if you have a decent DevOps team already. After the initial setup time I doubt I spent more than half an hour a month managing the internals and updates.
I did spend more than that on configuring the system for use, which you'll need to do regardless.
I have had to correct this too many times already. Server is the name of the deployment type of their on-prem. It means single node non-clustered. Data center is their deployment that supports clustering to multiple nodes (and used to support a few extra features). They are retiring the Server deployment type licenses and pushing everyone to data center or cloud. So no, they aren’t EOLing their on-prem.
The datacenter product also seems geared towards people reselling Atlassian stacks. For example there's a company that offers HIPAA compliant Confluence (complete with signing a BAA, so you can actual store PHI on it). It doesn't seem like a great replacement for the server version.
DevOps really is not just doing DevOps on cloud platforms and SaaS. Besides, the sysadmin aspects of self hosting should be handled by, well, sysadmin. DevOps should be handling other aspects like developing solutuons necessary to have things (in this case Jira) work together with other systems. (Among other responsibilities) Though DevOps can implemented in different ways with responsibilities that are different from one organization to another, but I've never heard it defined as "we don't deal with on prem"
But I also get the impression that you may just be expressing a preference, not a rule of DevOps? If so then I definitely understand. Custom solutions to integrate or glue disparate systems together is often not the most interesting work. My area... A single word doesn't encompass what I do, I'm a generalist in my domain with one or two specialties, but glueing data together (not the same as a full integration, I know) is a big part of my job, and usually the least interesting.
Though in this case from other comments on prem seems a a dwindling option anyway for Jira. I worked with it about 7 years ago under one of their free licensing programs and disliked it enough that I didn't bother following them after that.
Yes, I should have written a bit more. It is definitely a preference. Currently, in my team Devops is a one man show (me). So I have to be careful with how much burden I allow. There are things that make sense to self host, I agree. I am just trying to avoid becoming classical IT and having to provide lots of end user support and such.
A case for reducing complexity of software. Also, given the recent GitHub incident spree, it's almost debilitating. The entire tech industry takes a hit when companies like these fail at operations.
It might be a good short opportunity... I imagine a lot of customers are kicking off their own internal process for migrating away from JIRA. By the time they actually do, it'll be at least a couple of quarters from now, which is when the customer hit will start materializing in quarterly results for the company.
Maybe time to throw a few chips at some long term puts?
I wouldn’t short. They just slapped 400+ customers and likely hundreds of thousands of users in the face and the C-suite didn’t think it was important to even acknowledge.
That might look like incompetence, but I think it’s confidence. They know the switching costs for large orgs are so high they can treat these people like trash and few if any will leave. I wouldn’t be surprised if the total number of seats among affected customers has gone up in a few months. By failing to acknowledge the problem they’ve kept it out of the mainstream media and financial press.
They have their customers by the balls and don’t respect them. That’s a short term bullish signal to me.
Aren't most customers in 12+ month contracts? A migration seems like it would take many months to select a new vendor and migrate regardless. Be careful about the date on those puts. It's pretty hard to out-think the market on this kind of stuff. I'd just as soon bet the other way: few customers will actually churn and in 6 months this won't really matter.
They might even get some new customers after people who never used it look at their site and offerings.
Disclaimer I have puts that expire 4/22 (purchased yesterday) so I hope they go down in the short term. Seems like a total loss now after being up 50% yesterday.
I bought puts yesterday morning. Was up 50% by the end of day but now down to 50% of what I paid.
Mine expire 4/22 but I have more calls open at the moment anyways so if I had to choose between this going down or the market up I'll take a full loss on these puts (seems likely at the moment)
> Most of them said they won’t leave the Atlassian stack, as long as they don’t lose data. This is because moving is complex and they don’t see a move would mitigate a risk of a cloud provider going down. However, all customers said they will invest in having a backup plan in case a SaaS they rely on goes down.
The real key lesson here. Your business is important to you. Not so much to the service provider.
Reading this piece is kinda boring. As usual, the root cause is a design defect in their backup-restore functionality. And it's at a complexity level any senior developer could have pointed out to be posing a fatal risk to the company.
My guess is many people new about the problem inside, but corporate taboos made it impossible to discuss. I'd bet a fortune on this being the case.
Honest question here: The companies impacted by this, are they not taking backups of their Jira/Confluence/Bitbucket instances? Or is this outage impacting the ability to import those backups?
There are some Python scripts that will back up Jira and Confluence. I whipped up a quick script that gets a list of all our bitbucket repos and then it clones those daily as well.
A Jira backup blob isn't especially useful. Confluence could be if it's essentially HTML dumps that you could host internally read-only. Bitbucket clearly has backups and a migration path.
Why is it not useful? From eyeballing it it looked like the same file I built from our Fogbugz data to import our historic cases into Jira. I'll carve out some time to try doing an import into a new project to see if it loads properly.
In this particular case, if you have another account, that could work. What I meant was that Jira isn't very useful unless people can actively use it for issue tracking. It's not all that valuable when it's just a reference.
I suspect - pure speculation - that they can't restore the backups, because if they could then they could easily do this in a way that accounts affected could be restored selectively. In other words: test your backups, if you don't they won't be there for you when you need them.
Don't trust cloud providers with your core business functions. I'd go even further and say don't trust the cloud, period. I think the next big thing is going to be moving back on premise or private cloud as more businesses realize this.
My current employer uses Jira but we seem to have not been affected by this. Hopefully those customers affected are able to press Atlassian for improvements from notification time, backups, usability etc.
When doing bulk deletes like this what safe guards do you put in place, other than testing the script up/down in another environment, turning off app servers etc (which Im guessing they did not do)?
Naive approach, replace delete with select and see if you're surprised at the results.
More mature approach, especially in an environment where engineers are running bulk changes against the database, you don't do bulk deletes. You change that delete into an update that marks things for later collection.
One tactic I've seen that worked, assuming you have straightforward relational tables: you add a "marked for deletion" column whose value is an identifier for the single run of the bulk job you just did. Then you can query rows with that value in that column to ensure it had the desired effect. If you're satisfied, you run another bulk job which doesn't re-run your original query.. it just deletes rows with that "marked" value.
Lots of places rely on schema-enforced foreign keys and cascading deletes though. In that case, my recommendation is: don't.
Canary deploys i.e. start with a couple customers and do manual validations, wait a little bit of time (maybe a few days) before incrementally rolling it out to larger amounts of customers.
It's not clear I'd the issue affected all tenants where the script ran--which it sounds like it did. It wouldn't be as effective if it only effected certain tenants (maybe with a specific config)
yah, it's interesting our org is in the middle of a big debate around BCP/DR where cloud migration is presented as a no-brainer solution that obviates need for DR. Maybe not?
> However, if they [restore backups], while the impacted ~400 companies would get back all their data, everyone else would lose all data committed since that point
OK, so you restore backups to a separate system, and selectively copy the stomped accounts data back to production. Simple concepts aren't that simple at their scale, sure, but I suspect this is skimping details on some truly horrendous monolithic architecture choices that they're trying to hide.
Not that I ever thought using their products was a good idea; to be clear about my position... But at this point anyone continuing to rely on them for anything is asking for the suffering they'll get. Signing up for their crap for a vital business function is like offering your tonker to a snapping turtle.
I would really like to understand who makes the decision to purchase JIRA. It's like the C++ of ticketing software--it does everything because no one wanted to sit down and think critically about the use cases and instead decided it would be easier to say "yes" to every single feature request. It definitely feels like whoever is buying JIRA is not on the team who is using it (maybe IT or finance) because it ticks their boxes and it has such a huge list of features that nominally it appears to tick the product development boxes (ignoring more subjective concerns like "quality", "performance", and "usability").
I would really like to try working in an organization that uses something simpler, like Trello (although now that this is also an Atlassian property, maybe not exactly Trello?).
JIRA is a framework for making assembly lines out of knowledge workers. When you're a middle manager at a decent sized company, a major problem you face is that the mass of knowledge workers beneath you are opaque: you have no way of knowing whether they're working or not. Another problem you face is that they're uppity: people who went to college and got used to managing their own time now have all kinds of idiosyncratic ideas about how to manage their own time and arrange their own working lives. Since you are a middle manager you despise local differences. Since you are a manager you're pretty sure that only you and your lieutenants can be trusted with this kind of decision making power. Adopting JIRA is a powerful level to put people back in their place as work item churning machines. Constraints such as only certain people can create or assign tickets, only certain people can mark them completed, only certain states are valid transitions from other states, etc. implement a level of domination over white-collar workforces that managers would be otherwise uncomfortable asserting face to face.
Other ticketing systems do not work nearly as well for this purpose because they are designed mainly as external brains or communication platforms for workers, and they assume a level of worker autonomy in moving tasks through their lifecycle. In Trello you cannot make it so that a PM has to sign off before a card is moved to the in-progress column, or that only in-progress cards can have code reviews associated with them. JIRA eats these kinds of requirements for breakfast.
EDIT: This is not to say you can't use JIRA in a workflow-neutral way, or that everyone uses it for this reason, but I would submit that it's JIRA's differentiated advantage.
It sounds like you've been hurt by the some terrible management practices, I'm truly sorry that some managers think their job is to control their subordinates.
However, regarding ticketing systems, in team environments, it is very effective and helpful to have a system that manages the data about the work that has been completed, is being worked, and is planned to be worked on .
Part of that system might be defining restrictive workflows for some teams, not for control, but to ensure the agreed upon process is followed for quality or consistency.
One of the many problems Jira has is that if you don't have a Jira admin on your team, it's impossible to build an effective and efficient workflow for your team. Coupled with Jira making many things global by default (it takes a lot of care to make a change that only affects specific Jira projects) most configurations end up being a pile of garbage automatically inherited from changes an admin(that is not part of the team) made when intending to change something for another specific team.
Caveat: this is going to be a meta comment rather than a comment about the topic proper, and so maybe not appropriate for HN, but I think it's worth discussing.
> It sounds like you've been hurt by the some terrible management practices, I'm truly sorry that some managers think their job is to control their subordinates.
When we assume someone was hurt, and imply they hold an opinion only because they were hurt, we risk delegitimizing their position. The interpolated message we might be sending is "your experience is personal and not representative of the subject at hand, and so your thoughts are only applicable to your situation; so, after we express our sympathy, your thoughts can be dismissed." Or the message we might be sending can be patronizing: "you hold your opinion for emotional, rather than rational, reasons; I'm sorry that you are so unfortunate."
To be clear, though, I'm sure this wasn't your intent, and it makes me glad to see someone being compassionate (i.e. that you bothered to consider the experiences and feelings of the parent commenter).
A personal story: I was raised devoutly religious but left the church in my twenties. My family and friends assumed I left because I wanted to be free from guilt, had been hurt by a culture that belied the doctrine, and so on (and they said as much). My change of belief occurred after recovering from a few years of mental illness, and while it is true that I may not have left when I did were it not for the opportunity to reexamine my beliefs (while trying to piece back the fragments of my life into a sense of self), the reasons why I left were the result of a lot of research and thinking. It was mildly frustrating when people assumed my decision was made for emotional convenience, when in reality, the research was uncomfortable and contemplating an unfamiliar universe was scary.
I recognize the irony here – the issue I'm highlighting in this comment may be something that only I feel is an issue, born from a personal experience. But I think it's more common than that.
Beyond trivial scale, you need good processes so that individuals can do their jobs. If you have no processes, change and development becomes extremely difficult because people will be hunting for documentation all the time, stepping on each other's toes, and making mistakes that they should not be making because they forgot a trivial procedure that was a prerequisite to solving their actual problem.
I work with a variety of different environments, and depending on the environment I can either solve my problem in minutes and get it deployed in another few minutes or solve the problem in minutes and spend hours figuring out how to safely deploy it without breaking everything. JIRA is terrible if you do anything that it offers by default, but when used properly it can absolutely help with this.
To add to that, and perhaps educate your downvoters a bit, it can be very hard to imagine why or when such strict processes are helpful without having direct experience with organizations of sufficient scale. It literally boggles the mind but the process truly is king when there are hundreds (or thousands) of individuals working on a single product.
Agreed. An essential part of blameless engineering culture is "the outage isn't any one person's fault, it's the fault of the tooling and processes for allowing them to do that". Good processes prevent everyone from making the same mistakes.
IMHO the "correct" or at least humane organizational design is that most things happen in local teams, which are of trivial scale and can get along just fine with informal, ad-hoc, or locally varied processes.
Obviously not all work is this way. Sometimes you need to drive a migration that touches every team, and then the technologies of bureaucracy and process become important. But most work should be done in human-scale groups that can be more towards the self-organizing and trust-based end of the spectrum.
However some middle managers take offense to the idea that their different sub-teams have different operating models internally, and lean on technologies like JIRA to try to make them all the same. Middle managers at my company have tried this, not very effectively , so it hasn't hurt me too bad. But I've seen their vision and recoiled in horror.
>However, regarding ticketing systems, in team environments, it is very effective and helpful to have a system
I think the point is that Jira is particularly granular in the way that it lets you do things with permissions, workflow rules, roles, metrics, etc. There's a fair number of places that use that granularity to create a weird digital sweatshop.
Meaning the complaint is more about really deep "micromanagement as a service" than what you might get with lighter tools.
Micro managers are everywhere, even in places that may seem culturally incompatible. I’ve yet to work for a business that prioritizes regularly evaluating managers for their management skills. It’s only addressed when shit really hits the fan. Managers are primarily evaluated by their own managers on deliverables. As long as they’re getting results and entire teams aren’t quitting simultaneously there’s no need to question anything. As long as a manager is toxic in ways that don’t break the law or violate major company policies any attempt to address this by a direct report carries the risk of termination or retribution. Does it contradict your company’s cultural values? Rules for thee.
And I wouldn’t assume you’re not one of them. The worst cases I’ve run into aren’t even the psychos that embrace micro management as part of their “management style”. It’s the ones that genuinely believe they aren’t engaging in the behavior. They’re not micro-ing, they’re “helping” their team because they are an awesome manager and their team is almost awesome, they just need to be monitored very carefully and given “suggestions” until they nail it. But they’ll never nail it. Because no one is as smart, experienced or does a task “just so”. They view themselves as a mentor to all. All decisions must be theirs to make. Jira becomes the perfect tool since the team effectively becomes little boxes that accept tickets or stories and return work both performed and delivered as specified.
For any managers reading this that don’t see a problem with this or see some of those behaviors in yourself please understand that you are sacrificing your team’s happiness and motivation at the altar of your own insecurities. No one can grow where they’re not trusted and no one can improve their skills when they’re never given latitude to make meaningful decisions. Your people will make mistakes. They will accomplish things in ways that are different from how you would do them. It might even be objectively worse. That’s ok. That’s how you grow into a strong team with confident members.
Kanban, by design, was a tool used in production control. It's one of the ways Toyota made their JIT production function.
I worked on the line (Toyoda Iron Works) and used a real-life Kanban implemented by the plant engineers. It was used for quality control, to broadcast quality control and station output, and was checked regularly against their internal estimates and baselines and used also as a gauge for employee output.
Control is what it's designed to do. The very fact that Kanban is the tool of choice should support at least some of OP's points, objectively.
Agreed. This is a problem of scale in my opinion. When we have 10 engineers, it is easy to check in with everyone and know what they are working on and get a status update. When we have 500 engineers, making sure all their tasks are aligning (organizations are one big race condition) is not just hard but impossible without some sort of tracking system. We all want to grow big. To do so, your processes need to change as you add more people. The exceptions (Valve, Netflix, etc.) that can handle being flat or semi-flat are very unique.
Are they unique because their problem domain allows it or because the leadership is uniquely ideologically driven (and competent) to implement efficient, flat systems?
I think it’s mostly the people. They are die hard about their culture. At most of my workplaces, culture is generic and the company would be willing to set whatever culture rules seemed to work.
I was told by a lifetime manager turned successful consultant, that roughly fifty percent of engineering firms govern their engineers basically using fear.
Could you elaborate? What kind of fear? “You’re fired”? I wonder how effective it actually is because of the current job market and also because I (and others) react very poorly to this kind of tactics: “you want me to fear getting fired? Joke’s on you, please DO fire me, I dare you”
> I wonder how effective it actually is because of the current job market
Counterpoint: software developers aren't necessarily well paid or highly regarded everywhere, since remote working for companies abroad hasn't quite gotten mainstream enough.
So it might just be effective against some people, or in cases where the hiring process itself has become increasingly unreasonable - the job being working on boring CRUD apps but the hiring process being multiple stages of Leetcode and complex interviews.
It probably applies to the industries and companies where devs are treated as a cost center and since those companies aren't all out of business, plenty of people must be working in such environments, with sometimes sub-optimal conditions.
I'm guessing it's a sort of a nerd shorthand for "various means that are accompanied with self confusion of users but not with strong rational or scientific or technical basis"
Not in a negative way. You want to trust engineers to always have changes built and tested before they go to production, but when something egregious happens you need to go back and see what went wrong. You can choose to interpret that as control, but really the only alternative (often cited) is "Well that shouldn't ever happen, so you don't need tooling to support that situation".
And that is not a useful way of thinking when you have real engineers writing software that people depend on.
This is very likely even if engineers come up with the processes, unless all process is scrapped and done from scratch every time an engineer is hired.
I think you've got part of the answer here, but are selling it short. Jira is the most complex task-processing rule engine that is also easy enough for a small team to operate, and also has the broadest set of integrated tools of any offering.
You can use Jira as a simple Scrum board, a Kanban board, or you can build enforced-process monstrosities. You can build customer-support / internal-helpdesk workflows, or even model internal work-item-oriented business processes, etc. Now, as you point out, just because you can doesn't mean you should, and many orgs fall into the trap of making issue workflows overly-restrictive. But most companies (I believe) choose Jira before they choose those hairy task workflows. Startups with zero process use Jira.
Also, you can integrate it all together to give good-enough dashboards/roadmaps, good-enough (for some, not me) docs integrations with Confluence, Git integration with Bitbucket etc. -- while there are big issues with these systems, I think it would be myopic to ignore the real benefits of working in one integrated stack where every design doc you write has dynamically-updated labels and auto-complete for each issue you type in.
For context, I use Jira for tasks and don't love it, found Confluence to be really annoying and so I don't use it, and prefer Gitlab to Bitbucket, but I think you have to recognize these unique selling points. If all Jira had to offer was the rule engine it would not be as widely used.
I feel you here, but I've been at multiple companies that used JIRA and never once had any of those requirements. I've also never seen it come up when deciding which ticketing system to use. Teams have always been free to move tickets at-will.
One very large video game studio has tons of automation for Jira. Imagine someone deciding to add new weapon. The automation creates 100s of tasks for concept artists, 3d artists, animators, sound artists, software developers with complex dependencies better those. Most importantly, automation creates multiple QA steps for each element of completed work.
The same exists for levels, enemies, quests and tons of other elements.
I would not be surprised if a lot of studios had similar workflows.
See, that is great. Automate what can logically be deduced from the information available and set up templates to provide that information. For developers, it should be automated enough you shouldn't have to write the same info twice, once in commit messages/merges/branch names, once in the ticket itself. If the workflow is so streamlined, all that information can be deduced and the ticket can be advanced automatically. Most information is available and documented for other parties.
However, that's just not what most people go through in companies using JIRA. Worse, they have to toggle between pages multiple times, each taking at least a few decent seconds to reload. I'd like to give JIRA the benefit of the doubt here, but it sounds like the tool is just very easy to misconfigure and abuse.
This is pretty easy with Jira. There's a GitHub plugin which links PRs and commits to a ticket, and a GitHub plugin that links ticket numbers back to Jira tickets.
And you generally do them both at a lower level than tickets, certainly commits, so you don't want to have too much automation between them as that starts adding constraints.
Even worse, companies with the resources to buy JIRA will probably hire consultants to set it up, and you wind up with a system 1) bought by people who don't understand how programmers work, 2) configured by people who don't know how your company works. So end users usually wind up with a terrible system that continually generates complaints (along MANY axes), and the people responsible for foisting it on them think they're just being difficult.
So I would say that this assessment is on the whole, kind of cynical, however I suppose I have the interesting position of being in an organization where I feel like I actually see both JIRAs.
One JIRA is the project that's used for development of the core product, where there are no constraints— anyone can add a comment, create links, change assignee, add new tags, push the tickets through whatever state transitions they want, and so on. It works, though it is a little chaotic sometimes as subgroups of people have different preferences for how things should go (eg, for tickets requiring test team validation, should the ticket assignee remain as the person who did the original work so it's clear who has more to do if it fails validation, or should the assignee change to the test team person, so that it's clear that that's the next person who has it as an action item?)
The second JIRA is the IT team's internal support project, which is completely locked down— no one except them can close tickets or move them around, or even edit the contents, closed tickets can't be commented on any more, and so on. This is the one that gives me the vibes you are talking about. Every time I have to interact with it, I loathe it because every inch of it is transparently a funnel, railroading me along a path toward one of either DONE or WONTFIX. This is absolutely efficient, in the sense of meeting the goal of closing all the tickets, but I feel it introduces friction for the larger business goal of actually helping people resolve their problems. To the point where eventually most of the IT support activity moved away from the JIRA project to an informal Slack channel, which is way more accessible, but worse in basically every other way: it's harder to effectively search, impossible to properly link, bad for async, bad for dealing with more than one thing at once, etc.
Oh, nonsense. People buy Atlassisn because the licensing is cheap, not because it's particularly good at what it does or designed with any particular workflow in mind.
I don't see how it is cheap. Standard may be cheap but then you are missing a lot of features that are announced on the product pages with a small footnote saying "only in premium".
Free software has zero acquisition cost, but non-zero TCO, which can measure in millions USD (recurring salary of dedicated IT team), depending on the size of organization and complexity of the setup.
You will need to maintain on-premise infrastructure, automate backups and recovery, automate security, automate updates (including testing and rollbacks) etc etc, basically doing all the jobs of the people responsible for the infrastructure at the SaaS provider, but at much smaller scale and not achieving the same efficiency. You will have to do those jobs considerably better to justify the costs.
in thirty years of experience, I see this talking point straight from Microsoft anti-Open Source days..
> Free software has zero acquisition cost, but non-zero TCO, which can measure in millions USD
Often a primary driver is exactly the opposite -- for-profit companies are accustomed to paying money for a good or service, with a billing pattern and legal obligations. The company financial deciders do not want a setup that does not have a billing pattern and clear legal obligations. Meanwhile, Open Source Software went from niche to mission-critical in the 2000s via the Internet. For-profit companies (and their publicists) scrambled to explain it, and came up with that exact line repeated again today. I do not blame any person for saying it, it was in print in some reliable place. It does not capture the reality in 2022 IMO.
> The company financial deciders do not want a setup that does not have a billing pattern and clear legal obligations.
I haven’t ever met a CTO or CIO, who would make budget decisions like that, neither I do it this way myself. The reality in 2022 is the same as it was in 2012 or in 2002: when you choose a solution, you consider all long term costs.
In 2022 TCO for the server software includes everything that I mentioned in my comment and more. There’s a lot of use cases for OSS in corporate environment, for sure, but not every OSS solution is cheap or even affordable. Running on-premise open source collaboration tool is certainly not cheap if you do it right.
Sure, if you host it yourself you have to pay someone to admin it (usually significantly more expensive than a license), and if you use a hosted solution you have to pay the host.
I think people buy JIRA because you can set it up however you want. I've seen it almost as simple as Trello and much more complicated. It doesn't have to be terrible, it just usually is.
If JIRA didn't allow you to make it terrible, it wouldn't allow for some of the absurd things that people want it for and those companies might not buy it.
They used to say of Microsoft Word, "Nobody uses more than 5% of its features, but every company uses a different 5%."
The saying is apocryphal and unlikely to be accurate, but the shape of the thing its describing applies to almost every piece of enterprise software whether installed on-prem or SaaS.
And as another comment points out, at Enterprise scale you can substitute "team" or "group" for customer. Every team might use a different 5%, and unless you standardize their processes, you have to buy the product that can accomodate all of their needs.
Only if you assume the 5% of features to be a contiguous block each time.
However, if we assume there are, say, 100 features in Word (the real number is likely much higher), the number of combinations is orders of magnitude higher than 20.
> Well its mathematically impossible to be accurate as soon as you have > 20 users.
It's probably in the semantics.
Text input and editing is clearly a part of functionality that's probably used by everyone (or at least most users), so it's not possible for "different 5%" to mean what you're alluding to, maybe the phrasing needs work.
In any given 5% there might be 1-4% of overlap with what others are using and the remainder of that is specific to the company.
And the greater the degree of overlap the weaker the implicit argument.
If it's a uniform distribution of discrete features then each feature is equally "important" and worth equal resources and dev time. If 81/100 companies use the exact same 5% of features and the remaining 19 cover the remaining 95%, then all else equal you can probably drop 95% of your features and still do well.
The dynamics of the Enterprise market are such that there are features where having just one customer that will make a buy/no-buy decision based on just one feature will deliver enough incremental ARR to justify the opportunity cost of doing that feature instead of a bunch of others.
Typically you do the most popular features first, but most Enterprise vendors end up working on a long tail of niche features that nevertheless are profitable.
There's a long conversation to be had about how this ends up being a trap where Enterprise software gets bloated and shitty and eventually gets disrupted by a small vendor that does "less," but in a powerful, transformative way that obsoletes the Enterprise "standard," which leads us back to discussing Atlassian :-)
They're a good example of this dynamic, because they have a "constellation" of products to sell. So if they build a niche feature that gets a new customer to buy Jira seats, having "landed" in the account, their salespeople can "expand" by selling OpsGenie and other related products very profitably.
My relatively small team at a massive enterprise built all our report generation tools around JIRA for an entire class of offerings. It's been easier for them to justify continuing to pay for JIRA and keep it propped up than to develop (or migrate to) a new solution.
As the lone dev on the team I've been continually astounded by my leadership's willingness to commit more and more to tech debt laden paths. The notion that all software requires maintenance is anathema to them, and it's led us to be 'cornered' into decisions re: what software we can use / where we can invest our discretionary funding.
Moreover, we're constrained by the parent mega-enterprise's software purchase policies; JIRA's already approved (and run elsewhere in the enterprise), whereas off-the-shelf or SaaSy alternatives are significantly harder to get buy-in for. (No using corporate cards for SaaS, all purchases need to go through the quote/purchase-order process, etc).
I once worked for a company that did the same - but with Lotus Notes, in 2013. Modified it into a full-fledged ticketing- and time-tracking tool. Using it took a half hour out of each workday.
I really, really like Trello and am dreading the day when atlassian starts tinkering with it in any real capacity. As a content creator, it is the first workflow system I’ve ever seen that I can effectively share with my client. It’s so simple and streamlined and the fact that I’ve stuck with it despite my ADHD says a lot.
Clients add their notes to the card, I check the boxes as I hit the notes, and I move the card further right as we enter different stages of the post production process. We then have a column of every completed project, which is incredibly easy to sift through if we need to revisit something. It’s literally left to right in the workflow, it visually is telling me where we are at all times.
It’s incredibly simple and elegant. For fast turnaround, relatively stripped down content (like podcasts) there is nothing like it.
I’m just a user but totally happy with all our Atlassian apps. Confluence is a huge win across our multi-thousand person company and the best teams use it very well. I like the integration between Jira and Bitbucket. We don’t over complicate things and it works fine.
It’s like my taste in wine. I don’t want an overdeveloped sense of taste where only a $400 bottle will do. I’m fine with what we have because the work is what excites me and if people are documenting projects and managing workloads and committing code, we’re 90% of the way there.
I was part of the decision to purchase Atlassian tools at my company. We had been using a variety of self-hosted and SaaS tools which had varying abilities to integrate with each other. We’ve had very positive feedback from users since switching to them. We were also able to move some of our help desks to JIRA Service Management, and away from another self-hosted product which is still used by a good portion of our business. The self-hosted product is honestly a nightmare to maintain and keep secure. According to the vendor, the “fix” is to have 1-2 people dedicated to that product, which simply isn’t something that my team has the bandwidth or will to do.
JIRA does try to be all things to all people…and mostly succeeds. For instance, we use the same workflow and mostly the same nomenclature across our development and helpdesk teams. Some of our software projects use Kanban-style workflows, while others use sprints, but we can keep track of a project across multiple teams using the same tools. I’m sure other products also offer this, but we liked the integration and overall capability for the price.
There are definitely issues: some feature requests and bugs have languished in their backlog for years. But you can get started very quickly and we’ve had great feedback from users.
The answer is medium to large companies. Jira is a tool that can satisfy hundreds of different teams’ work management needs without having to buy dozens of different products.
The fact that it’s so feature packed and customizable is the point.
I think the complainers are not really investing the time in to change project settings to fit their needs.
My only complaint about the Atlassian suite is the performance of Jira and Confluence. The overall page load speed is too slow.
> allow any ticket to be a child of any other ticket?
I have no idea why you would want this from a work management point of view, but you can just use issue linking to describe a parent <-> child relationship.
Re 1: I'm not sure why that's a necessity beyond a notion of consistency. I find that major wiki editors are not often major ticket creators, and these are different products with different audiences at the end of the day. Also, Confluence uses a WYSIWYG editor, so it's rare to need to think about the markup.
Re 2: Set the project's issue type scheme to one that only allows tasks and subtasks. That gets you one level of nesting. (And even though task and subtasks are different issue types, changing from one to the other is trivial since they have identical fields.) Allowing epics gets you another at the top level. That's a bit limited, but wouldn't arbitrary nesting be even more complex?
I agree. I look at every JIRA killer and think we could maybe move and nope...they're missing something we use. In many ways JIRA is like Excel. On the surface it can appear easy to replicate for a single user, then you realize every user uses 10 different features.
I made the decision, unfortunately. The rationale was literally that I hated pivotal tracker -- what a garbage app that is -- and I'd heard of jira, needed something to track bugs / work items, and signed up. It crucially had a zendesk -> jira sync, so all our zendesk requests could end up in jira.
In the beginning, with me plus 2 engineers, I noticed it was slow but since I used it for 20 minutes a week, that didn't really matter. By the time I started using it for an hour a day, we had 10 engineers on 2 teams using it. I got to see a friend using linear, and I had some spare time that I was going to use to switch, but I couldn't get in the beta. By the time they let me in, the opportunity was over and I was too busy.
JIRA is generally fine software that is good enough for most folks, especially if you're willing to adapt your workflow to it. Where it goes wrong is where tools like Jenkins go wrong: folks add too much customization.
That means the tool is often the wrong one for the job, but instead of picking something that's a better match out of the box folks stick with the easy choice (extend what they have).
Trickle down and first mover. JIRA was there first being "decently ok", enough people adapted it and now others do the same. Then couple with that what you write, the people in charge of deciding the software are generally the ones who can justify wasting half their day on it.
To this day I still don't know what JIRA does so much better that other products don't which big corps are willing to waste months worth of manhours over. It's biggest selling point is integration with the remainder of the Atlassian stack, not exactly known for being great either.
Personally, I like JIRA. I think it adds a ton of transparency in our org, and while I've used Trello for personal and home projects, I don't see how it's good enough for business. Trello doesn't even allow for time estimates (last I tried), which for us is part of planning. Search in JIRA is also really good, so no ticket is ever just lost to the ether.
Sure, it's not perfect, and waiting for a board to load is annoying, but for distributed work and visibility, I haven't seen something as professionally useful.
I was a C++ programmer in a past life and I sorta like it. C++ and JIRA seem to have the same philosophy with respect to choosing which features to admit: "yes". The idea is that by supporting the largest number of features possible, they'll surely build something that everyone likes because it will tick everyone's boxes. What people frequently fail to realize is that the absence of misfeatures or redundant features is an important feature in and of itself. Moreover, the more features you support, the harder it is to control for quality.
> The idea is that by supporting the largest number of features possible, they'll surely build something that everyone likes because it will tick everyone's boxes.
The idea that the C++ committee are unthinking people pleasers it patently false.
C++ does have a lot of cruft, but mostly because it aims to:
i) support new features
ii) maintain pretty strong backward compatibility guarantees
In general the new features are actually pretty well liked, but in conjunction with (ii) it creates a big language. There's a reasonably decent subset that can be carved out, but it's also clear why newcomers without legacy baggage (e.g. rust) are making inroads.
"unthinking people pleasers" isn't how I would characterize it; rather, I think of it more as a "kitchen sink" or "more is more" philosophy rather than a "less is more" philosophy. I'm sure the committee deliberated extensively, but deliberation within their particular philosophical context still produced an unpleasant result. I think the same is true of JIRA.
IBM effect. If you don't care a whole lot about your ticketing system, you just pick Jira because everyone'll nod along with the choice and you won't personally be blamed if/when it sucks, you won't make enemies or have to argue over the choice because it can't do something that someone else in the org "needs" it to do, et c.
The way it works is, someone always says "Sure, JIRA is bad out of the box, but you can customize it to work the way you want" and there is nobody around to say "so now you have two problems: a bad system that depends on having an expert to make it work the way it should".
Then, you pay for JIRA, and that expert customizes it the way they like. It still doesn't work very well for most people. Nobody likes it except one stakeholder, and the engineering lead who acts as a admin on it. A while later, those people have left the company, and everyone else is out of luck.
Seen this exact scenario play out at two different companies now. Am witnessing it play out in real time at a third.
And yet, it actually is set up in an extremely opinionated annoying way. For example there is no way to actually assign multiple users to the same ticket, which is a big problem if your org legitimately does pair programming (mine does for juniors)
In a lot of ways, JIRA disrupted Remedy Action Request System, which had a painful transition from X to Windows client. Remedy was even more admin dependent and unwieldy.
I find it helpful to stop thinking of JIRA as a bug tracker or anything like that. In my opinion JIRA is more of a way to create and track workflows. It can be used as a blank slate for quite a lot of things (which I cannot come up with any examples for at the moment!)
That being said, because it can do anything, it doesn't take much effort to make a workflow as painful as possible. Somebody with the "right" mind might make all kinds of checkpoints in a workflow, which makes a lot of operations a pain in the ass because you wind up hopping through a bunch of steps. Pretty sure in our org we just make our workflow "you can hop from any state to any other state"--basically a free-for-all.
The reason to buy Jira is that loads of stuff integrates with it, and lots of people know it. Maybe not perfect, but that's why. And unless you're in it all the time, which some people may be, its ergonomics are not as important as, say, an IDE's.
The one place I worked that used Jira was a small-but-not-tiny company (about 15 devs at the time). The only people who actually used Jira were the managers. Developers got printed stories. These were used for planning, and were printed on cards and taped to a white board when ready. Developer would pull a card to work on, and return it to the manager when it was complete. The manager did all the status updates and reporting to upper management.
IDK if this was to cheap out on the licensing with a minimal number of users, or if it was to insulate the developers from the experience of using Jira. Perhaps some of both.
Clearly that usage pattern would only scale so far.
At big companies I've worked at, the justification was that JIRA was the only one that met all the regulatory/compliance requirements. I don't know if this is actually true, but smaller companies certainly don't market compliance as well.
Most likely the database tables themselves are just a mixture of everyone's data. There's no true multitenancy. So they have to load the backups into a separate database. Then just go through and individually select/insert into the old database. And then you have to worry about things like foreign key constraints complicating the bulk data loading. Are you going to disable constraint enforcement while you bulk load the data? How does that affect existing and new data from customers using the database? Just a guess. But this sounds like a nightmare honestly.
Yup. The database schema of one of our products uses a tenant_id in most tables to separate customers logically.
I've eventually gotten a tenant exporter to work. Practically, this requires some deep and nasty digging through the information_schema to build a graph of tables and foreign key constraints. Once it had that, it generates selects with a simple where clause for tables with the tenant_id, and selects with weird joins all over the place for other tables to dump the tenant data.
All of that sounds complex, but that part took a day or two to hammer together to 90% completion, since it's just some graph handling. The other 10% were getting some weird date formatting questions right to produce a properly importable sql dump. And interestingly enough, it's working for more than just that one product.
But that's just where the journey started. After that, it took a weeks and months to sort out legacy tables, old tables, tables without indexes, tables no one knew about, tables that were important (but not), tables with inconsistent data, .... And it's just handling a single relational database. And compared to \copy in psql, it's slow. And at times, weird things happen if you import huge chunks of sql into a postgres with deferred foreign keys (because our schema has cyclical references).
Point is, I know how painful it can be to handle that kind of database schema, at a ridiculously smaller scale. I'm kind of happy to not work there.
I can't believe that they would intermix the data in that way... but if they did, godspeed to them, they're likely still overpromising what can be done in this time frame.
Wait. Why? This sounds like something that feels hard, if you are used to the giant DBs of old. But you can probably get many many instances of the smaller databases without much trouble.
Would still be some maintenance, don't get me wrong. But far from impossible.
Having worked at shops that used this architecture it's really not that bad. Can you write the code to do one schema migration? Great, now you can do 1000. App server boots and runs the schema migrations, drops privs and launches the app. Now you've staved off your scaling issues from "how to have a db large enough to hold all our customer data" to "how to have a db large enough to hold our biggest customer's data." Much easier.
One of the many reasons to put good constrains on fields and use referential integrity! If you don't let the database enforce data validity you are gonna get fucked at some point!
source: every single place I've worked at that poo-poos referential integrity has a database that is full of bullshit that "the application code" never cleaned up
Always use referential integrity. The people who are against it almost always are against it for superstitious reasons (eg: "it makes things slow" or "only one codebase calls it so the code can enforce the integrity"). All it takes is exactly one bug in the application code to corrupt the whole damn thing. And that bug will happen over the lifetime of the product regardless of how "good" or "awesome" the programmers think they are....
That's one thing yes. What if there's a transient network error, or the DB runs out of memory, and now you have some data in an old state and some in a new.
You're lecturing about table design. I'm talking about more general transactionality over any errors.
You'll quickly run into limitations of how many tcp connections you can hold open. Unless you also want to run separate app servers for each customer, which will cost a lot of $$$
Oh, and just forget about allowing your customers to share their data with each other, which most enterprises want in one way or another.
Wait. What? None of the enterprise customers want to share data with each other. And definitely not on a DB level. That should happen in the business logic.
Lots of companies have consultants, and want to be able to share their consulting-related tickets with their consultants. And the consultants want one system they can log into and see the tickets from all of the companies that are hiring them.
It would be a nightmarish scenario if you have thousands of customers. And completely unnecessary. You can create multiple databases and or schemas in a single instance.
Don't do any of the above unless you understand the implications.
I worked at company that architected their multi-tenancy in almost exactly this style. In their particular case, only a few of the very largest customers had their database set aside on their own dedicated instance, but every customer did have their own DB with their own set of tables. Having worked in that world (every customer had their own DB) and on a product where all customers had their data intermingled in one gigantic set of tables in one giant DB on one logical instance, I'd definitely encourage the "every customer gets their own DB".
Giving every customer their own table means you're going to need database administrators. For these folks their dedicated job was maintaining, operating, and changing their fleet of databases, but they where very technical and were amazing to work with.
This is the case. I won't comment on your "hundreds of thousands" figure because the number of Cloud customers was a closely guarded secret at least when I worked there, but yes one DB per tenant, dozens to hundreds of DBs per server, and some complicated shuffling of tenant DBs when you run into noisy neighbours.
To be honest I'm at a bit of a loss too. My speculation is that since they went all-in on microservices and utilizing various AWS services (something that was underway when I worked there) their data stores have become very much more disparate.
For example, they have the main PostgreSQL data store. Surely that's easy to restore. But the users in that DB have a "foreign key" (in a logical sense, not physical) to the Identity service. This is a real life example that occurred while I was there. So now we have a mixture of multi and single tenancy. So perhaps the identity records are also tied to this app ID and deletes were propagated to that service. And perhaps there is an SQS queue and a serverless function to handle, say, outgoing mail from Jira. Where does this data go? I dunno maybe some Go-powered microservice with its own DocumentDB store. Do deletes propagate here too? Who knows. You can see how this gets complicated and how issues multiply with more services.
Again, this is only speculation. But "decomposing the monolith" was a big deal and it was coming from the top.
If they had multi-tenant databases for SaaS it would mean either the self-hosted jira instances also had the same multi-tenant database schema or they'd have to maintain two almost entirely different data access layers for cloud vs. on-prem. Since their cloud offering came from a historically on-prem codebase, I would expect the easiest way to offer cloud stuff is to do a DB per tenant. Otherwise there would a shit-ton of new code that only applies for cloud stuff....
Not quite the same but at Fandom (Wikia), every wiki has its own DB (over 300,000 wikis), and they are clustered across a bunch of servers (usually balanced by traffic). It works well - but we don't ever really need to query across databases. There's a bunch of logic around instance/db selection but that's about as complex as it gets.
Interesting architecture. From a design point of view, I like the idea of full isolation. From an infrastructure point of view I'm a little scared. I'd assume it's actually not that bad and there's a good way to manage the individual DBs and scale them individually.
Really interested if you can share any details.
Edit: I know each wiki is on a subdomain. Does each wiki also have it's own server?
There are _many_ databases on each server, last I checked there was around 8 servers (or: "clusters") - and we have it so the traffic is somewhat evenly distributed across each server. There are reasonable capacity limits, and when servers get full we spin up a new one and start accepting new wikis there. I am not in OPS, and they do a lot of work behind the scenes to make this all run smoothly - but from an eng perspective we rarely have issues with this at scale.
Some of this was open source before we unified all of our wiki products, which has a lot of the selection / db logic, at https://github.com/Wikia/app.
It doesn't change often, if we do we just have large automated rollout plans - but we've done mass changes enough times there are good procedures around large DB migrations.
There are basically two options for multi-tenancy with their own tradeoffs.
1. An account/tenant_id field for each table
2. A schema for each tenant wrapping all of the tables
Option 2 gives you cleaner separation but complicates your deployment process because now you have to run every database change across every schema every time you deploy. This gets more complicated as your code is deploying in case the code itself gets out of sync, there's a rollback or an error mid deploy due to an issue with some specific data.
The benefit of the approach is the option to do different backup policies for different customers, makes moving specific customers to specific instances easier and you avoid the extra index on tenant_id in every table.
Option 1 is significantly easier to shard out horizontally and simplifies the database change process, but you lose space on the extra indexes. Plus in many databases you can partition on the tenant_id.
Most people typically end up with option 1 after dealing with or reading horror stories about the operational complexity of option 2.
The secret bomb in option 1 is that you generally have to have smarter primary keys that fully embrace multitenancy and while Atlassian hires smart folks and I'm sure they at some level know this--that's a relatively hard retrofit to work into a system.
The second problem is mitigated by the fact that schemas are trivially migratable between database servers. Once you grow too big for one cluster just make another.
> Is it not a good idea to spin up separate db instances for each client/company?
It depends, really. There is a trade-off in terms of software and operational complexity vs scalability/perf and isolation. And probably a bunch of other factors.
If you have separate databases for each customer, schema migrations can be staged over time. But that means your software backend needs to be able to work with different schemas concurrently. You can also benefit from resilience and isolation guarantees provided by the dbms. On the other hand, having a dbms manage lots of databases can affect perf. Linking between databases can be a minefield, especially w/r/t foreign keys and distributed transactions.
I have built multiple multi-tenancy platforms and I never create separate databases for each customer. If you have separate databases, it's almost impossible to run meaningful queries across all of them. That architectural choice creates far more headaches than it solves. Usually people end up with the split-database architecture when they want a quick retrofit for a system that wasn't designed with multiple tenants.
I've also had to restore partial data from backups on a few occasions when customers fat-fingered some data and asked pretty-please to undo. If someone on staff understands the system well, it's not hard. I suspect Atlassian suffers from a complicated schema and a post-IPO brain drain.
It's likely a mixture of all these factors, the brain drain could absolutely be responsible.
At least it would not be the first time in history that a company has lost the engineering spirit. And instead the business people have taken over, so that details like disaster plans become less of a priority.
A business person and an engineer will always view risk differently, better disaster plans is a kind of insurance that is a lot harder to sell when too many business people run the company.
When all customer data lives in the unified database: Just wait until a bug in a query exposes the data of customers to each other, creating instant regulatory and privacy nightmares for everyone.
With an orm and customer objects to create scoped queries, I haven't found this to be a problem. It's also very easy to check in code reviews. And not a painful issue from, well, the lack of this happening given it's an extremely common app design.
It is like any other architectural choice - there are pros and cons both directions. If you have separate db instances, you have to scale up the operations to manage each one - migrations, scripts, etc need to be either run against them all, or you need good tooling in place to automate it. A single instance avoids all that, but is more complex in the actual software and definitely more complex for security. A single DB also would let you share data amongst organizations fairly easily, but whether that is good or bad depends on your product. I've created and run products both ways, and I like separate DBs at small scales, single DBs at medium scale, but separate DBs again at huge scale if you also put management tooling in place.
I believe you can sign up an account for free or incredibly cheap ($5/user). You would potentially have tens of thousands of databases. Imagine trying to do something like a database migration to add a column. I believe the day to day operations would be a nightmare as no RDBMS has probably had that kind of feature stress tested.
Answer: it depends on the application. For example big social app is not going to provision a new db for every user, or for every customer that runs an ad. Likewise, a lot of enterprise software fits a model where each customer getting it's own db makes sense. So, really, just a design decision.
Separate DB instances doesn't scale as well cost wise, and generally means onboarding takes a few minutes instead of being instant. It is very common though.
The solution that satisfies everyone is having a separate schema per customer and a number of database clusters. Then each customer is assigned to a particular cluster. Always make sure you have excess capacity on your pool of clusters and onboarding is still instant.
By segregating as much as you can. Definitely not by putting everything in a single table. At the very least separate databases/schemas with proper permissions so there's not any chance of data intermiBy segregating as much as you can. Definitely not by putting everything in a single table. At the very least separate databases/schemas with proper permissions so there's no chance of data intermixing.
The best would be multiple separate database instances, which is not even hard to manage specially for qualified engineers like Atlassian surely has plenty of. The problem are business decisions of ignoring the tech debt, usually...
Now every time you run a database migration, you have to adjust N tables - and in Atlassian's case, N is 200000. Is that better? It depends. There is no "best" way of doing multitenancy.
Multiple schemas? You don't need every tenant in the same schema. However I'm not a DBA by trade so there might be some issue with doing this at scale that I'm unaware of.
They don't, you're responding to speculation which is just outright wrong. Jira and Confluence is single tenanted databases, unless something fundamental has changed at Atlassian in the past 4 years.
Source: worked at Atlassian, on Jira, 4 years ago.
Personally, given the multi-day outage, I think I would just restore everything to a separate system, and then only point the affected customers to this new system. Take the hit of having two fully separated system initially, and work on reconciling them after without having to worry about the on-going outage.
I wonder if they're not doing this due to some tech limitations, to avoid taking the financial cost of running two systems, or to avoid having to reconcile the systems.
At a big multi-tenancy company I used to work at, the problem would have been the accessory machines: we had something like 15-20 different machines around the main DB and API machines, running cron jobs, terminating SSL connections, load balancing, sending alerts to us and customer emails out, etc. And while the backing up and failing over on DB and API machines was a well documented, thoroughly tested process... the other machines were all custom jobs that were very poorly documented, with who knows what scripts running on them, that might or might not be important. Trying to replicate all of that during an emergency would have been a challenge.
For just this sort of problem, we actually had three DB servers running all the time: active, passive, and hour behind with the ability to break hour behind's copying of the write-ahead log of active as the DBA's secret weapon for just this problem. If all customers had accidentally lost an hours worth of data it would have been embarrassing, but much less than completely shutting out hundreds of paying customers for two weeks, I think?
Not being able to selectively restore data for a subset of users might also indicate that users are not fully isolated from each other, which is worrying for technical and nontechnical reasons.
There is nothing non-technical that matters. If we start acting like it does, we incredibly poor decisions that in fact have nothing to do with physical reality, and quickly arrive at unworkable technology.
Non-technical reasons include "legal" and "compliance", which often matters a fair bit. I am not disagreeing that non-technical requirements occasionally lead to poor decisions, for some value of poor.
I live is a state that once tried to legislate that pi = 3.15. The results were tragic, and the attempt to legislate a ratio was a failure, much like systems created by regulation and laws often are. Math is much less forgiving than legal prose. Making database decisions based on criteria that don't make any engineering sense one way or the other is not far off from legislating the value of PI.
I don’t really understand what this has to do with “monolithic” or not.
Atlassian’s software is probably very complex and convoluted but from my experience it’s almost impossible to keep a clean architecture in a software system that has grown over many years and is used and customized by many customers so you have to avoid breaking backwards compatibility.
Perhaps they don’t have the right people on hand to do hard things like this.
They also apparently lack an incident response plan since a critical component of that is coms to affected customers.
They also lack good practices around preventing human error. It should not have even been possible to make the initial mistake. It certainly should have involved multiple steps of “are you sure” and potentially even review.
Sounds like an operations shit show. Glad it’s not my circus.
> However, if they [restore backups], while the impacted ~400 companies would get back all their data, everyone else would lose all data committed since that point
How would they lose committed data? Even after restoring the backups can't they run the logs so that everyone is caught up?
(There's a tacit assumption here that the data across tenants is commingled in tables, and that's being disputed elsewhere in the thread, but playing along..)
You wouldn't be able to do that without forcing downtime for all customers, for the duration it takes to restore the snapshot and then replay the logs. Not to mention the risks of the process failing somehow
You could narrow the window to just the "replay" portion, if you were able to stand up an extra database/infra, to switch over to when it was ready. But at some point you'd probably still have to go read only to checkpoint the logs and begin the replay.
It's of course possible to do something more complicated here and stream the changes then eventually enact a failover, but this would all be too complex and error prone to introduce in their current crisis mode. It's something I'd suggest considering when architecting their DR/BCP, but it's too late for that kind of elegance (and complexity) now.
Not to mention a CEO who is more interested in activities outside the company like the green energy transition and politics.
As an Aussie I always wanted Atlassian to succeed as we have so few tech companies at that scale or larger. Now I view them as another Oracle. Now they innovate little, they keep ratchetting up prices, pushing deployments to cloud where they make more money. Nickel and dime you for what should be core features (SAML Auth?). They aren't coming up with anything new to keep the value in the ecosystem. They buy applications in, spend a little to make some cross integration and then drop down to a slower development Cadence.
I think that Atlassian's core issues stem from the CEO who is more interested in activities outside the company. The lack of vision starts from the top and flows down through the rest of the company.
If MCB is so interested in those things, then he should do the right thing which is to retire from Atlassian and go whole-heartedly after those noble causes.
Right. Feels similar in a way to an ongoing conflict elsewhere... There is what happens now, and what happens over the next decade because people have lost fundamental trust in you.
How do we solve this problem? In other industries based on physical products there is a big incentive to buy goods as locally as possible because of reduced shipping costs, shorter shipping time, no import taxes etc.
But with software it costs nothing to spin up new instances, costs nothing to deliver half way across the world, and has no delivery time. How can you convince a manager to use a software solution provided by a local company when a company in a completely different country 600 miles away offers similar software with 5 extra features?
It seems like the internet is now perfectly set up to create, for each software type, a single company that has a global monopoly.
That's OK in principle, as long as those companies function like governments (i.e. they work to improve things rather than turn a profit, subject to constitutions, public voting, judicial review). As engineers we should embrace the efficiency of scale, but it's quite clear that it can't work under capitalism.
You probably wouldn't if you weren't in the affected subset of customers who were. This wasn't a total outage, but rather it affected a group of users who had been running a legacy standalone app called "Insight – Asset Management".
People who call for other people's firings in organizations that they have no visibility into are so weird. This post reads like a tech outage's version of cancel culture where trying to find someone to blame and skewer for an injustice is more important than actually determining how much (if any) blame they deserve for it
Also posting a LinkedIn event photos with people's real names and pictures in a top post on HN along with provocative framing like "partying in Vegas" about this outage is pretty shitty. Atlassian is a big company and has a huge engineering and sales department, you have no idea if any of those people at the Vegas event had anything to contribute with the outage response. For example in your link there are literally people talking about some new mobile app their team is launching at the event, I doubt any of those people are involved in the outage response.
No one posting on HN unless they work at Atlassian in a leadership role is in any position to even start assigning blame, call for firings or publicly shaming people (the later of which you shouldn't be doing even if you 100% knew for a fact that they were at fault).
The main article is about how Atlassian have, out of nowhere, ceased business operations for 400 of their customers. They are no longer a going concern and this could happen to anyone who uses their products which is (alas) quite a lot of us.
That’s my perspective. Reading your comment it feels like between the lines you don’t think this is as serious as other people. If you have a different perspective, could you come out and say it? Can you elaborate on why a “Global Head Of” isn’t a senior leader? Do you think this is an unfortunate tech outage that is to be expected from any b2b tech company of this size? If you did, I wonder how many people here would disagree with you. Implicitly, the person to whom you are responding does. “Business collapse” and “vegas party” do not look good getting caught in bed together.
Your point would come across much better if it wasn’t mixed with moral outrage. If you have alternative opinions please share them and back them up. Then you will have earned a little more of the massive amount of social capital you need to tell someone, in public and quite rudely, to do better.
Do you realize just how disconnected sales and operations are? Look at your organization chart. They connect at the CEO.
My previous company lost customer data. Someone deleted a previous employee's account which surprise contained a production customer website. We didn't halt sales and outreach.
I mean RSA lost all of their encryption seeds in 2011. Every account was compromised. They still do sales today.
The original commenter (OC) did not call for anyone to be fired. They stated an opinion; they did not make a demand.
The OC is directing us to information made public by Atlassian team members themselves. The OC did not make the information public. Anyone could find that information with little effort.
I’m not really sure why you attribute such intensity to the OC’s rather benign comments.
Seems like you're still assigning blame. Incidents are rarely if ever monocausal. The fantastic and accurate point the GP made is that fingerpointing is pointless. Much better to seek to learn and understand, which is always difficult but definitely can't be done from the sidelines without speaking to those involved.
Some people have an important role to play when other people mess up. The fire brigade aren't the ones who start fires but it's their job to try to help out when it happens. If they don't show up when your town catches fire, you would be disappointed.
1. They can confirm that they have backups of our data (about a thousand stories, substantial confluence, opsgenie history, and three service desks).
2. Will our integrations, configuration, and customizations also be recovered, or will we need to rebuild those once our data is recovered?
I have received no response, and no human is even willing to acknowledge those questions. The service desk staff ignore them as if I never asked. Repeatedly.
Also, I've been asking around, and haven't been able to find a single story from somebody that can confirm that they were down, who has had their data recovered.