We should probably treat this issue as something more like a disease like high blood pressure, namely, you don't know you have it, but it is probably doing irreparable damage to your internal organs. If we had no name for the disease or understanding of it we would just die at an earlier age without an obvious cause.
Let's identify the diseases of this sort in our industry and work on prevention, diagnosis and treatment of them instead of just saying you should have been more careful.
Everybody screw up from time to time in our industry. And when this happens, you have 2 types of guy : those who try to hide it, and those like Giltlab team who communicate as fast as they can because they respect their customers. Paradoxically, to me, it creates more trust than it damages it !
I feel for you greatly here, and I commend your openness about how data restoration caused 6 hours of data loss. I too work in a critical area where even minutes DB lost is bad.
We just had our own test event recently. We make sure that we can fail everything over, and run on all secondaries. I found out how that worked; we failed. The problem with this, is I found out after the fact. Due to the secrecy, not even the teams knew why things failed the way they did. I had to piece it together from disjoint hearsay, and now I believe I have a competent picture.
So yes, when I read your post mortem and RCA, it reminded me greatly of what happened here as well. But we all can learn from your example. As for me, I'm posting it as a throwaway due to likely threats on my job.
Thank you for sharing.
all the love and support in the world towards the team
Thanks for the support, we've received a lot of kind reactions and are very grateful for them.
All outages are blameless. It's always a process failure or lack of a proper system or tool.
And presumably even if your notes are kept private, it'll still come out in discovery if you end up in court?
What are the risks here? I'd be interested to learn more.
Those companies have huge legal departments writing iron-clad contracts and stocking a lot of very sharp knives for any such eventuality. Smaller companies make for much easier targets.
> even if your notes are kept private, it'll still come out in discovery
Those notes might have been lost by then - that is, if their existence at any given time is even known. It is perfectly reasonable for people working quickly after an outage, to not actually write down every step or observation they make.
> What are the risks here?
Not prison, but you might be forced to pay a bunch if someone manages to get a ruling against you for negligence or the likes.
Doesn't mean it never happens.
Sure plenty of uber larges Incs are "in the cloud" but with mission critical (read: high value worth sueing over) applications and data? That's a pretty small universe.
Incidents are inevitable and it's important to have a proper RCA/Service-Disruption process in place to handle those.
As mentioned on the other thread, maybe Gitlab doesn't have enough operational/SRE expertise in-house yet, but that was the case in every fast-growing company I've worked for last decade.
I kind of hope the CEO of Gitlab isn't reading HN right now
For every backup system you have, test restores, from scratch, periodically. The more critical the data is to your business, the more frequently and more automated you want those backup checks.
Of course you also want procedures to try to prevent tired admins from deleting production databases. To the greatest degree possible, implement systems to prevent manual tampering with production data; often there's an alternative. But databases can get broken or corrupt without admin interference, so preventing manual database removal or corruption is not an ultimate solution.
Do you have DBAs? Are they completely inept at their job? Don't answer.
That's why we need awareness days.
Things that aren't tested can pretty much be counted on to fail a non-trivial percentage of the time. If it's going to be business critical, then it needs to be tested. (Lesson hopefully learned here)
By the same token, things which aren't monitored can also be counted upon to fail a non-trivial percentage of the time and even worse, go unnoticed. (Another teachable point here)
...but people being what they are tend to learn these things the hard way unless we've got concrete, real-world incidents like this for them to learn from.
This is where automated testing helps. My mail server has a sister VM out in the wild that once a day picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup. If I don't get the message, restoring the backup failed. If the message looks too old then making or transferring the backup failed.
My source control services and static web servers do something similar. None of the shadow copies are available to the world, though I can see them via VPN to perform manual check occasionally and if something nasty happens they are only a few firewall and routing rules away from being used as DR sites (they are slower as in their normal just-testing-the-backups operation they don't need nearly the same resources as the live copies, but slow is better than no!).
This won't catch everything of course, but it catches many things that not doing it would not. The time spent maintaining the automation (which itself could have faults of course) is time well spent if done intelligently. For a system as large in scale as GitLabs then a full restore daily is probably not practical so a more selective heuristic will need to be chosen if you are operating at such a scale. My arrangement still needs some manual checking and sometimes I'm too busy or just forget, so again it isn't perfect, but the risk of making it more clever and inviting failure that way is at this point worse than the risk of my being lazy at exactly the wrong time.
One thing my arrangement doesn't test is point-in-time restores (because sometimes the problem happened last week, so last night's backup is no use) but there is a limit to how much you can practically do.
> The problem of restoring non-existing backups should be treated as a more serious problem in our industry
It is by people that care about it, but not enough people care, and too many people see the resources needed to get it right as an expense rather than an investment for future mental health.
It isn't just non-existent backups. Any backup arrangement could be subject to corruption either through hardware fault, process fault, or deliberate action (the old "they hacked us and took out our backups too" - I really must get around to publishing my notes on soft-offline backups...).
> Let's identify the diseases of this sort in our industry
The people who care most are either naturally paranoid (like me), have lost important data at some point in the past so know the feeling first hand (thankfully not me, though in part to having a backup strategy that worked) or have had to have the difficult conversation with another party (sorry, I can't magic your data back for you, it really is gone) and watch the pitiful expressions as they beg for the impossible.
The only way to enforce the correct due diligence is to make someone responsible for it, it is more a management problem than a technical one because the technical approaches needed pretty much all exist and for the most part are well studies and documented.
Of course to an extent you have to accept reasonable risks. It is usually not practical to do everything that could be done, and understandable human error always needs to be accounted for as do "acts of Murphy". But someone needs to be responsible for deciding what sort of risk to take (by not doing something, or doing something less ideal) rather than them just being taken by general inaction.
This is sensible, but one problem with this is developing "blindness" for messages you receive every day. For some recurring tasks you can automate a bit further. Instead of receiving a success message every day, receive a failure message the first day there's an absence of success message. A number of tools and services exist for setting this up, one is linked in my HN profile (shameless plug ;-)
This can be an issue, I do get lazy about checking them when I'm busy.
The next step will be to automate checking them a bit more, rather than stopping the positive messages. A simple script that for the mail example logs in via POP/IMAP, checks for the last notification and checks the relevant timestamps. Similar for other services (is the last commit in the repo from more than 24 hours ago?). Then I get a simple "all OK" or not. I still have to run the script, perhaps hooking it to CGI and setting it to my browser home page to further remove my laziness from the equation, but I don't want to just be told when something is wrong as I'll never really trust that something going wrong won't block warning messages - I want to make checking as easy as possible though.
I may play with your service and add it as a secondary set of checks though!
Now that I'm developing email stuff I get all the time the existential crisis of "how to I make sure my testing and alerting infrastructure is working?"
Really, how can you be sure your restoration tests are running correctly and their messages get all the way to you?
I think I'll add periodic positive messages for my restoration procedures. Stuff like (if success and day % 23 == 0 then email_something). Still doesn't guarantee the test is correct.
I get a daily message, so if I know if things aren't getting sent or getting through. I've been wary of setting up anything that only sends problem messages because regularly getting the "all OK" messages proves that at least that part of the infrastructure is working.
Getting a message must be mostly a surprise. One daily message is OK for me, but if so it may be good to tell about more than a backup. A daily digest of "everything is up and running" may indeed be something very nice to have.
And how do you ensure a team will always have someone surprised not to get a daily message? Maybe sending for may people, or requiring some acknowledge, and creating a warning over the warning channel if there isn't one. (And you certainly aren't going to send those positive messages over the warning channel, because it requires a completely different kind of action.)
> And how do you ensure a team will always have someone surprised not to get a daily message?
For a team you have a management problem not a technical one! Somebody needs to be responsible for it in the end, plain and simple.
On a less authoritarian note a technical helper might perhaps be to have a screen, always on, showing a dashboard of statuses, where everyone can see it? Make it beep incessantly when a new warning arrives until someone acknowledges it. If you acknowledge a message either deal with it (if you can) or notify whoever needs to be told so they can take action.
People will still sometimes ignore or miss the warnings. Nothing and no one is perfect.
But I'm only dealing with my own collection of personal data and projects here so it isn't that scale of problem - at work I used to be responsible for infrastructure as well as my actual job but we no longer operate at a scale where that is at all practical so I have to trust someone else is dealing with it (though I'm not that trusting I do sometimes take time out to check things myself where I have access).
That's awesome. I've always tested my backups by having shitty, cheap servers so bad that restoring from backup happens about once every few months.
Going to an automated restoration test is a great idea, though also a lot of work for most people.
Just checking, manually or automatically that your backups are occurring and are of a reasonable size is probably sufficient for most operations and would have caught most if not every case in the GitLab instance.
The ENTIRE Saturn 5 rocket had to be wheeled back to the hangar and the CM dismantled and rebuilt, resulting in a month long delay in the launch.
When the Apollo launchpad manager was asked if he had fired the technician in question, then answer was allegedly "Nope. He is the one guy on the next launch team that I know will NEVER make the same mistake again."
I am willing to bet that Gitlab is now a company that will never slacken off their backup checking in the future.
This same general story appears in different forms with million dollar trading errors, etc. But I have to wonder if it's really that good of a lesson. If you truly take it to heart, you could potentially keep people on the team that are legitimately incompetent which could result in another catastrophic failure.
If a surgeon keeps killing patients because he keeps forgetting to wash his hands, then he needs to be fired.
If a developer skips all company policies and deploys directly to production without a good reason, then he needs to be fired.
There are reasonable mistakes, then there are errors resulting from wreckless actions indicative of a larger problem with the person's view towards their work.
Make the same mistake again, you get fired.
All commercial aircraft have checklists in place, and a two person cross checking system in the cockpit, as well as mechanical means to detect potential problems - but aircraft still every now and then land with their gear still up etc.
Does it make economic sense for an organisation that has invested possibly millions of dollars in training a pilot to immediately sack him/her for a costly mistake like this? Sure, disciplinary action and possible demotion within the ranks is a given, but will you get rid of the one person who will be almost fanatical about never landing with the gear up again?
If they consistently make technical mistake and are sloppy in their airmanship, then sure, fire away. But it is going to be pretty much impossible to design a 'system' that can prevent human errors from creeping in.
As in, they think the act of generating a backup file is the last stage of the process and they are done with it. Maybe they go the extra mile and throw it in a cronfile too.
What you have to do is to consider any and all backups non-existent until you have a complete backup strategy.
In other words you have to appreciate that "having a backup" is a means to an end and another way of saying "being able to successfully recover from data loss".
So just generating a file is not sufficient.
You have to complete a successful test restore before you can call it a backup.
You have to have heart-beat measures in place to make sure you will not be impacted by a silent failure (example: check $last_successful_backup >= last X hours).
You have to periodically manually check that your automated checks work ("simulate" backups not being generated and wait for alert).
Far too often people don't appreciate the depth and the weight that a phrase like "backup plan" carries.
So you may ask them to "take care of backups" they will go run pg_dump and mysqldump and say "It's done". No goddamn it, it's not done.
Even the minimum level: an untested backup to the same disk is better than what most people have. That untested backup is there and protects against accidental file deletion - which will eventually become a test for most people. If the disk fails it is also something to work with. If you tell a disaster recovery service that there is a backup they can use that redundancy to their advantage: odds are physical damage isn't to both the backup and the read data and they only have to recover one. Even if the damage is to both there is something to work with.
As we move up the ladder there is more and more. An untested backup has data - give me the team a few years and we can recover it. We might have to recreate the restore from scratch, but there is something.
Remember though, the farther down the ladder you are the more expensive recovery can be. If you have tested backups on 3 continents recovery only costs a couple hours downtime - an actuary can put an exact dollar cost on this. If the price is too high you can invest in redundancy on the live system. If you have an untested backup it might be years before your team can recover it: millions of dollars in labor to recover the data and sever years of no/reduced business while they recover it. (hint: the company will go out of business because they cannot afford to recover the data)
Unauthenticated MongoDb on default port? The likelihood of a person port scanning the entire web just doesn't strike as anything adverse, and then, someday someone does exactly that. I guess this is one important reason to bring at least some experienced people on board because there is a likely chance that they can give a better perspective on seriousness of such issues.
I understand we "live in a different world" is the favorite motto these days. But do we really? If anything data is bigger, more complex, not in one place, you can't just ship a truck load of tapes and three people somewhere to test.
IMHO the more we try to reinvent technology the more we realize some of the things we all felt were weighing us down were actually smart ideas brought on not by fear but by real life experience.
And the pendulum will swing once more here and back again at some point in the future.
That's not to say all companies do it (it seems GitLab didn't) but the tone of your comment doesn't reflect a lot of people's experiences.
If you worked at a responsible place decades ago you might be aghast at what happens at a random sampling of companies now, but the same would have been true at a random sampling decades ago. The difference is that unless you were a customer or the outage was especially prominent you probably never would have heard about it.
I want to feel bad for GibLab, but it's really, really hard when they hire people like that.
We use it because it works well for us. We've put a lot of work into making MySQL scale for us to the point where it's a very well supported system and one of the main choices for a lot of storage decisions.
We even use MySQL as a queue for Facebook Messenger. More details about this:
I understand that there are perfectly legitimate reasons for using a database as a queue. If you frequently need to look at and rearrange your jobs while they're in flight, chances are a pure FIFO structure probably doesn't work that well for you anyways. If the enqueue is contingent on a transaction committing, probably makes a lot of sense for the job to be in the same database. You don't need to tell me -- I've seen more than a few in production.
But an actual message queue also "works perfectly fine" based on the information provided, and I would imagine that a company like Facebook probably already has a few of those lying around. It would have been a conscious choice to use MySQL as a queue.
I swear I'm actually, seriously just curious!
Everyone is talking about backups, but why not about this? How is it even possible to delete the production database by accident? Why does he have SSH access there? Why do they test their database replication in production? Why are they fire-fighting by changing random database settings _in production_?
I know that all of this is common practice. I am questioning it in general.
Because when something goes wrong he has to be able to discover what it was.
> Why do they test their database replication in production?
Many places do that. With free databases there is simply no reason to, but the practices are inherited from the non-free databases best practices arsenal.
People, if you are using postgres and have a cluster in production, make sure you also have the script that creates this cluster in a VCS and is able to create an equal cluster on virtual machines in your computer.
> Why are they fire-fighting by changing random database settings _in production_?
Oh, that's because they are fire-fighting. The problem is that when you get a problem you know nothing about, you don't really know what to replicate in another environment to reliably test things there. The most you can do is verify that your changes aren't harmful, the real test is always in production.
It takes no more than couple hours most of the time, and as our wise said, "An ounce of prevention is worth a pound of cure"
Stop making excuses and automate testing your backups.
There is no reason not to automate your backup tests, but there is no reason not to eyeball the check is actually working from time to time.
MySQL is also pretty good about not starting if the data is corrupted for whatever reason. The process not starting is a pretty obvious alarm, too.
Checks are cheap. Better to automate your logic and invest in good monitoring and automation.
Oh how I deleted the stack? I was using the mobile app and was trying to look at the status of the Cfn stack, but the app was laggy my finger pressed the wrong button... sigh The other interesting thing was I checked the status because the previous night I changed my RDS instance to provisioned IOPS (took 8 hours and failed too). I felt sad and guilty, but at the same time I felt whatever because the upgrade didn't go through so perhaps this accident was all meant to be....
Doubly ouch that it seems that there's no confirmation dialog with a 5-second countdown before you can hit Yes, or whatever.
E.g. forcing someone to manually type the characters D E L E T E before allowing deletion of something potentially important.
Either that or everything should have multiple levels of undo. Everything.
So if you've got a dud client implementation then you're going to lose the check.
One way to do stuff like this is to have separate roles for read-only and read-write access. I pay a lot more attention to what I'm doing on the rare occasions I assume permissions to change things.
Make a backup, restore to test environment, run checksums, anonymize, release test environment.
That way each and every backup is tested both for integrity and ability to rebuild a working environment from it.
In my practice insufficient backup is still (unfortunately) a very common occurrence.
On another note: just having stuff stored with triple replication in the cloud is emphatically NOT a backup.
And it also helps if the same people that have access to the live environment do not have write access to the backups, but that's only feasible past a certain team size.
We had a MongoDB server (with a read-replica) in our production environment (this was an ec2 instance with MongoDB).
One day, a dev accidentally deleted the main collection in the DB during a night coding session. Next morning, when we realized that, we went straight to the daily backups that we have been doing. It happened that for some reason the backup of the previous 2 or 3 days did not work.
We had to get into Mongo-OpLog (which was on only because of the read-replica) and reconstruct the missing 3 days from it.
That was fucking scary.
I've realized in the past that in day-to-day situations it is more likely to lose data because of temporary programmer carelessness. Examples: deleting the backup version of a folder, deleting the copy on the wrong server, etc. This seems similar to what happened at Gitlab.
How to protect oneself from this failure mode? Can we design a better system than to assume that every human command is well considered? (Sort of like guard rails, when purging backups)
The CEO should laugh with us a bit and be proud to inspire a day be named after their hiccup, thanks to their transparency.
I'm totally compassionate but we should never lose our sense of humour!
I have seen far too often one man miracle teams swimming in technical debt solving problems constantly but failing to have the time to play the political games needed to push for the kinds of changes they need. Obviously small operations setups are different, but for example, I've seen a ~250 person 6 branch business have 1 senior and 1 junior part time sysadmin, while his requests for a budget and personnel were constantly denied, and so he said his backup system worked but he knew it wasn't as good as he could make it. He eventually quit in frustration. He was a great sysadmin but didn't play enough politics and therefor he failed and his management failed him, all the while jeapordizing the business. Please don't do this to your sysadmin.
CTOs and CIOs, please take a moment to ask your sysadmin what things they need they haven't been able to convince you of yet, and see if you can compromise or otherwise try to lend their arguments importance.
In all but the leanest of SV web startup land, sysadmin are the backbone that keeps your company running. Don't neglect or forget them.
If you do, one day that backup may fail, or a cryptovariant will hit the server, and although you will scapegoat your sysadmin, it will have truly been your fault.
Quite another to test the backup's restorability.
Most of us back up. Very few test that the backup works as intended. I need to do better in the later.
Look, shit happens... we don't need to make fun of people for it. We all cut corners at times... when we are lucky, nobody notices. When we aren't...
Of course, backing a whole platform up is more complex, and things like databases normally require custom scripting (dump -> backup dump, eg. pg_dumpall | borg create ... -).
Ouch. I can't even imagine how that feels. This is why even despite monitoring and paging scripts, I still have an event to check my company's backups weekly. Now I don't feel so paranoid.
No hidding, no eupheminization, their live doc stream actually made me question what things did I do on my systems. Looks like convergent evolution in some parts, like prompt changes.
My Time Machine backups in the past have been missing gigabytes of data, without me telling anything about it. And not just volatile data, like temp files or caches, but photos, music and documents.
I think easy, good server-level backup software would be great. The problem is that enterprise servers are usually highly customized, part of a large, unique architecture, and contain a lot of proprietary, confidential, and potentially personal, legally protected data. That makes it a lot harder to get a one-size-fits-all backup solution set up, which means that the onus of reliable backups will, of necessity, rest upon the company's administrators.
It is very sad when no one checks backups. This bites companies every day and it's usually easy to sympathize, but there's no excuse for it. GitLab needs to perform a serious review of its processes.
I've checked out the GitLab job listings that get published on HN Who's Hiring and other places regularly (I'm currently 100% remote with my current employer and like to track other 100% remote employment opportunities). They have a salary calculator/estimator and personally, I was really underwhelmed with the values it would put out. That calculator makes a city price index adjustment from the base salary and contains a statement that says GitLab prefers to hire people who live in less expensive cities. I also remember feeling that their interview process sounded a little overbearing.
It may be time for GitLab to consider upping the ante on its recruitment procedures and adding some more experienced people to the ranks.
You might end up disappointed if you do.
ALWAYS CHECK !