Hacker News new | past | comments | ask | show | jobs | submit login
GitLab Database Incident – Live Report (docs.google.com)
1162 points by sbuttgereit on Feb 1, 2017 | hide | past | favorite | 598 comments

The public report is nice and we can see a sequence of mishaps from it, that shouldn't have been allowed to happen but which (unfortunately) are not that uncommon. I've done my share of mistakes, I know what's like to be in emergency mode and too tired to think straight, so I'm going to refrain from criticizing individual actions.

What I'm going to criticize is the excess of transparency:

You absolutely DO NOT publish postmortems referencing actions by NAMED individuals, EVER.

From reading the whole report it's clear that the group is at fault, not a single individual. But most people won't read the whole thing, even less people will try to understand the whole picture. That's why failures are always attributed publicly to the whole team, and actions by individuals are handled internally only.

And they're making it even worse by livestreaming the thing! It's like having your boss looking over your shoulder but a million times worse...

I myself initially added my name to the document in various parts, this was later changed to just initials. I specifically told my colleagues it was OK to keep it in the document. I have no problems taking responsibility for mistakes, and making sure they don't happen ever again.

I think perhaps, you want to not do this in the future.

Incident reports are about focusing on the "what" and "when" not the "who". This is not about taking responsibility (you don't need to be published on the internet to do that) and you can always have a follow up post after the incident report has been published as a "what I learned during incident X".

While it's great you're OK with publishing your name, you've now set a precedent that says it's OK to do this to other developers. A blanket policy on keeping names out of the incident report protects others who may not be as willing to get their name of HN (as well as not having to make amendments or retractions if the initial assumptions are incorrect). It also keeps a sense of professionalism as it's clear that no blame is being assigned. I know that you guys are not assigning blame, but if I was to show this to someone outside of this discussion, they'd assume that it was a fingerpointing exercise, which does not reflect well on Gitlab.

I think you're blowing it out of proportion. If you showed it to someone and they told you they assumed it was about fingerpo...Look, it's not that big a deal. They decided to do it, not everything is a blame game.

Woah there! I think you may have misread the parent as it looks like some friendly advice to me (with actual reasons and stuff), rather than the "You shouldn't have done that! You've destroyed your company!!!!" you seem to have read it as.

Heck, they didn't even say to retract anything from the report, just maybe to leave adding names to things until a later date in future incidents.

I'm not sure impugning their professionalism qualifies as friendly advice.

Again, you've put your own tone on things.

"It also keeps a sense of professionalism as it's clear that no blame is being assigned" is not the same as "you guys acted unprofessionally!". It's letting the GitLab guys know there's a potential problem with the communication style at that point in the story.

I find it funny that in a comments section full of comments about allowing a frank learning experience you're being so down on someone giving tips to consider learning from.

That's awesome, but why publicize it? This isn't an act of contrition for you, no one outside your team really needs to see your dirty laundry, and actually comes off as unprofessional to me. The gitlab team is a team, and you take responsibility as a team. Placing names and initials in the liveblog makes it look like SOMEONE is trying to assign and pass off blame, even if that is not what is happening.

Presumably in the coming days there will be a number of team meetings where you discover what went wrong, and what the action items are for everyone moving forward. The public looking info just needs to say what went wrong, how it is being fixed, and what will be done in the future to prevent it from happening again. I don't need names to get that.

On the contrary, it comes off as very professional. All other companies would hide this, they would show off a very cleaned up post-mortem and say "problem solved" and that's it. Ok so what does that mean, does it mean the process will change for the future or that they just fixed it for today?

This is also an awesome advert to see how they work remotely all together and I'm sure they're hiring for DevOps people now ;)

Naming individuals is not professional. Even allowing it with permission does not set a good standard for operation.

I agree the transparency was probably appreciated by most users. I also think the human factor should be considered not just the technical aspect. Here's my take on the issue... https://forgetfulprogrammer.wordpress.com/2017/02/04/key-lea...

There are quite a few comments on this very thread about how this creates trust, not ruins it.

Thinking that there's a "right" and a "wrong" way (or an "unprofessional"/"dirty laundry" situation) is quite contrarian to the Gitlab transparency model and their culture. If you don't like their culture, don't work there.

IMO the idea of secrecy equating to professionalism is _the_ problem with many things. "Information wants to be free." It's also more personable, especially to those who use their product - to me, it shows they're on top of it, they care and are taking responsibility. Gives you a sense like you're part of the team (or they part of yours).

Keeping names out of incident reports isn't about secrecy. There's nothing stopping the folks at Gitlab posting up a retrospective blog post. Incident reports are formal documents published to let users and customers know what's going on. The names can come later, if all parties are OK with it.

And they initially put in only initials and the developer posted in this thread only after it was published.

You're assuming the only reason for wanting to do it would be as contrition, but it sounds like that's not the reason here. Possibly the GitLab team cares about transparency to the extent that they simply prefer to be transparent.

It's not transparent to name people; it's unnecessary and betrays a misunderstanding of how to handle an incident.

I don't know why people on hacker news are against transparency. I'm glad you guys are live streaming this, others would feel too inadequate to do so. Being this transparent only makes me want to use (and contribute to) gitlab even more.

I'm guessing they feel strongly about getting singled out if something like this would happen to them. Possibly because they have been used as a scapegoat by a employer or team mate once.


> I don't know why people on hacker news are against transparency.

> Being this transparent only makes me want to use (and contribute to) gitlab even more.

I hope you'll be there when someone doesn't hire the person responsible for their mistake so you can vouch for them.

You don't have radical transparency because the world is not the understanding meritocracy you think it is. There is no value to the employee for having radical transparency in a post mortem.

I don't know why people on hacker news are against transparency.

It appears to be only a couplefew people on HN who have a real problem with it, so maybe temper your generalizations in a way that they aren't.

That's courageous but this wasn't your mistake, it was the CEO's mistake. They owe you a vacation and apology for putting you in that terrible position!

I'm not sure if it's the CEO's mistake, or any specific individual's mistake for that matter. In this particular case many different problems coalesced, producing one giant problem. If it wasn't me, somebody else would have eventually made the same mistake; perhaps with even greater impact.

Any blame that would be generalized to the company as a whole is also specifically the CEO's fault. The buck has to stop somewhere. That is part of the deal for the big chair/title/paycheck/expense account (at any company, not just Gitlab or in SV).

Exactly. The CEO clearly didn't properly weigh the impact of a failure like this on the reputation of the company. GitLab will lose many customers over this. Unfortunately choices like calling in (or allowing) the original engineer to work on recovery, not protecting the engineer from mistakes of the organization, etc. don't signal recognition of the required corrective action. That corrective action requires the CEO to take full responsibility, protect individual employees from process-driven failure, study the cultural aspects that allowed the failure to occur, etc.

I agree with the CEO being responsible. As I mentioned 15 hours ago in this HN post in https://news.ycombinator.com/item?id=13537245 "the blame is with all of us, starting with me. Not with the person in the arena."

I think you're right that it wasn't a mistake of anyone below the CEO position, but I'm certain that it was a huge mistake on the CEO's part. The customers and employees deserve a huge apology from the CEO. I'll be shocked if without this realization on the leader's part the board doesn't replace them.

Anybody whose opinion matters understands that this type of event is a process problem, not a person problem.

GitLab has always blazed their own trail with their transparency, whether through their open run books, open source code, or in this case their open problem resolution. Kudos to them in whatever manner they want to do it in (with or without names).

To be honest, through all of the comments, yours seems the most high-strung, and you're the one complaining about high-pressure situations like having your boss looking over your shoulder. Relax buddy. :)

In a few years the guy doing the `rm -rf` is going to be on a job interview and someone will recall bits of this report. Enough bits to remember the guy, not enough bits to remember that it wasn't his (individual) fault.

Transparency doesn't mean publicly throwing people under the bus.

I'm not a GitLab customer, I'm relaxed. :)

Honestly, if I were interviewing the guy, that would almost be a bonus! Like, everyone makes mistakes, we're all human, but I can guarantee you that THAT person will never make that particular mistake ever again. And he's going to be 10 times more diligent than the average engineer in making sure there are good backup/restore procedures.

There's a probably apocryphal story like this about a guy forgetting to refuel a plane. The pilot made sure that guy was solely responsible for refuelling his plane in future, because he knew he'd never forget again.


I run backups on my computer before installing new software/fiddling with important settings/etc. because I've fucked up before.

I'll run backups of phones (or at least verify that they are present) before trying to fix issues on them after nuking my mom's phone which resulted in her losing pictures of my niece and nephew. (Luckily she had sent a lot of those pictures to us via e-mail, but still).

We learn and adjust.

I think that might have been in Chuck Yeager's (auto-?) biography. (Great read, BTW.)

I remember reading the story in "How to Win Friends and Influence People". The plane was filled with jet fuel and had to do an emergency landing, but the pilot was able to save everyone. When he got back he told the mechanic that he wanted him to fill his plane the next day because he knew that he'd never make the same mistake again.

If the pilot can forgive someone for a mistake that almost cost lives, I'm sure any good interviewer can forgive him for a mistake that cost data and will probably never be repeated.

I've heard this anecdote before and it never sat well with me. Forgetting to fuel a plane as a plane mechanic exposes a serious character flaw that could lead to something devastating if allowed to continue (perhaps next time he forgets to oil the engine? Grease the brakes?). Sensationalizing this story could actually do alot of harm. The plane mechanic should have been fired for failing such an important task. If he showed incredible remorse and was responsible enough to own up to his mistakes, he should still have been striped of all his other responsibilities and only fuel planes until he has proven himself enough to take on more responsibilities again.

When people are afraid to loose their jobs if they make an error you can be pretty sure they will do everything in their power to hide the fact that they made an error, which is the exact opposite of the behavior you want. To allow process improvements it must be absolutely clear that errors will not be punished, but used to help everyone to learn.

The JAL 2 mishap is legendary in the aviation world. Learning from mistakes is a big part of aviation safety


The Captain basically got up before the NTSB and when asked what happened, he responded "I F__ked Up!" instead of trying to deflect blame onto an unforeseen system glitch or other excuse. Its since been known as the "Asoh Defense"

They also have the NASA ASRS for reporting near misses, and incidents without fear of FAA enforcement.


It must be coupled with processes that guard against errors though. Defense in depth. I'd imagine the pilot has a tick sheet to go over before takeoff and fuel is an item on that sheet.

You imagine right. It's on every checklist: check fuel quantity and type.

I think you highly underestimate the number of mistakes like this on the flight line.

By an order of magnitude it sounds like from your comment. Even if you get 99.99% reliability (good luck with humans involved) think of the number of flight movements per day multiplied by the number of tasks that must be completed.

This is why there are redundant checks and checklists and systems in place. To catch human errors, as absolutely everyone in the business will eventually make a trivial yet critical mistake.

Demanding individual human perfection is great, but you'll find you will end up with no workforce.

The aviation industry recognizes and accepts that people make mistakes, and that this is a simple fact of being human. Firing that mechanic without fixing the process wouldn't have done any good in the long run. Someone else would just make the same mistake. Maybe not the following week, maybe not the following month, but eventually, it would've happened again. The right answer is to fix the process.

Agreed. Point in case, the recent death of nearly the entire Chapecoense football team:

> According to the preliminary report, several decisions of the flight crew were incompatible with aviation regulations and rendered the flight unsafe. Insufficient flight planning (disregarding necessary fuel stops) and not declaring an emergency when the fuel neared exhaustion caused the crash.


That guy is going to be interviewing at some company with someone who's obsessive enough about outage reports to remember a then-obscure one years later, but enough of an idiot to not understand that people aren't personally to blame for this sort of stuff?

Sounds like even in that very contrived scenario the guy involved would dodge a bullet in not being hired by a bunch of idiots.

"I worked at GitLab."

Googles name + GitLab, finds postmortem

Highly likely, and now you don't get to tell your own story and emphasize what you want to.

Awsome postmortem -is there any thing you would do differently today?

What's your most valuable lesson from that incident?

You're hired!

also, maybe some people on here are perfect, but if you've used Unix for more than half your life (as I have) you've 'rm -rf'-ed some stuff.

I think people who've been through disasters have a much better understanding of the importance and methods of not ending up there than those with a perfectly clean record.

IOW, I'd hire the "rm -rf" guy first if he owns it.

I'd rather hire someone who learned what not to do than someone who hasn't yet.

Years ago I worked for a University. We lost power in our data centre. No big deal right. Stuff comes back up, you realize which service dependencies you missed, set them to run at startup, change some VM dependency startup order and you're good.

One of the SAN arrays didn't come up, and then started rebuilding itself. Our storage was one of those multi-million dollar contracts from IBM. They flew a guy out to the University and after a lot of work, they said the array was lost and unrecoverable.

Backups for production for some VMs were on virtual tape .. on the same shelves as production. O_o

At least a lot of our clusters were split between racks, so in many cases we could just clone another one. We learned that MS BizSpark, in a cluster, only puts the private key on half the machines. We had to recreate a bunch of BizSpark jobs based off what we could still see in the database and our old notes and password vaults. We had been planning on upgrading to a newer version of BizSpark on a Server 2012 (it was on 2003), so this kinda forced us to. Shortly afterwards we learned how to make powershell scripts to backup those jobs and test the backups by redeploying them to lower environments.

The sys admin over the backups was looking for a new job. You can't really fire people from universities easily, because it's very difficult to find IT staff that will take university wages. Word was out though, if he didn't find new work, he was going to be let go. Not laid off, made redundant, or have his position removed. He would be fired.

When we interview people one of the questions we like to ask is "What's the biggest thing you've accidentally deleted?"

When people answer that question honestly and with humility it is a big plus.

Might be a plus for your organization, and it might be devised as a trap by another. It's like the biggest weakness question.

Oh, we can't hire someone who has made a mistake THAT big.

You don't want to work for a company that has that attitude anyway, honestly. That shows they have a poor attitude towards problems and probably will overreact to things like missing deadlines or pursuing a solution that ends up not working, etc.

You'll never know if it was a great company with a bad interviewer. It's better to use all the advantages you can to get through an interview and get the job to see for yourself. I don't think you can learn anything definitive from most interviews - they're mostly subjective, unscientific voodoo.

knowing exactly how a potential employee handles an error he might have caused? This guy is going to be fighting off job offers, if he hasn't already been.

Good! I'd like to talk about what the engineer learned from the experience. Certainly if trawling through someone's public repos and records turns up a pattern of repeated mistakes, that should be considered - but the mistakes we all make from time to time are chances to learn.

So what I'd be interested in seeing is if the candidate did learn. The mistake is less important than the candidate demonstrating they moved past it as a stronger developer.

On the flip side - given a choice in situation, I'd prefer not to work for a place that dredges up my old bugs and uses them in isolation as a basis for their decision. That suggests the kind of environment I wouldn't enjoy being in.

> Anybody whose opinion matters understands that this type of event is a process problem, not a person problem.

That how the world should be. Not how it is.

Yes, someone with hiring/firing ability might blame the individual, and you could claim "Oh, you shouldn't listen to them, they're an idiot". But that's not much comfort if you're out of a job and gonna be kicked out of your house. In that situtation, the idiot with hiring/firing power matters to your life a lot.

I completely agree. Trying to get low level details out to the public while in the heat of the issue is a misstep; you can still be transparent while not risking over communication that could haunt you later.

While I think most of the HN audience understands that some days you have a bad day and that sometimes very small actions, like typing a single command at a prompt, can have dramatic consequences in technology, there are nonetheless less enlightened souls in hiring positions that simply might find fragments of this in a search on the name when that time comes.

Being too transparent could also encourage legal problems, too, if someone decides that they had a material loss over this, at least for the company. Terms of service or likelihood of a challenge prevailing doesn't necessarily matter: you can be sued for any reason and since there's no loser pay provision in any U.S. jurisdiction that I know of, even a win in court could be very costly. Being overly transparent in a case like this can bolster a claim of gross negligence (justified or not) and the law/courts/judges/juries cannot be relied upon to be consistently rational or properly informed.

Part of the problem is that this isn't actually a postmortem: they're basic live blogging/streaming in real time. What would be helpful for us (users) and them (GitLab) in terms of real-time transparency:

* Acknowledge there were problem during a maintenance and data may/may not have been lost. * If some data is known to be safe: what data that is. * What stage are we at. Still figuring it out? Waiting for backups to restore? Verification? * Broad -estimated- time to recovery: make clear it's a guess. Even coarsely: days away, 10's of hours away, etc. * When to expect the next public update on the issue.

None of this needs to be very detailed and likely shouldn't include actual technical detail. It just needs to be honest, forthright, and timely. That meets the transparency test while also protecting employees and the company.

Later, when there is time for thoughtful consideration, a technical postmortem at a good level of detail is completely appropriate.

[edit for clarity]

At one company I worked for we had a saying: "You're not one of the team until you've brought down the network. "

We all mess up. Much respect to gitlab for being open about.

That's the kind of team I want! All hands on deck. No lame responsibility shifters.

Yep, I'd feel awful if I were the employee in this headline "GitLab Goes Down After Employee Deletes the Wrong Folder" [0].

It's the process and the team who are at fault.

[0] https://www.bleepingcomputer.com/news/hardware/gitlab-goes-d...

Modern companies tend to have people in roles and teams. What I've done in postmortems is to use role names and team names, not person's names. Even if the team is just one person. This helps keep it about the team and the process. We're all professionals doing our best and striving for continuous improvement.

Maybe these individuals don't mind, it could just be a cultural thing.

Many have echo'ed you, but I agree.

The person who made the error is just the straw that broke the camels back. I'm sure these folks knew that they needed to prioritize their backups but other things kept getting in the way. You don't throw people under the bus.

Am I missing something? Where in this report are any individuals actually named? My understanding was that they're using initials in place of names specifically because they want to _avoid_ naming anyone.

The original versions of the document had names. Those were later replaced with initials.

I think the issue was in part that this document didn't appear to be a public "here's what's going on doc" as much as it was a doc they seemed to be using as a focal point for their own coordination efforts.

I'm a huge Gitlab fan. But I long ago lost faith in their ability to run a production service at scale.

Nothing important of mine is allowed to live exclusively on Gitlab.com.

It seems like they are just growing too fast for their level of investment in their production environment.

One of the only reasons I was comfortable using Gitlab.com in the first place was because I knew I could migrate off it without too much disruption if I needed to (yay open source!). Which I ended up forced to do on short notice when their CI system became unusable for people who use their own runners (overloaded system + an architecture which uses a database as a queue. ouch.).

Which put an end to what seemed like constant performance issues. It was overdue, and made me sleep well about things like backups :).

A while back one of their database clusters went into split brain mode, which I could tell as an outsider pretty quickly... but for those on the inside, it took them a while before they figured it out. My tweet on the subject ended up helping document when the problem had started.

If they are going to continue offering Gitlab.com I think they need to seriously invest in their talent. Even with highly skilled folks doing things efficiently, at some point you just need more people to keep up with all the things that need to be done. I know it's a hard skillset to recruit for - us devopish types are both quite costly and quite rare - but I think operating the service as they do today seriously tarnishes the Gitlab brand.

I don't like writing things like this because I know it can be hard to hear/demoralizing. But it's genuine feedback that, taken in the kind spirit is intended, will hopefully be helpful to the Gitlab team.

Hey Daniel, I want to thank you for your candid feedback. Rest assured that this sort of thing makes it back to the team and is truly appreciated no matter how harsh it is.

You're absolutely right -- we need to do better. We're aware of several issues related to the .com service, mostly focused on reliability and speed, and have prioritized these issues this quarter. The site is down so I can't link directly, but here's a link to a cached version of the issue where we're discussing all of this if you'd like to chime in once things are back up: https://webcache.googleusercontent.com/search?q=cache:YgzBJm...

I'm running a remote-only company and we moved to GitLab.com last summer from cloud hosted trac+git/svn combo (xp-dev). The reason we picked GitLab.com was because the stack is awesome and Trac is showing its age. We also wanted a solution that could be ran on premises if needed. We spent about a month migrating stuff over to GitLab from Trac. Once we were settled the reliability issues started to show. We were hoping that these would be quickly sorted out given the fact that the pace of the development with the UI and features was quite speedy.

A sales rep reached out and I told him we would be happy to pay if that's required to be able to use the cloud hosted version reliably but I got no response. Certainly we could host GitLab EE or CE on our own but this is what we wanted to avoid and leave it to those who know it best. xp-dev never ever had any downtime longer than 10 minutes that we actively used during the last 6 years. I'm still paying them so that I can search older projects as the response time is instant while gitlab takes more than 10 seconds to search.

Besides the slow response times and frequent 500 and timeout errors that we got accustomed to, gitlab.com displays the notorious "Deploy in progress" message every other day for over 20-30 minutes preventing us from working. I really hoped that 6-7 months would be enough time to sort these problems out but it only seems to be worsening and this incident kinda makes it more apparent that there are more serious architectural issues, i.e. the whole thing running on one single postgresql instance that can't be restored in one day.

We have one gitlab issue on gitlab.com to create automated backups of all our projects so that we could migrate to our own hosted instance (or perhaps github) but afair gitlab.com does not support exporting the issues. This currently locks us into gitlab.com.

On one hand I'm grateful to you guys because of the great service as we haven't paid a penny, on the other hand I feel that it was a big mistake picking gitlab.com since we could be paying GitHub and be productive instead of watching your twitter feed for a day waiting for the postresql database to be restored. If anyone can offer a paid hosted gitlab service that we could escape to, I'd be curious to hear about.

Meant to mention this earlier: Gitlab self-hosted actually has a built-in importer to import projects from Gitlab.com - including issues.

It's mostly worked reliably in my experience (it's only failed to import one project across the various times I've used it, and I didn't bother debugging because for that import we really only needed the git data).

Ping me and we'd be happy to discuss hosted Gitlab for you.

I'm a bit curious here. Do you think that your issues with scalability and reliability have to do with your tech choice (I think it was Ruby on Rails)? Don't want to bash Rails, I'm just genuinely curious, since I come from a Rails background as well and have seen issues similar to yours in the past.

It's not just the tech stack, but a combination of the technical choices made and with the human procedures behind them. We're actively pushing towards getting everybody to focus on scalability, but there's still a lot of debt to take care of.

You can check out their codebase here: https://github.com/gitlabhq/gitlabhq

Just looking at their gemfile is rather telling: a couple hundred gems. I've always felt that if you're going above 100, you should carefully consider how much your codebase is trying to achieve.

They're probably at the point where they really want to think about splitting off of their monolith codebase and into microservices.

Yeah, given how their ops situation is, I don't think that would be a good idea.

Maybe it's because I'm familiar with almost all of the gems, but I don't see anything wrong with their Gemfile. It's a pretty complex project, and they really do have a ton of integrations and features that need those gems.

There's probably a few small libraries that they could have rewritten in a few files (never a few lines), but what's the point? The version is locked, and code can always be forked if they need to make changes (or contribute fixes).

> (never a few lines)

You'd be surprised what you can do by carefully considering what the desired outcome actually needs to be.

Maybe there is justification for all the gems in gitlab's Gemfile, I didn't go through it with a fine tooth comb - but this reaffirms my experience that complex projects outgrow monolith codebases. Having an infrastructure outage take down your entire business is kind of a symptom of that.

> I've always felt that if you're going above 100, you should carefully consider how much your codebase is trying to achieve.

This is a mindset issue. Some communities reject NIH so strongly that you get the opposite problem that everything depends on hundreds of different developers. Gitlab can start some library forks with more stuff integrated, or change communities. Microservices is something that can't help, as all the dependencies will stay just where they are (Gitlab is already uncoupled to some extent).

But, anyway, most of those are stable¹, and I doubt many of Gitlab problems come from dependencies.

1 - They are unbelievably stable for somebody coming from the Python world. When I first installed Gitlab, I couldn't believe on how easy it was to get a compatible set of versions.

I see the opposite of NIH especially in the RoR/Ruby world and I don't think it's always a good thing. Developers reach for a library for one piece of functionality in a discrete area of the codebase when they could have achieved the same functionality with a few lines of code. That's not automatically NIH, that's being pragmatic about the dependencies you're bringing in and are going to need to support moving forward.

It is fairly large, but I still find it more organized than some examples I've seen.

Also, I don't see another very common issue with big gemfiles in that they don't seem to have multiple solutions of one thing in there (ie multiple REST clients, DB mockers, etc).

I've considered setting up gitlab locally, and have a couple of students that are trying to set it up on a vps. Customizing their bundle installer is... an interesting learning experience in managing complex * nix servers.

I think it's telling that their standard offering/suggestion for self-hosters is as complex as it is. While on the one hand I applaud the poor soul that maintains the script that tries to orchestrate five(?) services on a general, random, unix/linux server without any knowledge/assumption on what other things are running there -- it unsurprisingly falls over in "interesting" ways when you try to do radical stuff like install it on a server that runs another copy of nginx with various vhosts etc.

Now, running services like gitlab at "Internet scale" is far from trivial - but running it at "office scale" should be.

I fully understand how gitlab ended up where they are - but ideally, the self-host version should just need to be pointed at a postgresql instance, and be more or less a "gem install gitlab" -- or similar away - popping up with some ruby web-server on a high port on localhost -- and come with a five-line "sites"-config for nginx and apache for setting up a proxy.

I really don't mean to complain - it's great that they try to provide an install that is "production ready" -- but if the installer reflects the spirit of how they manage nodes on the gitlab.com side -- I'm surprised they manage to do any updates at all with little down-time...

For now I'm running gogs - and it seems to be more of a "devops" developed package - where deployment/life-cycle has been part of the design/development from the start. Single binary, single configuration file. Easily slips in behind and plays well with simple http proxy setups.

At some point I'll find a day or two to migrate our small install to gitlab (we could use the end-user usability and features) -- but I know I'll need to have some time for it. Time to migrate, time to test the install, time to test disaster-recovery/reinstall from backup... all those steps are slowed down and become more complex when the stack is complex.

(I'll probably end up letting gitlab have a dedicated lxc container, although I'll probably at least try to figure out how to reliably use an external postgres db -- it pains me to "bundle" a full fledged RDBMS. These things are the original "service daemons", along with network attached storage and auth/authz (LDAP/AD etc)).

LOL. GitHub is also a RoR shop.

It might be. I'm not saying it's impossible to scale Rails. It's just very, very hard. Github can do this, because they probably get the best of the best engineers. They even used to have their own, patched Ruby version.

Not everyone can afford that.

Why do you question Rails while the entire report is about Postgres only ?

And as someone working on one of the biggest and oldest Rails codebase out there, I can tell you that in term of scaling, Rails is the least of our concerns.

Sure it's not as efficient, so it's gonna cost you more in CPU and RAM, but it's trivial to scale horizontally. The real worry are the databases, they are fundamentally harder to scale without tradeoffs.

As for the patched Ruby, we used to have one too (but our patches landed upstream so now we run vanilla). It's not about allowing to scale at all. It's simply that once you reach a certain scale, it's profitable to pay a few engineers to improve Ruby's efficiency. If you have 500 app servers, a 1 or 2% performance gain will save enough to pay those engineers salary.

Depending on hundreds of gems means you are depending on the decisions of hundreds of developers with packages which are in constant churn.

Apps like Gitlab and Discourse that depend on hundreds of gems and require end users to have complex build environment and compile software are I think operating a broken user hostile model.

The potential for compilation failures, version mismatches and Ruby oddities like RVM is so gigantic with hundreds of man hours wasted one is left to conclude they may actually want to run a hosting business and not have users deploy themselves.

Compare that to Go or even PHP where things are so orders of magnitude simper that it is not even the same thing. To deal with this complexity you now have containers but have you solved the complexity or added another layer of complexity? There are technical but I think also social factors at play here.

Regardless of wether I agree or disagree with your critique, it has absolutely no relevance in the context of the current outage.

You don't like Ruby / Rails we get it. But that's totally out of topic.

I don't think it's that. GitLab IS a complex setup and Rails is not helping making it simple. There is a ton going on in the stack and the company only has limited resources.

It's not hard to scale a Rails server, when compared to other frameworks and languages. It's exactly the same as scaling a server written in Java, Node.js, Python, or any other language. You just spin up more machines and put them behind a load balancer.

Yes, Ruby is slower than other programming languages, but this usually doesn't matter. If you are charging people to use your software, or even if you are serving ads, you will always be making money before you need a second server. Plus, Rails is super productive, so you'll be able to build your product much faster.

I'm not sure why GitHub used a patched Ruby version, but no, that's not necessary.

Having said all of that, I'm moving towards Elixir and Phoenix. Not just because of the performance, but also because I really like the language and framework.

Nah this is just about having a robust backup system

I have searched the gitlab website and repositories looking for processes and procedures addressing change management, release management, incident management or really anything. I have found work instructions but no processes or procedures. Until you develop and enforce some appropriate processes and the resulting procedures I'm afraid you will never be able to deliver and maintain an enterprise level service.

Hopefully this will be the learning experience which allows you to place an emphasis on these things going forward and don't fall into the trap of thinking formal processes and procedures are somehow incongruent with speedy time to market, technological innovation or in conflict with DevOps.

Like you, I would like to add my 2 cent, which I hope will be taken positively, as I would like to see them provide healthy competition for GitHub for years to come.

Since GitLab is so transparent about everything, from their marketing/sales/feature proposals/technical issues/etc., they make it glaringly obvious, from time to time, that they lack very fundamental core skills, to do things right/well. In my opinion, they really need to focus on recruiting top talent, with domain expertise.

They (GitLab) need to convince those that would work for Microsoft or GitHub, to work for GitLab. With their current hiring strategy, they are getting capable employees, but they are not getting employees that can help solidify their place online (gitlab.com) and in Enterprise. The fact that they were so nonchalant about running bare metal and talking about implementing features, that they have no basic understanding of, clearly shows the need for better technical guidance.

They really should focus on creating jobs that pays $200,000+ a year, regardless of living location, to attract the best talent from around the world. Getting 3-6 top talent, that can help steer the company in the right direction, can make all the difference in the long run.

GitLab right now, is building a great company to help address low hanging fruit problems, but not a team that can truly compete with GitHub, Atlassian, and Microsoft in the long run. Once the low hanging fruit problems have been addressed, people are going to expect more from Git hosting and this is where Atlassian, GitHub, Microsoft and others that have top talent/domain expertise, will have the advantage.

Let this setback be a vicious reminder that you truly get what you pay for and that it's not too late to build a better team for the future.

> They really should focus on creating jobs that pays $200,000+ a year, regardless of living location

For those who haven't been following along, Gitlab's compensation policy is pretty much intentionally designed to not pay people to live in SF. It's a somewhat reasonable strategy for an all remote company. But they seem to have some pretty ambitious plans that may not be compatible with operating a physical plant.

> pretty ambitious plans

I would point you to some very ambitious feature proposals on their issue tracker, but I can't for obvious reasons. I think GitLab is at a cross roads and this setback might be the eye opener they need. Moving forward, they really need to re-evaluate how they develop and evolve GitLab. For both online and Enterprise.

This idea of releasing early and on the 22nd works very well for low hanging fruits problems, but not for the more ambitious plans they have. If they understood the complexity for some of the more ambitious plans, they would know they are looking at, at least a year of R&D to create an MVP.

I think it makes sense to keep doing the release on the 22nd, but they also need to start building out teams that can focus on solving more complex problems that can take months or possibly a year to see fruition. Git hosting has reached a point, where differentiating factors can be easily copied and duplicated, so you are going to need something more substantive, to set yourself apart from the rest. And this is where I think Microsoft may have the upper hand in the future.

> I think GitLab is at a cross roads and this setback might be the eye opener they need. Moving forward, they really need to re-evaluate how they develop and evolve GitLab.

Judging by their about team[1] page, they are currently short an Infrastructure Director. When you read their job listings, even for DBAs and SREs, it' all "scale up and improve performance." Very little "improve uptime, fight outages." One assumes it's upper management approving the job descriptions, so the missing emphasis on uptime, and redundancy probably pervades the culture. And again, judging by the team profile, they've hired very few DBA / SRE experts, and instead appear to have assigned Ruby developers to the tasks.

Perhaps they simply have to bet the farm on scaling much larger to sustain the entire firm, which is troubling for enterprise customers, and for teams like mine running a private instance of the open source product. Should probably review the changelog podcast interview[2] with the CEO and see if any quotes have new meaning after today.

[1]: https://about.gitlab.com/team/ [2]: https://changelog.com/podcast/103

What is Microsoft doing in this space? I honestly don't know, so not trying to be a jerk.

> Gitlab's compensation policy is pretty much intentionally designed to not pay people to live in SF.

What do you mean? They pay people in SF much more than in other cities because the high cost of living. I'd consider working for Gitlab if I would live in SF, living in Berlin it's not an option.

Look, I love GitLab. Gitlab was there for me when both my son and I got cancer, and they were more than fair with me when I needed to get healthy and planned to return to work. I have nothing but high praises for Sid and the Ops team.

With that said, I'll agree that the salary realities for GitLab employees are far below the base salary that was expected for a senior level DevOps person. I've got about 10 years experience in the space, and the salary was around $60K less than what I had been making at my previous job. I took the Job at GitLab because I believe in the product, believe in the team, and believe in what Gitlab could become...

With that said, starting from Day 1, we were limited by an Azure infrastructure that didn't want to give us Disk iops, legacy code and build processes that made automation difficult at times, and a culture that proclaimed openness, but, didn't really seem to be that open. Some of the moves that they've made (Openshift, rolling their own infrastructure, etc) have been moves in the right direction, but, they still haven't solved the underlying stability issues -- and these are issues that are a marathon, not a sprint. They've been saying that the performance, stability, and reliability of gitlab.com is a priority -- and it has been since 2014 -- but, adding complexity of the application isn't helping: if I were engineering management, I'd take two or three releases and just focus on .com. Rewrite code. Focus on pages that return in longer than 4 seconds and rewrite them. When you've got all of that, work on getting that down to three seconds. Make gitlab so that you can run it on a small DO droplet for a team of one or two people. Include LE support out of the box. Work on getting rid of the omnibus insanity. Caching should be a first class citizen in the Gitlab ecosystem.

I still believe in Gitlab. I still believe in the Leadership team. Hell, if Sid came to me today and said, "Hey, we really need your technical expertise here, could you help us out," I'd do so in a heartbeat -- because I want to see GitLab succeed (because we need to have quality open source alternatives to Jira, SourceForge Enterprise Edition, and others).

Not trying to be combative, but, "You truly get what you pay for" seems a little vindictive here -- the one thing that I wish they would have done was be open with the salary from the beginning -- but, Sid made it very clear that the offer that he would give me was going to be "considerably less" than what I was making.

> They really should focus on creating jobs that pays $200,000+ a year, regardless of living location, to attract the best talent from around the world. Getting 3-6 top talent, that can help steer the company in the right direction, can make all the difference in the long run.

SIGN ME UP! That would be a freaking great opportunity!!

> SIGN ME UP! That would be a freaking great opportunity!!

I think you asking for the job, might be a signal, that you are not who they are looking for :-)

Yup - top talent is already making more. Gitlab needs to recruit with purpose (this is what we're doing and why), environment (remote first, transparency, etc), and pay (we can match 70% of what you'd get at XYZ Company). Right now, it feels like they're capped at 30-50% of what someone could make at a big org, which is just a drop in salary most people would never take, regardless of the company values/purpose.

One alternate idea would be to hire consultants on a temporary basis. You may not be able to pay $250k a year, but you could pay a one time $40k fee to review the architecture and come up with prioritized strategy for disaster recovery and scalability.

Why would they try to recruit from Microsoft? Most of the software engineers at Microsoft are not focused on developing scalable web services architectures. And the ones that do have built up all of their expertise with Microsoft technologies (.net running on Windows server talking to mssql).

>Microsoft and others that have top talent/domain expertise, will have the advantage.

Again, Microsoft isn't even in this same field (git hosting) or if they are, are effectively irrelevant due to little market/mindshare. Are you an employee there or something?

> Most of the software engineers at Microsoft are not focused on developing scalable web services architectures.

Uh, MS literally runs Azure, which may not be the biggest IAAS offering, but is certainly vastly larger and more complex than Gitlab. There are certainly numerous engineers at MS who would have experience relevant to Gitlab (though perhaps not with their particular tech stack). It may not be most of the engineers there, but in a company with literally tens of thousands of engineers, there are few things that will be true of most of them.

> Microsoft isn't even in this same field (git hosting)

How is what they're hosting at all relevant to the problem at hand? This could have happened regardless of what the end product was - it's a database issue. In fact, the git infrastructure was explicitly not involved in this issue - it was only their DB-backed features that had data loss.

Additionally, Microsoft is in the business of git hosting, if only tangentially. TFS supports git, and has since 2013: https://blogs.msdn.microsoft.com/mvpawardprogram/2013/11/13/... Your objection is both unkind and factually incorrect. The "mindshare" comment is a bit silly - even though they may not be as active on forums like HN, developers working on MS technologies are still one of the largest groups in programming (as a non-MS developer looking for work in the Pacific Northwest, this is something I'm constantly reminded of). I doubt your estimate of Microsoft's real mindshare is anything close to accurate.

> Are you an employee there or something?

This accusation is eminently not in the spirit of HN, and Microsoft was hardly the only company he mentioned. Whatever your personal vendetta against them, it's absurd to think that Microsoft is not one of the top pools of talent in tech - they're a huge company with a vast variety of offerings and divisions.

> Why would they try to recruit from Microsoft?

I'm not sure if you read my post correctly, but I never mentioned poaching from Microsoft. I said compete for programmers that would choose to work for Microsoft. I'm also not sure if you understand what Microsoft does. It's a very diverse company with R&D spending that rivals some small nations.

> Microsoft isn't even in this same field (git hosting

I guess you haven't heard of https://www.visualstudio.com/team-services/ and their on premise TFS solution that supports Git.

Microsoft understands Enterprise and it's quite obvious they want to be a major provider for Git hosting. It will be foolish to believe Microsoft is not focused on owning the Git mindshare in Enterprise.

> Are you an employee there or something

No. Just somebody that understands this problem space.

One of the main drivers of revenue for Microsoft is Office 365, with 23.1 million subscribers[0]. Along with Azure, MS runs some of the largest web services around. Most developers at MS don't necessarily work on these products, but to say that all the devs working on them use a simple .NET stack + SQL Server is discrediting a lot of work that they do.

Disclaimer: I work for Microsoft in the Office division and opinions are my own

[0] https://www.microsoft.com/en-us/Investor/earnings/FY-2016-Q4...

>I work for Microsoft in the Office division

Hey there, honest question incoming. Any chances of you chaps making Word a better documentation tool in the future? Edit history storing formatting and data changes on the same tree is making it impossible to use Word for anything serious. This really comes to light once you start working at an MS tech company on documentation, where it is obvious that you should use MS products for work. Some tech writers I know just end up using separate technology branches for their group efforts, since neither Sharepoint nor Word is a professional tool for this job.

Hotmail, MSN, Skype, msdn.com, microsoft.com, the Windows Update Servers, Azure.

Microsoft has a ton of people with experience in building cloud system, either in-house people or people from aquisitions.

Microsoft has so many employees and domains of activity that you can probably find an engineer for any domain you're looking for.

>us devopish types are both quite costly and quite rare - but I think operating the service as they do today seriously tarnishes the Gitlab brand.

The sad thing is it doesn't have to be this way. Software stacks and sysadmin is out there for the learning, but due to the incentives of moving jobs every two years, nobody wants to invest to make those people, we all know we'l find /someone/ to do it anyways.

I think they are running to catch up on the gitlab system itself, let alone running it as a production service. The bugs in the last few months have been epic. Backups not working, merge requests broken, chrome users seeing bugs, chaotic support. Basically their qa and release processes are not remotely enterprise ready.

If I understand correctly, the public Gitlab is similar to what you can get with a private Gitlab instance. That makes me wonder, instead of trying to scale the one platform up, would it be OK to spin up a second public silo? I mean yeah, it would be a different silo, but for something free I'd say "meh".

I think it's totally fine admitting when you've stopped being able to scale up, and need to start scaling out.

They could, and as a stopgap measure that might work, but..

(1) Some of the collaboration features (e.g. work on Gitlab itself) depend on having everyone on the same instance.

(2) Gitlab.com gives them a nice dogfood-esque environment for what it's like to actually operate Gitlab at scale. If they are having problems scaling it, then potentially so are their customers. Fixing the root cause is usually a good thing and is often an imperative to avoid being drowned in technical debt.

(3) It moves the problem in some respects. Modern devops techniques mostly allow the number of like servers to be largely irrelevant, but still.. the more unique instances of Gitlab, the more overhead there will be managing those instances (and figuring out which projects/people go on which instances).

It's a simple approach which I'm sure would work, but it also means a bunch of new problems are introduced which don't currently exist.

There is a free version and paid version.

The one they offer for free is the paid version.

You can run your own, but you won't have every feature unless you pay.

>1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage

>2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

>3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

>4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

>5. Our backups to S3 apparently don’t work either: the bucket is empty

>So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

Sounds like it was only a matter of time before something like this happened. How could so many systems be not working and no one notice?

What if I told you all of society is held together by duct tape? If you're surprised that startups cut corners you're in for a rude awakening. I'm frequently amazed anything works at all.

Startups only, you say?

Everything, everywhere, is held together by ducttape!

True. I watched a 60 year old manufacturing plant shut down for 7 days once because someone saved an excel spreadsheet in the wrong format and it deleted a macro written more than a decade prior that held together all of their operations planning.

At my current project we have components which are literally called `duct-tape` and `glue`.

Reminds me of this:

> Websites that are glorified shopping carts with maybe three dynamic pages are maintained by teams of people around the clock, because the truth is everything is breaking all the time, everywhere, for everyone. Right now someone who works for Facebook is getting tens of thousands of error messages and frantically trying to find the problem before the whole charade collapses. There's a team at a Google office that hasn't slept in three days. Somewhere there's a database programmer surrounded by empty Mountain Dew bottles whose husband thinks she's dead. And if these people stop, the world burns.


And if you think that paragraphs does not apply to your applications / company, ask yourself if that's really true. My company sends around incident statistics and there's always some shit that broke. Always.

The real question is what holds together duct tape?

Duct tape is like the force. It has a light side and a dark side, and it holds the universe together.

Good intentions and heartfelt prayer.

Duct tape is fractal

The dark side.

Bailing wire.


I don't think a lot of things work at all. We who live in developed countries are the minority really. Most things are simply non-functional and downright ugly. Such complex systems the world has. And now apparently even the developed worlds are in for some troubles. The peace and progress in the last few decades just really isn't the norm in human history.

That said, one can't deny that there are indeed things that do work, and work very well, and people who make that happen, and one can always be amazed/inspired by those. There are good things as well as haphazard things. It's just that the latter generally outnumber the former in many settings. It doesn't necessarily imply a sweeping statement about everything though.

There's a big difference between holding together things with duct tape, and literally taking zero backups of it.

I'm not sure that there is.

It's called the Fundamental Failure-Mode Theorem - "Complex systems usually operate in an error mode". https://en.wikipedia.org/wiki/Systemantics has more rules from the book. It's worth the read.

if #2 is correct, holy shit did gitlab get lucky someone snapshotted 6 hours before.

Dear you: it's not a backup until you've (1) backed up, (2) pushed to external media / s3; (3) redownloaded and verified the checksum; (4) restored back into a throwaway; (5) verified whatever is supposed to be there is, in fact, there, and (6) alerted if anything went wrong. Lots of people say this, and it's because the people saying this, me included, learned the hard way. You can shortcut the really painful learning process by scripting the above.

Do you have to download the entire backup or is a test backup using the same flow acceptable? I'm thinking about my personal backups, and I don't know if I have the time or space to try the full thing.

For plain files, pick your risk tolerance.

For DB backups, until you've actually loaded it back into the DB, recovered the tables, and tested a couple rows are bit identical to the source, it's a hope of a backup not a backup. Things like weird character set encodings can cause issues here.

No way to do a partial recovery?

If time and space for the full thing are an issue, it could be really important to get going after an incident to be able to recover the most important bits first.

Need to save this comment.

I'm guessing they all worked at some point in time, but they failed to set up any sort of monitoring to verify the state of their infrastructure over time.

failing silently

I really wish all the applications I use had an option to never do that.

Systems failing one by one and nobody caring about them is not uncommon at all. See https://en.wikipedia.org/wiki/Bhopal_disaster (search for "safety devices" in the article).

everything is broken, and we are usually late to notice due to the infrequency of happy-path divergence.

If you're a sys admin long enough, it will eventually happen to you that you'll execute a destructive command on the wrong machine. I'm fortunate that it happened to me very early in my career, and I made two changes in how I work at the suggestion of a wiser SA.

1) Before executing a destructive command, pause. Take your hands off the keyboard and perform a mental check that you're executing the right command on the right machine. I was explicitly told to literally sit on my hands while doing this check, and for a long time I did so. Now I just remove my hands from the keyboard and lower them to my side while re-considering my action.

2) Make your production shells visually distinct. I setup staging machine shells with a yellow prompt and production shells with a red prompt, with full hostname in the prompt. You can also color your terminal window background. Or use a routine such as: production terminal windows are always on the right of the screen. Close/hide all windows that aren't relevant to the production task at hand. It should always be obvious what machine you're executing a commmand on and especially whether it is production. (edit: I see this is in outage the remeditation steps.)

One last thing: I try never to run 'rm -rf /some/dir' straight out. I'll almost always rename the directory and create a new directory. I don't remove the old directory till I confirm everything is working as expected. Really, 'rm -rf' should trigger red-alerts in your brain, especially if a glob is involved, no matter if you're running it in production or anywhere else. DANGER WILL ROBINSON plays in my brain every time.

Lastly, I'm sorry for your loss. I've been there, it sucks.


YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

Also, as a safety net, sometimes you don't need to run `rm -rf` (a command which should always be prefaced with 5 minutes of contemplation on a production system). In this case, `rmdir` would have been much safer, as it errors on non-empty directories.

Or use `mv x x.bak` when `rmdir` fails

Or even instead of any kind of rm command. mv is less subtle. I tend to prefer

    `mv x $(date +%Y%m%d_%s_)x`

    %Y - 4 digit year
    %m - 2 digit month
    %d - 2 digit day
    _  - underscore literal
    %s - linux timestamp (seconds since epoch)
This ensures that the versions you're 'removing' will be lexically sorted from newest to oldest in a way that is easy to interpret and also works if you need to try more than once in a day.

In case it's not apparent to some, this command moves the directory (or file) called

to something like

Then when you're all done (i.e. production is humming and passing tests, no need to ever rush), you can delete the timestamped version, or if space is plentiful, move the old file to an archive directory as an extra redundancy (i.e. as an extra backup, not in lieu of a more thorough backup policy).

This needs to be the first thing anyone who works with stateful systems learns. NEVER rm. mv is insufficient. mv dir dir.bak.`date +%s` has prevented data loss for me several times.

I agree with what you're saying, and this is almost exactly what I do, but when disk space is limited - particularly during time-sensitive situations - this advice isn't very useful. For example, if a host or service is on the cusp of crashing because of a partition quickly filling up with rolling logs, what do you do since mv doesn't actually solve the problem?

At some level you have to run an rm, and you better hope you do it right in the middle of an emergency with people breathing over your shoulder.

In an ideal world, this wouldn't ever happen, but it does. Inherited/legacy systems suck.

You shouldn't let yourself get to that point. Your alerting system should alert you when disk is at 70% or something with a ton of margin. If it's not set up that way, go stop what you're doing and fix that. (Seriously.) If you're running your systems so they usually run at 90% disk usage, go give them more disk (or rotate logs sooner).

And even assuming all that fails and I'm in that situation where I have seconds until the disk hits 100%, I would much rather the service crashes than make a mistake and delete something critical.

If someone is breathing over your shoulder, you can even enlist them to co-pilot what you're doing. Even if they're not technical enough to understand, talking at them what you're about to do will help you spot mistakes.

If you lose a shoe while running across a highway, it's probably not worth the risk trying to get it back.

The disk usage is just one example of many. Actions that you take really depends on what field you're in and, and again, legacy/inherited systems are completely filled with this sort of shit.. You can say that I shouldn't let get it to that point, but you're kinda dismissing the point I'm trying to make that - shit can and will happen, and you need to know how to deal with shit on your toes. There are times when you have to do things that would make most people flip there shit. There are ways to mitigate the risk in emergency scenarios as you say, but when the risk is actually worth it, you tend to do Bad Things because there's no other option.

In my case, it was in HFT where I inherited the infrastructure from a JS developer who inherited it from a devops engineer who inherited it from a linux engineer who inherited it from another linux engineer. It was a complete shitshow that I was dropped into mostly on my own with little warning. To make matters worse, each maintenance period was 45 minutes at 4:15pm and weekends. Even worse, if a server went down at 5:00pm, the company immediately lost about 35k - which was the same for if the trading software went down. When I asked for additional hardware to do testing on, I was told that there wasn't a budget for it. The saving grace there was that there was 23:15h of time to plan during downtime, so an `rm -rf /` would have had nearly identical long term impact as a `kill -9` on the application server.

Mind you, the owners of the company were some of the smartest and most technical folk I've ever worked with and were surprisingly trusting in my ability to manage the infrastructure. The company no longer exists, and not without reason.

Just to show the lunacy of the infrastructure - they had their DNS servers hosted on VMs that required DNS to start. About a month after I joined, we had a power failure. You can imagine how that went..

(That all said, it was the greatest learning experience I've ever had. Burned me out a tad, though.)

I don't mean to sound flippant, but if I walked into the situation you describe, I would immediately walk right back out. That is a a situation set up from the start for failure, and no way would I care to be responsible for it.

I certainly get that we inherit less-than-ideal systems from time to time; I've been there. But I've also learned that every time I get paged in the middle of the night, it's my failure, whether for a lack of an early-warning system, or for doing a bad up-front job of building self-healing into my systems. If I inherit a system that I can tell is going to wake me up at night, I refuse to be responsible for it in an on-call capacity until I've mitigated those problems.

There seems to be this weird thing in the dev/ops world where it's somehow courageous to be woken up at 3am to heroically fix the system and save the company. I've been that guy, and I'm sick of it. It's not heroic: it's a sign of a lack of professionalism leading up to that point. Make your systems more reliable, and make them continue to chug along in the face of failure, without human interaction. If you have management that doesn't support that approach, make them support it, or walk out. Developers and operations people are in high enough demand right now in most markets that there will be another company that would love to have you, hopefully with more respect for your off-duty time.

I once tried to help a company with similar infrastructure insanity recover from a massive failure. Absolutely brutal.

When my team finally got services up and running (barely) after ~18 hours of non-stop work, the CTO demanded that we not go home and get some sleep until everything was exactly as it had been before the failure.

Not my happiest day.

> You shouldn't let yourself get to that point. Your alerting system should alert you when disk is at 70% or something with a ton of margin.

What would your recommendation be?

Unfortunately the answer is "it depends on the application". I tend to run stuff with even higher margins: I never expect more than 30-40% disk utilization. Yes, it's more expensive, but I value my (and my colleagues') sleep more.

But it's all just about measurement. Run your application with a production workload and see how large the logs are that it generates within defined time intervals. Either add disk or reduce logging volume until you're happy with your margins. (Logging is often overlooked as something you need to design, just like you design the rest of your application.)

Log rotation should be a combination of size- and time-based. You probably want to only keep X days of logs in general, but also put a cap on size. If you're on the JVM, logback, for example, lets you do this: if you tell it "keep the last 14 log files and cap each log file at 250MB", then you know what the max disk usage for logging will be.

If you can do it, use an asynchronous logging library that can fail without causing the application to fail. If your app is all CPU and network I/O, there's no reason why it needs disk space to function properly. If you can afford it, use some form of log aggregation that ships logs off-host. Yes, you've in some ways just moved the problem elsewhere, but it's easier to solve that once in one service (your log aggregation system) than in every individual service.

If your app does require disk space to function properly, then of course it's a bit harder, and protecting against disk-full failures will require you to have intimate knowledge of what it needs disk for, and what the write patterns are.

It's never going to be perfect. Just a 100% uptime is, over a longer time scale, unachievable, you're never going to eliminate every single thing that can get you paged in the middle of the night. But if you can reduce it to that one-in-a-million event, your time on-call can really be peaceful. And when you do get that page, look really hard at why you got paged, and see what you can do to ensure that particular thing doesn't require human intervention to fix in the future. You may decide to cost of doing so isn't worth the time, and that getting woken up once every X days/weeks/months/whatever is fine. But make that your choice; don't leave it up to chance.

I'm curious, too. Proper disk space management and monitoring is probably the most difficult problem I know of in the ops field. I haven't seen anybody do it in a way that prevents 3am wakeup calls or a 24/7 ops team.

For example, a 3am network blip that causes the application server (still logging in DEBUG from the last outage) to fill up its log partition while it can't communicate to some service nobody monitors anymore. Not sure how you'd solve that one.

> For example, a 3am network blip that causes the application server (still logging in DEBUG from the last outage)

Nope. Don't do that. Infra should be immutable. If you need to bring up a debug instance to gather data, that's fine, but shut it down when you're done. If you don't, and it causes an issue, you know who to blame for that.

> to fill up its log partition while it can't communicate to some service nobody monitors anymore.

Sane log rotation policies (both time- and size-based) solves this. If you tell your logging system "keep 14 old log files and never let any single log file to grow above 250MB", then you know the upper bound on the space your application will ever use to log.

Also, why are you not monitoring logs on this service. If it's spewing "ERROR Can't talk to service foo" into its log file, why aren't you being alerted on that well before the disk fills up?

> ... nobody monitors anymore.

Nope. Not allowed. Fix that problem too. Unmonitored services aren't allowed in the production environment, ever.

I've heard (and given) all the excuses for this, but no, stop that. You're a professional. Do things professionally. When management tells you to skimp on monitoring and failure handling in order to meet a ship date, you push back. If they override you, you refuse on-call duty for that service. (Or you just ignore their override and do your job properly anyway.) If they threaten to fire you, you quit and find a company that has respect for your off-duty time. Good devs & ops people are in high enough demand these days that you shouldn't be unemployed for long.

We just switched over to centralized logging two years ago. All hosts are configured to only keep small logfiles and rolling them over every few megabytes. Filling up the centralized logging is nearly impossible when good monitoring is done and used diskspace is never over 50%.

Btw. using an orchestration platform simplifies many of those aspects of "one node is going rough and I've to do accidentally something stupid".

Monitor the rate at which the disk is filling up, and extrapolate that to when it will hit 90%. If that time is outside business hours, alert early. If current time is not in business hours, alert later if possible.

How does this help in situations where something rogue starts filling the disk? The idea makes sense in theory, but in practice, it doesn't work out that well. Ops work is significantly harder than many devs think..

> Ops work is significantly harder than many devs think

No, it's not (I've done both). Ops is about process, and risk analysis and mitigation. Yes, there's always the possibility that something can go rogue and start filling your disk. That shouldn't be remotely common, though, if you've built your systems properly.

To this I would add: always pass "-i" to any interactive use of mv. This has also helped me prevent data loss a few times.

date -Im is shorter and easier to remember (though not 100% standard, IIRC). ;)

Example: 2006-08-14T02:34:56-06:00

You can "brew install coreutils" and then use "gdate " on macOS.

These days, I've been very implicit in how I run rm. To the extent that I don't do rm -rf or rmdir (edit: immediately), but in separate lines as something like:

  pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir
It takes a lot longer to do, but I've seen and made enough mistakes over the years that the forced extra time spent feels necessary. It's worked pretty well so far -- knock knock.


  find ... -delete
avoids any potential shell escaping weirdness and saves you a fork() per file.

This seems to be the best here. As a side note: if someone does something more complicated and uses piping find output to xargs, there are very important arguments to find and xargs to delimit names with binary zero -- -print0 and -0 respectively.

Very interesting article: https://www.dwheeler.com/essays/fixing-unix-linux-filenames.....

I've been writing an `sh`-based tool to check up on my local Git repos, and it uses \0-delimited paths and a lot of `find -print0` + `xargs -0`:


I admit the code can look a little weird, but it was because I had some rather tight contrainst: 1 file, all filenames `\0` separated internally and just POSIX `sh`. I still wanted to reuse code and properly quote variables inside `xargs` invocations (because `sh` does not support `\0`-separated read's), so I ended up having to basically paste function definitions into strings and use some fairly expansive quotation sequences.

Nice plug for gitlab ;).

\0 is an insanely useful separator for this sort of thing and yeah, it definitely gets messy. I'm working on a similar project that uses clojure/chef to read proc files in a way that causes as little overhead as possible. \0 makes life so much easier used. The best example I can think of off of the top of my head is something similar to:

  bash -c "export FOO=1 ; export BAR=2 && cat /proc/self/environ | tr '\0' '\n' | egrep 'FOO|BAR'"

I was so freaked out at the news, I normally have local backups of my projects but I just happened to be in the middle of a migration where my code was just on Gitlab, and then they went down... Luckily it all turned out OK.

\0 is very useful but I really wish for an updated POSIX sh standard with first-class \0 support.

On your code, why do you replace \0's with newlines? egrep has the -z flag which makes it accept \0-separated input. A potential downside to it is that it automatically also enables the -Z flag (output with \0 separator).

I solved the "caller might use messy newline-separated data"-problem by having an off-by-default flag that makes all input and output \0-separated; this is handled with a function called 'arguments_or_stdin' (which does conversion to the internal \0-separated streams) and 'output_list' (which outputs a list either \0- or \n-separated depending on the flag).

Good advice.

I would add a step where you dump the output of find (after filtering) into a textfile, so you have a record of exactly what you deleted. Especially when deleting files recursively based on a regular expression that extra step is very worthwhile.

It's also a good practice to rename instead of delete whenever possible. Rename first, and the next day when you're fresh walk through the list of files you've renamed and only then nuke them for good.

I am actually curious what user they were logged in as and what permissions were in effect.

Unfortunately, the answer most places is that the diagnostic account (as opposed to the corrective action account) is fully privileged (or worse, root).

From a comment in the doc ("YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN"), I assume they're running as a regular user, sudo'ing as necessary.

A regular user that can use `sudo`.

Alternatively, if you have the luxury - `zfs snapshot`.

Having a snapshotting storage system (NetApp) once saved a lot of pain when I accidentally deleted the wrong virtual machine disk from an internal server (hit the system disk instead of a removed data disk) I was able to recover the root disk from a snapshot and bring up the machine in less than an hour.

Snapshots are not a backup strategy, but they make me sleep better at night regardless.

lsof -> check if a process is accessing that file/directory.

Good lesson on making command prompts on machines always tell you exactly what machine you're working on.

I like to color code my terminal. Production systems are always red. Dev are blue/green. Staging is yellow.

All of my non-production machines have emojis in PS1 somewhere. It sounds ridiculous, but I know that if I see a cheeseburger or a burrito I'm not about to completely mess everything up. Silly terminal = silly data that I can obliterate.

I think I'd rather make the production systems stand out, and add the emojis there. My prompts have a red background, but emoji prompts just tickle me, somehow.

I've been color-coding my PS1 for years, but this is seriously brilliant, thanks!

I got the idea of an emoji prompty that is happy or sad depending on the $? value from somebody here a few years ago.

I run it on any dev environment since then.

ha! I do the same thing with figlet and cowsay, it prints a big dragon saying "welcome to shell!", if I see that, then I know I'm on a box I own/have sudo/am me. it's a good visual reminder. I don't fuss with prompts much, but this is a pretty good idea!

It seems Gitlab has noticed your comment.

Recovery item 3f currently says:

> Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)

I use iterm2's "badging" to set a large text badge on the terminal of the name of the system as part of my SSH-into-ec2-systems alias:

    i2-badge ()
      printf "\e]1337;SetBadgeFormat=%s\a" $(echo -n "$1" | base64)
It's not quite as good as having a separate terminal theme, but then I haven't been able to use that feature properly. :(

This is a great idea! Documentation https://www.iterm2.com/documentation-badges.html -- supports emojis too fwiw

Didn't know about this feature, gonna have to use this thanks!

In this case it looks like it has been a confusion between two different replicated Production databases. So this would not have helped.

Yep, good idea. The same thing has been suggested by team members http://imgur.com/a/TPt7O

I do this too, but in this case both machines were production, so this alone would not have sufficed. The system-default prompts on the other hand are universally garbage.

I have a user and a local admin account at work on our Windows SOE. I have made my PC's cmd.exe show green on black in large font, but intentionally did not change the local admin. When I run a command prompt in admin it's visible different. This is a great tip and it's caught me many times.

If using iterm you can set a "badge" text on a terminal window that shows up as an overlay. Super useful when you have lots of SSH sessions open to different servers.

iterm badges have saved me many times.

Is a really good idea, and is one of the improvements that are likely to be put in place as soon as possible. Its already listed on the document.

I've done this exact thing my my servers. Also like GL I prepend PRODUCTION and STAGING on PS1.

I should probably make the PRODUCTION flash just in case.

We need a <blink> tag for our PS1.

Good news! Enable ANSI in your terminal and: ESC[5m


You can do that - I actually tried it and everyone on my team agreed that it was bloody annoying, so I made it stop blinking (still red, though).

Not as serious, but I always set root terminals on my system to be this scary red.

How do you go about colour coding your terminal?

I assume he color coded the prompt. You can use ANSI color escape codes in there to e.g. color your hostname.

Here's a generator for Bash: http://bashrcgenerator.com/, the prompt's format string is stored in the $PS1 variable.

My dev is yellow, staging orange, and live red

How exactly do you color code it?

for example, in .bashrc on osx:

   RED=$(tput setaf 1)
   NORMAL=$(tput sgr0)
   PS1="\[${RED}\]PROD \[${NORMAL}\]\W\$ "
produces prompt

PROD ~$ <-- prod in red, directory ~

'db1' vs. 'db2' is still insufficiently clear, though. Even better would be e.g. to name development systems after planets and production systems after superheroes. Very few people would mistake 'superman' for either 'green-lantern' or 'pluto,' but it's really easy to mistake 'sfnypudb13' for 'sfnydudb13.'

Or skip the middleman: prod-db1, prod-db2, dev-db1, dev-db2.

Visually there's not a whole lot of difference. Ideally, you want something where the shape of each name is reasonably distinct from all the rest — otherwise folks will just ignore that odd blot before the prompt.

And then, once you've started naming things as $ENV-$TYPE, someone will want to cram in the location, and the OS, and the team which maintains it, and the customer, and and and. Then someone will reduce all of those identifiers into single characters … and you'll be in the situation I mentioned. Clearly clpudb1 is CharlesCorp's first London production database system!

I strongly argue against those types of abbreviations in naming, for exactly the points you make.

Here's how I view it:

The distinction between prod and dev is pretty clear cut. The words "dev" and "prod" have significantly different shapes and are immediately unambiguous. There's no need to remember a superhero vs astronomical object distinction.

As the number of database server instances grows, you've got another level of naming issues that arises when needing to distinguish between hosts and db server instances—and perhaps even database/schema independence if necessary. Staying consistent and easy-to-use with an arbitrary naming scheme becomes increasingly unwieldy, in my opinion.

Additionally, I have taken up a habit of not abbreviating "production" and "staging", whenever possible. The fact that len("production") > len("staging") > len("dev") is a feature when you find yourself typing it into a terminal or db shell.

At work we have an ascii-art, most-of-80x40-filling version of our logo rendered in an environment-associated color on login + matching prompt. The ascii-art logo might not have helped in this case (if it was a long sequence of console interactions), but It does catch "db1" vs "db2" typos, for instance, and also elicits a certain reverence upon connection.

This doesn't really help if there are multiple production databases. It could be sharded, replicated, multi-tenant, etc.

Why would it matter? In my last job we had user home directories synced via puppet (I am overly simplifying this) which enabled any ops guy to have same set of shell and vim configuration settings on production machines too.

I daresay - having hostname as part of prompt saves lot of trouble.

Having the hostname on the prompt is a good idea, but I don't think it would have helped with this process failure.

I work at a company where we have hundreds of database machines. Running this kind of command _anywhere_ without some kind of plan would be foolish. (It's one of the reasons why we have a datastore team that handles database administration.)

But the same lesson applies to application servers as well. Don't run deleterious commands out of curiosity. Have a peer-reviewed roll plan to act on when doing things like this. A role plan would have called for verifying the host before running the command.

But even before that, the issue should have been investigated more!

All of these things contributed to the failure. There should ideally be better ownership through dedicated roles, peer-reviewed processes for dangerous activities, and a better process for investigation that does not involve deleting things haphazardly.

Uh no! Don't rely on command prompt, there are hardcoded ones out there, and clonining scripts have duplicated them.

uname -n

Takes seconds.

colour is more noticeable than words. I do both, though.

Also a good lesson for testing your availability and disaster recovery measures for effectiveness.

Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.

Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.

I'm not sure there is a "late" at night or tired in the incident report. All times are UTC and it's unclear where all the team is located but if in SF then this is at 4pm which is when the incident just occurred. It doesn't necessarily change that you shouldn't be firefighting for extremely long cycles in hero-mode for that long, but not exactly the same as exhausted and powering away for hours.

Gitlab's team is worldwide. YP seems to be someone in EU time, which right now is UTC+1.

That would mean that the incident happened around 11pm or midnight in YP's local time.

Correct, I'm based in Europe/Amsterdam so this happened mostly during the evening.

Actually, there is a brief comment:

"At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden."

I audibly gasped when I hit this part of the document.

I think I can count on one hand the number of times I've run an rm command on a production server. I'll move it at worst, and only delete anything if I'm critically low on disk space. But even then I don't even like typing those characters if I can avoid it, regardless of if I'm running as root or a normal user.

> Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

In these situations, I always keep the following xkcd in mind: https://xkcd.com/349/

Just a thought, but is it possible to override destructive commands to confirm the hostname by typing it before running the command?

I remember when we accidentally deleted our customers' data. That was the worst feeling I ever had running our business. It was about 4% of our entire storage set and had to let our customers know and start the resyncs. Those first 12 hours of panic were physically and emotionally debilitating - more than they have the right to be. I learned an important lesson that day: Business is business and personal is personal. I remember it like it was yesterday, the momement I conciously decided I would no longer allow business operations determine my physical health (stress level, heart rate, sleep schedule).

For what it's worth, it was a lesson worth learning despite what seemed like catastrophic world-ending circumstances.

We survived, and GitLab will too. GitLab has been an extraordinary service since the beginning. Even if their repos were to get wiped (which seems not to be the case), I'd still continue supporting them (after I re-up'd from my local repos). I appreciate their transparency and hope that they can turn this situation into a positive lesson in the long run.

Best of luck to GitLab sysops and don't forget to get some sleep and relax.

I had a great manager a little while back who said they had an expression in Spain:

> "The person who washes the dishes is the one who breaks them."

Not, like, all the time. But sometimes. If you don't have one of these under your belt, you might ask yourself if you're moving too slow.

If that didn't help, he would also point out:

> "This is not a hospital."

Whatever the crisis, and there were some good ones, we weren't going to save anyone's life by running around.

Sure, data loss sucks, but nobody died today because of this.

I really appreciate the raw timeline. I feel your pain. Get some sleep. Tomorrow is a new day.

> Get some sleep.

Definitely get sleep, but it would be nice if the site were back online before that. I actually just created a new GitLab account and project a couple days ago for a project I needed to work on with a collaborator tonight. This is not a good first impression.

Paid or unpaid account and project?

I applaud their forthrightness and hope that it's recoverable so that most of the disaster is averted.

To me the most illuminating lesson is that debugging 'weird' issues is enough of a minefield; doing it in production is fraught with even more peril. Perhaps we as users (or developers with our 'user' hat on) expect so much availability as to cause companies to prioritize it so high, but (casually, without really being on the hook for any business impact) I'd say availability is nice to have, while durability is mandatory. To me, an emergency outage would've been preferable to give the system time to catch up or recover, with the added bonus of also kicking off the offending user causing spurious load.

My other observation is that troubleshooting -- the entire workflow -- is inevitably pure garbage. We engineer systems to work well -- these days often with elaborate instrumentation to spin up containers of managed services and whatnot, but once they no longer work well we have to dip down to the lowest adminable levels, tune obscure flags, restart processes to see if it's any better, muck about with temp files, and use shell commands that were designed 40 years ago for when it was a different time. This is a terrible idea. I don't have an easy solution for the 'unknown unknowns', but the collective state of 'what to do if this application is fucking up' feels like it's in the stone ages compared to what we've accomplished on the side of when things are actually working.

Be careful not to overlook the benefits of instrumentation even in the "unknown unknowns" scenario. If you implement it properly, the instruments will alert you to where the problem is, saving you time from debugging in the wrong place.

The initial goal of instrumentation should be to provide sufficient cover to a broad area of failure scenarios (database, network, CPU, etc), so that in the event of a failure, you immediately know where to look. Then, once those broad areas are covered, move onto more fine-grained instrumentation, preferably prioritized by failure rates and previous experience. A bug should never be undetectable a second time.

As a contrived example, it was "instrumentation," albeit crudely targeted, that alerted GitLab the problem was with the database. This instrumentation only pointed them to the general area of the problem, but of course that's a necessary first step. Now that they had this problem, they can improve their database-specific instrumentation and catch the error faster next time.

Engineering things to work well and the troubleshooting process at a low level are one and the same. It's just that in some cases other people found these bugs and issues before you and fixed them. But this is the cost of OSS, you get awesome stuff for free (as in beer) and are expected to be on the hook for helping with this process. If you don't like it, pay somebody.

Really everyone could benefit from learning more about the systems they rely on such as Linux and observability tools like vmstat, etc. The less lucky guesses or cargo culted solutions you use the better.

Seems like very basic mistakes were made, not at the event but way long before. If you don't test to restore your backups, you don't have a backup. How does it go unnoticed that S3 backups don't work for so long?

Helpful hint: Have a employee who regularly accidentally deletes folders. I have a couple, it's why I know my backups work. :D

Even better, have a Chaos Monkey do it ;)

Would you believe I have enough chaos already?

Yeah, the "You don't have backups unless you can restore them" stikes again.

Virtually the only way to lose data is to not have backups. We live in such fancy times that there's no reason to ever lose data that you care about.

Not "can restore them", it's "have restored them".

Best way to ensure that is to have backup restoration be a regularly scheduled event. For most apps I work on, that's either daily or (worst case) weekly, with prod being entirely rebuilt in a lower environment. Works great for creating a test lane too!

> How does it go unnoticed that S3 backups don't work for so long?

My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.

I bet it writes to a log file. It just doesn't alert anyone on failure so the log just grows and grows daily with the same error.

Or it alerts people, but on the same channel every other piece of infrastructure alerts them, and they have a severe case of false positives.

I've seen that many more times than I've seen the "no alert" option.

New guy: "Hey I see an alert that XYZ failed to run."

Existing team: "Yah don't worry about that. It does that every day. We'll get to it sometime soon."

Gotta love the tweet though : "We accidentally deleted production data and might have to restore from backup."


As usual, I really love the transparency they are showing in how they are taking care of the issue. Lots to learn from

As I read the report I notice a lot of PostgreSQL "backup" systems depend on snapshotting from the FS & Rsync. This may work for database write logs, but it certainly will corrupt live git repositories that use local file system locking guarantees. NFS also requires special attention (a symlink lock) as writes can be acknowledged concurrently for byte offsets unless NFSv4 locking & compatible storage software is used.

Git repo corruption from snapshotting tech (tarball, zfs, rsync, etc): http://web.archive.org/web/20130326122719/http://jefferai.or...

Prev. Hacker News submission: https://news.ycombinator.com/item?id=5431409

Gitlab, I know you are all under pressure atm but when the storm passes feel free to reach out to my HN handle at jmiller5.com and I'd be happy to let you know if any of your repository backup solutions are dangerous/prone to corruption.

I see LVM[1] mentioned in the notes. It allows you to, among other things, snapshot a filesystem atomically which you could then mount read-only to a separate location to read for backups or export to a different environment. That would give you a point in time view of the state of all the repos that should be as consistent as a "stop the world then backup" approach.

[1]: https://en.wikipedia.org/wiki/Logical_volume_management

LVM snapshots the raw block device (logical volume). The filesystem is layered on top of that, and then open and partially written files on top of that. So snapshotting an active database is really not the best idea; it might work, it should work, but it'll need to discard any dirty state from the WAL when you restart it with the snapshot. You might be in for more trouble with other data and applications, depending upon their requirement for consistency.

It's definitely not as consistent as "stop the world then backup" because the filesystem is dirty, and the database is dirty. It's equivalent to yanking the power cord from the back of the system, then running fsck, then replaying all the uncommitted transactions from the WAL.

It's for this reason that I use ZFS for snapshotting. It guarantees filesystem consistency and data consistency at a given point in time. It'll still need to deal with replaying the WAL, but you don't need to worry about the filesytem being unmountable (it does happen), and you don't need to worry about the snapshot becoming unreadable (once the snapshot LV runs out of space). LVM was neat in the early 2000s, but there are much better solutions today.

> LVM snapshots the raw block device (logical volume). The filesystem is layered on top of that, and then open and partially written files on top of that. So snapshotting an active database is really not the best idea; it might work, it should work, but it'll need to discard any dirty state from the WAL when you restart it with the snapshot. You might be in for more trouble with other data and applications, depending upon their requirement for consistency.

> It's definitely not as consistent as "stop the world then backup" because the filesystem is dirty, and the database is dirty. It's equivalent to yanking the power cord from the back of the system, then running fsck, then replaying all the uncommitted transactions from the WAL.

I was referring to using LVM to snapshot the filesystem where the git repos are hosted. It'd work for a database as well, assuming your database correctly uses fsync/fdatasync, and for git specifically it works fine.

Using LVM snapshots with a journaled filesystem (i.e. any modern/sane choice for a fs) should have no issues though there would be some journal replay at mount time to get things consistent (v.s. say ZFS which wouldn't require it). If it does have issues, you'd have the same issues with the raw device in the event of hard shutdown (ex: power failure).

Quick question for you since it seems like you are very knowledgeable.

I am the sole back end developer for a greenfield web application that is very data heavy. The application is still in alpha at the moment, but part of the development process involved prepopulating the database with about 10 mil rows or so spread out over about 15 tables. Nothing too crazy. However, once the application is launched I expect to have exponential growth in data due to the nature of the application.

Currently, this application is set up on Linode. The database server is standalone and I have the ability to spin up multiple application machines and a load balancer. Each of these application machines read from and write to the single database machine. The database machine itself has full disk image backups taken every 24 hours, 7 days, and month. I also do a manual snapshot from time to time. On top of this I also usually dump into a tar file on my own external drive every once and a while as well. I'm fairly new to devops stuff and most of my experience involves building applications and not necessarily deploying them. I'm wondering if what I'm doing as far as backups and stability is enough or if I should be incorporating other methods as well. Have any thoughts?

> Git repo corruption from snapshotting tech (tarball, zfs, rsync, etc):

The link discusses why rsync and tarballs are not good backup solutions. But, those wouldn't rightly be called "snapshots". "Snapshot" implies atomic, right? Surely an atomic snapshot would not corrupt git -- I would expect git, like any database, is designed to be recoverable after power failure, to which recovering from an atomic snapshot should be equivalent.

Or is that not the case?

That is actually not the case.

Say a git server is in the middle of a write to refs/heads/master . You atomically snapshot the FS and then a power outage kills the server. The repository state will have a small chance of lock files that are never removed. Depending on the lock, future writes to a ref or to the repository can fail.

Not the worst situation as data lose won't occur but definitely not a stable state. If you treat git repositories as a service that needs a recovery step it would be fine, unfortunately most don't.

(edit) source: http://joeyh.name/blog/entry/difficulties_in_backing_up_live...

OK, from your link, it looks like the only problem affecting atomic snapshots is that `.git/index.lock` could be left in existence which will block git from doing any operations until it is deleted. For some reason git will instruct the user to delete this file but will not do it automatically, even though it could actually check if any live processes have the file locked, and assume that, if not, it is stale.

Seems like a bug in git IMO, but still reasonably easy to recover from.

Filesystem snapshots and rsync are between the PostgreSQL backup best practices. Not really for logs, but for stored data. (For logs you don't really need snapshots.)

The other good alternative is atomic backups (like with pg_dump), but that does put some extra load on your database that may be unacceptable.

Yes, you will want something different for your git repositories. There is no backup procedure that is best for all cases.

Amazingly transparent and honest.

Unfortunately, this kind of situation, "only the ideal case ever worked at all", is not uncommon. I've seen it before ... when doing things the right way, dotting 'I's and crossing 'T's, requires an experienced employee a good week or two, it's very tempting for a lean startup to bang out something that seems to work in a couple days and move on.

Regarding making mistakes:

Tom Watson Jr., CEO of IBM between 1956 and 1971, was a key figure in the information revolution. Watson repeatedly demonstrated his abilities as a leader, never more so than in our first short story.

A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,

“Not at all, young man, we have just spent a couple of million dollars educating you.”

This happened to me one night late a few years back, with Oracle on a CentOS server. rm -rf /data/oradata/ on the wrong machine.

I managed to get the data back though, as Oracle was still running and had the files open. "lsof | grep '(deleted)'" and /proc/<ORACLEPIDHERE>/fd/* saved my life. I managed to stop all connections to the database, copy all the (deleted) files into a temp directory, stop Oracle, copy the files to their rightful place, and start up Oracle, with no data lost.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact