I've screwed up before, and I sympathize/empathize with their ops folks, but this should make us think about plan B in case something like this happens again.
GitLab isn't popular because of the stability of its cloud platform. It's popular because you can install your own instance for free practically anywhere with minimal effort.
I run GitLab CE on a box in my server closet for projects that involve livelihoods.
If that is the reason why you use Gitlab, then why not try gitea or gogs? gogs is written in Go and provides a docker image or a drop in binary
I am so happy with Gitlab that I didn't even install Gogs even once to give it a try, though I know about it since it's initial days. For me Gitlab CE just works. Unless Gitea shows a 10x feature I am very unlikely to shift away from Gitlab CE.
Activity for the last 160 days. There were 175 commits to gogs and 720 commits to gitea.
Activity for the last 60 days. There were 109 commits to gogs and 262 to gitea.
The options and vendors directory are unique to Gitea and they account for a lot of the changes within the last 60 days. I was told the vendors directory is used to store dependencies but I don't know what the options directory is used for. And as the following shows, they account for a lot of the files touched, in the last 60 days.
Based on what I've read on Hacker News, the developer behind Gogs, tends to merge in changes in spurts, so it's hard to tell if this recent flurry of activity is a spurt or not. In this 365 days of activity, you can see the 3 spurts for Gogs so far.
Regardless of whether or not Gogs will continue to develop at an increased rated, it looks like Gitea will.
The only (minor) issue I've had with it was when I tried to push the OpenCV repository to it, just for the hell of it, on a heavily constrained VM (Debian with 256 MiB of RAM). The poor thing just couldn't handle it without crashing until I upped the memory to a couple gigabytes.
I've found Gitea to handle larger repos fine after running the following:
git config --global core.packedGitWindowSize 16m
git config --global core.packedGitLimit 64m
git config --global pack.windowMemory 64m
git config --global pack.packSizeLimit 64m
git config --global pack.thread 1
git config --global pack.deltaCacheSize 1m
A not known library that also addresses the same circumstances as curl might have none listed.
Which would you prefer to use?
The other hasn't had that chance.
Thus, try the first, as it is better tested in the world.
But I don't trust the employee that was shown forgiveness for a horrible mistake. Some might learn to not repeat what they had done, while others learn that they can get away with things through the magical power of phrasing the situation in a positive light. And some might be mistake-makes-for-life. Not all people come out the other side stronger.
Manufacturer A: perfect product.
Manufacturer B: omits some pieces, expedites replacements.
People will love and extol the amazing service virtues of B, not A.
They are just better at hiding it, smoothing it over and lying.
Stuff like this only happens once :) haha,
To be fair, I suspect gitlab will a adopt a 'never-again' policy towards making sure their backups work.
That's just nonsense. In most cases, the local repo should be more than enough to continue work.
I'm not trying to skewer Gitlab, I host an instance at home. I've also nuked 30% of the ports on our openstack cluster and screwed up everyone's day. I admire the transparency. I just wanted to call out HN's reaction, vs way smaller (technical) issues involving GitHub. But someone did point out downthread that HN's not a hivemind, so there is that.
You should always have 2+ production nodes in case one goes down.
Gitlab was _one_ (last!) final step from complete data loss of everything. One. At that night, there was quite a long moment there had only one copy (and 6 hours old). Every other backup was missing/notworking/deleted.
This is scary.
I find this to be an eye opener. The real problem was dodged by doing the right thing in the proverbial last minute (6 hours).
Neither should be exclusively relied upon for business-critical services. Ask Github about their production backups and DR plans some time.
I think "everybody learns the hard way" is just one of those things with operations.
In the US maybe, but given that they're a remote company with people all over the world, they don't necessarily have to live by US standard (especially when it comes to spending!).
40k USD is a VERY good developer salary in Argentina, and I'd bet that in other countries a lower figure might make the cut too. I can definitely understand why they'd hesitate to pay thrice that amount.
I believe it's over 5 times the average salary.
And its not just me, they've done this in the past as well:
Also when you work for a remote company, they are cutting cost in terms of office and all which should reflect back in the salaries, the whole idea of working remotely was to mutually benefit both the employeer and employee and not just gitlab using their whole startup argument to cop out when they want on that "truly remote" so-called transparent company.
I fail to see how this disproves it, it merely proves that you're not a good choice for them (because you live in a, relatively, expensive region).
If what OP is saying is true, then the company might as well be upfront about it.
Regardless of that, the average salaries are posted online and they seem to suggest quite the opposite to what they offer people outside of US. It's just an indication of how much of a bully culture they have in negotiations or discriminatory, they might as well just outsource the site and not have a team of their own at all.
This argument sounds more selfish than fair to me.
A company should follow some rules to neutralise the discrimination between employees, it shouldn't be all pick and choose and take advantage where they can, that's not really the reason one should have a remote distributed team to get cheap labor.
Everyone has their own copy of the gitlab remote.
Does that mean you should only use gitlab for toy projects? I don't think so.
I think they'll quickly learn from this.
Yes, our process is too tightly tied to a single service, but it happens because we don't have the resources to self-host our own solution. We love GitLab, but this has absolutely got us looking elsewhere.
If you love GitLab and don't want to self host you can pay them for GitLab Hosted (paid customers had no troubles today).
If you self-hosted code reviews/CI, would you expect having a similar downtime causing problem in the future?
I expect the answer to be "Yes" for most companies.
I don't know. I'm very hesitant trying out GitLab now whereas I was interested before.
If "too much transparency" is a turn off for you, you're probably just an authoritarian trying to scheme and scam your way into profit, and you probably lack the confidence required to put whatever skill you think you have on display.
Recently, our Redis cluster failed and both master & slave host machines rebooted.
This caused a latency spike in our app, from roughly a 70ms to 400ms response time for less than 10 minutes.
We posted to status within 60 seconds and posted 3 updates within those 10 minutes.
The next day, a new customer (who hadn't gone live with the app yet) cancelled their subscription because the app was "not reliable".
I guess my point is that there is a balance to strike. Our customers are not tech-savvy in any way and treat any small issue as the end of the world. Maybe there's no need to freak people out for a minutes-long latency spike.
Just in case you ever use that phrase in critical correspondence. Better to err on the side of caution.
Working with people is hard, specifically because you have to get outside of your head and think about how they see the world.
There's no reason to hide behind a new account other than to be an asshole.
It's not just the impact, which is fairly sizeable in it's own right, but it's the HUGE oversight on their part and the fact they tried to pin part of this on PostreSQL?
Credit where it's due; their report/transparency were good if a little unprofessional, and something I'd like to see more of from other companies.
Putting on my BOFH hat, this is what happens when you let Devs do operational stuff.
What they went through is what you'd expect from a first-time pet project, not a professional business out to make money. I suspect they prioritized trying to scale with enough features to catch up with the competition above all else, to the point where sustainability is an afterthought. When desperately trying to gain market share is more important than the quality of the product, this is what you can expect.
"DBA" is a full-time position, not an addon to a developer's duties. They are separate skillsets; you don't get a "2-for-1" special by hiring an expert developer + DBA in one person for one lowly salary.
Do you realize this event would have never happened if you had hired a pure DBA? Or do you really believe you can pin the blame on your ruby developers for not being able to wrangle a production postgres database?
The oblivious or intentionally cheap "the DBA must be an amazing ruby developer" expectation shows your hiring staff - or the management guiding them - has absolutely no clue what they are doing. I can just imagine the internal discussion right now; pointing the finger at the developers with no postgres experience, or downplaying the significance of this event and pretending like it was simply bad luck, and lying to yourselves about how "it will never happen again".
This job posting is completely outside the realm of reason. If that job posting has been up for months or years, I can see its description being exactly why you didn't have the right talent on board to avoid this incident.
This is a mistake. The number of DBAs who are good with postgres is very small compared to something like mysql. The truly talented pool for such a position is too small to expect them to also be a developer. The very mention of terms like "ruby" and "programming" should be removed from that job post. It's not a realistic expectation.
> I'm the CEO of GitLab https://about.gitlab.com/ More information about me is on http://sytse.com
But maybe you meant part-time?
It is always a HUGE red flag when a company opts to have all or a majority of their workforce working remotely. It's a cost-cutting measure, nothing more. Cutting costs equates to cutting corners, and the business - and its customers - suffer the deserved consequences.
This really explains the flippant "it's 11pm and I want to go to bed" reaction in their report. The guy doesn't have an office to go to when shit hits the fan. He's sitting at home, with a bloody ssh terminal open, trying to remotely debug critical engineering problems over a slow vpn connection. Alarmingly huge red warning flags.
Might wanna revise that today.
Instead of backing up every system in isolation, have well though out backup/restore processes for all parts of their operation.
Saying that as someone who's done exactly this before (professionally, for mission critical places). ;)
· Inventory the systems (boxes, services, etc)
· Determine what each needs (package dependencies, etc)
· Create scripting (etc) for consistent backups
· Work out the restore processes
· Make it work (can take several test/dev iterations)
· Document it
And also (importantly):
· Have the ops staff perform the documented processes, to reveal holes in the docs, and show up parts which need simplifying
People like to say they have backups or a "backup procedure", but in my experience almost none of them ever tested the backup... Not even once. 95% of the time "having a backup procedure" just means "we have a replica of some data sitting somewhere with no idea how/if we can restore it, or how long it takes".
The best thing to happen to OSS is GitHub, not GitLab. GitLab is just a fast follower and likely wouldn't even exist without the former.
I for one am happy to throw money GitHub's way for their role in so dramatically changing how we code.
More seriously: Github is a business focused on turning open source community ethos into hard $$$. There is nothing wrong with that, and their model depends on people buying into the carefully constructed marketing of Github -hearts- open source. It's inconvenient for Github for people to point out that it is a closed source for-profit business with shareholders to please.
(One might even say that Gitlab is even more focused on turning open-source community ethos into hard $$$, because GitLab takes open-sourced community contributions to GitLab CE and rolls them directly into their paid, closed-source GitLab EE. Not that I'm suggesting that's necessarily a bad thing; I just want to make it clear that this isn't a Mozilla-style "take contributions from the community for the community's sake"-type thing.)
Just FYI, the GitLab EE source code is available at https://gitlab.com/gitlab-org/gitlab-ee/. You're right about the "paid" part, though.
SaaS companies are in a position where it can make it difficult for them to release their core software because it severely undercuts their viability if it's easy for people to set up their own installations.
What follows is naïve speculation as to counter arguments:
One could argue that they could differentiate themselves on better quality of service and support. It's possible, of course, but there may be enough people willing to put up with lesser quality that it may not be feasible for them to survive.
One could counter that taking advantage of community software contributions can lower their own internal development costs. Maybe, but there are very real costs when managing an open source project, including not being able to move as fast as they may want to. For example, if GitHub wanted to make a change for expediency' sake that wouldn't be appropriate for the community code, they're saddled with maintaining their own fork, which adds considerable overhead.
Of course GitHub loves open source. Their business is to provide a service that benefits open source (and closed source, for that matter) projects by providing infrastructure to manage their code. Some of their current and former employees are well-known in the open source community. I refuse to see anything nefarious or untoward (or even the slightest bit bad) about this until I see them doing something that impedes or actively works against open source. I see nothing "inconvenient" for GitHub about this. Some people are far too quick to go out of their way to look for hypocrisy.
(And as for "turning open source community ethos into hard $$$": my goodness. Yes, they make money. I hope plenty to continue to support the business, improve the service, and pay their employees well. How many community projects pay nothing for hosting on GitHub? How many devs pay to host their own projects, or fork others? The only reason I've considered paying was to host private repos: not open source. It'd be interesting to see their revenue numbers, but I suspect most of their revenue is for enterprise installs for software that isn't open source. Any corrections as to this most welcome.)
And there are alternatives for those who want open source. GitLab appears to be a great one.
For a company that praises open-source so much they really do try to be as closed as possible themselves.
Github enables centralization of code which is not really inline with philosophy of git being fully distributed . Gitlab with its OSS model and the ability to host your own git service makes is bit closer to that vision than Github is.
Sourceforge had integrated mailing-list support, bug trackers, page-hosting, and more.
There was a community there, it was just that each community was based around a particular project. There was little chance of a user of project A from interesting with project B. But I guess the same could be said of github.
Sourceforge failed in part because of feature-creep, and availability issues. But I think it would be unfair to say that it wasn't "social".
Found some tiny project that does almost everything you want, but has a tiny feature you can implement easy? No worries, fork, add it.
Even if it's never pushed upstream, it's easy for someone to find and perhaps use themselves.
I do this with things like zabbix templates/scripts and docker containers that are 99% what I want.
Previously there was no easy way to do this - I'd have to set up an entirely new sourceforge/google code/etc account, which is a lot more effort.
That's the part that wasn't really possible before git though. But you could always do an anonymous checkout and make your changes there,you could even do a pull request (via a patch). There was no need to signup for that, even though most of us had a source forge account (I still do) anyway.
Just running an apt-get install gitlab gives me around ~350 dependencies. In those dependencies, I see python, ruby, nodejs, redis, postgres.
With a little Java, and a little Go, plus some admin scripts written in Perl, the picture would be mostly complete...
I may be a little harsh, but when I see a piece of software with so much complexity in it that it requires 3 stacks, I'm not completely convinced it's a well conceived piece of software.
Having 5 backup systems, with none of them working properly kind of falls in the same category.
Rarely, I see Java or Go instead for the backend. I cannot recall the last time I saw more than three languages in production for anything nontrivial. I've seen companies significantly larger subsisting on one language for backend (even in microservice architecture!) and one language for frontend. That's not to say there is no sophistication, just that the number of actual technologies in play is slimmer.
This isn't a comment on GitLab's utility or stability, of course. I haven't worked with them in this context, and I'm not a user. I'm just pointing out that, assuming those dependencies are all for GitLab and not e.g. git itself, that is quite a stack to maintain. I don't know if we can extrapolate that to a systemic issue with GitLab that caused a data loss incident, though. That seems uncharitable.
I can understand "right tool for the job" but at some point it should all come together. An MVP can be hacked together from bits and bobs, but when it becomes a business it should be refactored to reduce complexity where possible.
Trying to maintain five seperate backup solutions, let alone trying to restore from all five of them, sounds like my worst nightmare. Trying to restore from one backup is often hard enough by itself.
Everyone knows how much backups are important and talk about it all the time, but I bet a lot of companies don't do it right. It is expensive, don't add real value (except when it does), etc.
Today I added automated replication of backups from AWS to another cloud provider. Just in case...
I did this becauase recently, a local brazilian cloud provider (ServerLoft) didn't paid his server (Equinix) and went offline forever. 16k companies went offline without time to recover anything there.
I'm using this as a great example of a reason to bump up some of the fixes to some of our backups and replication issues up the priority list. And it's much easier to sell to some of the "higher ups" when you can point at a concrete example of how badly a misstep here can hurt.
I'm floored with their honesty and openness, I can honestly say I wouldn't be able to put this out there like they have... But i'm really glad they are doing it, and I'm really happy at the outpouring of support they are getting for it from people like 2ndQuadrant.
I manage our company's GitLab instance. It's connected to our massive AD and it works just fine - the only thing I really miss is LDAP group creation and assignment. GitLab uses only the name and email attributes from LDAP/AD either way, and I think if I have some spare time I'll just write a hourly cronjob that manages groups and assignments using the GitLab API.
Being open about it inspires confidence that they will improve in the future, instead of being quiet and having repeat incidents.
I feel like the next step for me is scaling my business so that we have an actual usage for my newly found interests :)
But please realize that this is a horrible example. Almost everything done was wrong, technical choices, processes, everything - please don't use it as a positive example.
Someone not involved with helping fixing the problem set up the stream from their home, while we continued work as normal.
I think the overall spirit was that it was comfortable to do it like this. Note that no one was required to work like this and we'd happily stopped streaming if anyone would have any problems with it.
..and yep, it's listed.
In my own experience, I'd still take PostgreSQL over MySQL. MySQL doesn't allow for DDL modifications within a transaction, which makes database migration with tools like Flyway a little less resilient. On the other hand, you can use one connection for multiple databases with MySQL, MSSQL and others, which you can't with Postgres.
I mean really, they all have trade-offs. It really just depends on your specific use case.
i actually regret not using mysql because mysql at the time supported out of the box logical replication which would make database upgrades easier. we haven't upgraded our postgres DB boxes because it would involve a lot of pain. whereas if we were using mysql we probably would just upgrade the slaves and wait for a failure.
i really hope this becomes a thing.
Have a single point of contact that provides information about the recovery process. Being transparent and providing technical info is good, but that task should not be handled directly by the admins at the same time they are focusing on the drop-everything-shit-is-broke emergency.
Not sure if we'll be able to get the full 8+ hours up.
From , complete db is ~300GB and from some iffy pixel measurement of the graph at the very bottom of that page, copying speed between otherwise idle db hosts was about 22.8 GB/hour (in-production replication is probably slower than that).
From that, 4GB of replication lag would represent 1.3% of db by size, or 10+ minutes of lag (as measured by time required to catch up under ideal circumstances).
However, scale was the wrong word for what I was wondering about. My question should've been whether 1% of your total DB/10 minutes of replication lag seems reasonable/nothing to worry about, like the article suggested.
> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. We ended up restoring a 6 hours old backup.
That must be _terrifying_ to realize. I mean, thank goodness they had a 6 hour old back up or they'd be in such an awful spot.
It would be like Boeing or Airbus announcing all the safety features on their airliners were non functioning.
To add to the other replies: I'm not trusting them with keeping my code safe (everybody has copies on their own computers). I'm trusting them with facilitating my workflow, helping me collaborate, and also to keep my issues etc. safe.
So yes, this can have a relatively large impact in terms of not being able to work as efficiently for a day and potentially losing some issues, but nowhere near as disastrous as losing all my code.
If you want to move it, in most cases you can simply create a new bare repo on another remote and push yours to it. It's probably the easiest system out there to simply pick up your data and go, unlike the walled gardens of social networking, video (YouTube isn't video, it's video + annotations + another ecosystem of tools that's not easy to export), and other services.
Even blog software isn't as resilient. You still have to export your Wordpress or Ghost blog when you want to move it. With git, when you work on it, you already have a full copy (with a few exceptions of course, like remote branches people prune without merging or local branches people never push).
This might be a good time to note that GitHub dropped their database awhile back (twice!). Bad Things always happen; you recover, learn, and hopefully fewer Bad Things happen in the future. Nature of the beast, unfortunately.
It can be argued that once a company messes up this bad it will make sure nothing similar happens ever again. However, it can also be argued that if a tech company has all five of its back up procedures fail it's borderline criminally negligent.
I have another company that runs on a completely open stack, where pretty much nothing is integrated by a specific vendor. We have hiccups, but we've never had the OS get hijacked and upgraded.
I've noticed most start-ups run by devs run on a more open stack and hack their way through problems on the cheap, and the ones run by corporate executives try to keep things as closed as possible, but end up spending millions to solve problems that they could have had some people solve for fun on the internet.
I prefer to use the right tool for the right job, but I wish companies like Microsoft would be more open when they cause huge issues that end up causing monetary loss. I make sure all my critical infrastructure is open source these days.
Weird that you group OSX in there. The only company in mainstream tech more secretive than Microsoft when it comes to problems is... Apple.
"Test-recover backups" is ops 101. "Monitor your backup process to be sure your backup store isn't empty" is ops 101. "Script your rollouts so you don't have an ops person doing SSH on boxes" is... ok, that one might be ops 102.
This points to a company with almost no understanding of how to operationalize software. There are certain to be far more landmines, possibly even bigger ones. Hiring an ops person to fix these problems is definitely possible - and I sincerely wish GitLab luck getting a competent ops team in place before the next crisis.
The day was saved by the fact that Oracle stored data on a block device directly. There were no data loss and we just had to restore the machine itself.
Since that day, I never run any scripts in /, /etc/, ...
Use local installers. Or tarballs.
I've since setup wal-e with daily base backups deleting things older than a week along with the nightly pg_dump's along with a hot stanby. Maybe thats overkill, but after having lost data once. Never again!
The nice part about doing wal archiving like barman or wal-e do is you can do more than just backup/restore. You can do it with some time target in mind as well.
Someone somehow do a massive update or delete or insert millions+ in garbage? No worries, stop, destroy, restore to a previous point in time, continue onwards.
A bug in postgres or the kernel or the filesystem or any other multi-million line codebase in that stack screw up? Most likely the WAL segments are good up to a point still.
Hot standby gets you potentially sub-minute failovers if you automate them, or short enough to be ok even with manual failovers. WAL archiving gets you another whole level of safety net that is hard to beat.
"If it's not tested it doesn't work"
while this was originally talking about software, it's amazing how many other places it applies. Do you require code reviews before commit? Periodically sample a few random commits and see if one was done for each.
Heck, if you don't test your system for running automated tests, it may be that you aren't even testing what you think you're testing.
GitLab is surely losing subscribers faster than land in Crimea after this...
--it happens. I suspect the the guy responsible for the final straw is feeling pretty bad. I know I've come close to doing similar things on production environments I really didn't want to be touching while they were falling apart.
But they've been honest about it. If they learn from it and six hours of database data is the worst data loss they ever experience I think it'll be a credit to them that they've been promptly transparent.
I think companies prefer other databases like MySQL because they "just work."
On performance: Every database will have some areas where it performs better than others - Uber just happened to hit a case that MySQL is well optimised for.
On data loss: MySQL's binlog replication is hardly a 'just works' solution. It's definitely got a lot better over time (particularly in 5.7), but it's not like a replica falling behind is some incredibly unusual event. Go back a fair while and it was really broken - much more so than Postgres' solution has ever been. It's fair to say it's more mature right now, and one reason to prefer MySQL for replication is that there's certainly more help out there for it.
The data loss in this incident was human error, not PostgreSQL fault.