status.github.com reports "We're failing over a data storage system in order to restore access to GitHub.com."
Github engineers, if you are reading me (probably not), KEEP IT UP, it happens to the best of us! <3
"...with the aim of serving fully consistent data within the next 2 hours."
That's somewhat significant
> Multiple services on GitHub.com were affected by a network partition and subsequent database failure resulting in inconsistent information being presented on our website. Out of an abundance of caution we have taken steps to ensure the integrity of your data, including pausing webhook events and other internal processing systems.
> Information displayed on GitHub.com is likely to appear out of date; however no data was lost. Once service is fully restored, everything should appear as expected. Further, this incident only impacted website metadata stored in our MySQL databases, such as issues and pull requests. Git repository data remains unaffected and has been available throughout the incident.
- prove it's a human that typed it
- there is code the prevents repeating twice the same message
either way it's entertaining... But it's Monday morning in Australia and we need to release! (yep we do this via pr/tagging etc.)
Off-Topic: If it's not possible to write a twit with the exact text from a deleted twit, then a way to prove someone wrote a twit and then deleted it would be to have them try to write it again.
Ironic isn't it? The whole point of git was to be distributed, and yet we're at a place where a bunch of companies can't deploy software when a single git provider is down. I myself am in the same boat. Sure, I could reconfigure for a different repo, but is it even worth the effort?
All this for $0.
I agree, gitlab and github will cost the same in the long run. Time costs money and self hosted has a lot of downsides as well.
However, gitlab being self hosted (and not costing anything in terms of separate invoice) makes us feel like we're more in control and are somehow saving some money (as we add more developers as well). This comment here was just sparked by a sigh of relief because this week is an important release for us and had we been on github this monday wouldn't have started too well.
Overall, all the comments listing the trade-offs are true. Time costs $$$ and a paid hosted solution is worth the minimal cost, time and expertise required.
I mean kudos to you for setting it up but it's a bit naive to believe it's better than a hosted solution right away.
We're a company making WISP/ISP software here in India and our cloud (hosted) offering is also hosted on servers at our office itself. Our infra manages hundreds of thousands of internet subscribers in India.
We're a small tech company with 25-30 people or so, with linux kernel experts who have experience of running software on questionable hardware so that service providers can take internet to new places in India, that experience helps a lot in keeping all our services up with minimum downtime.
Our office runs 24x7 with support. If anything goes down our business will be in jeopardy so we have protocol to be back up in minimum amount of time.
How many hours we spent on it? Less than a couple hours on setting up the self hosted gitlab instance. More time on getting people to get used to it and setting up the runner for CI, but I would have to spend that time on other paid solutions too.
As for time costing $0, definitely not, but people in India don't cost nearly as much as they do in the USA. Also because of experience with self hosting it is less of a deal for us to spend time on this.
 we have multiple bandwidth backup, electricity backup, etc needed to self host and we've always stuck with self hosting.
We have SaaS offering for provisioning, subscriber management (AAA, Radius), billing, OSS, CRM, customer portals (including mobile apps) and more!
Please let me know if you'd like to know more, also I would really love to know what software you use to manage your users, etc. and what WISP software is popular in the UK. Thanks!
As for this:
> Who will be woken up at night when there's an outage?
Unless they are working with teams around the world, does it matter if their server is down the whole night, as long as they can get it back up first thing in the morning?
> GitHub has the right to suspend access to all or any part of the Website or your Account at any time, with or without cause, with or without notice, effective immediately.
So, we already had a 3-DC virtualisation platform.
We also chose Gitlab, with a primary in our primary DC (where dev/QA infrastructure resided), with daily full backups rsynced (as created by Gitlab) to the standby gitlab instance (and backups from the stadnby rsynced to the primary) and repos sync hourly. Plus, all volumes on the SAN-attached storage array were snapshotted and backed up to virtual tape in both sites.
The 3rd DC was too small for the same footprint, but had a smaller SAN array and we would be able to host the gitlab instance there in ~1 hour if both the primary and standby failed.
* 2 VMs with 8GiB ram. All other costs were negligible.
* 2x50GiB SAN volumes (at ~ $1/GB for capex+5-year-opex)
* Some virtual tape costs (negligible to our other backup costs)
* Approx 1 hour to set gitlab up
* Approx 1 hour to setup backup replication and test restores
* Approx 1 hour spent on Gitlab upgrades per year
* Benefits: Integration with other on-premise infrastructure we would prefer not to expose to the 'cloud' via e.g. VPC gateways etc.
* Unlimited (except by ram/storage assigned to VMs) private repos
* Unlimited (except by ram/storage assigned to VMs)
* Ability to deploy applications regardless of any other external outage (network, cloud etc.): priceless
I am not aware of any failures in the first 3 years of operation.
> What did the server cost?
Negligible, we already had 12 physical hosts with 256GiB ram each in the primary and secondary site
> How many hours did you spend on it?
3 more hours than we would have if we had used a hosted solution, if the hosted solution would have cost 0 hours to integrate with our on-premise infrastructure in non-internet-exposed DMZs.
> What's your SLA?
Our standard SLA was 98.5% availability of all internal services measured monthly with 1h MTTR.
Same team that gets woken up for the other ~200 VMs that are more critical than the git repo
> What's your backup and recovery procedure?
Assuming we couldn't recover the primary:
* Ensure the primary gitlab instance isn't taking writes/updates (e.g. disable the virtual NIC)
* Start a restore from the lastest backup on the standby
* Check if there were any commits to the git repos since the last backup, of so, push them to the recovered gitlab instance from the clones
* Check that the repos are accessible and working
* Flip the DNS from git.company to point to git.secondary-dc.company instead of git.primary-dc.company
We built everything to be able to failed over from the primary to secondary site and had tested all failovers (application servers, multiple databases, authentication servers, monitoring, infrastructure management applications etc.).
> but it's a bit naive to believe it's better than a hosted solution right away.
Depends on your use case and existing infrastructure/skills/operational posture etc.
This is precisely the problem with developers. You decided to go with a complex self built solution because it "feels" cheaper. If only I would have received a $ every time someone made a wrong decision based on a feeling I'd be a rich man today.
Self-hosting vs 3rd party hosting has a ton of tradeoffs, but systems can fall over no matter who's hosting them.
With hard drive failure rates of 1.6% per year, and all other bits of computer hardware being very reliable, I would guess that an unmaintained self hosted solution will have a longer 'time to failure' than a service provider.
The main benefit of a hosted service IMO is less setup time and a support team to help you when they inevitably 'retire' the 'old' api.
Luckily for us, we do have experience with figuring out servers and are usually required to fix crashes within minutes as the unit measured rather than days.
GitLab may well be the right choice for many teams, but I’ve not yet seen price be the winning factor over the long term. Between general sysadmin tasks, scaling with the size of your team, doing upgrades, and performing database migrations, the cost of running GitLab will probably be the same ballpark as GitHub.
We're a profitable software company that intends to remain in business forever, we're well staffed though and people do have time to try out things and even do R&D on the side. Frugality is built in to our DNA and having spare time on top means failure at amortizing time costs.
However, it might seem a wise choice given gitlab's history of breaking every week and a half. (However that has improved significantly in recent months.)
Gitlab is a far better tool for enterprise, as it can group repos into projects and have a shared issue tracker at a project level (as well as repo). It also has first class CI integration(git lab runners). All of this beats paying for github
 I worked for a very large financial news company that has partially migrated from stash/bitbucket to github. With a mix of public/private repos its a total mess. The tools for managing more than 40 repos simply don't exist. Not only is it more expensive, it doesn't integrate will with Jira, (well thats not entirely github's fault) it literally is just a git HTTP web interface, with a ticketing system bolted on.
Thank you for sharing your feedback and comments with everyone, we really appreciate that. It's great to hear how you are handling your self-hosted GitLab instance, that's awesome!
You say that now but if you're at the sort of scale where you might have the same issue Github is having you'll be a lot less glad someone else isn't fixing it. Especially if it's 3am. On a Saturday. In the middle of your vacation.
Gitlab is awesome but it's not invincible. Problems happen regardless of what solution you pick. I'm happy to pay Github to resolve them for me faster than I could do myself, and better, and with solutions in place up front to make sure I don't actually lose any data.
I'm interested to see what our experience with bugs and updates is in the future, my happiness could very well be premature.
Obligatory "that's what happens when the whole world relies on a centralized git repo" and a reference to gitea, which has a very slick github-esque UI and is incredibly easy and light to deploy/run (on an existing server, your own PC, a raspberry pi, a docker vm, or whatever): https://gitea.io/en-us/
• We aim to serve fully consistent data within the next 2 hours.
• 45 minutes later. On track for serving fully consistent data within the next 1.5 hours
• 45 minutes later. On track to serve consistent data within the hour.
• One and a half hours later. We estimate they will be caught up in an hour and a half.
They had a split brain when multiple masters were running. Then they were not able to choose a master to keep because the data in both masters was 'corrupted' so they are now restoring from a backup.
So how do they get data corruption from multiple masters running:
1) Performing reads from slaves during an update operation. If you perform a read from a slave then you might get data from the other master. If you update data on one master based on data from another master then you get data corruption. Probably they don't do this much because if you had any slave lag then you would notice this problem during normal operation. However, they might do it for checking permissions. You can imagine because this data almost never changes it would never show up as a problem normally.
2) Data stored outside of the database. This would be the repositories themselves and cookie-session storage. Imagine repos have an incrementing id. Then if you have two masters a repo gets created with the same id on both masters. This is very bad because now two people can see each others data. You have the same problem with cookies. Imagine if you have user id as an incrementing id. The two masters create a user with the same id and an encrypted cookie or another storage system (redis) stores the user id. Now depending on which master you get routed to you appear as a different user.
Weirdly enough this would always be a problem based on how their failover system works. The only safe way i know how to turn a master-slave system into a safe HA system is how joyent does it (https://github.com/joyent/manatee/blob/master/docs/user-guid... : you basically need a vote at commit time. if you have this then only 1 master can commit). However, I'm guessing most of the time they only have two masters running for < 1 minute but probably this time they had two masters running for a long time.
EDIT: oh. they said it affected issues & pull requests [they have different dbs for different stuff] so repo and authentication stuff wouldn't apply. oh, they said no data was lost as well so presumably that would exclude a split brain running for a long period of time.
i guess this is a 'bug' because when you do stuff like this you should include some kind of version identifier to catch a concurrent update in the normal case. this scenario is also 'storing stuff outside the DB' kind of similar to the user_id in the cookie getting out of sync.
Also Microsoft post plenty of postmortems, like this detailed one from the VSTS outage in Sept. https://blogs.msdn.microsoft.com/vsoservice/?p=17485
Looking forward to the write-up. I'm also curious as to how this significant outage lines up with their SLA for enterprise users.
> At 10:52 pm Sunday UTC, multiple services on GitHub.com were affected by a network partition and subsequent database failure resulting in inconsistent information being presented on our website. Out of an abundance of caution we have taken steps to ensure the integrity of your data, including pausing webhook events and other internal processing systems.
> We are aware of how important our services are to your development workflows and are actively working to establish an estimated timeframe for full recovery. We will share this information with you as soon as it is available. During this time, information displayed on GitHub.com is likely to appear out of date; however no data was lost. Once service is fully restored, everything should appear as expected. Further, this incident only impacted website metadata stored in our MySQL databases, such as issues and pull requests. Git repository data remains unaffected and has been available throughout the incident.
> We will continue to provide updates and an estimated time to resolution via our status page.
> We continue to monitor restores which are taking longer than anticipated. We estimate they will be caught up in an hour and a half.
What I don't understand is why they don't set it all to "read only" until the problem is sorted. Looks like any update goes to /dev/null, just let your users know!
(unless they plan to replay those updates, somehow)
If replaying is even on the table (which sounds very dangerous to me), that requires a huge coordination and I am pretty sure they are not able to tell right now if that is going to work once the actual issue is fixed.
Better to stay quiet while fixing the issue and only say things that you actually know 100% to be correct.
Great way to start the week.
Your code and your wiki (if you use it) are already in git repos. If the industry moved forward on issues in dvcs repos, you have another central point of failure removed.
You also have less hassle to work across dvcs hosts, which is exactly why GitHub would never embrace this.
> We are currently in the later stages of a restore operation, with the aim of serving fully consistent data within the next 2 hours."
(source: my website is hosted on github pages)
In the enterprise space, a 'data storage system' could be an Array or a SAN or a lightpath etc, with usually quite long failover times. For an org like GitHub I'd think more like an object store (an Array-of-Hosts, if you will) or whatever storage mechanism holds their database files. Do they self host this sort of thing or is it an AWS/GCE/Azure service?
FWIW, all git commands are working fine for me (create a branch, push, colleagues can fetch my branch), but the UI doesn't show my branch & nor can I review/comment on PRs.
Snark aside, this is a great time to reassess your deployment strategies and look into things like local apt and pypi proxies. I'm confident you can find similar projects that will transparently cache your dependencies.
Absolutely is this a perfect time to assess deployment strategies and challenge all the advice of how big a company has to be before it's worth to do X.
If you can't do your job without a tool, then you need to have a plan B on hand for when that tool fails. Both in the micro sense of the tools you use to code (editor, browser, laptop, mouse, coffee mug, etc) and in the macro sense of tools your organization uses (ticketing systems, chat, bathrooms). Show some initiative, figure out some mirrors for your dependencies, and try standing up a local caching proxy for your team. It'll probably take a lot less time than you think.
It could be a networking issue, but you'd expect more sites to be impacted.
If it were a software issue, you'd expect a big player like google to be aggressively patching and talking about a bad release.
More likely is its just a coincidence.
Not even sure if that's feasible but it's an idea.
I personally don't find a relation likely.
What are the (good) alternatives to Github? Gitlab supposedly is Google-backed, so I don't want to have my private code there. Is Bitbucket the only one left?
I don't mind paying monthly, which I already do for GitHub.
This strikes me as odd. May I ask why usage of GCP is a deal-breaker for you? While I can understand not wanting to use Google products directly as a consumer, I believe it would be - for lack of a better term - platform suicide for Google to intercept and perform its usual analytical shenanigans on the data content of transmissions to/from their platform.
Either way, Phacility's Phabricator is $20/user/mo.
I use gmail, I'm quite happy to trust that contract.
Its only recently that gmail's contract involved keeping out of your data. I think they also only say they abstain from using your data for targeted advertising, not that they don't use it for other purposes. I haven't read the terms in quite a while though and I could be mistaken.
Great products though. I really do wish I could pay for them in exchange for a real, trustworthy, comprehensive privacy promise.