Status.github.com: “We're failing over a data storage system”

jypepin · on Oct 22, 2018

It's been over 6hrs without update, except the hourly message which states the same. I'm really looking forward this post mortem, but having been in the situation where I've had to deal with large scale outages like this one, I guarantee some engineers are having a bad time right now, and I feel for them.

Github engineers, if you are reading me (probably not), KEEP IT UP, it happens to the best of us! <3

olingern · on Oct 22, 2018

Posted at 15:51 Japan Standard Time

"...with the aim of serving fully consistent data within the next 2 hours."

That's somewhat significant

Cthulhu_ · on Oct 22, 2018

Yeah, sounds like a backup restore or a RAID synchronization is in progress.

woogle · on Oct 22, 2018

They've just posted an "Incident Report": https://blog.github.com/2018-10-21-october21-incident-report...

> Multiple services on GitHub.com were affected by a network partition and subsequent database failure resulting in inconsistent information being presented on our website. Out of an abundance of caution we have taken steps to ensure the integrity of your data, including pausing webhook events and other internal processing systems.

[...]

> Information displayed on GitHub.com is likely to appear out of date; however no data was lost. Once service is fully restored, everything should appear as expected. Further, this incident only impacted website metadata stored in our MySQL databases, such as issues and pull requests. Git repository data remains unaffected and has been available throughout the incident.

keyle · on Oct 22, 2018

I don't know if they keep changing the text updates with a slightly different version to

- prove it's a human that typed it

- there is code the prevents repeating twice the same message

either way it's entertaining... But it's Monday morning in Australia and we need to release! (yep we do this via pr/tagging etc.)

rococode · on Oct 22, 2018

I like how the times are unevenly spaced too haha. I'm imagining a bunch of devs sitting at home in their pajamas (Sunday night) talking on Slack as they try to fix the site and every time there's a lull one person's like "hey we should probably refresh the status again".

Fuzzwah · on Oct 22, 2018

It is to ensure that the updates get mirrored onto twitter, where exact duplicates can't be posted.

hiccuphippo · on Oct 22, 2018

Isn't it enough to delete the older tweet? Or maybe just add a timestamp to the message.

Off-Topic: If it's not possible to write a twit with the exact text from a deleted twit, then a way to prove someone wrote a twit and then deleted it would be to have them try to write it again.

askmike · on Oct 22, 2018

Deleting tweets is a terrible workaround. Once github start tweeting and people start linking to those tweets they can't go ahead and delete them 50 minutes later..

owl57 · on Oct 22, 2018

Re: off-topic: You have to do it fast. You certainly can tweet a lot of identical tweets if you aren't in a hurry: there is a "Rock in the forest"[1] that tweets "Nothing happened today" (in Russian) every day for several years, and most of the time the text is exactly the same. Interestingly enough, the time of tweet varies widely from day to day, so it looks like a human or a program specifically made to imitate one.

[1] https://twitter.com/kamen_v_lesu

hnarn · on Oct 22, 2018

If you include a unique hashtag, would the message be considered unique? In that case, you could include the epoch timestamp as a hashtag :-)

drb91 · on Oct 22, 2018

Or even just prefix the tweet with the update ordinal.

dataflow · on Oct 22, 2018

They can't alternate?

OJFord · on Oct 22, 2018

For several hours it was alternating.

aapeli · on Oct 22, 2018

https://github-status-generator.com/

Ayesh · on Oct 23, 2018

I was thinking this guy just be from Australia to be annoyed tis much, and turns out he is!

jedberg · on Oct 22, 2018

> But it's Monday morning in Australia and we need to release!

Ironic isn't it? The whole point of git was to be distributed, and yet we're at a place where a bunch of companies can't deploy software when a single git provider is down. I myself am in the same boat. Sure, I could reconfigure for a different repo, but is it even worth the effort?

beatgammit · on Oct 23, 2018

I've seen some that are spaced at regular intervals with very subtle changes between messages, none of which actually say anything. I really like Github's messages since they seem like they're actually being updated by a human.

ajb257 · on Oct 22, 2018

Given the latest update is "We are currently in the later stages of a restore operation, with the aim of serving fully consistent data within the next 2 hours", I can only assume someone at GH read your comment!

majewsky · on Oct 22, 2018

Hacker News is, surprisingly, not the center of the universe.

r_singh · on Oct 22, 2018

Just a few weeks ago my organization was in the position of choosing a version control platform for our repos. I'm so glad we went ahead with self hosted gitlab. We installed it on a CentOS server at our premise, SSL'd via Let's encrypt, I've even set up a dedicated gitlab runner to use Gitlab CI for continuous delivery and so far the testing is progressing pretty smoothly.

All this for $0.

Update:

I agree, gitlab and github will cost the same in the long run. Time costs money and self hosted has a lot of downsides as well.

However, gitlab being self hosted (and not costing anything in terms of separate invoice) makes us feel like we're more in control and are somehow saving some money (as we add more developers as well). This comment here was just sparked by a sigh of relief because this week is an important release for us and had we been on github this monday wouldn't have started too well.

Overall, all the comments listing the trade-offs are true. Time costs $$$ and a paid hosted solution is worth the minimal cost, time and expertise required.

Cthulhu_ · on Oct 22, 2018

What did the server cost? How many hours did you spend on it? What's your SLA? Who will be woken up at night when there's an outage? What's your backup and recovery procedure?

I mean kudos to you for setting it up but it's a bit naive to believe it's better than a hosted solution right away.

r_singh · on Oct 22, 2018

Fair questions.

We're a company making WISP/ISP software here in India and our cloud (hosted) offering is also hosted on servers at our office itself[1]. Our infra manages hundreds of thousands of internet subscribers in India.

We're a small tech company with 25-30 people or so, with linux kernel experts who have experience of running software on questionable hardware so that service providers can take internet to new places in India, that experience helps a lot in keeping all our services up with minimum downtime.

Our office runs 24x7 with support. If anything goes down our business will be in jeopardy so we have protocol to be back up in minimum amount of time.

How many hours we spent on it? Less than a couple hours on setting up the self hosted gitlab instance. More time on getting people to get used to it and setting up the runner for CI, but I would have to spend that time on other paid solutions too.

As for time costing $0, definitely not, but people in India don't cost nearly as much as they do in the USA. Also because of experience with self hosting it is less of a deal for us to spend time on this.

[1] we have multiple bandwidth backup, electricity backup, etc needed to self host and we've always stuck with self hosting.

simonjgreen · on Oct 22, 2018

What do you make out of curiosity? I run an ISP in the UK.

r_singh · on Oct 22, 2018

Awesome!

We have SaaS offering for provisioning, subscriber management (AAA, Radius), billing, OSS, CRM, customer portals (including mobile apps) and more!

Please let me know if you'd like to know more, also I would really love to know what software you use to manage your users, etc. and what WISP software is popular in the UK. Thanks!

k_ · on Oct 22, 2018

I agree that most of these are good points, although they can be worth it depending on your priorities.

As for this:

> Who will be woken up at night when there's an outage?

Unless they are working with teams around the world, does it matter if their server is down the whole night, as long as they can get it back up first thing in the morning?

codedokode · on Oct 22, 2018

There is a big difference though: self-hosted Gitlab won't kick them out:

> GitHub has the right to suspend access to all or any part of the Website or your Account at any time, with or without cause, with or without notice, effective immediately.

https://help.github.com/articles/github-corporate-terms-of-s...

ti_ranger · on Oct 22, 2018

I am not the OP, but in my previous job, we ran infrastructure that could absolutely NOT be cloud hosted (network management control plane for an ISP), and had to be geo-redundant (3 DCs in our country, which also hosted the core routers and other data-plane components). The ISPs applications also ran on the same infrastructure.

So, we already had a 3-DC virtualisation platform.

We also chose Gitlab, with a primary in our primary DC (where dev/QA infrastructure resided), with daily full backups rsynced (as created by Gitlab) to the standby gitlab instance (and backups from the stadnby rsynced to the primary) and repos sync hourly. Plus, all volumes on the SAN-attached storage array were snapshotted and backed up to virtual tape in both sites.

The 3rd DC was too small for the same footprint, but had a smaller SAN array and we would be able to host the gitlab instance there in ~1 hour if both the primary and standby failed.

Total cost: * 2 VMs with 8GiB ram. All other costs were negligible. * 2x50GiB SAN volumes (at ~ $1/GB for capex+5-year-opex) * Some virtual tape costs (negligible to our other backup costs) * Time: * Approx 1 hour to set gitlab up * Approx 1 hour to setup backup replication and test restores * Approx 1 hour spent on Gitlab upgrades per year * Benefits: Integration with other on-premise infrastructure we would prefer not to expose to the 'cloud' via e.g. VPC gateways etc. * Unlimited (except by ram/storage assigned to VMs) private repos * Unlimited (except by ram/storage assigned to VMs) * Ability to deploy applications regardless of any other external outage (network, cloud etc.): priceless

I am not aware of any failures in the first 3 years of operation.

> What did the server cost? Negligible, we already had 12 physical hosts with 256GiB ram each in the primary and secondary site

> How many hours did you spend on it? 3 more hours than we would have if we had used a hosted solution, if the hosted solution would have cost 0 hours to integrate with our on-premise infrastructure in non-internet-exposed DMZs.

> What's your SLA?

Our standard SLA was 98.5% availability of all internal services measured monthly with 1h MTTR.

> Who will be woken up at night when there's an outage?

Same team that gets woken up for the other ~200 VMs that are more critical than the git repo

> What's your backup and recovery procedure?

Assuming we couldn't recover the primary: * Ensure the primary gitlab instance isn't taking writes/updates (e.g. disable the virtual NIC) * Start a restore from the lastest backup on the standby * Check if there were any commits to the git repos since the last backup, of so, push them to the recovered gitlab instance from the clones * Check that the repos are accessible and working * Flip the DNS from git.company to point to git.secondary-dc.company instead of git.primary-dc.company

We built everything to be able to failed over from the primary to secondary site and had tested all failovers (application servers, multiple databases, authentication servers, monitoring, infrastructure management applications etc.).

> but it's a bit naive to believe it's better than a hosted solution right away.

Depends on your use case and existing infrastructure/skills/operational posture etc.

vbezhenar · on Oct 22, 2018

Git is tool for developers. If developer can't fix broken server, he's doing something wrong. No need to outsource trivial tasks.

codetrotter · on Oct 22, 2018

Time spent fixing broken servers is time not spent working on other things. In some cases it makes sense to spend your time that way, in other cases it does not.

softawre · on Oct 22, 2018

Might as well have your developers do the support and the testing and answering the phones and cleaning the bathrooms too, eh?

vbezhenar · on Oct 23, 2018

Support and testing may be useful. Cleaning bathrooms no.

dustinmoris · on Oct 22, 2018

> makes us feel like we're more in control and are somehow saving some money

This is precisely the problem with developers. You decided to go with a complex self built solution because it "feels" cheaper. If only I would have received a $ every time someone made a wrong decision based on a feeling I'd be a rich man today.

r_singh · on Oct 22, 2018

True, enthusiastic devs are definitely bad at amortizing their own time’s cost.

lowry · on Oct 22, 2018

There's no cloud. Just someone else's computer.

akerl_ · on Oct 22, 2018

I feel like this is the first chapter of a story whose ending is probably happy, but whose plot twist is "and then the on-prem server crashed and it took a week to figure out how to fix it".

Self-hosting vs 3rd party hosting has a ton of tradeoffs, but systems can fall over no matter who's hosting them.

londons_explore · on Oct 22, 2018

"and then the service we were using got 'sunsetted' and we spent weeks migrating to a new incompatible provider".

With hard drive failure rates of 1.6% per year, and all other bits of computer hardware being very reliable, I would guess that an unmaintained self hosted solution will have a longer 'time to failure' than a service provider.

The main benefit of a hosted service IMO is less setup time and a support team to help you when they inevitably 'retire' the 'old' api.

r_singh · on Oct 22, 2018

This would definitely be true if we didn't have experience with self hosting services that are critical to the businesses of our customers.

Luckily for us, we do have experience with figuring out servers and are usually required to fix crashes within minutes as the unit measured rather than days.

danpalmer · on Oct 22, 2018

Your time isn’t free.

GitLab may well be the right choice for many teams, but I’ve not yet seen price be the winning factor over the long term. Between general sysadmin tasks, scaling with the size of your team, doing upgrades, and performing database migrations, the cost of running GitLab will probably be the same ballpark as GitHub.

r_singh · on Oct 22, 2018

This is true, we've always had troubles with amortizing costs for time spent on doing something.

We're a profitable software company that intends to remain in business forever, we're well staffed though and people do have time to try out things and even do R&D on the side. Frugality is built in to our DNA and having spare time on top means failure at amortizing time costs.

KaiserPro · on Oct 22, 2018

Hosted gitlab has more features than github. But you are on the hook for security, config and uptime.

However, it might seem a wise choice given gitlab's history of breaking every week and a half. (However that has improved significantly in recent months.)

Gitlab is a far better tool for enterprise, as it can group repos into projects and have a shared issue tracker at a project level (as well as repo). It also has first class CI integration(git lab runners). All of this beats paying for github[1]

[1] I worked for a very large financial news company that has partially migrated from stash/bitbucket to github. With a mix of public/private repos its a total mess. The tools for managing more than 40 repos simply don't exist. Not only is it more expensive, it doesn't integrate will with Jira, (well thats not entirely github's fault) it literally is just a git HTTP web interface, with a ticketing system bolted on.

dsumenkovic · on Oct 22, 2018

Hello, Community Advocate from GitLab here.

Thank you for sharing your feedback and comments with everyone, we really appreciate that. It's great to hear how you are handling your self-hosted GitLab instance, that's awesome!

onion2k · on Oct 22, 2018

I'm so glad we went ahead with self hosted gitlab.

You say that now but if you're at the sort of scale where you might have the same issue Github is having you'll be a lot less glad someone else isn't fixing it. Especially if it's 3am. On a Saturday. In the middle of your vacation.

Gitlab is awesome but it's not invincible. Problems happen regardless of what solution you pick. I'm happy to pay Github to resolve them for me faster than I could do myself, and better, and with solutions in place up front to make sure I don't actually lose any data.

deadbunny · on Oct 22, 2018

I think you've forgotten this is HN, a place where operational issues are "easy", till they happen. Then they realize they might actually need to hire someone who has a clue what their doing when it comes operations that don't involve turning it off and on again.

yread · on Oct 22, 2018

You value your time at 0$ ?

r_singh · on Oct 22, 2018

Definitely not, that was just a hyperbole sparked by a sigh of relief that our release won't be affected because of this outage.

dis-sys · on Oct 22, 2018

you are basically assuming that your deployed version of GitLab is bug free and your hardware will be able to serve you 24/7 without downtime in the next few years.

r_singh · on Oct 22, 2018

I'm confident about the downtime not being an issue for us because of experience with all kinds of server issues.

I'm interested to see what our experience with bugs and updates is in the future, my happiness could very well be premature.

ksec · on Oct 22, 2018

The Only reason for sticking with GitHub is UX. gitlab is expanding in all directions may be they will catch up.

ryanmccullagh · on Oct 22, 2018

It's definitely not zero cost. Your time and maintenance efforts will cost you time.

woolvalley · on Oct 22, 2018

+ cost of additional labor

TanakaTarou · on Oct 22, 2018

Plus cost of nobody can work because their single server crashed

Grue3 · on Oct 22, 2018

Git is a distributed system. People can work if they want to, but it's certainly a convenient excuse to not be working.

deadbunny · on Oct 22, 2018

Sure, but the tooling around git isn't. Hard to do things if you can't make PRs, which trigger tests, which trigger builds, which trigger test deployments, which trigger tests, etc...

ComputerGuru · on Oct 22, 2018

I can't add comments to pull requests("you can't do that right now") and any commits pushed to branches are not updating updating the visible status in the web interface, no are newly created branches showing up. However, if you navigate to a new commit directly with its SHA (so you can share it with someone if you really want), it'll show up (so they're just not being indexed).

EDIT:

Obligatory "that's what happens when the whole world relies on a centralized git repo" and a reference to gitea, which has a very slick github-esque UI and is incredibly easy and light to deploy/run (on an existing server, your own PC, a raspberry pi, a docker vm, or whatever): https://gitea.io/en-us/

rodorgas · on Oct 22, 2018

I’m receiving emails (a lot of duplicates) from comments in PR, but they won’t show up in the browser. I guess people are trying to submit multiple times, the email is sent but comment isn’t posted.

majidazimi · on Oct 22, 2018

Welcome to eventual consistency.

hnarn · on Oct 22, 2018

Mañana Consistency(tm)

samuel1604 · on Oct 22, 2018

yes, but does it scale tho ? is it DR ? I mean sure fine it's easy and pretty but there is some operation challenge to figure out,

perlgeek · on Oct 22, 2018

If you host just your own Open Source projects, does it need to scale?

metildaa · on Oct 22, 2018

Likely not, you can probably even host your friends too without much server load. Scaling is a concern when you have many users, just a handful of users won't ttend to create much load.

biddlesby · on Oct 22, 2018

The status updates are bearing a remarkable resemblence to a Windows loading bar.

• We aim to serve fully consistent data within the next 2 hours.

• 45 minutes later. On track for serving fully consistent data within the next 1.5 hours

• 45 minutes later. On track to serve consistent data within the hour.

• One and a half hours later. We estimate they will be caught up in an hour and a half.

tjoff · on Oct 22, 2018

I appreciate that they try. As a user I would never rely on those estimates, but it at least gives you a sense of what's going on and what they are doing. I take that over the more common silence or "we are working on it" any day.

Ma8ee · on Oct 22, 2018

Exactly. We all know not to trust that those estimates are exact, but they are much better than nothing. When they say nothing we don't know if it's half an hour, half a day or half a week.

radiospiel · on Oct 22, 2018

Well, so they say it is within the hour, but we still don't know if it's half an hour, half a day or half a week.

fredley · on Oct 22, 2018

They got to the end of the restore, realised there was a problem, and had to start over.

tomas789 · on Oct 22, 2018

My Jenkins is creating CI Builds like crazy. If your CI is on the pay-as-you-go service, make sure you are not burning money (or credits).

benmmurphy · on Oct 22, 2018

My guess at what happened:

They had a split brain when multiple masters were running. Then they were not able to choose a master to keep because the data in both masters was 'corrupted' so they are now restoring from a backup.

So how do they get data corruption from multiple masters running:

1) Performing reads from slaves during an update operation. If you perform a read from a slave then you might get data from the other master. If you update data on one master based on data from another master then you get data corruption. Probably they don't do this much because if you had any slave lag then you would notice this problem during normal operation. However, they might do it for checking permissions. You can imagine because this data almost never changes it would never show up as a problem normally.

2) Data stored outside of the database. This would be the repositories themselves and cookie-session storage. Imagine repos have an incrementing id. Then if you have two masters a repo gets created with the same id on both masters. This is very bad because now two people can see each others data. You have the same problem with cookies. Imagine if you have user id as an incrementing id. The two masters create a user with the same id and an encrypted cookie or another storage system (redis) stores the user id. Now depending on which master you get routed to you appear as a different user.

Weirdly enough this would always be a problem based on how their failover system works. The only safe way i know how to turn a master-slave system into a safe HA system is how joyent does it (https://github.com/joyent/manatee/blob/master/docs/user-guid... : you basically need a vote at commit time. if you have this then only 1 master can commit). However, I'm guessing most of the time they only have two masters running for < 1 minute but probably this time they had two masters running for a long time.

EDIT: oh. they said it affected issues & pull requests [they have different dbs for different stuff] so repo and authentication stuff wouldn't apply. oh, they said no data was lost as well so presumably that would exclude a split brain running for a long period of time.

sudhirj · on Oct 22, 2018

Split brain can run for a while if the all identifiers are UUIDs and tables are used as append only. Restoration is complex, though.

benmmurphy · on Oct 22, 2018

depending on your application logic you can get external inconsistencies. like your typical CRUD app might dump the state of an entity in an edit form, then a user might edit one field, and the whole form is sent to the backend and written over whatever exists there. if you have two different masters then you can get a series of updates where you can't tell if the update is because the user meant to make the change or if it was because the change was incorrectly propagated from the other master. like even if you knew the user read from the other master you can't tell if the user intended the change or not. maybe they saw the value was the value they wanted so they didn't change it.

i guess this is a 'bug' because when you do stuff like this you should include some kind of version identifier to catch a concurrent update in the normal case. this scenario is also 'storing stuff outside the DB' kind of similar to the user_id in the cookie getting out of sync.

theSage · on Oct 22, 2018

I'm really looking forward to the post mortem that comes out of this (if it does). I always learn a lot from reading those.

reindeerer · on Oct 22, 2018

[flagged]

JoyrexJ9 · on Oct 22, 2018

GitHub are not part of Microsoft yet, the deal has only just been approved. They are still totally separate, and will be for some considerable time.

Also Microsoft post plenty of postmortems, like this detailed one from the VSTS outage in Sept. https://blogs.msdn.microsoft.com/vsoservice/?p=17485

_shadi · on Oct 22, 2018

I always thought it is silly that we pay for both private repos on github and host our own enterprise github on premise, turns out whoever have set this up knew what they were doing.

olingern · on Oct 22, 2018

Local to Tokyo, Github has been down for most of the day. I never realized how much of my day centers around it: pr review, creating / commenting on issues, etc.

Looking forward to the write-up. I'm also curious as to how this significant outage lines up with their SLA for enterprise users.

k_ · on Oct 22, 2018

An update has been posted there: https://blog.github.com/2018-10-21-october21-incident-report...

> At 10:52 pm Sunday UTC, multiple services on GitHub.com were affected by a network partition and subsequent database failure resulting in inconsistent information being presented on our website. Out of an abundance of caution we have taken steps to ensure the integrity of your data, including pausing webhook events and other internal processing systems.

> We are aware of how important our services are to your development workflows and are actively working to establish an estimated timeframe for full recovery. We will share this information with you as soon as it is available. During this time, information displayed on GitHub.com is likely to appear out of date; however no data was lost. Once service is fully restored, everything should appear as expected. Further, this incident only impacted website metadata stored in our MySQL databases, such as issues and pull requests. Git repository data remains unaffected and has been available throughout the incident.

> We will continue to provide updates and an estimated time to resolution via our status page.

k_ · on Oct 22, 2018

1.5 hour later (a couple minutes ago), despite an estimated time of less than an hour:

> We continue to monitor restores which are taking longer than anticipated. We estimate they will be caught up in an hour and a half.

reidrac · on Oct 22, 2018

Someone posted a comment on an open issue, I got the mail, but there's nothing on the web interface.

What I don't understand is why they don't set it all to "read only" until the problem is sorted. Looks like any update goes to /dev/null, just let your users know!

(unless they plan to replay those updates, somehow)

askmike · on Oct 22, 2018

I don't think it's easy to predict what is going to happen to all updates now. They run a big distributed infrastructure. I doubt they can even predict which updates will be commited and which ones will not.

If replaying is even on the table (which sounds very dangerous to me), that requires a huge coordination and I am pretty sure they are not able to tell right now if that is going to work once the actual issue is fixed.

Better to stay quiet while fixing the issue and only say things that you actually know 100% to be correct.

dfcowell · on Oct 22, 2018

PR comments are also failing with HTTP 405 error code.

Great way to start the week.

keyle · on Oct 22, 2018

Well for the poor sods fixing it in the US, it's still Sunday evening...

geerlingguy · on Oct 22, 2018

It's when I typically get an hour or two to crank out some open source PRs and issue queue cleanup. Sadly, the outage means I don't get that time to devote this week :(

hayd · on Oct 22, 2018

Surely they have SRE teams around the world?

megakid · on Oct 22, 2018

I'm trying to add a new user to my organisation (Monday morning, new joiners...!) and despite getting confirmations I purchased another seat, I cannot invite them. Or when I do, those seats are gone, guess it's down to which data store I'm hitting on each request.

calmconviction · on Oct 22, 2018

They probably moved their storage backends to Windows 10

geggam · on Oct 22, 2018

Interesting how everyone uses a tool designed to eliminate SPOF in a way it has a SPOF.

geerlingguy · on Oct 22, 2018

GitHub does not equal git; the reason I’m paused is because I use Github’s issue queues to organize my OSS work. Much easier than self hosting an issue repository/bug tracker. I am still able to do all my work, run new containers, etc., but GitHub is more tied into business processes than actual code (which is what causes the pain during these outages).

dfcowell · on Oct 22, 2018

This is certainly making me rethink the workflows we are using in my team, particularly those around PRs and code review.

stephenr · on Oct 22, 2018

Obligatory: to all the people running around like de-headed chickens: this is what happens when you focus on a service not on the dvcs.

Your code and your wiki (if you use it) are already in git repos. If the industry moved forward on issues in dvcs repos, you have another central point of failure removed.

You also have less hassle to work across dvcs hosts, which is exactly why GitHub would never embrace this.

6t6t6t6 · on Oct 22, 2018

Next: "We are restoring tape backups from some of our storage systems"

6t6t6t6 · on Oct 22, 2018

"We are continuing to repair a data storage system for GitHub.com. You may see inconsistent results during this process."

josteink · on Oct 22, 2018

Seems like they're on their way back up:

> We are currently in the later stages of a restore operation, with the aim of serving fully consistent data within the next 2 hours."

nodesocket · on Oct 22, 2018

I created a gist four hours ago, which as of now is not returning and showing a 404. Hopefully the gist did not get lost.

mattio · on Oct 23, 2018

They started to replay all webhook events yesterday evening, triggering all kinds of events in our systems, as deployments and the like :-(

gavreh · on Oct 22, 2018

My issue comments are not being saved, and if I do a tag push I don't see that reflected on the "releases/tags" area of GitHub.

diegoperini · on Oct 22, 2018

I just lost my session.

shry4ns · on Oct 22, 2018

I don't know if someone's mentioned it yet but TravisCI is also not running on the commits that do show up on git, in Houston

_lffv · on Oct 22, 2018

Earlier GitHub pages was "down for maintenance". Probably not related but might be of note.

(source: my website is hosted on github pages)

beamso · on Oct 22, 2018

Seeing a lot of issues with Pull Requests.

igni · on Oct 22, 2018

Before all the trolling about Microsoft starts up, does anyone have current information on what these systems are?

In the enterprise space, a 'data storage system' could be an Array or a SAN or a lightpath etc, with usually quite long failover times. For an org like GitHub I'd think more like an object store (an Array-of-Hosts, if you will) or whatever storage mechanism holds their database files. Do they self host this sort of thing or is it an AWS/GCE/Azure service?

FWIW, all git commands are working fine for me (create a branch, push, colleagues can fetch my branch), but the UI doesn't show my branch & nor can I review/comment on PRs.

sciurus · on Oct 22, 2018

For storing things other than git repos, GitHub is heavily invested in MySQL. AFAIK all of GitHub is hosted on their own hardware.

https://githubengineering.com/mysql-high-availability-at-git...

rurban · on Oct 22, 2018

This blog post describes exactly the scenario we were experiencing here. A master (single writer) failure, with missing fail over. You can only guess what went wrong with this plan. Looks good on paper, but some unexpected network or HW or routing problem could have caused the problem to identify the single writer.

reindeerer · on Oct 22, 2018

I think it's obvious that it's their SQL storage that holds the website up that is pretty much in read-only mode, and has been for a couple hours now, not the git repos themselves.

geerlingguy · on Oct 22, 2018

Yeah, been trying to post comments and new issues and keep getting "405 Not Allowed" responses.

friedman23 · on Oct 22, 2018

I can push a branch but I cannot access the pull requests that I create off of that branch

dvfjsdhgfv · on Oct 22, 2018

[flagged]

jononor · on Oct 22, 2018

It is incredibly unlikely that there is any Azure in the production infrastructure. The acquisition is too recent for big changes like that.

paavohtl · on Oct 22, 2018

It's actually so recent GitHub hasn't even been acquired yet. That's going to happen some time next year.

dvfjsdhgfv · on Oct 22, 2018

I admire your confidence but will withhold my judgement until the postmortem is published.

karaokeyoga · on Oct 22, 2018

Is it just me or is that status message wonky? Failing over something to restore access?

radicality · on Oct 22, 2018

"Failing over" in this context most likely means either migrating to some replica (say doing a dead master mysql promotion) or spinning up some backup storage system to serve master reads/writes.

cesarb · on Oct 22, 2018

In this context, "failing over" means switching the active system to a replica. Which means their primary data storage system had some issue, and they were switching to a secondary data storage system, which hopefully contains the same data (replicated in real time, or nearly real time). Clearly that switch didn't quite work as expected, otherwise it would have taken just a couple of minutes before everything went back to normal...

rphillips · on Oct 22, 2018

Sounds like an engineer knee deep in diagnosing the issue.

diegoperini · on Oct 22, 2018

That person needs our support more than anyone else right now.

samontar · on Oct 22, 2018

It’s just a variant referencing https://en.m.wikipedia.org/wiki/Failover

karaokeyoga · on Oct 22, 2018

failing -> migrating

pinneycolton · on Oct 22, 2018

Well, they've got the "failing" part right. It's the "over" that's taking a while ;)

hartator · on Oct 22, 2018

I can access GitHub.com here (Austin,TX), but it's taking forever.

mellisdesigns · on Oct 22, 2018

What particular services are down? I noticed I can hit the UI.

avip · on Oct 22, 2018

For me: can't fork, can't login (tested incognito), can't clone new repos.

BadassFractal · on Oct 22, 2018

Wonder if GitHub is "too useful to fail" at this point. As in, most people and companies won't switch git repo providers unless GitHub is down for many days at a time during the work week.

avip · on Oct 22, 2018

Transition to BitBucket or gilab is one click away. Companies surely will move if incentives are there.

jononor · on Oct 22, 2018

Not when you use third party integrations. Like when using Travis for CI, and Travis doing your deploy.

Skinney · on Oct 22, 2018

Travis supports both Bitbucket and Gitlab

jwilk · on Oct 22, 2018

No, Travis CI doesn't support any code hosting other than GitHub.

dustinmoris · on Oct 22, 2018

GitHub must be migrating to Azure...

guardian5x · on Oct 22, 2018

Actually, they just announced to stay with AWS, as long as things run well. Maybe now they have a reason to.

reindeerer · on Oct 22, 2018

[flagged]

mohammedbin · on Oct 22, 2018

Snark, except snark only comes from people with little knowledge, as in this case you seem unaware that they are a completely independent entity as of today.

benatkin · on Oct 22, 2018

http://howfuckedismydatabase.com/mssql/

julienfr112 · on Oct 22, 2018

[flagged]

CodeM0nkey · on Oct 22, 2018

GitHub has not yet been acquired by Microsoft, they're completely separate.

arthurcolle · on Oct 22, 2018

I wonder if this will vaporize the deal.

manigandham · on Oct 22, 2018

That's not how these deals work. A little infrastructure downtime is not going to stop a 7.5B platform acquisition.

____Sash---701_ · on Oct 22, 2018

What!?

guywhocodes · on Oct 22, 2018

What's strange to me is how many hours we've been given the same message. Today is an important day for my organization and not getting any information that hints how long this will last is a huge problem for our planning.

steventhedev · on Oct 22, 2018

Are you a paying customer of GitHub Enterprise? If not, then you're getting your money's worth.

Snark aside, this is a great time to reassess your deployment strategies and look into things like local apt and pypi proxies. I'm confident you can find similar projects that will transparently cache your dependencies.

guywhocodes · on Oct 22, 2018

I don't know how much my org is paying but it's no small amount, Enterprise no. Are we getting what we pay for? I don't think so, I think effective SLA of 99% isn't good enough for any SasS. But of course you understand that moving away from github is no small decision.

Absolutely is this a perfect time to assess deployment strategies and challenge all the advice of how big a company has to be before it's worth to do X.

steventhedev · on Oct 22, 2018

It sounds like you're using GH for ticketing also, which means that your R&D org's productivity is tightly coupled with an external companies uptime. Self-hosting Gitlab is easy, as is running Jira, redmine, and a dozen other tools.

If you can't do your job without a tool, then you need to have a plan B on hand for when that tool fails. Both in the micro sense of the tools you use to code (editor, browser, laptop, mouse, coffee mug, etc) and in the macro sense of tools your organization uses (ticketing systems, chat, bathrooms). Show some initiative, figure out some mirrors for your dependencies, and try standing up a local caching proxy for your team. It'll probably take a lot less time than you think.

samontar · on Oct 22, 2018

If you’re paying for GHE you don’t have any trouble. That’s delivered as an appliance you host.

steventhedev · on Oct 22, 2018

You are quite correct. I had meant the paid hosted version. Maybe they used to call it GHE at some point? Dunno why it got stuck in my mind like that...

OJFord · on Oct 22, 2018

No it isn't: https://enterprise.github.com/features#pricing

akx · on Oct 22, 2018

Yes it is. https://enterprise.github.com/faq#faq-3

sajithdilshan · on Oct 22, 2018

I think it is obvious what happened. They must have moved their backend to Windows 10.

sajithdilshan · on Oct 22, 2018

It's a joke people. Why downvote?

deadbunny · on Oct 22, 2018

Because it's a boring, obvious joke which wasn't funny even in the 90s?

Memosyne · on Oct 22, 2018

Anyone know if this could be related to the Youtube outage? It's been a while since I've seen these big websites go down.

bognition · on Oct 22, 2018

In what way? Youtube runs on internal google hardware, GitHub runs on its own hardware too.

It could be a networking issue, but you'd expect more sites to be impacted.

If it were a software issue, you'd expect a big player like google to be aggressively patching and talking about a bad release.

More likely is its just a coincidence.

danielhlockard · on Oct 22, 2018

I'm not sure how you could postulate that they're related

ObsoleteNerd · on Oct 22, 2018

Targeted attacks on specific aspects/assets of a major websites infrastructure? Eg target a specific service that both sites have in common in their back end?

Not even sure if that's feasible but it's an idea.

Memosyne · on Oct 22, 2018

I figured it was just a coincidence, but the HN network is so well informed I thought I'd just ask the question on the off-chance it wasn't.

I personally don't find a relation likely.

ObsoleteNerd · on Oct 22, 2018

Oh I find the relation very unlikely. I was just trying to think up a potential connection for fun/curiosity, as the coincidence is pretty crazy that two huge sites known for very few/little outages had major outages in the same week.

_0nac · on Oct 22, 2018

Github is not hosted by Google and was acquired by Microsoft some time back, so that seems... unlikely.

platinium · on Oct 22, 2018

This has been super frustrating, as people have deadlines and are working to finish projects before Monday morning.

What are the (good) alternatives to Github? Gitlab supposedly is Google-backed, so I don't want to have my private code there. Is Bitbucket the only one left?

I don't mind paying monthly, which I already do for GitHub.

Tecuane · on Oct 22, 2018

> Gitlab supposedly is Google-backed, so I don't want to have my private code there.

This strikes me as odd. May I ask why usage of GCP is a deal-breaker for you? While I can understand not wanting to use Google products directly as a consumer, I believe it would be - for lack of a better term - platform suicide for Google to intercept and perform its usual analytical shenanigans on the data content of transmissions to/from their platform.

Either way, Phacility's Phabricator[1] is $20/user/mo.

1. https://www.phacility.com/pricing/

Jedi72 · on Oct 22, 2018

Nobody trusts Google for any reason any more as they have proven unworthy of our trust.

hactually · on Oct 22, 2018

That's a silly statement. By saying 'nobody' - a single point of data invalidates your assertion.

I use gmail, I'm quite happy to trust that contract.

jammygit · on Oct 22, 2018

You're saying that you trust that contract today, or are you saying that you have always trusted that contract?

Its only recently that gmail's contract involved keeping out of your data. I think they also only say they abstain from using your data for targeted advertising, not that they don't use it for other purposes. I haven't read the terms in quite a while though and I could be mistaken.

Great products though. I really do wish I could pay for them in exchange for a real, trustworthy, comprehensive privacy promise.

deadbunny · on Oct 22, 2018

There is a whole world of difference between paid for and not paid for Google services.

naner · on Oct 22, 2018

This type of thing will always be a risk with cloud infrastrucure no matter what service you choose.

toomanybeersies · on Oct 22, 2018

You can self-host Gitlab, as well as Gogs, Gitea, and a handful of other solutions.

cmroanirgo · on Oct 22, 2018

If you've a server of your own (and even if you don't, you can self host for a few $/mo), gitea is an easy choice: https://gitea.io

patrickg_zill · on Oct 22, 2018

I have been meaning to evaluate fossil-scm.org for a while...