
GitHub: October 21 Incident Report - pietroalbini
https://blog.github.com/2018-10-21-october21-incident-report/
======
_verandaguy
I saw some comments on reddit which highlighted a pretty serious problem -
many orgs rely on github as a fully integrated CD platform, with everything
from code hosting, to running CI hooks, to pushing to staging or prod.

It seems very unwise to have essentially your whole deployment process manager
in the hands of an entity which you don't only have no control over, but which
has regularly been targeted in attacks by nation-state-level actors because of
their role as a code hosting platform.

EDIT: GH hasn't been targeted regularly, but it has been so historically, so
this is a plausible thing which might happen again.

~~~
Klathmon
It's a tradeoff, just like anything else.

We use Github heavily at my work, and at past jobs as well. But at the same
time it's not the ONLY way we can work. If github has an outage, our CD will
shut down, but we can still run all tests locally, and we can still push to
the server directly, and we can still push code between ourselves manually and
review it.

Sure, it's a hiccup in our day when it goes down, but it's not like the entire
company grinds to a halt. And the alternative of maintaining servers and
systems to replicate all of that would take significantly more time,
potentially cost more, and is probably more likely to go down.

~~~
colonelpopcorn
On premises git hosting with Gittea or GitLab with mirrors to GitHub seems
like a smart idea going forward.

~~~
elliotlarson
But then you have to host it and maintain it. It's a slippery slope. How many
3rd party services do you in-house with hosted OS software. Pretty soon you're
spending a huge chunk of your time doing ops work. And, where do you host it?
On AWS, which can also go down, or on hardware hosted at your office. With on
premise hosting, now you're in the hardware game too.

~~~
module0000
Well, your code is your companies' IP. It might be prudent to have that IP on-
prem(eg Gitlab). Every business has varying requirements, but I've yet to be
employed at one using 3rd party hosting for source control without an on-site
mirror. Disclaimer: my employment has been at megacorps so far, smaller shops
may not do this.

~~~
Klathmon
with git, you have a mirror on every single developer machine (kinda...
depending on what you consider "all" of the code).

We also have a full copy on our CI server (which is hosted on another service,
so still not "in-house"), and we have a copy of at least the master branch
(and all it's history) on our production boxes (which also gets pulled into
our whole backup system there).

In a disaster recovery scenario, that's more than enough for me.

Sure, if github blinked out of existence we would probably be at a fraction of
our normal productivity for a while until we fully recover everything and find
new workflows, but the risk vs reward there is well within the margins of what
I'd consider acceptable for a company like us.

~~~
ams6110
> In a disaster recovery scenario

That's not really enough. You need more than "all the code is around here
somewhere" you need a plan with specific steps that have been tested.

~~~
twunde
Honestly for git, it probably is enough. We're talking about someone deleting
your github account or github closing overnight with no warning (it's been
acquired by Microsoft, so it's much more likely that the company you're
working for will shutter). It should take ~30 minutes to push your repo to
another provider including looking up instructions. Unlike database backups,
there is rarely any data loss and any data loss should be recoverable. It's
also not client-facing, but is a temporary problem similar to wifi going down
at your office. An inconvenience and hassle, yes. Long-term problem, no.

Furthermore, the problem with these disaster scenarios are that there are much
more dangerous problems than your account being deleted. Someone with admin
access, could insert a back door or sell your source code to someone else.
That's honestly scarier.

~~~
CamTin
That's probably the case for valley-style startups where the whole team can
fit in a room and they all hack on the same handful of repos, but most
"enterprise" customers will have hundreds of repos with not necessarily
anybody hacking on most of them at any given moment. It's very good policy for
such organizations to have a plan in place to "break glass in case Github is
down" with local mirroring of all data and a tested process for doing deploys
without Github.

~~~
module0000
We have exactly that... in our DR plan, there is a section for how to cope
with the 3rd party source control provider being unavailable/compromised/etc.
Update DNS for the equivalent of "upstream-git.foo.com" to an internal
address, and continue business as usual.

It's like you said, smaller shops probably think it's over-planning and
overkill, but we do indeed have 100's of projects that are "mission critical",
that might not have been touched in 1+ years.

------
shashwat986
Yeah, we're still facing issues with, erm, Github issues.

Also, while they haven't updated this blog post for a while, their status page
has been very up-to-date and informative:
[https://status.github.com/messages](https://status.github.com/messages)

~~~
AdamJacobMuller
> very up-to-date and informative

Is that satire? It said 2-hour ETA 5 hours ago and the last update was over
two hours ago.

~~~
lol768
I see an update 7 minutes ago.

>12:56 British Summer Time

>The majority of restore processes have completed. We anticipate all data
stores will be fully consistent within the next hour.

~~~
LeoNatan25
Every hour they promise something will be done until the next hour. I haven’t
been able to work all day so far.

~~~
Piskvorrr
Consider this a lesson on serverlessness. (We have been similarly afflicted,
but their git backend seems to be up; and even further, we have rediscovered
what we stopped paying attention to: that with Git, a centralized repo is just
a convenience, not a requirement.)

~~~
LeoNatan25
Yes, I agree, but here it’s not up to me to choose the infrastructure.

I wonder what the total cost of this ordeal must be. Surely in the tens of
millions.

------
platinium
On the plus side, this disastrous calamity by Github really made me try out
Gitlab and in the process, I will now set-up a second remote on my repo's:

[https://stackoverflow.com/questions/11690709/can-a-
project-h...](https://stackoverflow.com/questions/11690709/can-a-project-have-
multiple-origins)

    
    
      Quoted:
      "Try adding a remote called "github" instead:
    
       $ git remote add github    
       https://github.com/Company_Name/repository_name.git
    
       # push master to github
       $ git push github master
    
       # Push my-branch to github and set it to track github/my-branch
       $ git push -u github my-branch
    
       # Make some existing branch track github instead of origin
       $ git branch --set-upstream other-branch github/other-branch"
    
    

Actually, I don't know why I pay for Github private repo's.. I might as well
set-up two origins, one at Gitlab and one at Bitbucket, for all my privates.
Then keep Github as a public front-facing portal.

~~~
robinhood
Every company at some point has some kind of incidents. It just happens.
GitHub is most of the time rock solid and doesn’t deserve to be judged based
on one major incident like this. On the contrary they need our support.
Bitbucket and GitLab both have had problems of the same magnitude.

~~~
platinium
But a multiple origin solution seems the most sensical. We have failover for
everything infra and services.. it seems we now need failover for cloud-based
code. It just seems logical, especially if all 3 of the big cloud providers
have big incidents.

------
jbb67
Ugh I detest the term " abundance of caution " as used in that message. Weasel
words designed to stop you thinking about the problem as who wouldn't want
them to be overly cautious

~~~
deathhand
It's not that bad. X definitely modifies Y but it may also modify Z. You are
informing the audience of this fact and it provides them with 'warm fuzzies'.

------
vqng
> During this time, information displayed on GitHub.com is likely to appear
> out of date;

if the new data is not presented, users will typically retry which may result
in duplicated new content.

~~~
rashthedude
Happened to me already. Created a new branch but it kept me informing me the
non-existence, retried a few times and shortly after was created with a bunch
of duplicates.

~~~
geerlingguy
Same here, but last night I was trying to post a comment to another repo,
refreshed and tried again like 5 times before realizing there was an outage
and it wasn't on my end. This morning that issue has 5 dupe comments (so I
look like an idiot), and deleting them does nothing; they just reappear when I
refresh the issue.

------
diegoperini
As long as the same problem doesn't occur again, I have no hard feelings
against them.

------
Ajedi32
Wow, is this still going on? I noticed this yesterday and figured it would be
fixed in a few hours, but [GitHub's status page][1] is still showing red. Is
this the longest outage GitHub has had?

[1]: [https://status.github.com/messages](https://status.github.com/messages)

------
LaserToy
This incident report tells very little. I hope they release what actually
happened and how it affected their services. And how they are going to avoid
it in the future. Almost any issue can be publicly described as: stuff broke
because of network.

~~~
WorldMaker
"Partition failure" is a relatively specific term of art that implies it was a
partial database failure, so "stuff broke because we lost part of a database".

It may not be in GitHub's best interest to describe much more than that
because which database, which tables, and how they were partitioned can be
secret performance sauce.

Postgres partitioning: [https://www.postgresql.org/docs/9.1/static/ddl-
partitioning....](https://www.postgresql.org/docs/9.1/static/ddl-
partitioning.html) Redis partitioning:
[https://redis.io/topics/partitioning](https://redis.io/topics/partitioning)

~~~
zzzcpan
"Network partition" is unrelated to partial database failure or database
partitioning. It means that database servers got disconnected from each other.
Normally this shouldn't be a problem, but as far as I know they don't use
proper distributed algorithms and so it's possible that disconnected servers
each became masters and were serving requests independently, which is why they
speak about inconsistencies. I believe this problem is commonly known as split
brain [1].

[1] [https://en.wikipedia.org/wiki/Split-
brain_(computing)](https://en.wikipedia.org/wiki/Split-brain_\(computing\))

~~~
WorldMaker
It's not "unrelated", it's an intertwined phenomenon, especially in practical
consequences. In Redis, as one primary example I already mentioned, all
database partitioning is network partitioning (and vice versa). The logical
object is still a "database" even if the physical object suffering problems is
the "network".

For similar reasons that you suppose they "don't use proper distributed
algorithms" (which seems an overly harsh way to put it), I simply presumed the
logical entity in question is still a "database", even in the case of a
network partition problem. I don't think it was either Postgres or Redis in
this case, but they are simple enough examples to illustrate the overall
problem.

Either way, my case still seems to stand that we aren't likely to get more
information about the specifics of the partition failure because likely both
the logical (which "database") and physical layer specifics (which
datacenters/"network") are things that are internal to GitHub that they may
not be able to publicly describe much more than what they have.

~~~
LaserToy
It doesn’t really matter what they used, I just pointed out that their
incident report is BS. I can blame virtually any production problem on
network, but the ugly truth is - it is not networks fault in most of cases,
but fault of engineers who developed the system with unreasonable assumptions.

------
nik736
I don't understand why code hosting platforms like GitHub, GitLab or BitBucket
have so many issues regularly. Is there anything special about it?

~~~
NL807
Define regularly.

I can recall only 2 incidents this year. I think that's not too bad
considering the level of traffic they have to contend with.

~~~
Illniyar
Considering this is going on for several hours now, their SLA is down to at
least 99.9 and going down by the hour. Their business SLA is 99.95% (though I
have no idea what it refers to), so it's quite possible that they are in
breach.

Still not bad, but 2 incidents like this a year, is usually considered
unacceptable for infrastructure service providers.

~~~
dudul
I wonder how they define their SLA though. If only some of the features are
down, does it impact the SLA?

------
dvfjsdhgfv
It would be interested to know what was the cause of the network partition
failure.

------
mr_toad
Something that struck me yesterday as this started, is that Github isn’t
really just a dvcs hosting solution, Github is a social network.

It’s easy to change the remote origin of a git repository. It’s not hard to
migrate a project to Gitlab etc, and duplicate all the technical features of
Github.

What’s really hard is replacing the social graph. If you’re a large project
with a lot of contributors, onboarding everyone is not going to be easy. About
as easy as convincing all your friends to stop using Facebook.

------
josteink
> Further, this incident only impacted website metadata stored in our MySQL
> databases, such as issues and pull requests.

Basically the only things I was going to do on Github today :)

------
sneak
Why bother telling people five incorrect estimates for service recovery?

[https://twitter.com/sneakdotberlin/status/105434537971655884...](https://twitter.com/sneakdotberlin/status/1054345379716558849)

After one or two wrong guesses it just shows that you’re making it up.

------
richardwhiuk
I hope they are going to provide a better RCA than that!

~~~
rplnt
Sure, but I don't think they have the time since the outage is still ongoing.

------
foxhop
Don't try creating a new repo, it will create part of the meta data but not
allow you to see or use the repo and the repo name gets taken.

~~~
Symbiote
I created a new repository at about 9 UTC, and it started to work some of the
time at 12 UTC — I've pushed the code from my computer, and pulled it from
elsewhere.

However, it's still intermittently failing as of 14 UTC, so I haven't managed
a Maven release build yet.

------
sneak
GitHub also silently published a private repo of mine this week. I checked
audit logs for both the owner user and the org and it didn’t show a
permissions change anywhere.

Netlify has caused me to stop using GitHub Pages and between the clownshoes
outage reports and the security issue I am now a GitLab user.

This is GitHub’s jump the shark episode. :(

~~~
andyana
Gitlab has had its own issues.

------
wangyjx
This was a head-on blow for MS?

------
k_
Also discussed there:
[https://news.ycombinator.com/item?id=18271180](https://news.ycombinator.com/item?id=18271180)

------
mncharity
As I rearrange today's todo list, I recall wishing that I'd used the Microsoft
purchase event to encourage folks to increase their familiarity with gitlab.
So I now note:

[https://gitlab.com/explore/projects](https://gitlab.com/explore/projects) is
a live feed of project activity, suitable for code surfing. It also sorts by
stars and trending.

------
edejong
Lets do a quick back of the envelope calculation:

Github reports 28,337,706 users by 2018-06-05 [1]. Lets assume 50% of these
are active. Lets also assume that, due to the unavailability of GH, around 2
usable hours per developer are lost. Another assumption is that each developer
contributes around 50 US$ per hour.

This means, this outage has cost us users: (28337706 * .5 * 2 * 50) = 1.351
billion US$.

Perhaps not use MySQL for such critical systems?

[1]
[https://github.com/search?q=type:user&type=Users](https://github.com/search?q=type:user&type=Users)

~~~
alexeiz
The assumption that developer time is lost when Github is unavailable is
wrong. The whole idea of Git being a distributed VCS that it does not require
any connection to the main server (i.e. Github) to work with a local copy of
the repository. If Github is down, I can still do my work locally and then
push changes to Github when it's back online. The only case when I may get
blocked, is when I need to fetch project dependencies hosted on Github to do
the initial build of my project, which doesn't happen very often.

~~~
goykasi
But you are ignoring a huge component of using github. If you only use github
as a central, shared repo, sure you lost almost nothing. Their git infra
seemed to be operational throughout this, but if you use the major features of
github: issues, PRs, review, webhooks for CI, etc, you probably did lose out
on developer time.

Me and my team pushed tons of code to origin today (JST btw), but we were
almost at a stand still as far as merging to master and closing out branches.
Github being down had a huge affect on the process -- review, merging and CI
success. Merging was our big one since we have protected branches via github
-- master; CI was second since we received no webhook events. So either we
wait it out (~8 hours) or we throw out our process and do something different
until they fix it. The later didnt seem like a reasonable course of action.

github != git

Its easy to say, "well they are down, that doesnt affect git", but the reality
is that a lot of orgs dont just use git. They use github. That fact envelopes
a lot of process, routine, infra, schedule, money, etc. Luckily, its only been
one day. But developer time has definitely beem lost. You cant reasonably say
otherwise if you use github.

edit: I will say I dont fully agree with the GP. Blaming the downtime on MySQL
is silly and coming up with dollars lost is over the top. Things happen. That
comment was a weird attack on a reasonably stable database. Github had a bad
day; it happens.

~~~
platinium
Going forward, I see that the sensible solution for all small companies
relying on cloud-hosted git is to always have a secondary cloud provider at
all times.

~~~
goykasi
> Which brings into question... what is so special about the "front-end"
> features that Git provides? ... bare bones skeleton that we can swap in/out
> at will

Your comment disappeared. I think you had a good point. And itd be cool if it
could be a real thing, but...

I dont disagree. But our team realized that we rely almost too much on github;
we just decided to put up with it. Is there a solution that doesnt depend on
running your own "github"? My infra lead and I had the usual, fun tongue and
cheek chat that began with... them: "maybe its time we switch to gitlab", me:
"can we be up tomorrow?" Weve had that same discussion many times before.

In the end, it comes down to process. If you buy into the features, PRs, CI
hooks, etc, then its really hard to just say "well we can maintain and
replicate the alternative for the .1% edge-case". Otherwise, you might as well
just use that and not github, gitlab, etc. Its hard to decouple from github.
They do that by nature -- dev still continues; they are a piece of the process
puzzle. I think abstracting them away just complicates things unnecessarily.

~~~
platinium
Ah yeah, it disappeared because I thought I would get more reads If I moved it
up in the post chain lol.

May I ask what makes your company feel secure about using cloud hosted
solutions. I mean, can’t a disgruntled employee easily clone the git repo to
his own GitHub account ? I suppose they could do that anyways, just by copying
the repo somewhere, but having an entire company’s secret code on the cloud
just seems to remove too many barriers of entry for protecting the code.

------
smilesmile2018
GitHub team seems to be _VERY_ unprofessional. 15 hour outage means ~99.82%
availability which is extremely bad. 9 hours ago they also told that they
would fix the problem within 2 hours... still not fixed!!!

~~~
fishnchips
In order to objectively assess their level of professionalism, you'd need to
know what happened, and what's going on in there right now. Think about it
this way - can someone break things where you work right now to cause an
outage of this magnitude? For all places that I've worked the answer would be
"yes, of course".

~~~
platinium
It doesn't matter, really. Its a black box from a business perspective. Some
users have lost faith, and some people will migrate to other solutions.
Regardless of how fair or unfair the incident was, it is a fact that it was
poor up-time especially for such an important cloud provider for code.

~~~
fishnchips
Making decisions based on a single event is risky. One could even argue that
an event of this magnitude is likely to cause significant improvements in
reliability. And you just paid the price for that as a user, so unless they
keep failing repeatedly, you may be better off sticking with them.

~~~
zzzcpan
> One could even argue that an event of this magnitude is likely to cause
> significant improvements in reliability.

Doesn't happen in practice and usually the whole thing is just blamed on
"process" with subsequent "process changes".

