
Update on 1/28 service outage - traviskuhl
https://github.com/blog/2101-update-on-1-28-service-outage
======
rburhum
Yesterday I was being a bit of an ass to a few people about how "the whole
point of using git is so that we can do decentralized code management and why
these dependencies were being pulled from our private github if the could be
sent point to point yadda yadda yadda". Then they proceeded to go over the
list of package managers and dependencies we used and I had to shut up. Even
when we host our own Docker Hub and package managers (we do), if you dig far
enough, you can find some dependency of a dependency of dependency that relies
on GitHub. Brew/npm/build script/whatever. It is crazy how everything has
changed so much in the past few years. GitHub went from something that was
really nice to have to a core requirement for complex systems that rely
heavily on open source.

~~~
nickpsecurity
That's a good point. I've been ignoring learning Git as long as I can but
almost everything on my todo list heavily uses it. Or ties into it as you
said. So, I'm going to have to bite the bullet and learn it.

Yet, I swore Git fans told me its decentralized design avoids single points of
failures where everyone has a copy and can still work when a node is down just
not necessarily coordinate or sync in a straight-forward way. This situation
makes me thing, either for Git or just Github, there's some gap between the
ideal they described and how things work in practice. I mean, even CVS or
Subversion repos on high availability systems didn't have 2 hours of downtime
in my experience.

When I pick up Git/Github, I think I'll implement a way to constantly pull
anything from Git projects into local repos and copies. Probably non-Git
copies as a backup. I used to also use append-only storage for changes in
potentially buggy or malicious services. Sounds like that might be a good
idea, too, to prevent some issues.

~~~
ajkjk
I'm sorry to be rude, but, it sounds like you should go learn Git and come
back to this conversation.

The decentralized design _does_ avoid single points of failures, and everyone
_does_ have a copy. So - check, check, great. Unfortunately (maybe..) everyone
has put their master repos in the same place, which somewhat counteracts the
decentralization. But there is certainly no immediate coupling between the Git
repository on your computer and the Github repository it's pulling from. It's
not like Github being down in any way prevents you from working on code you've
already checked out, unless you need to go check out more code.

(The same obviously may not be true for package managers and build scripts
that are not running in isolation from your upstream repository, which is
where the problems have arisen.)

~~~
nickpsecurity
"I'm sorry to be rude, but, it sounds like you should go learn Git and come
back to this conversation."

It looks like it.

"The decentralized design does avoid single points of failures, and everyone
does have a copy. "

So, like many decentralized systems I've used, a master node gets worked
around by other nodes who communicate in another way? Or would some retarded
situation be possible where...

"Unfortunately (maybe..) everyone has put their master repos in the same
place, which somewhat counteracts the decentralization."

...one node going down could prevent collaboration? Oh, you answered that.
That sounds better than CVS but shit by distributed systems standards. I'll
still learn it anyway since everyone is using it. Probably in next week or
two.

~~~
tehbeard
It's a pebkac issue. The software is fully capable of having multiple remotes,
but it's rarely used that way.

~~~
zwp
Is there an easy config for that? Suppose I want to push to eg github and
bitbucket (without sharing my creds with ifttt or similar)? Is a post-receive
hook on a local pseudo-master the way to go?

~~~
Symbiote
See, for example, here: [http://stackoverflow.com/questions/14290113/git-
pushing-code...](http://stackoverflow.com/questions/14290113/git-pushing-code-
to-two-remotes/14290145#14290145)

    
    
        git remote set-url --add --push origin git://original/repo.git
        git remote set-url --add --push origin git://another/repo.git

------
skewart
Am I the only one who is a little shocked that a power outage could have such
a huge effect and bring them down for so long? I'm not an infrastructure guy,
and I don't know anything about Github's systems, but aren't data center power
outages pretty much exactly the kind of thing you plan for with multi-region
failover and whatnot. Is it actually frighteningly easy for kind of to happen
despite following best practices? Or is it more likely that there's more to
the story than what they're sharing now?

~~~
LinuxBender
I am not at all surprised. There are 'best practices' and then there is what
really happens based on business processes and needs. In reality, even the
most cloudy of cloud providers will run into this problem at some point. Folks
often come up with ideas of implementing something like Chaos Monkey in their
data-center, then realize the actual impact it will have and find it is almost
impossible to get the rest of the business to agree to this concept. It isn't
as easy at it sounds. I only know of two businesses that have actually
implemented Chaos Monkey; one being the company that coined the term. Even
regular reboots won't catch these problems and if folks were honest, you would
see +1 year up-times on most servers in most places. That is just based on my
experiences and I am sure some of you have seen different.

~~~
JetSetWilly
The problem is most environments are very heteregenous. I evaluated chaos
monkey approach for a big bank, the issue is that netflix has whole data
centres full of loads of machines doing pretty much the same thing, streaming
and serving.

And the worst that can happen is a customer's stream stops and they have to
restart it.

But in most big companies you have thousands of apps that are all doing very
different things. Perhaps a critical app might run on 4 hosts spread across
two data centres - you're not going to convince people to have chaos monkey
regularly and randomly bringing down these hosts, it would cause real impact
and is risky. Yeh in theory it should be able to cope but in reality the
scales in most orgs are quite different.

That said github sounds a lot more like the netflix end of the scale, doing
one specific thing at large scale.

~~~
drather19
While Netflix as a company is focused at doing one specific thing at large
scale, they're heavily vested in microservices and do actually have "thousands
of apps that are all doing very different things".

Chaos Monkey fits when people build and deploy their services with the notion
that any particular instance (or dependency) could fail at any given time.
It's a tough road to evolve out of a legacy, monolithic stack without much
redundancy baked in.

~~~
JetSetWilly
Whether they have broken up their apps into microservices doesn't seem to
matter to me. That's just a matter of how they have organised their code,
whether the actual app is monolithic or microlithic doesn't seem to matter.

They have a focussed business with relatively little variation in how they
make money - all their customers simply pay for a streaming service.

Most large companies, certainly banks anyway, have thousands of apps because
there's also thousands of different parts of the business making money in
their own unique ways that have their own unique needs.

What works for netflix therefore can't work for other businesses, because the
actual business is much more heterogenous than that of netflix and the
technology will reflect that whether it is organised in microservices or
monolithically - that's totally irrelevant.

------
nickpsecurity
Here's the only page I could quickly find on Github's architecture for those
interested:

[https://github.com/blog/530-how-we-made-github-
fast](https://github.com/blog/530-how-we-made-github-fast)

This looks like a single datacenter. I don't see anything here indicating high
availability or other datacenters. You'll usually spot either an outright
mention of it or certain components/setups common in it. They might have
updated their stuff for redundancy since then. However, if it's same
architecture, then the reason for the downtime might be intentional design
where only a single datacenter has to go down.

Might be fine given how people apparently use the service. It's just good to
know that this is the case so users can factor that into how they use the
product and have a way of working around the expected downtime if it's
critical to do so.

------
bhaak
"Millions of people and businesses depend on GitHub"

Well, we shouldn't depend on it so much.

I shudder at the thought what an outage of GitHub would mean for our company.
This time, we were lucky as it was during the night in Europe.

Unfortunately, I don't have the power to test this scenario in our company.

~~~
banku_brougham
I like others am confused by this common sentiment. Github is the remote repo,
but the version control is distributed so everyone has a copy. I'm pretty sure
I can fill a few hours or more with work needing to be done on my local repo.
FYI I'm not a professional software developer but I would like to know.

The things that come to mind: issue trackers, messaging, not being able to see
latest pull requests.

Update: Now i'm starting to understand the build dependency issue. Still, why
do you need to rebuild all dependencies from GitHub repo to build the
application? Can't the currently available version work?

~~~
yeukhon
Continuous integration, continuous delivery. Your Jenkins jobs all point to
repos on GitHub? Do you plan to fix every single url? Some tools actually pull
stuff from GitHub. If you don't have a mirror privately somewhere, where do
you push your code? How can you tell you actually own the latest of
everything? Time to compare with every co-worker.

------
anton_gogolev
It's one thing when one temporarily loses access to remote repositories for
pushes. Quite bearable, because you can exchange code across your corporate
network using patches and whatnot. And it's totally different when you cannot
friggin _build_ anything because package managers grab dependencies directly
off of GitHub.

~~~
msbarnett
This is more an argument for caching or vending dependencies than anything
else.

If the ability to make builds is critical to your org, making your build
process depend on the availability of third-party services over which you have
no control is going to end in tears.

~~~
banku_brougham
This is it. Production builds have to have dependencies hosted internally, not
all over the web.

------
bjacobel
Not much detail here. A more thorough postmortem would give me more confidence
they can recover from another similar issue. Hoping to see one soon.

~~~
anon987
Yep, I think most of these post-mortems from any company are pointless from a
technical perspective. It's 4 paragraphs that boils down to "someone did
something wrong and we'll make sure it doesn't happen" with zero specifics.

There's no point in reading these because there's no technical information.
Stuff like this is something you sent to your customer because they want root
cause.

~~~
Zikes
I strongly disagree that these sorts of communications are pointless. In every
major service outage I've seen where the company maintained a degree of
silence, it's caused major damage to their public relations and consumer
trust.

I know it doesn't tell you much about exactly what happened, but the truth is
they may still be sorting that out and focusing on ensuring it does not happen
again. An in-depth post-mortem accompanied by a description of the fix would
be great. In the meantime, admitting culpability and apologizing are the ideal
essential first steps.

------
frik
You can see the cascade effect on their status page graphs:
[https://status.github.com/](https://status.github.com/)

~~~
Loic
What is impressive is that with a website 2h down, they can still announce a
97% availability for the day even so the graph clearly shows the 2h of
failures in the day... :-/

~~~
arthurschreiber
Unless I'm mistaken, 97% of (24 hours) = 23.28 hours.

~~~
mirekrusin
yes, it went down to 89% or something just after the problem.

------
tommoor
This post makes it sound like Github has it's own data centers and power
infrastructure structure, this is definitely news to me.. I'd presumed co-lo
at best.

~~~
noazark
The last news I've heard about it was back in 2009,
[https://github.com/blog/493-github-is-moving-to-
rackspace](https://github.com/blog/493-github-is-moving-to-rackspace). But
I've also heard that they have some infrastructure on site (clearly not what
they were talking about).

------
moondev
Github doesn't deploy their services in multiple az's?

~~~
rs999gti
Maybe they do.

But this two hour failure tells me that they have never really tried a hot
failover and failback scenario in order to test the resiliency of their site.

~~~
detaro
Or something happened that didn't happen in the tests. And if they suspected
something might be in an inconsistent state, taking some downtime to make sure
it comes back up properly clearly is the better option.

------
beachstartup
it seriously makes me lol that people are upset, or surprised, that an
internet service went down for a couple of hours. a couple of hours! get some
perspective please. go for a walk, get a tasty burrito, try a new brand of hot
sauce.

"why didn't they do X, Y, or Z"

the answer in every case is it's extremely expensive, or extremely hard to do,
or both. you want a reason, there's the reason. maybe they'll fix it. maybe
they won't. next question.

make your own backups and redundant systems. "but github is so critical!" \--
even more reason to have a backup. bad shit happens in this world. even to
good people. prepare or suffer the consequences.

------
ljk
Maybe I'm ignorant, but why do companies rely on github? Why not just host it
in-house? If there's power outage in the office then everything would be down
anyways, right?

~~~
danneu
A rare two hour Github outage isn't enough to make anyone on my team want to
start dicking around with internal tools.

------
gavazzy
Would it be possible for a cross between Git and Torrents? Rather than having
a central server to pull/push from, instead the server would provide a list of
clients. If the server goes down, the list is still available, and so people
who depend on it would be able to communicate.

------
matt_wulfeck
Why is it so hard for us to distribute our dependencies? Hash the package to a
sha and put t anywhere on the Internet. Then we just need a service that holds
and updates the locations of the hashes and we can fetch them anywhere.

------
ibejoeb
For those that have been affected by this, what parts of your process were
disrupted? I've read, so far:

    
    
      * Build fails due to unreachable dependencies hosted by GitHub
      * Development process depends on PRs

------
free2rhyme214
Chinese DDoS? Somehow I don't buy power going out at a server farm.

~~~
oxguy3
Why not? Things break. Electricity is one of those magical things that's very
hard to have insanely good uptime -- frankly, it's incredibly impressive that
power outages aren't more common.

And why would GitHub not disclose that it was a DDoS? They were very
forthcoming when there actually _was_ a Chinese DDoS last April:
[http://arstechnica.com/security/2015/04/ddos-attacks-that-
cr...](http://arstechnica.com/security/2015/04/ddos-attacks-that-crippled-
github-linked-to-great-firewall-of-china/)

And in a DDoS, the service typically becomes slower and slower until it
reaches the point where only like one in a hundred requests succeeds. With the
GitHub outage, it died fairly instantaneously, and it was completely 100%
dead. There was no timeout as the servers tried to respond -- the "no servers
are available" error page loaded instantly every time.

------
smaili
It's always scary when a cloud service you rely on goes down but great to see
GitHub recover. Well done!

------
out_of_protocol
Various date/time formats across the world bringing me to the knees. If 1/28
outage was _that_ rough 2/28 would be twice as bad and 28/28 would feel like
armageddon maybe?

------
ryanfitz
I recently read a blog post from Github about them operating their own
datacenter [http://githubengineering.com/githubs-metal-
cloud/](http://githubengineering.com/githubs-metal-cloud/)

Im not positive, but it sounds like a fairly recent switch from a cloud
provider to their own datacenter. If thats the case, Id expect a number of
outages to come in the following months.

~~~
secure
AFAIK, they never used a cloud provider.

~~~
ryanfitz
Github was hosted at rackspace, here is there blogpost about it
[https://github.com/blog/493-github-is-moving-to-
rackspace](https://github.com/blog/493-github-is-moving-to-rackspace)

From their blog posted last month:

 _As we started transitioning hosts and services to our own data center, we
quickly realized we 'd also need an efficient process for installing and
configuring operating systems on this new hardware._

