
Downtime last Saturday - Pr0
https://github.com/blog/1364-downtime-last-saturday
======
ghshephard
This may sound selfish, but github does such a great job of writing up post
mortems, that I almost look forward to their outages just because I know I'm
going to learn a lot when they write their follow up.

~~~
tptacek
Came here to say the same thing. This is Mark Imbriaco's wonder twin power. It
really is something to generate goodwill from an outage writeup. Part of it is
just unflinching transparency combined with nerdy details; you never feel like
they're hiding anything, _and_ you get to learn about all the operational
doodads they're working with to run at this scale.

~~~
imbriaco
Thanks, I really appreciate that.

For me, the motivation for transparency came from too many frustrating
instances of being kept in the dark after things had gone wrong. The worst
thing both during and after an outage is poor communication, so I do my best
to explain as much as I can what is going on during an incident and what's
happened after one is resolved.

There's a very simple formula that I follow when writing a public port-mortem:

1\. Apologize. You'd be surprised how many people don't do this , to their
detriment. If you've harmed someone else because of downtime, the least you
can do is apologize to them.

2\. Demonstrate understanding of the events that took place.

3\. Explain the remediation steps that you're going to take to help prevent
further problems of the same type.

Just following those three very simple rules results in an incredibly
effective public explanation.

~~~
daeken
This sort of approach is the reason that when I need to upgrade to a higher
plan on Github, I don't flinch. In fact, I _love_ giving you guys more money,
simply because you make my life completely painless; I don't think I can say
the same about any other service. Keep up the awesome work.

------
jetsnoc
Wow, I'm very glad our company chose a routed design with an interior routing
protocol (OSPF.) I've never been able to push the limits of a layer two
network as far as GitHub. A routed network helps segment the network so when
systems fail or a re-convergence mistakenly occurs only a few racks are having
problems and not the entire system. It's also very helpful for us to push
routes to our exterior routing protocol (BGP.)

I also find it interesting they don't use any additional out-of-band network
for heartbeats/management especially as unstable as their layer two network
has been. It sounds like file servers need a secondary stable heartbeat
network even if it is only 10/100. No judgements being passed here it just
seems like a lot of eggs in one basket. That being said, Thank you for this
write-up and sharing so openly and honestly. Happy GitHub customer here!

EDIT: Yes, I know routed networks can have similar problems but they are
designed for routing, pathing and redundancy with a lot less overhead on the
broadcast domain.

~~~
imbriaco
You're right on all counts. We have a great many plans with regard to how we
want our network to operate that are underway.

------
ewokhead
Note to Github:

Freeze prod changes two weeks before and two weeks after all major holidays.

Your employees probably don't appreciate the hassle when all they are thinking
about is "YEAH! DAYS OFF!"

Just my opinion and how I run my systems in the DC.

~~~
nixgeek
Holidays are actually one of the best times to be making changes as traffic is
significantly lower, and IMO, one should be aiming for an infrastructure where
you can always ship changes without being afraid of the ramifications.

Architecturally that may mean many things - hitting "SHIP IT!" might push code
into a staging environment for some final testing before delivering it onto a
platter in production. Should you have multiple sites, it might involve
rolling out the new stuff to just one of them until you see how it goes. Maybe
you have feature flags and want to introduce a new change to all servers, but
just 1% of the user population?

Fundamentally hitting "SHIP IT!" should be doing just that. Any constraints
you put on how fast it gets to 100% of the user population are a risk control,
and you need to optimize for a balance of developer happiness and system
stability.

When you concede "We can't make changes because we're frozen" outside of a
critical systems ('life critical') environment, you should quit your IT job
and go become a fisherman or something.

~~~
ewokhead
Shipping code changes is a different beast.

I am talking about the infrastructure side of things.

I have built large scale percentage-deployment, slice deployment (whatever you
want to call them) scenarios like you speak of but modifying an AGG switch
that provides connectivity to your entire prod space... Uh.. Go ahead and use
your philosophy for managing large infrastructure and I will enjoy my days off
thanks.

This change is not a SHIP IT! change. This is a switching infrastructure
upgrade. This is not a push from your CI into your rolling rel. system that
updates prod applications.

This is an underlying infrastructure change with high impact and high
visibility with many stakeholders at risk.

Sorry for any confusion that my, very vague, post caused.

Maybe someday I will become a fisherman. But for now, I will keep these
switches and servers up and running with 99.999% uptime. It is what I love to
do!

Got any fishing tips?

~~~
nixgeek
I guess we have differing viewpoints, I see absolutely no fundamental reason
why infrastructure should be treated all that much differently to code. It
should be possible to fire off a test suite, to automate its deployment, etc.

I would agree that is not where most folks are at today.

I would argue the far more interesting discussion is how we develop and mature
tools to get more folks there in future.

~~~
caw
Since not everyone here is ops, if your holiday is going to be potentially
impacted by a deployment, you are fully aware of that going into the
deployment. We take note of people with blacked out dates (e.g. you booked
your flight before we ever started talking about this), and everyone else
impacted knows what's on the docket. While the issues are sudden, everyone at
least has that nagging feeling that they might get a call to action.

I agree that we should be moving to automated infrastructure testing and stuff
like that. To some extent, it may be possible via puppet/chef/auto tools,
however, not all infrastructure is like that. Sometimes you have to go
physically move stuff at your downtime window, and you can't do redundant
wiring (particularly for network). I've been bitten network outages more than
anything else, particularly with partial/undetected failures.

I think we're seeing a move to the "treat infrastructure as code" future, such
as cluster fileservers (Netapp 8-cluster mode, or Isilon systems). You'll be
able to "seamlessly" migrate data around, and virtual interfaces without
impacting production. I'm looking forward to seeing how that changes ops.

------
raverbashing
And High Availability isn't. Again

It seems redundancy protocols end up grappling each other more often than not

Unfortunately, there is no easy answer, and I'm sure Github employs people
with lots of experience.

This makes me wonder about after several people working on problems like this
it's still a challenge

~~~
ChuckMcM
You can't predict what you can't predict. This is what makes the Chaos Monkey
experiment so interesting. And yes, HA is hard, and with many things it is
hard with respect to latency.

The more latency you can tolerate, the easier HA becomes. At an extreme, if
you can tolerate 1 minute latency than each request can come in and compute a
most likely way to complete a request with the most authoritative set of
actors. Few people though are willing to tolerate a commit taking 15 minutes,
much less a couple of hours.

This was one of the most insightful things about NFS and the whole stateless
design. By burdening the client with the state the server could be much
simpler.

Once you get above a certain size the problems change becoming both easier and
harder (easier in that you can disperse your data further, harder in that your
confidence in agreement between copies takes longer to compute and thus
increases latency). It would be interesting if Google shared their work on
Spanner (they seemed to have tackled this problem at a large scale) and given
Netflix's experience (Chaos Monkey's dad) it seems like Amazon still hasn't
quite gotten the recipe right.

It is a deliciously thorny problem with subtle complexity and unexpected
inter-dependencies.

------
gleb
It seems that every GutHub downtime I can recall was caused by automated
failover.

~~~
rdl
I wonder how much downtime they've avoided through automated failover.

~~~
gleb
Yuh, I have the same question too :-). Hard to do cost/benefit analysis when
you only see the costs.

~~~
rdl
In general, automated failover seems to make most small problems non-problems,
but turns some small problems into big problems. It probably depends on actual
numbers what makes sense for you app.

For some systems, I'd take getting rid of small outages -- I'll happily take
an increased risk of a projected 15 minute loss of heart function becoming a
>60 minute loss of heart function if it also eliminates what would otherwise
be a bunch of 5 minute losses of heart function, since even the 5 minute
interruptions would be fatal.

(Or, for a better example, revolvers vs. semi-autos. A revolver generally is
more reliable, but if it goes out of timing, it's basically doomed, whereas a
semi-auto can jam or pieces can break, but a monkey can clear, and a trained
monkey can fix.)

~~~
kyrra
Failover is meant to deal with hardware failures, which will tend to work just
fine. But if the node you are failing over onto was already has 60% capacity
and you add another 60% capacity during the failover, things are going to get
worse.

The top-level systems probably need to be able to deal with increased latency
or timeouts, and properly handle retries and throttling of traffic.

If you have some HA failover setup going but your alternate is already being
used more for load balancing than for failover, problems like this will occur.

(I used to work on failover drivers for a SAN).

~~~
jackowayed
GitHub's failover problems have never been load-related. GitHub has pairs of
fileservers where one is the master and the other's sole job is to follow
along with the master and take over if it thinks the master is down, so when
they do failover, it is to a node with just as much capacity as the previous
master.

All the failover problems I can think of since they moved to this architecture
4 years ago have been coordination problems where something undesired happened
when transitioning from one member of a pair to another. In this case, network
problems lead them to a state where both members of a pair thought they were
the master.

------
onetwothreefour
Ahhh... good old Heartbeat.

We used to use Heartbeat in a similar setup back in 2001. It was the worst
architectural decision we ever made, and after one too many a failure (where
STONITH/split-brain/etc killed the wrong machine, or both machines) we threw
it out.

TL;DR: This will happen again. Guaranteed.

~~~
justincormack
I was brought up to only use stonith via serial or other non-switched network
connection. Running over the same network is bound to cause problems. But its
not a great solution anyway.

~~~
nixgeek
Agreed. I completely agree, but the world isn't as kind. Providers often have
funny rules about how you can cable things up in their datacenters, and as
noted in #4 of where GitHub goes from here that needs to be addressed.

------
ewokhead
I just realized that the Sys. Admin/Prod. Ops to Developer ratio here is crazy
low. Everyone assumes I am talking about code changes when the article is
about prod switching and network transit device changes.

MLAG or any LAG technology, LACP, bonds whatever should never impact the
deployment of code. It should be invisible when it is working. Obviously it is
very visible when it breaks though.

My heart goes out to the Github guys!

Sorry for the confusion everyone.

------
akg_67
I get the impression that issue wasn't network hardware but bad high
availability design on fileserver side. Why do GitHub has failover network on
same network hardware as primary network? But I am not surprised as I see this
at lot of clients that they have failover network on separate VLAN on same
network hardware. And, whenever they have network hardware issue, servers run
into split brain problems.

The failover network should be totally different physically and logically from
primary network. The heartbeat between file servers should be checked through
both primary network and failover network. If a server can't be reached by its
partner over primary network, it should be gracefully taken offline by partner
through failover network.

------
el_cuadrado
High Availability strikes again. No surprise there.

But I am mildly flabbergasted by the fact that GitHub uses STONITH. The
technology is as safe as open-core nuclear reactor, and it works reliably only
in very simple conditions.

~~~
mofraw
STONITH is designed for critical failure so you don't end up with a split-
brain situation, which is far worse than a dead node. STONITH is a good thing.
The problem here is more than cluster wasn't configure to survive a
catastrophic switching failure.

~~~
el_cuadrado
Yep, right - fencing solution that relies on the network to stop the service.
How could that possibly fail?

------
chuhnk
GitHub as a sysadmin/system engineer I feel your pain, I understand it
completely and know the horrors of failures leading into multi-hour recovery.
That said, you need to do better. This is a heavily relied upon resource for
the open source community. Perhaps you didn't estimate this sort of growth but
now you are here I'm sorry but the weight does fall on your shoulders. I
heavily commend you guys on the service you've provided thus far and what
you've done to actually pull all the varying language/library communities
together. I honestly want to see this scale and serve 5 9s year round. You
need to take a good hard look at the architecture of the stack and find a way
to get it multi-homed.

~~~
Ricapar
Sure, GitHub is "heavily relied upon resource for the open source community" -
but what is really the impact of their downtime?

You have to wait a little longer to do your push/pull/merge/etc?

Give them a break.

~~~
chuhnk
Remember that individuals and organisations pay for private repo hosting.
Aren't they entitled to an adequate SLA of say 4 9s a year?

~~~
nixgeek
As with any purchasing decision you can read the T&Cs when making up your mind
and are free to vote with your wallet if you disagree.

Another factoid is that Amazon Web Services (AWS) only offers a 99.95% SLA so
where did the 99.99% for GitHub come from?

~~~
chuhnk
Why does it matter that AWS is 99.95%? "Cloud" based services like AWS and
AppEngine have more relaxed SLAs. GitHub is not a cloud service, its a
centralized version control repository. And the 4 9s, well that's just
something I've always strived for when providing a service but I do believe we
should all hold ourselves to it.

------
mikec3k
I always love to read post-mortems like this. It's fascinating to see how a
simple event can trigger a massive failure & we can learn a lot from them.

------
cbsmith
tl;dr: It's really hard to get high availability systems right, and we still
run the entire service as a single colo.

I can totally understand this kind of thing going wrong, but particularly
given the service they provide, why not have a second colo, with a relatively
recent clone of the repo, that you can route people to? Heck, you can likely
even do an automatic merge once the other repo is working again...

~~~
MichaelGG
Relatively recent clone? Sounds like that would screw customers up pretty bad
if they don't realise the problem.

If they went to a second site, having synchronous commit to both sites is how
it should be done, no? The extra latency on infrequent git pushes is far less
an inconvenience than the possibility of grabbing the wrong code.

~~~
cbsmith
> Sounds like that would screw customers up pretty bad if they don't realise
> the problem.

I think there'd be a variety of ways to have the system to fail until the
customer made some kind of adjustment that indicated they grokked that there
was a failure (like say... changing your upstream).

------
ksec
I have absolutely ZERO knowledge on Enterprise Networking. But it strikes me
that something as dump as Router, Switches or Network Equipment are still so
unstable.

A Hypothetical questions, why not something a 8 ARM 64 bit core Linux computer
as switches? Making more logic resides in software instead?

~~~
regularfry
There is a move to put more of the switching and routing logic in software
(google for openvswitch if you're interested) but part of the problem is that
general purpose hardware doesn't stand a hope in hell of keeping up with
interesting network data rates. You absolutely need to be doing a fair portion
of the work in hardware, the CPUs just coordinate it.

------
apeace
It's about time we heard from them. I understand the timing of this was
unfortunate (right before a holiday), but the trust alluded to in the
conclusion of that post would be bolstered by faster post-mortems on major
outages like this one.

~~~
InclinedPlane
I disagree. Other than for curiosity needs what's the value in a faster post
mortem? Especially considering that a faster post-mortem would likely be a
less accurate and less complete post-mortem? What is the difference in
actionability for anyone between an update in 2 days vs an update in 4 days? I
see none.

~~~
apeace
To me, a post-mortem doesn't just satisfy curiosity, it eases fears that the
problem will return and informs me of future plans which may help prevent the
problem, or may bring it back. It helps me form my own plan, since I'm a user
of Github.

I for one spent my break hawking my email, in case further Github outages
caused any of my automated deploy scripts to fail. I realize it's my own
responsibility to write scripts that handle failure scenarios, but the fact is
my company pays Github to host our repositories. Downtime happens, I'm
understanding of that. But when it does, I want to know what's going on as
soon as that information's available--especially when I'm on holiday.

I don't think a blog post written after they resolved all the issues would
have been less accurate. It just would have inconvenienced whoever wrote it on
a holiday.

Not the biggest deal in the world--it would take me a lot more than this slip-
up to switch from Github. But IMO a service provider should get at least some
information out faster than this.

~~~
imbriaco
I appreciate your point of view but respectfully disagree. The post-mortem
would have absolutely been less accurate if we had delivered it sooner since
we did not have the details about why the MLAG failover did not happen as
expected until late in the evening on Christmas Eve. We've worked as quickly
as possible since then to provide a full post-mortem.

~~~
rdl
IMO anything less than a billing cycle (minus a week or so) is pretty
acceptable for a post-mortem, although a week is better.

------
treskot
I've noticed MC-LAG / MLAG failing quite often in my case. Any details over
the MLAG fail? Are we doing it wrong? Alternatives?

------
robomartin
<http://www.youtube.com/watch?v=c8N72t7aScY>

------
dos1
Who's their network switch vendor? I'm not a networking expert, but boy - it
sure seems like their switch vendor has screwed some things up royally. Or
perhaps this is common with all complex network topologies regardless of
hardware vendor?

EDIT: I would just like to say, along with others, I greatly enjoy their
postmortems and I feel as though I learn something every time. Kudos to them
for being forthright. I host my personal and professional projects with them
and am supremely confident that my data is as safe with them as it is with
anyone.

~~~
sounds
This is likely the reason the large cloud companies all do testing that
_actively_ causes outages. I'm over-simplifying on purpose here: this is
something that requires a lot of thought.

At first glance that seems foolish, but to quote you, "complex network
topologies" are very prone to falling over badly. Since they all seem to be a
one-off custom setup these days, how can you be sure it won't fall over?

Here are the testing approaches I know about:

1\. Netflix Chaos Monkey: [http://techblog.netflix.com/2012/07/chaos-monkey-
released-in...](http://techblog.netflix.com/2012/07/chaos-monkey-released-
into-wild.html) \- but that doesn't mean Netflix has it all together. They
still have outages.

2\. Google does it. They have a team that goes around unplugging network
cables and monitoring how fast the engineers can find and fix the problem. I
can't dig it up but it was only a few months ago - hey, Google, your search
engine can't find an article about you. :)

~~~
tonfa
[http://www.wired.com/wiredenterprise/2012/10/ff-inside-
googl...](http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-
center/all/) (search for DiRT).

