
January 28th Incident Report - Oompa
https://github.com/blog/2106-january-28th-incident-report
======
eric_h
> One of the biggest customer-facing effects of this delay was that
> status.github.com wasn't set to status red until 00:32am UTC, eight minutes
> after the site became inaccessible. We consider this to be an unacceptably
> long delay, and will ensure faster communication to our users in the future.

Amazon could learn a thing or two from Github in terms of understanding
customer expectations.

~~~
dmunoz
I recently stepped into a role with a devops component, and one of my first
surprises was just how slow status.aws.amazon.com was to update about ongoing
issues. I had to scramble to find twitter and external forums confirmation for
the client.

~~~
atom_enger
What's even worse is that when Amazon finally updates their status page it's
usually still a green icon with a little i tick for "information" even if it
was a partial outage. It takes a lot for the icons to go red which is what
you'd look for if you're experiencing issues.

I do the same thing, often searching Twitter for "aws" or "outage" and find
people complaining about the problem which confirms my suspicions. It's a sad
state of affairs when you have to do this and Amazon doesn't seem interested
in fixing it.

~~~
click170
If you have a support agreement with them then file a ticket requesting better
customer communication and link back here as an example of how to do it right.

I think everyone complains in forums and online but doesn't actually file
tickets about it. These things are worth tickets too.

~~~
bmurphy1976
I take it you have no experience filing tickets with them. A typical ticket
goes something like this:

1\. File ticket.

2\. Wait. Then wait some more. Even if you pay big money for a support
contract, they take a long time to respond (often > 1 hour).

3\. Get a response from a first level rep who has no access to anything, has
little dev experience, and asks some inane questions which I'm _convinced_ is
a purposeful stalling tactic.

4\. Play the dumb question/obvious response dance, waiting an hour or more for
a response each time.

5\. If you are lucky (usually a couple hours in now) they acknowledge there's
some problem (but never give you any detail) and escalate your ticket to a
higher level internal team. If you are unlucky, you are calling up your
account rep (do you even have one??) and getting them to harass tech support.

6\. Usually around now the problem "magically" disappears if you haven't
already fixed it yourself.

7\. If you are lucky, a few hours, days, weeks later you get a response asking
if you are still having the problem? You, of course, are NOT having the
problem since you long ago solved it yourself. If you are really unlucky they
try to schedule a meeting with one of their "solution architects" who is then
going to waste an hour of your time telling you how to properly "design" your
software for the cloud (i.e. trying to sell you on even more of their
services).

8\. Ticket is closed having never gotten to the bottom of the problem, maybe
get a survey.

I've _never_ seen this go down differently. Filing more tickets isn't going to
change this. You want to really change things?

STOP PAYING THEM!

If a few mid-sized customers stop paying them and make a big-stink when they
do it, then I guarantee you things will change! Until then, they have little
incentive to improve and the big customers have a direct line to Amazon so
they can circumvent all this crap. It's up to the small and mid-sized
customers to push for change and the most effective way to do this would be to
spend your money elsewhere.

~~~
abrookewood
To be honest, I've always found their support to be really good. Sometimes it
can be a little slow to start, but I regularly experience technicians that go
way above what I would expect to assist me & deliver a great outcome. If other
companies in Australia were as responsive as them (e.g. telcos), I'd be a very
happy man. EDIT: I'm on Business Support, so maybe that's your issue?

~~~
jsjohnst
I'm on business support too and generally am talking to a rep in minutes. They
aren't always able to find the problem before I do, but I always get follow up
details later on the how / why that they did determine.

~~~
bmurphy1976
I wish our experience was like this. We used to have business level but we
dropped it because we weren't getting value for it. Our experience was
slightly better when we had it but we still ended up either fixing most
problems on our own or waiting them out.

~~~
jsjohnst
How much do you pay per month for AWS? That might be a difference.

------
bosdev
There's no mention of why they don't have redundant systems in more than one
datacenter. As they say, it is unavoidable to have power or connectivity
disruptions in a datacenter. This is why reliable configurations have
redundancy in another datacenter elsewhere in the world.

~~~
nemothekid
Given the dependency in question is Redis, such a solution is probably
exacerbated by the fact Redis hasn't really had a decent HA solution.

This is also hidden by the fact that Redis is _really_ reliable (in my
experience at least). In my experience it usually takes an ops event (like
adding more RAM to the redis machine) to realize where a crutch has been
developed on Redis in critical paths.

~~~
yeukhon
A lot of tools and services people use either don't have HA at all or don't
have a native support for true distributed HA. But that can't stop people from
making some HA or alike solution. I am not sure what they use Redis for but
along the line of caching and key-value store they must have figured out how
to invalidate data, otherwise they'd be running only a single instance of
Redis. i.e. they are running "HA" just in a single data center, so logically
speaking that's not difficult to port over to another data center.

~~~
mwpmaybe
I'm not familiar enough with Redis's clustering features to speak to the exact
issues with what you're proposing, but generally speaking, HA is almost a
completely different problem than disaster recovery (DR). Sure, the protocol
is the protocol, but you wouldn't want to cluster local and remote nodes
together for several reasons, primarily latency, security, and resiliency.
Performance will suffer if they're clustered together and a single issue could
take down nodes in both data centers, which kind of defeats the purpose.

What you really want is a completely separate cluster running in a different
data center (site). It should be isolated on its own network and ideally it
should have different admin rights/credentials and a different software
maintenance (patching) schedule. A completely empty site isn't much use so
you'll need some kind of replication scheme. Naturally, these isolating steps
make site replication difficult. You might patch one site and now the
replication stream is incompatible with the other site. (You can't patch both
sites at the same time because the patch might take down the cluster.) Or
whatever you're using to replicate the sites, which has credentials to both
sites, breaks and blows everything up. You need a way to demote and promote
sites and a constraint on only one site being the "master" at a time. What
happens if network connectivity is lost between sites? What happens if one
site is down for an extended period of time? Maybe you need a third, tie-
breaking site?

Once you work through these issues, you are still exposed to user error. Your
replication scheme might be perfect... perfect enough that that an
inadvertently dropped table (or whatever) is instantly replicated to the other
site and is now unrecoverable without going to tape. Maybe you introduce a
delay in the replication to catch these oopsies, but now your RPO is affected.
Anyway, it's a bit of a shell game of compromises and margins of error.

Source: 10 years designing and building HA/DR solutions for Discover Card.

------
danielvf
For all that work to be done in just two hours is amazing, especially with
degraded internal tools, and both hardware and ops teams working
simultaneously.

~~~
imbriaco
You're absolutely right.

We should collectively be using incidents like this as an opportunity to
learn, much like the GitHub team does. Our entire industry is held back by the
lack of knowledge sharing when it comes to problem response and the fact that
so many companies are terrified of being transparent in the face of failure.

This is very well written retrospective that gives us a glimpse into the
internal review that they conducted. Imagine how much we could collectively
learn if everyone was fearless about sharing.

~~~
totally
relevant:

[https://codeascraft.com/2012/05/22/blameless-
postmortems/](https://codeascraft.com/2012/05/22/blameless-postmortems/)

------
DarkTree
I don't know enough about server infrastructure to comment on whether or not
Github was adequately prepared or reacted appropriately to fix the problem.

But wow it is refreshing to hear a company take full responsibility and own up
to a mistake/failure and apologize for it.

Like people, all companies will make mistakes and have momentary problems.
It's normal. So own up to it and learn how to avoid the mistake in the future.

~~~
eric_h
As I said in another comment, the fact that they found an 8 minute delay from
outage to status page update to be unacceptable speaks volumes to how much
they value their relationship with their customers.

as an aside I feel that I'm quite fortunate to work in the EST timezone, as
their outage apparently started at about 7pm my time. We have a general rule
at my company to not deploy after 6pm unless an emergency fix absolutely needs
to go up.

I saw the title of the story and said to myself, what outage? :P

------
pedalpete
Does Github run anything like Netflix Simbian Army against it's services? As a
company by engineers for engineers with the scale that github has reached, I'm
a bit surprised they are lacking a bit more redundancy. Though they may not
need the uptime of netflix, an outage of more than a few minutes on github
could affect businesses that rely on the service.

~~~
imbriaco
Google "Netflix downtime" for evidence that Netflix also has outages. Google
has outages, sometimes very significant ones of Google Apps. Facebook has
outages.

Complex systems fail. Period. All the time. Things like the Simian Army are
fantastic tools that help you identify a host of problems and remediate them
in advance, but they cannot test every combinatorial possibility in a complex
distributed system.

At the end of the day, the best defense is to have skilled people who are
practiced at responding to problems. GitHub has those in spades, which is why
they could respond to a widespread failure of their physical layer in just
over 2 hours.

The biggest win with the Simian Army isn't that it improves your redundancy.
It's that it gives your people opportunities to _practice_ responses.

~~~
drdrey
More than practicing responses, Chaos Monkey and Failure Injection Testing
allow us to verify that we don't have unexpected hard dependencies. Sometimes
you find out that your service can't start if another one becomes latent, in
which case you can plan for it by adding redundancy/extra capacity, fallbacks
or working in degraded mode.

------
onetwotree
Every time I read about a massive systems failure, I think of Jurassic Park
and am mildly grateful that the velociraptor padock wasn't depending on the
systems operation.

~~~
mattdeboard
Well as long as you're not Samuel L. Jackson in that scenario you should be
fine. Ish.

~~~
onetwotree
Samuel L. Jackson taught me everything I know about ethics in software
engineering.

Including the principle that if your software breaks, you're the on who has to
go get savaged by velociraptors to fix it.

------
mjevans
This just shows how difficult it is to avoid hidden dependencies without a
complete, cleanly isolated, testing environment of sufficient scale to
replicate production operations and do strange system fault scenarios
somewhere that won't kill production.

~~~
ones_and_zeros
Or use the Netflix model: Chaos testing in production.

~~~
toomuchtodo
No system is perfect; as you continue to add 9s, the cost increases steeply.

Usually its just cheaper to be down for an hour or two, versus architect for
the end of times.

~~~
knodi123
> Usually its just cheaper to be down for an hour or two, versus architect for
> the end of times

The opposite of this philosophy was the motivation behind creation of the
internet in the first place.

~~~
jessaustin
This seems precisely wrong. Some reading:

[http://web.mit.edu/Saltzer/www/publications/endtoend/endtoen...](http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf)

[https://www.jwz.org/doc/worse-is-better.html](https://www.jwz.org/doc/worse-
is-better.html)

[thanks for the hint 'thinkpad20! I don't know what I was thinking.]

~~~
knodi123
It is not precisely wrong, and thanks for tricking me into opening an obscene
picture at work, asshole.

The internet is designed to be highly fault tolerant, because it was based on
an arpanet project to design a network that would NOT go down, even if there
was damage to a significant percentage of nodes.

~~~
jessaustin
The "asshole" in this case is JWZ, [randomly?] switching on the Referer
header. Apparently he has a hard-on for HN; he's not the only one, but I won't
be linking to his site again. (Although, is that really "obscene"? It doesn't
do anything for me?) Try this instead, since Stanford are unlikely to engage
in such shenanigans:

[https://web.stanford.edu/class/cs240/old/sp2014/readings/wor...](https://web.stanford.edu/class/cs240/old/sp2014/readings/worse-
is-better.html)

It's funny, my original comment had the links in plaintext so copying-and-
pasting was required and Referer wasn't involved. I changed that on request.
b^)

------
viraptor
> ... Updating our tooling to automatically open issues for the team when new
> firmware updates are available will force us to review the changelogs
> against our environment.

That's an awesome idea. I wish all companies published the firmware releases
in simple rss feeds, so everyone could easily integrate them with their
trackers.

(If someone's bored, that may be a nice service actually ;) )

~~~
vhost-
This was one of the toughest things about admining hardware clusters. Firmware
updates (and firmware issues) are so hard to track down. It's so annoying. I
remember spending a week tracking down an issue with a RAID controller and
then spending another day or two on the phone with the vendor trying to get a
firmware update so we did not have 2 racks of hardware sitting on a ticking
time-bomb.

------
matt_wulfeck
> Remote access console screenshots from the failed hardware showed boot
> failures because the physical drives were no longer recognized.

I'm getting flashbacks. All of the servers in the DC reboot and NONE of them
come online. No network or anything. Even remotely rebooting them again we had
nothing. Finally getting a screen (which is a pain in itself) we saw they were
all stuck on a grub screen. Grub detected an error and decided not to boot
automatically. Needless to say we patched grubbed and removed this "feature"
promptly!

------
gaius
You can very clearly see two kinds of people posting on this thread: those who
have actually dealt with failures of complex distributed systems, and those
who think it's easy.

------
Animats
_" We identified the hardware issue resulting in servers being unable to view
their own drives after power-cycling as a known firmware issue that we are
updating across our fleet."_

Tell us which vendor shipped that firmware, so everyone else can stop buying
from them.

~~~
gruez
I'm guessing they didn't disclose the vendor because they didn't want to be
sued for defamation.

~~~
Animats
Truth is an absolute defense to libel in the US.

~~~
mikeash
It doesn't stop you from getting sued, though, it merely stops you from
losing. It's pretty reasonable to want to avoid a lawsuit you're absolutely
certain you could win.

~~~
Animats
Vendors very seldom sue customers for publicly saying their product is
defective. The negative publicity tends to backfire. Legal action can backfire
even worse. If the vendor claims the product isn't defective, they have to
prove that in court to win a libel action. That means discovery and
examination of the company's internal documents and the complaints of other
customers, all on the record.

------
merqurio
I feel it was good incident for the Open Source community, to see how
dependent we are on GitHub today. I feel sad whenever I see another large
project like Python moving to GitHub, a closed-sourced company. I know, GitLab
is there as an alternative, but I would love to see all the big Open Source
projects putting pressure over GitHub to make them open their source code, as
right they are big player in open source, like it or not.

~~~
BinaryIdiot
Git is a distributed version control system. Github is simply a place to host
a repository and some issues. There is nothing stopping anyone from pushing to
another remote hub for redundancy.

So you want Github to open source where they put your git repo and issues? Who
cares about that? It's unimportant because regardless they're still the
central endpoint to many open source projects, opened or closed source. If you
want open source use Gitlab or any other service that sprinkles extra features
around git.

I'll never understand this outrage of dependence on Github when you have a
distributed version control system. It's not like it should be on github to
setup third party repositories for you.

~~~
nissehulth
From a developers point of view, you're right. But there are package
management systems and other stuff depending on being able to download from
Github.

Ofc, Github isn't to blame for this, rather the ones that thought Github would
be great to use as a CDN.

~~~
Animats
_" But there are package management systems and other stuff depending on being
able to download from Github."_

Rust program building, for example, seems to require that Github be up.

~~~
steveklabnik
Only if you need to fetch new dependencies.

~~~
Animats
Or do a clean rebuild.

------
rqebmm
It must be nice to know that the majority of your customers are familiar
enough with the nature of your work that they'll actually understand a
relatively complex issue like this. Almost by definition, we've all been
there.

------
dsmithatx
If only Bitbucket could give such comprehensive reports. A few months back
outages seemed almost daily. Things are more stable now. I hope for the long
term.

~~~
viraptor
Isn't BB's problem basically that there are too many users? GH's outage
writeup is cool, because it's a one off and it can be analysed. When BB is
just overloaded for a long time and needs more power, it's not going to be
very interesting.

(unless I missed some specific non capacity related outages?)

~~~
yeukhon
Maybe. BitBucket was also an acquisition so for some time I believe there was
a lack of resource provided to them and there was a huge technical
debt/integration effort required. At this very time, I don't know if Atlassian
actually care much about BitBucket. They are probably more concerned about
delivering Stash than BitBucket, my wild guess.

I was an active BB user a couple years ago, and the project I worked on would
hg clone from BB many times a day so I would be the first one to notice a 503
or whatever error coming from their service. Typically I would see one or two
outage per month, some last a few minutes, some last several hours. Most of
the time the outage impacted git/hg checkout, so I think that was their
technical bottleneck.

~~~
lhc-
FYI, Stash is now Bitbucket Server, and the plan as I've heard it is to work
towards feature parity between the two.

~~~
mattdeboard
We use Stash and it is surprisingly not bad at all. Github is much more
polished but for code browsing and review, it does that which it is supposed
to do.

------
guelo
Weird that they didn't say what caused the power outage and what the
mitigations are for that.

~~~
gsibble
I'm also confused about how the racks would lose power. Surely they had UPSes.

~~~
abrookewood
Generally speaking, I'd recommend AGAINST running UPSes in racks that are
managed by top-tier data centres. I've had way more trouble with UPSes
misbehaving than I ever have with data centres losing power. EDIT: I'd also
point out that 2 hours is a long time to be running on in-rack UPSes. I've
usually seen them designed to withstand about an hour, but not much more.

~~~
WatchDog
The power outage was only brief, enough to halt the servers but, much less
than the 2 hour outage window.

------
tmsh
> Over the past week, we have devoted significant time and effort towards
> understanding the nature of the cascading failure which led to GitHub being
> unavailable for over two hours.

I don't mean to be blasphemous, but from a high level, is the performance
issues with Ruby (and Rails) that necessitate close binding with Redis (i.e.,
lots of caching) part of the issue?

It sounds like the fundamental issue is not Ruby, nor Redis, but the close
coupling between them. That's sort of interesting.

~~~
byroot
No the fundamental issue is that an application should not require any
external service to boot.

It has nothing to do with Ruby, or Rails or even Redis. It's just a design
flaw of the application, that you often learn the hard way.

------
cognivore
Um, work from your local cache for a few hours? It's that the one of the main
reasons for git?

~~~
majewsky
Not all processes that involve GitHub are development processes. I've seen
automated deployments fail inside a corporate network when the resident HTTP
proxy had a bad day and could not connect to github.com.

------
timiblossom
If you use Redis, you should try out Dynomite at
[http://github.com/Netflix/Dynomite](http://github.com/Netflix/Dynomite). It
can provide HA for Redis servers

------
rurounijones
I would have expected there to be a notification system owned by the DC that
literally send an email to clients saying "Power blipped / failed".

That would have given them immediate co text and not wasting time on DDOS
protection

------
spydum
So, while it sounds like they have reasonable HA, they fell down on DR.
unrelated, I could not comprehend what this means?..: technicians to bring
these servers back online by draining the flea power to bring

Flea power?

~~~
Someone1234
I assume they mean completely disconnect the equipment from ALL external power
sources. Typically even when a piece of equipment is offline in a data center,
it continues to draw power, and will often keep running systems like DRAC and
other management/status tools (since the whole concept of a data center is
NEVER having to get up out of your chair, so even a "shutdown" system needs to
be able to be remotely started).

Since the firmware had a bug, bad state could be stored, completely removing
power may clear that state and appears to have done so in this case. They may
have also needed to pull the backup battery, and reset the firmware settings,
but I wouldn't presume that just from the term "flea power."

~~~
spydum
sure enough, it's a real term, and it's relatively old..
[http://answers.google.com/answers/threadview/id/185999.html](http://answers.google.com/answers/threadview/id/185999.html)

I have never known what to call this, but have definitely been engaged in
draining a few fleas.

Also, I can't believe it's been that long since google answers has been
closed..

------
tonylxc
TL;DR: "We don’t believe it is possible to fully prevent the events that
resulted in a large part of our infrastructure losing power, ..."

This doesn't sound very good.

~~~
jrockway
If your plan to avoid downtime is to prevent power outages, you're going to
have downtime. All their sentence says is they can't prevent power outages.
That's fine, because the other 1/nth of your servers are on a different power
grid in a different state.

~~~
tonylxc
I totally share the same view that to best avoid failure is to embrace it and
cope with it.

It is true that all their sentence is about recovery, however, it is
disappointing that they didn't mention anything about a redundant datacenter.

------
mattdeboard
Anyone have a link to a description of the firmware bug that caused the disk-
mounting failure after power was restored?

~~~
ymse
I'm going to guess that these are Dell R730xd boxes with PERC H730 Mini
controllers (LSI MegaRAID SAS-3 3108).

A failed/failing drive present during cold boot could cause the controller to
believe there were no drives present. To add insult to injury, on early BIOS
versions this made the UEFI interface inaccessible. The only way to recover
from this state was to re-seat the RAID controller.

There were also two bizarre cases where the operating system SSD RAID1 would
be wiped and replaced with a NTFS partition after upgrading the controller
firmware (and more) on an affected system (hanging/flapping drives). Attempts
to enter UEFI caused a fatal crash, but reinstall (over PXE) worked fine. BIOS
upgrade from within fresh install restored it.

From the changelog:

    
    
        Fixes: 
        - Decreased latency impact for passthrough commands on SATA disks
        - Improved error handling for iDRAC / CEM storage functions
        - Usability improvements for CTRL-R and HII utilities
        - Resolved several cases where foreign drives could not be imported
        - Resolved several issues where the presence of failed drives could lead to controller hangs
        - Resolved issues with managing controllers in HBA mode from iDRAC / CEM
        - Resolved issues with displayed Virtual Disk and Non-RAID Drive counts in BIOS boot mode
        - Corrected issue with tape media on H330 where tape was not being treated as sequential device
        - resolved an issue where Inserted hard drives might not get detected properly.

------
TazeTSchnitzel
> We had inadvertently added a hard dependency on our Redis cluster being
> available within the boot path of our application code.

I seem to recall a recent post on here about how you shouldn't have such hard
dependencies. It's good advice.

Incidentally, this type of dependency is unlikely to happen if you have a
shared-nothing model (like PHP has, for instance), because in such a system
each request is isolated and tries to connect on its own.

------
totally
> Because we have experience mitigating DDoS attacks, our response procedure
> is now habit and we are pleased we could act quickly and confidently without
> distracting other efforts to resolve the incident.

The thing that fixed the last problem doesn't always fix the current problem.

~~~
dgritsko
Occam's razor isn't a bad rule of thumb, however.

------
swrobel
Anyone got a good tl;dr version?

~~~
alblue
Power outage in DC brought many machines down. Redis clusters failed to start
owing to disk issues (not cleanly unmounted?). The reboot of remaining
machines uncovered an unknown dependency on the machines needing the redis
cluster to be up in order to boot.

There were other learning points such as immediately going into anti DDoS mode
and human communication issues that didn't realise or escalate the problem
until some time after the issues started occurring.

------
jargonless
What is this "HA" jargon?

I would STFW, but searching for "HA" isn't helpful.

~~~
xzlzx
You could google "HA", click in the Wikipedia link that shows all the things
"HA" may refer to, and deduct that the most logical thing in the list, given
the context, would be this link:
[https://en.wikipedia.org/wiki/High_availability](https://en.wikipedia.org/wiki/High_availability).

~~~
dgritsko
Would it have been so hard to just type "high availability" rather than making
him feel bad for being one of today's 10,000?
[https://xkcd.com/1053/](https://xkcd.com/1053/)

~~~
ryanlol
Why would you feel bad about not being familiar with an abbreviation?

~~~
bcook
The statement about having poor deductive logic skills was the more insulting
part of the post (compared to ignorance of an initialism, which I think you
are correct in thinking is insignificant).

~~~
xzlzx
Oddly enough, I wasn't stating that the person had poor deductive logic. I was
stating the exact steps I took to find the answer myself.

------
julesbond007
I seriously doubt this version of the story. While it's possible for several
hardware/firmware to fail in all your datacenters, for them to fail at the
same time is highly unlikely. This may just be a PR spin to think they're not
vulnerable to security attacks.

While this was happening at Github, I noticed several other companies facing
that same issue at the same time. Atlassian was down for the most part. It
could have been an issue with the service github uses, but they won't admit
that. Notice they never said what the firmware issue was instead blaming it on
'hardware'.

I think they should be transparent with people about such vulnerability, but I
suspect they would never say so because then they would lose revenue.

Here on my blog I talked about this issue:
[http://julesjaypaulynice.com/simple-server-malicious-
attacks...](http://julesjaypaulynice.com/simple-server-malicious-attacks/)

I think it was some ddos campaign going on over the web.

~~~
dandandan
They're not hosted in multiple datacenters; there was a power interruption in
their single datacenter that exposed this firmware bug. The point of this
postmortem isn't the initial power interruption but rather its repercussions,
why it took so long to recover from and how they can improve their response
and communications in the future.

~~~
julesbond007
Ok...so this is another PR...without admitting the issue. I don't know
github's infrastructure, but they have a single point of failure? Last I know,
every place these days have backup power especially a datacenter...so those
were not working either? My point is that it's much better to be upfront
sometimes. In fact github didn't have to say anything about the whole thing
since everyone forgot already...

