
Cause of today's Github outage - jlangenauer
https://github.com/blog/744-today-s-outage
======
aaronbrethorst
Ouch. I think we've all done this once or twice, in some fashion or another.
I'm just happy they're so open about it. Learning experience == good thing.

From Chris' Twitter stream (<http://twitter.com/#!/defunkt>):

Seriously, I blame whoever wrote our crappy continuous integration software.

Oh that's me

~~~
nigelsampson
Exactly, I don't think you've really lived till you've experienced that pit of
your stomach feeling when you realise you've just wiped out a product website
/ database.

Thankfully for me it was a small website and no one really noticed. I can't
imagine what that feeling would be like on something like github.

~~~
SkyMarshal
_> Exactly, I don't think you've really lived till you've experienced that pit
of your stomach feeling when you realise you've just wiped out a product
website / database._

Or sent a test email to thousands of customers in your prod database
encouraging them to use web check-in for their non-existent flight tomorrow.

Yeah, did that five years ago, talk about heart-attack-inducing. Quickly
remedied by sending a second email to the same test set, thankfully, but
that's the kind of mistake you never forget.

~~~
resdirector
I have a strict rule for myself: never use any curse words in any comments,
variable names, dummy accounts etc.

~~~
jrockway
I have the opposite strict rule: use as many curse words in comments, variable
names, and dummy accounts as possible. That way you'll find out quickly when
someone else notices!

~~~
1337p337
I recommend against that. :) A team I was on had a demo for a client, and a
shaky database schema. I had used, as a test account, the username
"MOTHERFUCKINYEAH" for the same purpose. The purging of this account caused a
few 500 errors, and we almost lost the client, and although I didn't get
fired, we were all shuffled around after that to less critical projects, and
one of us actually got sent out of the state.

------
burgerbrain
Good thing git is distributed. I've been working on my code all day and never
even noticed!

~~~
axod
So what's the point of using github if you don't notice it's offline?

~~~
burgerbrain
Github provides a convenient place for me to publish my work to the world,
allowing others to pull it as they desire (my work machine is a laptop, and
can never be relied upon to be up, or at any particular address). Because git
is distributed my workflow is not effected in the slightest by "somebody
else's" (github's) completely separate repo being down.

Contrast this with something older like subversion and sourceforge. If
sourceforge went down you were shit out of luck.

------
donw
This is why it's important to isolate production from other environments.
Three rules have kept me from ever borking a production database:

1\. Production DB credentials are only stored on the production appservers,
and copied in at deploy time.

2\. The production DB can only be accessed from the IPs of the production
webfarm.

3\. Staging, Testing, Development, and Everything Else live on separate
networks and machines than production.

~~~
oddi
While this is part of the problem, to me it seems like they didn't have proper
restore procedures or at least hadn't tested them enough. There are countless
ways to corrupt a database and restoring from backup is part of most remedies.

------
peterwwillis
If someone ever produces a good book of best practices for sysadm/syseng,
please provide examples like these of why it's important to follow these best
practices.

Yes, we've all made silly mistakes. But if you're in that design meeting and
somebody asks, should we do ABC in case of XYZ, try not to think about how
complicated or time consuming it might be to do ABC. Think about the worst
case. If not doing it could at some point bring down the whole business,
perhaps you should ponder it some more.

Actually, screw a book... Does anyone else want to start some wiki pages their
experiences with screw-ups, the causes and the solutions? Does this exist in a
comprehensive way and I just haven't found it?

~~~
michael_dorfman
While a book of screw-ups might be amusing, I think it might be more
instructive to look at the "old school" ways that screw-ups like this were
avoided.

I worked, back in the dark ages, for a health insurer that had two parallel
environments-- one for testing/development, and one for production. None of
the developers were allowed anywhere near the production environment. There
was one full time employee, a former developer, whose primary job was to move
code into production-- which he would only do when he received a signed
document authorizing the change. Said document included the telephone numbers
where the developer responsible for the code would be for the next 24 hours,
so that one of the operators could call you in the middle of the night if your
code caused any problems to the system.

At the time, I used to think that this was ridiculous. After managing a staff
of programmers myself, I'm not so sure.

~~~
viraptor
It depends on the software a bit, but I got a taste of "programmers not
allowed on production" environment (in a very small scale). The problem I run
into is that you have no idea what's going on on the production environment.
Some characteristics are maybe reproducible on the dev site, but actual users
will always do something differently. Sometimes not being able to poke the
live system in a specific way while it's live, will cost you weeks of guessing
in the dev environment.

Then again, there is no perfect solution for this, is there? If you tried
installing new version with more debugging points, you'd be deploying
something unstable over something previously unstable and trying to push more
data out which might be problematic in itself. I'm not even going into
"rollback to stable didn't work" scenarios :(

~~~
peterwwillis
In my experience the best solution for a large-scale dynamic site is a
combination of read-only access to production and deploy management.

Deploy management is the combination of a change management system with a
deployment tool in a flexible way. So for example, with the right options, in
an emergency the manager of a team of developers could deploy anything to the
site immediately without requiring approval from change management. The tools
are still right there to revert any change, and of course there's snapshots
and daily backups for the most critical data.

Except for emergencies, all changes to production would come with an approval
from a higher-up with potential code review back-and-forth first. Contact data
and reversion capabilities are built in, so everybody knows who did what, how
to contact them and how to revert it if they're unavailable. And of course
your trending data will tell you when your peak use is and code pushes are
typically frozen during that time, minimizing further potential loss.

However, besides having read-only access to production, devs should also have
two kinds of testing: "development" and "staging". Development is where the
bleeding-edge broken stuff lives and code is written. Staging is an
_identical_ machine to those in production. Often you'll see test or qa
machines which aren't identical to production, usually because changes aren't
pushed to them the way they are in production. The staging machine gets all
changes pushed to it like any other production machine, except it lives in a
LAN that cannot access production or anything else. A method to reproduce
incoming requests and sessions from production system to this staging server
will give you a pretty good idea what "real traffic" looks like on this box,
if you need it.

------
tlb
Forthright and classy. Compare to register.com, which had a big DNS outage
Friday (affecting anybots.com) and never admitted to a problem.

~~~
invisible
I believe they had a message on their homepage during the event about being
DDoSed, but yeah no after-remarks is kind of ugly in my opinion. They could at
least post and hide it from the general customer.

1) <http://seclists.org/nanog/2010/Nov/415>

------
seanmcq
Lesson, don't let your CI machine talk to your production servers (firewalls
are good at this).

~~~
nettdata
In my environment, our Dev's (individuals or environments/subnets) don't have
access to PROD or QA, and our CIT boxes are in DEV. Likewise, QA and PROD only
have access to their own environments.

We have a build master that promotes a reviewed deployment package to QA
and/or PROD environments, where the appropriate QA or PROD operations folks do
the actual deployment.

It's a luxury to have the resources available for this, but it's a life saver,
because it really is stupidly easy to make a simple mistake and totally screw
things up.

The last time something similar happened to me, it happened to be at the end
of a REALLY long day. And what do you know... that day was then made 24 hours
longer, interspersed with the occasional cat nap while backups were being
restored and verified.

Fun times. Not.

~~~
smokinn
I'm not so sure it's a luxury. Maybe it is if your startup is servicing a
group of techies who have knowledge of what problems lay in the background but
most clients don't know and don't care.

It's not that hard to keep up really. It's a question of a day or two of setup
and then the hardish part of constant discipline to not take the "easy way
out" and poke holes into the segregation you've set up. Mainly it takes a
single team lead or CTO or whatever to be really explicit that you just don't
break the steps and you'll avoid a LOT of problems. You'll still have
problems, problems are inevitable, but in general you'll have mitigated them
and with a proper backup and merge procedure you'll minimize downtime.

~~~
nettdata
Agreed, but to be clear, the luxury I was referring to was specifically having
the DEV and PROD staff being totally separate teams/individuals, not just the
separation of environments. In other words, nobody on the DEV team did any
PROD operations, except in the case of bug investigation, tuning advice, etc.

------
latch
I agree with what's been said so far. 1 - Shit happens 2 - We've all done
stupid stuff 3 - Testing environment shouldn't have access to production

What hasn't been said is how refreshing it is to see an honest and quick
explanation. I know this type of approach is getting more and more common (see
the foursquare outage), but in the grand scale of things, its still quite
rare.

~~~
chesser
I have a slightly higher threshold for considering something _refreshing_.

It took several levels of basic mistakes for this to happen AND for the
restore to be as slow as it is.

Using MySQL with no transactions, no binary backups, no way to do a quick
restore, and no separation of dev/production. DROP as opposed to DELETE cannot
be rolled back and is therefore scary -- unless you aren't using transactions
in the first place, in which case, WHEE!

~~~
matwood
I don't have much MySQL experience, but have lots of experience with other
RDBMSs. Large restores take a long time no matter what. They shouldn't take
days, but since we don't know how large the events table was we have no way of
knowing if a faster restore was possible. A hot restore which it sounds like
what they are doing may take longer by simple fact that it's restoring while
the table is in use.

And DROPs (or truncates) are almost always used if the goal is to remove all
the data in a table like you would want if rebuilding the entire system.
Something like 100M record transaction is not generally considered a good
thing.

~~~
chesser
They take a long time if you have to rebuild the indexes, which is what you
have to do if you only have a text dump.

For MySQL, using the standard format (MyISAM), you can just do a file copy if
you bothered to do a proper backup of the binary files.

------
random42
I am a software developer, so I know "shit happens", but having the same
configuration for database as testing environment, (same superuser name and
password), which is not isolated from test environment, is pretty criminal
even for a first time mistakeIMHO, especially for a product like github whom
business, small and big trust with there business critical piece
("repository").

If I were running some critical code, I would have seriously reconsider
github, or at-least ask for a detailed explanation on their engineering
practices and fail-safe mechanisms.

~~~
1337p337
I can only hope it shocks some sense into kids that use GitHub for
distribution rather than putting a tarball named $name-$version.tgz (or bz2 or
xz or whatever). As much as I love GitHub, is has been the bane of my
automated-build existence. I don't want to ever have to make a build script
that guesses at a SHA1 (or punts by doing a clone at depth zero) again.

~~~
yosh
git ls-remote will help you to get the hash of HEAD. It's much cheaper than
doing a zero-depth clone.

------
jlangenauer
I think it's a measure of the goodwill in the community for Github (and
perhaps, the fact that a lot of us have done something similar in the past)
that they won't cop much flak at all for this.

~~~
binaryfinery
I don't know. I use github, but my paid, private repos are elsewhere. The fact
that someone, anyone, can run against the production system and nuke it raises
some basic questions about password storage. I don't run a site anything like
github, but my production and test databases have different passwords and none
of them are stored in a way that the test environment could get access to the
live db, nor could the tests be run on the production environment. There's bad
luck and theres asking for trouble.

~~~
alextgordon
If github spent their time so that their site raised absolutely no basic
questions, then they'd still be in beta by now.

~~~
chesser
To me, that sounds like encouraging a race to the bottom.

When you're in the business of storing other people's data, things like
transactions, binary backups, QUICK RESTORES, and _separating dev from
production_ shouldn't be afterthoughts. They are core attributes.

I would understand this for a demo, but not after a couple of years and half a
million users. They have paying customers.

 _"Reliable code hosting

We spend all day and night making sure your repositories are secure, backed up
and always available."_

~~~
risotto
The half a million users validates their techniques, no matter how much
armchair quarterbacking.

The code repos were never in danger, and they've been killing it in the market
because they are racing to add awesome extra features, not racin to the
bottom.

I don't disagree that they can do more in their db operations, and that it's
fine for us paying customers to demand more, but the reality of startups is
that it very much is a race and features, for availability or recovery or
other, are viciously prioritized and many things don't happen until something
breaks.

If you don't thing Github cuts the mustard, svn on Google Code probably won't
have problems like this...

Disclaimer: I'm personal friends with much of the Github crew.

~~~
chesser
> _The half a million users validates their techniques,_

Let me introduce you to my friend, GeoCities.

Funny thing about free hosting.

> _no matter how much armchair quarterbacking._

Transactions aren't armchair quarterbacking. Binary backups aren't armchair
quarterbacking. Separating development from production isn't armchair
quarterbacking. Kindly, you have no idea what you're talking about.

You can't give them credit for the repositories being mostly intact, when the
ONLY parts that broke were the parts they mucked with to tie them into their
database.

> _but the reality of startups is that it very much is a race and features,
> for availability or recovery_

And those are exactly the places where they screwed up.

~~~
risotto
Are you sitting in a chair? Do you not work at Github? Then you're armchair
quarterbacking. We all are. Even if you are a quarterback for another team
(I'm a DBA myself)

Anyway, it's a great discussion so we can learn from other peoples mistakes. I
will be triple checking my restores later today, and likely halt the project
I'm working on to get cold standbys shored up asap. (but it's hard to
prioritize house keeping over customer centric features in the race I'm
running along side the Github crew)

~~~
chesser
"a pejorative modifier to refer to a person who experiences something
vicariously rather than first-hand, or to a casual critic who _lacks practical
experience_ "

Ironically the most fitting application would be to say that they're armchair
quarterbacking their own database administering.

~~~
burgerbrain
Unless you work for github, you lack practical experience in github's internal
operation.

~~~
chesser
Fortunately, there is this thing called "science" which means we can
understand things about the world regardless of where we live. As Dawkins
would say, there is no such thing as "Chinese Science" or "French Science",
just science. Similarly, there is no such thing as "Github MySQL" or "Github
separation of production and development systems" in that same sense.

These are _categorical_ mistakes.

------
jorangreef
I moved my repos off GitHub to my EC2 server a month or two back since they're
private and I was only using GitHub for keeping a copy of my code offsite.
It's faster for simple push/pull and considering the sunk cost of my EC2
server, also free. I was trying to browse some repos on GitHub yesterday
during the downtime and was thankful that my own were still available.

~~~
jorangreef
Also, if you just need something for replicating code, Amazon is offering free
micro servers on new accounts for up to a year I think.

------
woan
rookie mistake, better security if they isolate their networks too...

~~~
gregwebs
I don't know why you first got down-voted. I have never worked on a project
where a testing environment could access the production db.

~~~
storborg
I think he got downvoted for calling the GitHub team "rookies". Sure, they may
not have tons of experience running 5-nines systems, but they've clearly built
a great product that a lot of people love and respect.

~~~
chesser
They may not be rookies as Ruby hackers, but the evidence _clearly_ points to
them being rookies from the standpoint of data robustness. That's embarrassing
enough when it's _not_ your core business offering.

They even brag about it on their main page.

------
sankara
May be it's foresight or may be it's just paranoid. We've always used a
entirely different username/password in prod and the password for prod never
sits in the config files. Kind of saved us a couple of times. Sometimes it
doesn't require a highly sophisticated setup to prevent a catastrophe.

------
bytesong
Minutes before the outage, my account would seem non-existent and all my
repositories gone. They really scared the hell out of me.

This will be a good reminder for me to _always_ keep a local copy.

~~~
SkyMarshal
Since it's git-based, is it even possible to not keep a local copy? I thought
everything that goes into github has to be committed to your local git repo
first?

~~~
bytesong
Well, there is always a local copy since you have to commit to the local repo
first before you push it to github. That is, of course, as long as you don't
remove it for no apparent reason.

------
dacort
Second time I've heard of this happening fairly recently. Another incident,
same cause. [http://www.bigdoor.com/blog/bigdoor-api-service-has-been-
res...](http://www.bigdoor.com/blog/bigdoor-api-service-has-been-restored/)

~~~
spudlyo
This was pretty painful. Their backups were too old, so it was necessary to do
InnoDB data recovery rather than a straight-forward restore from backup.

Since InnoDB table spaces never shrink, 80G of their truncated data was still
all available in a single monolithic ibdata file. An InnoDB recovery tool
named page_parser read their 80G ibdata file and spits out a maze of 16k
InnoDB page files organized by an arbitrary index id.

There are two internal InnoDB meta tables called SYS_INDEXES and SYS_TABLES
which can give you a mapping from table name to PK index id. Unfortunately
after the mass TRUNCATE all the tables got new index mappings, so it became a
bit of table hide-and-seek.

The InnoDB recovery tools lack a certain polish and maturity. You need to
create a C header file for each table you want recover from the pile of 16k
page files. You end up having to build a separate version of the
constraints_parser binary for each table. There were bugs with the output of
negative numbers, unicode handling, VARCHAR types with >= 127 characters, and
some edge cases where InnoDB chooses to store CHAR(15) types as VARCHARs
internally. Aleksander at Percona really saved the day, he was able to find
and fix these bugs pretty quickly.

I remember that magic moment when I finally was able to successfully checksum
a huge block of the recovered data against the too-old-to-be-useful backup.

"I love the smell of CRC32 in the morning. It smells like... victory."

------
jrockway
I try to make my apps work against SQLite _and_ the production database, so I
can run all my tests against an in-memory SQLite database. This makes the
tests run Really Fast, and it prevents a configuration error from causing my
production data to go away.

(It's not possible to do this in every case, especially if you make heavy use
of stored procedures and triggers, but I don't. If I need client-independent
behavior or integrity checks on top of the database, I just use a small RPC
server. This makes testing and scaling easier, since there are just simple
components that speak over the network. Much easier than predicting everything
that could possibly happen to the database.)

------
geophile
Anyone know what database system they use?

~~~
blutonium
MySQL. Not that using anything else would have prevented this or would help
restore the table faster...

~~~
trezor
Incorrect. The article states that they _lost data_ between backup time and
whenever the incident happened.

Any reasonable quality transactional database system has a transaction log
_for a reason_. If you were using commercial DB system like Microsoft SQL
Server or Oracle (even a decade ago), this would not be an issue. No data
loss. Businesses should care about their data, and I guess this is why the
commercial databases are still doing fine in a landscape increasingly
dominated by FOSS everywhere.

I realize licensing costs do matter, but I can't fathom why people put up with
the sad excuse of a RDBMS which is MySQL. For any nontrivial tasks it is slow,
it's unreliable, attempting to secure your data by taking database backups
(which seemingly can't even provide you with transactional safety!!!) renders
the DB unusable and locked while backup is performed, having databases bigger
than what you can store in memory makes it perform like a flat-file, etc etc
ad infitum.

Surely there must be something better people can use which is still free?
Postgres for instance?

~~~
bigiain
FWIW, even MySQL let's you store "binlogs", which allow you to "rerun" all
commands which changed the data since your last backup - if you've configured
it to.

Which makes me think, last time I had anything to do with adminning MySQL,
binlogs were required for replication, I'm guessing this means github aren't
replicating that database anywhere either...

------
jacques_chester
Well done on coming clean. However this is why the dinosaur pens have such
arduous red tape -- to try and catch serious errors before they hit
production. A mate of mine works in that world and he regularly stops code
going into production that would hose mission-critical government data.

I prefer my agility to remain on the dev-and-test side of the fence.

------
1337p337
It kind of makes me wish NILFS2 would become production-ready faster. Give
MySQL its own partition, and just roll back to a previous checkpoint if you
wipe everything. Not a substitute for backups, but a pretty speedy way to
recover for a minor snafu like this.

~~~
jacques_chester
NILFS is very nifty for write-oriented work, but I'm not sure if that's their
workload.

But if you're logging stuff, NILFS rocks:
[http://lists.luaforge.net/pipermail/kepler-
project/2009-June...](http://lists.luaforge.net/pipermail/kepler-
project/2009-June/003452.html)

------
fizx
Mysql 5.6+ has delayed replication. Until then, there's always tools like:

<http://www.maatkit.org/doc/mk-slave-delay.html>

------
ammmir
simply checking if you're talking to a production instance could avert
something like this. having some metadata in the db about whether the data
stored there is acting as production and at what version and deployment level,
so tests can have a sanity check before destructive activities.

~~~
ammmir
why the downvote?

~~~
seldo
Presumably because the comment is obvious, unhelpful, and somewhat mean-
spirited. I don't think anybody here needs a lesson in how to avoid
accidentally dropping your production database.

~~~
ammmir
thanks for the explanation, i didn't mean to be evil. my suggestion might be
obvious to some, but definitely not everyone, considering this isn't a
singular occurrence.

------
mkramlich
Kudos to them for having the guts to say publicly that they accidentally
destroyed their production database.

It's been a great service, and I think as long as this kind of thing is rare,
and none of my code repositories get corrupted or destroyed, I plan to stick
with them.

------
chunkbot
I thought systems written in Erlang never go down! ;-)

