
GitHub availability this week - tanoku
https://github.com/blog/1261-github-availability-this-week
======
pbiggar
I know that they have to be apologetic like this, but the simple fact is that
GitHub's uptime is fantastic.

I run <http://CircleCi.com>, and so we have upwards of 10,000 interactions
with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A
seriously seriously small number of them fail. They know what they're doing,
and they do a great job.

------
cagenut
I'd like to welcome the github ops/dbas to the club of people who've learned
the hard way that automated database failover usually causes more downtime
than it prevents.

Here's sortof the seminal post on the matter in the mysql community:
[http://www.xaprb.com/blog/2009/08/30/failure-scenarios-
and-s...](http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-
solutions-in-master-master-replication/)

Though it turns into an MMM pile-on the tool doesn't matter so much as the
scenarios. Automated failover is simply unlikely to make things better, and
likely to make things worse, in most scenarios.

~~~
ghshephard
Automated database failover is absolutely mandatory for HA environments (as
in, there is no way to run a 5 9s system without it) but, poorly done, results
in actually reducing your uptime (which is a separate concept from HA).

I've been in a couple of environment in which developers have successfully
rolled out automated database failover, and, my takeaway, is that's it usually
not worth the cost - and with very, very few exceptions, most organizations
can take the downtime of several minutes to do manual failover.

In general, when rolling out these operational environment, they are only
ready when you've found, and demonstrated 10-12 failure cases, and come up
with workarounds.

In other words - if you can't demonstrate how your environment will fail, then
it's not ready for an HA deployment.

~~~
Xorlev
Every HA deployment I've done, the HA manager inevitably had issues to begin
with. It takes time, patience, and a few late nights.

~~~
ghshephard
With the possible exception of life safety systems, credit card processing,
stock exchanges, and other "High $ per second applications" - I just don't see
getting HA right on transactional databases as worth the effort. Properly
rehearsed, a good Ops/DBA (and, in the right environment) NOC team can execute
a decent failover in just a few minutes - and there aren't that many
environments (with the exceptions listed above) - that can't take two or three
5 minute outages a year.

The alternative is your HA manager decides to act wacky on you, and your
database downtime is extended.

For some reason - this rarely (almost never, in my practical experience) is a
problem with HA systems in networking. With just a modicum of planning, HA
Routers, Switches, and Load Balancers Just Seem to Work (tm).

Likewise, HA Storage arrays are bullet proof to the point at which a lot of
reasonably conservative companies are comfortable picking up a single
array/frame.

But HA transactional databases - still don't seem to be there.

------
WestCoastJustin
Here are the makings of a bad week (Monday of all things)

\- MySQL schema migration causes high load, automated HA solution causes
cascading database failure

\- MySQL cluster becomes out of sync

\- HA solution segfaults

\- Redis and MySQL become out of sync

\- Incorrect users have access to private repositories!

Cleanup and recovery takes time, all I can say is, _I'm glad it was not me who
had that mess to clean up_. I'm sure they are still working on it too!

This brings to mind some my bad days.. OOM killer decides your Sybase database
is using too much memory. Hardware error on DRBD master causes silent data
corruption (this took a lot of recovery time on TBs of data). I've been bitten
by the MySQL master/slave become out of sync. That is a bad place to be in..
do you copy your master database to the slaves.. that takes a long time even
of a fast network.

~~~
cageface
This kind of thing is one of the main reasons I prefer to do app development
instead of backend work now. I don't get calls at 3am any more.

------
andrewljohnson
The lack of any negative response on this thread is a testament both to the
thoroughness of the post-mortem, and the outstanding quality of GitHub in
general.

In GitHub we trust. I can't imagine putting my code anywhere else right now.

~~~
gbog
I like github too but please remember that things come and go. Some time ago
it was SourceForge that was hot.

~~~
mckoss
... but never as well loved.

------
cwb71
The part of this post that really blew my mind:

    
    
      We host our status site on Heroku to ensure its availability
      during an outage. However, during our downtime on Tuesday
      our status site experienced some availability issues.
    
      As traffic to the status site began to ramp up, we increased
      the number of dynos running from 8 to 64 and finally 90.
      This had a negative effect since we were running an old
      development database addon (shared database). The number of
      dynos maxed out the available connections to the database
      causing additional processes to crash.
    

Ninety dynos for a status page? What was going on there?

~~~
mbell
Anyone tested S3's static page hosting under heavy load? I would think you
could just update the static file as a result of some events fired by your
internal monitoring process.

~~~
dustym
We use S3 behind 1 second max-age cloudfront to serve The Verge liveblog. It's
been nothing but rock solid. We essentially create a static site and push up
JSON blobs. See here:

[http://product.voxmedia.com/post/25113965826/introducing-
syl...](http://product.voxmedia.com/post/25113965826/introducing-syllabus-vox-
medias-s3-powered-liveblog)

~~~
spicyj
This is really interesting -- thanks for sharing. It seems to me that you
could probably have nginx running on a regular box and then CloudFront as a
caching CDN to avoid the S3 update delay.

~~~
dustym
Probably could figure that out, yeah. But we didn't want to take any chances
given how important it was to get our live blog situation under control.

[edit]

Which is to say, we wanted a rock solid network and to essentially be a drop
in a bucket of traffic, even at the insane burst that The Verge live blog
gets.

------
druiid
Well, I have to say... replication related issues like this are why I/we are
now using a Galera backed DB cluster. No need to worry about which server is
active/passive. You can technically have them all live all the time. In our
case we have two live and one failover that only gets accessed by backup
scripts and some maintenance tasks.

Once we got the kinks worked out it has been performing amazingly! Wonder if
GitHub looked into this kind of a setup before selecting the cluster they did.

~~~
aaronblohowiak
any details on the kinks you worked out?

~~~
druiid
Sure. Maybe I should do a writeup for it on my blog at some point in the near
future :).

The two main issues we encountered both had to do with search for
products/categories on our sites. The first was that Galera/WSREP doesn't
support MyISAM replication (It has beta support, but I wouldn't trust it).
This meant that we had to transition our fulltext data to something else. The
something else in this case was Solr which has been a much better solution
anyway (fulltext based search was legacy anyway so this I can kind of count as
a win).

The second issue and the one that was causing random OOM crashes was partly
due to a bug, partly due to the way the developer responsible for the search
changes implemented things. The bug part is that galera doesn't specifically
differentiate between a normal table and a temp table. When you have very very
small/fast temporary tables that are created and truncated before the creation
of the table is replicated across the cluster it can leave some of these
tables open in memory (memory leak whoo!). We were able to fix for this and
have been happy ever since.

If there's any interest I can do a larger writeup about actual implementation
of the cluster, caveats and the like.

~~~
sciurus
Consider this an expression of extreme interest on my part.

~~~
gsibble
+1

------
aaronblohowiak
If Github hasn't gotten their custom HA solution right, will you?

Digging into their fix, they disabled automatic failover -- so all DB failures
will now require manual intervention. While addressing this particular
(erroneous) failover condition, it does raise minimum down time for true
failures. Also, their mysql replicant's misconfiguration upon switching
masters is also tied to their (stopgap) approach to preventing the hot
failover. So, the second problem was due to a mis-use/misunderstanding of
maintenance-mode.

How is it possible that the slave could be pointed at the wrong master and
have nobody notice for a day? What is the checklist to confirm that failover
has occurred correctly?

There is also lesson to be learned in the fact that their status page had
scaling issues due to db connection limits. Static files are the most
dependable!

~~~
autotravis
"There is also lesson to be learned in the fact that their status page had
scaling issues due to db connection limits. Static files are the most
dependable!"

Seriously, why would a status page need to query a db?

~~~
gsibble
I assume that the status server is not actively checking every Github
server/service whenever someone pings it. It probably polls the servers every
X seconds. The best place to store that type of data is in a DB.

Where else would you put it?

~~~
aaronblohowiak
> It probably polls the servers every X seconds.

And then you could write out a new static file, just once, and send it to your
edge server of choice.

------
jyap
"As traffic to the status site began to ramp up, we increased the number of
dynos running from 8 to 64 and finally 90."

Wait, why isn't there some caching layer? eg. Generate a static page or use
Varnish.

This part makes no sense at all.

At most you're then firing up another 5 dynos (or none) to handle the traffic.
90 is ridiculous.

------
jluxenberg
_"16 of these repositories were private, and for seven minutes from 8:19 AM to
8:26 AM PDT on Tuesday, Sept 11th, were accessible to people outside of the
repository's list of collaborators or team members"_

ouch!

~~~
nslocum
One of those repos was mine. :( Fortunately it was a fresh Rails app without
anything important. However, it does make me rethink the security of storing
my code on github.

~~~
mckoss
I store proprietary code on github, but I would never recommend storing actual
_secrets_ (like keys or passwords).

------
dumbluck
This was the awesome kind of explanation about what went wrong and what was
learned that I wish everyone would do.

------
donavanm
Update strategy of master first is interesting. I've always seen the other way
with update standby, flip to standby, verify, update original master. Auto inc
db keys once again cause horribleness. Nothing new there I suppose. And as
mentioned the multi dyno + DB read status page is craaaazy. Why oh why isnt
this a couple static objects. Automagically generate and push if you want.
Give 'em a 60 second TTL and call it a day. Put them behind a different CDN &
DNS then the rest of your site for bonus points.

------
akoumjian
I would love to know more about this two pass migration strategy.

~~~
jnewland
We use <https://github.com/soundcloud/large-hadron-migrator/>

------
cschep
Interesting to read about github using MySQL instead of Postgres. Anyone know
why? I am just curious because of all the MySQL bashing I hear in the echo
chamber.

~~~
technoweenie
Mostly because of legacy reasons, at this point.

~~~
lonnyk
Do you have a source for this information?

~~~
technoweenie
I have the source code, yes :)

------
gbog
Genuine question: github is built upon git, which is a rock solid system for
storing dataand in these reports we read that github relies a lot on MySQL,
so... Did the github guys ponder using git as their data store? Just an
example, in git one can add comments on commits, would it be possible to use
it for the github comment function? Or maybe it is?

~~~
holman
Generally, Git will be way too slow for that. Git is typically our bottleneck,
since you're dealing with so much overhead and disk access to perform
functions.

Databases are best for, well, performing relational queries. In the case of
commenting on a commit, if you store them only in the repository it becomes
non-trivial to ask "show me all of the comments by this user" unless you have
an intermediary cache layer (in which case you're back where you started).

~~~
gbog
Thanks for answering. Tell me if I'm wrong but MySQL would be behind a caching
layer anyway, so the choice would be between cached git or cached git + mysql.

In git, logging commits on a file from an author is also a kind of join, and
it is surprisingly fast, so using git as a data store is a weird idea that I
cannot take out of my head.

------
lokotecla1
para que sirve esta pagina soy nuevo

