For me, the motivation for transparency came from too many frustrating instances of being kept in the dark after things had gone wrong. The worst thing both during and after an outage is poor communication, so I do my best to explain as much as I can what is going on during an incident and what's happened after one is resolved.
There's a very simple formula that I follow when writing a public port-mortem:
1. Apologize. You'd be surprised how many people don't do this , to their detriment. If you've harmed someone else because of downtime, the least you can do is apologize to them.
2. Demonstrate understanding of the events that took place.
3. Explain the remediation steps that you're going to take to help prevent further problems of the same type.
Just following those three very simple rules results in an incredibly effective public explanation.
Everybody suffers from downtime. It's how you handle it that matters.
I also find it interesting they don't use any additional out-of-band network for heartbeats/management especially as unstable as their layer two network has been. It sounds like file servers need a secondary stable heartbeat network even if it is only 10/100. No judgements being passed here it just seems like a lot of eggs in one basket. That being said, Thank you for this write-up and sharing so openly and honestly. Happy GitHub customer here!
EDIT: Yes, I know routed networks can have similar problems but they are designed for routing, pathing and redundancy with a lot less overhead on the broadcast domain.
Freeze prod changes two weeks before and two weeks after all major holidays.
Your employees probably don't appreciate the hassle when all they are thinking about is "YEAH! DAYS OFF!"
Just my opinion and how I run my systems in the DC.
Architecturally that may mean many things - hitting "SHIP IT!" might push code into a staging environment for some final testing before delivering it onto a platter in production. Should you have multiple sites, it might involve rolling out the new stuff to just one of them until you see how it goes. Maybe you have feature flags and want to introduce a new change to all servers, but just 1% of the user population?
Fundamentally hitting "SHIP IT!" should be doing just that. Any constraints you put on how fast it gets to 100% of the user population are a risk control, and you need to optimize for a balance of developer happiness and system stability.
When you concede "We can't make changes because we're frozen" outside of a critical systems ('life critical') environment, you should quit your IT job and go become a fisherman or something.
I am talking about the infrastructure side of things.
I have built large scale percentage-deployment, slice deployment (whatever you want to call them) scenarios like you speak of but modifying an AGG switch that provides connectivity to your entire prod space... Uh.. Go ahead and use your philosophy for managing large infrastructure and I will enjoy my days off thanks.
This change is not a SHIP IT! change. This is a switching infrastructure upgrade. This is not a push from your CI into your rolling rel. system that updates prod applications.
This is an underlying infrastructure change with high impact and high visibility with many stakeholders at risk.
Sorry for any confusion that my, very vague, post caused.
Maybe someday I will become a fisherman. But for now, I will keep these switches and servers up and running with 99.999% uptime. It is what I love to do!
Got any fishing tips?
I would agree that is not where most folks are at today.
I would argue the far more interesting discussion is how we develop and mature tools to get more folks there in future.
I agree that we should be moving to automated infrastructure testing and stuff like that. To some extent, it may be possible via puppet/chef/auto tools, however, not all infrastructure is like that. Sometimes you have to go physically move stuff at your downtime window, and you can't do redundant wiring (particularly for network). I've been bitten network outages more than anything else, particularly with partial/undetected failures.
I think we're seeing a move to the "treat infrastructure as code" future, such as cluster fileservers (Netapp 8-cluster mode, or Isilon systems). You'll be able to "seamlessly" migrate data around, and virtual interfaces without impacting production. I'm looking forward to seeing how that changes ops.
I've always found that if you compensate people for it, it's not a problem. I personally would prefer 2-3 weeks extra vacation some other time in exchange for working through the holidays. It's way cheaper to travel in Jan/Feb, too.
It depends on your team, though -- if it's a bunch of people with kids who have school holidays at certain times, it might be more of an issue.
Github's a service provider and I would be surprised if none of their customers didn't have peak traffic over Christmas for this or a similar reason.
In any case, I agree with nixgeek; you shouldn't ever expect outages when upgrading infrastructure. On the other hand, you also have to measure the pain of fixing an outage, and that's likely to be higher during the holidays because of availability.
Also, they most likely scheduled this at this time due to the lower traffic, probably the lowest of the year for them. While half of Github was probably enjoying their families the other half was planning for this for a long time. Given the size of the operation I don't think anyone took it lightly.
Prod code push != prod infrastructure changes. Which is what the article is talking about. Specifically the agg. switching layer.
My reply is not about code deployments. It is about managing network devices with high visibility and impact.
I still stand by my original comment with a critical detail added:
Freeze prod ~infrastructure~ changes two weeks prior and two weeks after major holidays.
Push code all you want.
The RFO that they provided addresses link aggregation changes which are a part of an infrastructure change.
When I say prod, I mean prod infrastructure. Not code.
Sorry for the confusion.
It seems redundancy protocols end up grappling each other more often than not
Unfortunately, there is no easy answer, and I'm sure Github employs people with lots of experience.
This makes me wonder about after several people working on problems like this it's still a challenge
The more latency you can tolerate, the easier HA becomes. At an extreme, if you can tolerate 1 minute latency than each request can come in and compute a most likely way to complete a request with the most authoritative set of actors. Few people though are willing to tolerate a commit taking 15 minutes, much less a couple of hours.
This was one of the most insightful things about NFS and the whole stateless design. By burdening the client with the state the server could be much simpler.
Once you get above a certain size the problems change becoming both easier and harder (easier in that you can disperse your data further, harder in that your confidence in agreement between copies takes longer to compute and thus increases latency). It would be interesting if Google shared their work on Spanner (they seemed to have tackled this problem at a large scale) and given Netflix's experience (Chaos Monkey's dad) it seems like Amazon still hasn't quite gotten the recipe right.
It is a deliciously thorny problem with subtle complexity and unexpected inter-dependencies.
For some systems, I'd take getting rid of small outages -- I'll happily take an increased risk of a projected 15 minute loss of heart function becoming a >60 minute loss of heart function if it also eliminates what would otherwise be a bunch of 5 minute losses of heart function, since even the 5 minute interruptions would be fatal.
(Or, for a better example, revolvers vs. semi-autos. A revolver generally is more reliable, but if it goes out of timing, it's basically doomed, whereas a semi-auto can jam or pieces can break, but a monkey can clear, and a trained monkey can fix.)
The top-level systems probably need to be able to deal with increased latency or timeouts, and properly handle retries and throttling of traffic.
If you have some HA failover setup going but your alternate is already being used more for load balancing than for failover, problems like this will occur.
(I used to work on failover drivers for a SAN).
All the failover problems I can think of since they moved to this architecture 4 years ago have been coordination problems where something undesired happened when transitioning from one member of a pair to another. In this case, network problems lead them to a state where both members of a pair thought they were the master.
I have seen facility-scale UPSes go bad, and sometimes in weird ways, but an order of magnitude less frequently than grid power.
I think we reached a crossover point where designing for facility-scale survivability vs. replicating facilities ceased to be worthwhile for most Internet applications sometime in the past 10 years. It doesn't really make sense to drop $2b on a ~100k sf datacenter like AboveNet used to do for e.g. 365 Main. There are still some systems where replication is a pain, but even for those, I think metro area replication shouldn't be that hard. Even just running 5km of fiber in a loop between a few buildings in the same town gets you a huge amount of resilience against most facility problems.
We used to use Heartbeat in a similar setup back in 2001. It was the worst architectural decision we ever made, and after one too many a failure (where STONITH/split-brain/etc killed the wrong machine, or both machines) we threw it out.
TL;DR: This will happen again. Guaranteed.
We ended up with dedicated serial cards, and multiple network links just so STONITH worked properly. And even then, Heartbeat was buggy as hell back then, so we'd end up in a active-active situation way too many times and we'd end up with FS corruption.
We were running Reiser on DRBD (ahh, the good old days). Had to hack the kernel a fair bit to make it all work.
Anyway, in the end, we just abandoned the automated Heartbeat failover stuff, and just used it to alert us. We manually ran the failover scripts when a human determined that there was in fact a real failure.
We (in a new place) use our own fail over code now, but we're running databases instead of file servers -- and you usually don't want to STONITH a database.
For file servers we use Gluster, and it works great for us (I'm sure there are reasons GitHub are doing active-passive DRBD) and you don't need STONITH since it's shared-nothing.
Gluster will just throw it's hands up during self-heal if that happens, and you'll need to manually resolve it.
Your filesystem structure won't break, but your files certainly can.
That would be similar to comparing 2.4 kernels problems to the most recent 3.7 kernel. Likely not an overly useful anecdote.
For the record we use heartbeat at work with no issues such as this.
Software quality improves, sure, but you can still learn from the past.
Our heartbeat links are segregated from the public network with separate network cards/switches. So the specific issue github hit here isn't what we would have encountered. We do have quad nic cards for a reason in the systems we run, this issue github hit is one of the various reasons you don't run heartbeat over the public topology. It will bite you in the ass no matter what cluster software you are running.
Unless you also have a disk heartbeat over shared fibre/scsi, or maybe serial but same difference. Depending upon the public network though is a lost cause.
MLAG or any LAG technology, LACP, bonds whatever should never impact the deployment of code. It should be invisible
when it is working. Obviously it is very visible when it breaks though.
My heart goes out to the Github guys!
Sorry for the confusion everyone.
The failover network should be totally different physically and logically from primary network. The heartbeat between file servers should be checked through both primary network and failover network. If a server can't be reached by its partner over primary network, it should be gracefully taken offline by partner through failover network.
But I am mildly flabbergasted by the fact that GitHub uses STONITH. The technology is as safe as open-core nuclear reactor, and it works reliably only in very simple conditions.
GitHub is doing better by proactively upgrading their network. During the upgrade, they have run in to some technical difficulties. There are maybe, what, a thousand people in the U.S. that have worked on network administration for something GitHub's size? And far fewer who could have predicted this kind of trouble?
If we're going to shit on someone, we should at least make it nutrient-rich: include recommendations based on experience from dealing with problems of that nature and magnitude.
My first recommendation would be setting up a warm-standby or even a hot-standby if possible in another datacenter. It all comes down to state, meaning the data/repos. They are currently using drbd for block level replication of disks in their file servers, this can be done over wan using something like drbd proxy. Another option is to write something in house which replicates write requests to git repos hosted elsewhere. They wrote a collection of services to access git repos on the file servers via rpc, its entirely plausible to extend this. If they are using mysql anywhere thats a pretty simple task for master-slave replication or semi-sync replication in mysql 5.5. Redis also supports replication.
I'm sort of hinting at high availability with read/write but starting out with a read-only solution hosted in another datacenter on the order of 60-120 secs out of sync is not impossible. I think we would all benefit from that kind of continuity right?
Replicate your data to a secondary site, keep it in read-only mode, set low ttl's in dns. Take it a step further and start looking at globally distributed file systems if you are feeling adventurous.
I guess what your saying is in an error making the site entirely read only from another data center. That could be of some use but I would imagine that would require a complete rearchitecture to separate out all writes and also have a method of recombining stuff like logging which will still be written later on.
On the git level though it would be useful for a major use case of github which is to grab code.
My point was really from the standpoint of a system and network wide outage for which it will take multiple hours to recover. Its about business continuity. I've been making comments from the standpoint of a user and the open source community but what about from the other side. I guess its a subscription like service for paid for customers so it has no immediate consequence on revenue but in places I've worked downtime is money lost and customers would actually require us to have SLAs and disaster recovery procedures in place before they would sign contracts with us. Anyway, yea the step from single server to multi server, to multiple datacenters does normally involve some re-architecting at the state layer, everything else is load balancing.
One big recommendation I'd offer is to have a secondary network with as few moving pieces as possible (think a dumb unmanaged Netgear switch or two) to run things like heartbeats and DRBD over. That sort of thing should not be living on the same segment as your production traffic.
All of the major vendors MLAG implementations are limited to two agg switches, assuming you have a pair of 24 ports and a 48 port rack level switch hanging off each, that still puts total servers at less than 1000.
What stops someone from having multiple zones within a datacenter interconnected with suitably large LAGs thereby avoiding being limited by the port capacity of any one device?
Why is the exact number of servers important to the topic, or is it just to satisfy your curiosity? :)
> There are maybe, what, a thousand people in the U.S. that have worked on network administration for something GitHub's size?
I didn't intend to deflate any nerd egos, just putting things in realistic terms. GitHub is cool. It's the biggest in its industry. It's definitely an interesting problem. But it's still hosted within Rackspace, which in itself isn't that big.
You have to wait a little longer to do your push/pull/merge/etc?
Give them a break.
Another factoid is that Amazon Web Services (AWS) only offers a 99.95% SLA so where did the 99.99% for GitHub come from?
What's really the impact of Facebook downtime? You have to wait a little longer to "poke" your crush's profile/look at their picture/etc? Give them a break.
Rubygems is kind of unusable without Github.
Several ppensource projects documentation is unavailable without Github.
Part of the beauty of git is that it would be absolutely _trivial_ to have a secondary "backup URL" located somewhere else if Github were down. And git would automatically handle the rest.
I can totally understand this kind of thing going wrong, but particularly given the service they provide, why not have a second colo, with a relatively recent clone of the repo, that you can route people to? Heck, you can likely even do an automatic merge once the other repo is working again...
If they went to a second site, having synchronous commit to both sites is how it should be done, no? The extra latency on infrequent git pushes is far less an inconvenience than the possibility of grabbing the wrong code.
I think there'd be a variety of ways to have the system to fail until the customer made some kind of adjustment that indicated they grokked that there was a failure (like say... changing your upstream).
A Hypothetical questions, why not something a 8 ARM 64 bit core Linux computer as switches? Making more logic resides in software instead?
I for one spent my break hawking my email, in case further Github outages caused any of my automated deploy scripts to fail. I realize it's my own responsibility to write scripts that handle failure scenarios, but the fact is my company pays Github to host our repositories. Downtime happens, I'm understanding of that. But when it does, I want to know what's going on as soon as that information's available--especially when I'm on holiday.
I don't think a blog post written after they resolved all the issues would have been less accurate. It just would have inconvenienced whoever wrote it on a holiday.
Not the biggest deal in the world--it would take me a lot more than this slip-up to switch from Github. But IMO a service provider should get at least some information out faster than this.
I don't believe any company that is smart would put out a writeup as quickly as possible if they weren't sure of the facts, that just doesn't help anyone out if the company is spreading unverified information. Plus developers need to be focused on fixing the actual problem and root causes first, pretty write ups come second.
For upto the second info the status page should be more than enough.
EDIT: I would just like to say, along with others, I greatly enjoy their postmortems and I feel as though I learn something every time. Kudos to them for being forthright. I host my personal and professional projects with them and am supremely confident that my data is as safe with them as it is with anyone.
At first glance that seems foolish, but to quote you, "complex network topologies" are very prone to falling over badly. Since they all seem to be a one-off custom setup these days, how can you be sure it won't fall over?
Here are the testing approaches I know about:
1. Netflix Chaos Monkey: http://techblog.netflix.com/2012/07/chaos-monkey-released-in... - but that doesn't mean Netflix has it all together. They still have outages.
2. Google does it. They have a team that goes around unplugging network cables and monitoring how fast the engineers can find and fix the problem. I can't dig it up but it was only a few months ago - hey, Google, your search engine can't find an article about you. :)
The learning seems to be that if you're going to have automated failover, it must go in one direction only. That means you're guaranteed never to get flaps or mexican standoffs, and determining which node was active is a (relatively) simple question of "did a failover happen?"
I'm sure there's a downside to this, but I can't see one that outweighs the gains from having much simpler failure modes.
I'm curious if they'd even be able to support an Arista network. Although this would probably not be a problem with that kind of network.
(oh, I see you are with GitHub. Somehow I suspect you did not push for this downgrade.)
(followup to sounds: I'm not particularly pro cisco except that all-cisco lets you avoid interop problems. Since at least some of this is cisco, there's a good argument for all cisco.
I've actually used HP and Dell successfully for certain things, but only because I kept it as simple as possible.
In the past using other stuff was necessary at the high end, too, because cisco didn't provide 10GE very well, etc.
What I am against is cheaping out on your ~few aggregation switches, which I've seen other people do, when you're already spending good money on everything else. Maybe it's different once you get to Google or Facebook scale, but I've never been responsible for quite that many switches, and an extra $500-$1000 per 24 ports isn't going to kill you.
I actually prefer non-cisco substantially for routing and for security products.)
Please don't take this the wrong way – can you explain some of your experiences with Cisco?
I sense you're pretty loyal to Cisco. I've had the exact opposite experience, getting rock solid uptime and no-compromises performance from commodity hardware. My experience with Cisco is typified by the IOS command line: hard to use, even harder to hire someone who makes it easy, and every minute you feel the pull on your wallet. The one place Cisco seems to shine is at the boardroom table when negotiating contract renewals. :)
Again, I don't mean to offend.
This isn't so much applicable to datacenter usage, but to serve some of our satellite offices reqires placing equipment in places that I'd rather not have to have any hardware; in these situations cisco has been much more reliable than the HP switches we migrated away from.
I've seen some really wonky behavior in Cisco equipment even in a really small ISP network, but I have no idea if this is normal or not. Maybe one of the very few people that's responsible for a network of GitHub's size could chime in; I'd be interested too.
Whilst everyone prefers to work with a blank canvas that is a rare opportunity, and one which still has constraints like budget limitations or "We need this by Wednesday!".
All vendors have their foibles. No vendor is perfect. Datasheets are rarely 100% accurate, behaviour fluctuates based on what code train/version you're running on devices. Network vendors are like everyone else - humans - so just like everyone else they break things occasionally too.
One of the best ways to mitigate this as demonstrated by some of the largest players is to have N+1 physical sites and to be able to control traffic distribution to each of them, including taking one entirely offline without your customers noticing that it's happened. It's also extremely hard to do when you start looking at how CAP theorem applies to your given application, and not just the infrastructure changes required to network it all together, but the application changes required to not have it just explode on you.
I think they should be more forward about naming and shaming, but I understand why they'd rather not. Personally I have horrible experience with Nortel/Avaya and would recommend anyone against using their equipment for core switching. The least worst, again in my experience, seem to be Cisco and Juniper.