This may sound selfish, but github does such a great job of writing up post mortems, that I almost look forward to their outages just because I know I'm going to learn a lot when they write their follow up.
Came here to say the same thing. This is Mark Imbriaco's wonder twin power. It really is something to generate goodwill from an outage writeup. Part of it is just unflinching transparency combined with nerdy details; you never feel like they're hiding anything, and you get to learn about all the operational doodads they're working with to run at this scale.
For me, the motivation for transparency came from too many frustrating instances of being kept in the dark after things had gone wrong. The worst thing both during and after an outage is poor communication, so I do my best to explain as much as I can what is going on during an incident and what's happened after one is resolved.
There's a very simple formula that I follow when writing a public port-mortem:
1. Apologize. You'd be surprised how many people don't do this , to their detriment. If you've harmed someone else because of downtime, the least you can do is apologize to them.
2. Demonstrate understanding of the events that took place.
3. Explain the remediation steps that you're going to take to help prevent further problems of the same type.
Just following those three very simple rules results in an incredibly effective public explanation.
This sort of approach is the reason that when I need to upgrade to a higher plan on Github, I don't flinch. In fact, I love giving you guys more money, simply because you make my life completely painless; I don't think I can say the same about any other service. Keep up the awesome work.
Git itself is a distributed VCS, so the GitHub downtime shouldn't have been too devestating to most people. However, speaking as somebody who uses GitHub Pages a lot and had it go down at one of the worst possible times for me, I can say that this postmortem definitely quenched any possibility of me considering moving somewhere else, since I am confident my data/uptime is as safe with them as it could be with anybody.
Everybody suffers from downtime. It's how you handle it that matters.
Wow, I'm very glad our company chose a routed design with an interior routing protocol (OSPF.) I've never been able to push the limits of a layer two network as far as GitHub. A routed network helps segment the network so when systems fail or a re-convergence mistakenly occurs only a few racks are having problems and not the entire system. It's also very helpful for us to push routes to our exterior routing protocol (BGP.)
I also find it interesting they don't use any additional out-of-band network for heartbeats/management especially as unstable as their layer two network has been. It sounds like file servers need a secondary stable heartbeat network even if it is only 10/100. No judgements being passed here it just seems like a lot of eggs in one basket. That being said, Thank you for this write-up and sharing so openly and honestly. Happy GitHub customer here!
EDIT: Yes, I know routed networks can have similar problems but they are designed for routing, pathing and redundancy with a lot less overhead on the broadcast domain.
Routed networks tend to fail closed where switched networks tend to fail open. That usually seems to be the problem with layer two failures, they can easily spread traffic everywhere and it takes a while for things to calm down, in the meantime the network hardware is often overwhelmed. With layer three networks the failures tend to be that you lose the ability to route somewhere but in my experience it's often easier to recover from that situation.
Holidays are actually one of the best times to be making changes as traffic is significantly lower, and IMO, one should be aiming for an infrastructure where you can always ship changes without being afraid of the ramifications.
Architecturally that may mean many things - hitting "SHIP IT!" might push code into a staging environment for some final testing before delivering it onto a platter in production. Should you have multiple sites, it might involve rolling out the new stuff to just one of them until you see how it goes. Maybe you have feature flags and want to introduce a new change to all servers, but just 1% of the user population?
Fundamentally hitting "SHIP IT!" should be doing just that. Any constraints you put on how fast it gets to 100% of the user population are a risk control, and you need to optimize for a balance of developer happiness and system stability.
When you concede "We can't make changes because we're frozen" outside of a critical systems ('life critical') environment, you should quit your IT job and go become a fisherman or something.
I am talking about the infrastructure side of things.
I have built large scale percentage-deployment, slice deployment (whatever you want to call them) scenarios like you speak of but modifying an AGG switch that provides connectivity to your entire prod space... Uh.. Go ahead and use your philosophy for managing large infrastructure and I will enjoy my days off thanks.
This change is not a SHIP IT! change. This is a switching infrastructure upgrade. This is not a push from your CI into your rolling rel. system that updates prod applications.
This is an underlying infrastructure change with high impact and high visibility with many stakeholders at risk.
Sorry for any confusion that my, very vague, post caused.
Maybe someday I will become a fisherman. But for now, I will keep these switches and servers up and running with 99.999% uptime. It is what I love to do!
I guess we have differing viewpoints, I see absolutely no fundamental reason why infrastructure should be treated all that much differently to code. It should be possible to fire off a test suite, to automate its deployment, etc.
I would agree that is not where most folks are at today.
I would argue the far more interesting discussion is how we develop and mature tools to get more folks there in future.
Since not everyone here is ops, if your holiday is going to be potentially impacted by a deployment, you are fully aware of that going into the deployment. We take note of people with blacked out dates (e.g. you booked your flight before we ever started talking about this), and everyone else impacted knows what's on the docket. While the issues are sudden, everyone at least has that nagging feeling that they might get a call to action.
I agree that we should be moving to automated infrastructure testing and stuff like that. To some extent, it may be possible via puppet/chef/auto tools, however, not all infrastructure is like that. Sometimes you have to go physically move stuff at your downtime window, and you can't do redundant wiring (particularly for network). I've been bitten network outages more than anything else, particularly with partial/undetected failures.
I think we're seeing a move to the "treat infrastructure as code" future, such as cluster fileservers (Netapp 8-cluster mode, or Isilon systems). You'll be able to "seamlessly" migrate data around, and virtual interfaces without impacting production. I'm looking forward to seeing how that changes ops.
I love doing major upgrades over Xmas/NY, Easter, and Labor Day. Lower traffic, and much better chances of fixing things if they go wrong.
I've always found that if you compensate people for it, it's not a problem. I personally would prefer 2-3 weeks extra vacation some other time in exchange for working through the holidays. It's way cheaper to travel in Jan/Feb, too.
It depends on your team, though -- if it's a bunch of people with kids who have school holidays at certain times, it might be more of an issue.
It's not just the employees - if you provide a service that your customers depend on to provide their service, they're not going to appreciate the hassle on a holiday especially when it's not at all their fault. This isn't true so much for GitHub, but is for infrastructure providers like AWS.
Tell that to rackspace not sure how many where effected on Christmas Day they decided to update a router - causing a 3-1/2 hr outage from 8:30 est to 11:00est aprox, took out part of my infrastructure - made for a nice Christmas morning surprise. Effected the ORD datacenter- only thing I could on RS status is https://status.rackspace.com/index/viewincidents?group=2&.... And of course the ticket in my account
A lot of people replying to this are talking about how holidays are their low period. I work in mobile; the Christmas holiday is the busiest time of the year for us because that's when everyone gets their new devices and downloads apps and has all day to play with them.
Github's a service provider and I would be surprised if none of their customers didn't have peak traffic over Christmas for this or a similar reason.
In any case, I agree with nixgeek; you shouldn't ever expect outages when upgrading infrastructure. On the other hand, you also have to measure the pain of fixing an outage, and that's likely to be higher during the holidays because of availability.
Stopping coders from deploying stuff due to a risk of an unspecified "something" going bad makes for frustrated employees. By that time you might as well give them the days off as they'll have little to strive toward. You don't need to freeze all code deployments and other things that have little risk.
Also, they most likely scheduled this at this time due to the lower traffic, probably the lowest of the year for them. While half of Github was probably enjoying their families the other half was planning for this for a long time. Given the size of the operation I don't think anyone took it lightly.
One month of no deploys seems pretty high for an organization that does upwards of a hundred deploys a day. 2 days seems far more reasonable for the organizational philosophy they're going for.
Yep, it is a long time which is why I mean to speak specifically on the infrastructure side of things. Modifying the agg layer providing connectivity for all prod systems during a holiday weekend would suck.
When I say prod, I mean prod infrastructure. Not code.
You can't predict what you can't predict. This is what makes the Chaos Monkey experiment so interesting. And yes, HA is hard, and with many things it is hard with respect to latency.
The more latency you can tolerate, the easier HA becomes. At an extreme, if you can tolerate 1 minute latency than each request can come in and compute a most likely way to complete a request with the most authoritative set of actors. Few people though are willing to tolerate a commit taking 15 minutes, much less a couple of hours.
This was one of the most insightful things about NFS and the whole stateless design. By burdening the client with the state the server could be much simpler.
Once you get above a certain size the problems change becoming both easier and harder (easier in that you can disperse your data further, harder in that your confidence in agreement between copies takes longer to compute and thus increases latency). It would be interesting if Google shared their work on Spanner (they seemed to have tackled this problem at a large scale) and given Netflix's experience (Chaos Monkey's dad) it seems like Amazon still hasn't quite gotten the recipe right.
It is a deliciously thorny problem with subtle complexity and unexpected inter-dependencies.
In general, automated failover seems to make most small problems non-problems, but turns some small problems into big problems. It probably depends on actual numbers what makes sense for you app.
For some systems, I'd take getting rid of small outages -- I'll happily take an increased risk of a projected 15 minute loss of heart function becoming a >60 minute loss of heart function if it also eliminates what would otherwise be a bunch of 5 minute losses of heart function, since even the 5 minute interruptions would be fatal.
(Or, for a better example, revolvers vs. semi-autos. A revolver generally is more reliable, but if it goes out of timing, it's basically doomed, whereas a semi-auto can jam or pieces can break, but a monkey can clear, and a trained monkey can fix.)
Failover is meant to deal with hardware failures, which will tend to work just fine. But if the node you are failing over onto was already has 60% capacity and you add another 60% capacity during the failover, things are going to get worse.
The top-level systems probably need to be able to deal with increased latency or timeouts, and properly handle retries and throttling of traffic.
If you have some HA failover setup going but your alternate is already being used more for load balancing than for failover, problems like this will occur.
GitHub's failover problems have never been load-related. GitHub has pairs of fileservers where one is the master and the other's sole job is to follow along with the master and take over if it thinks the master is down, so when they do failover, it is to a node with just as much capacity as the previous master.
All the failover problems I can think of since they moved to this architecture 4 years ago have been coordination problems where something undesired happened when transitioning from one member of a pair to another. In this case, network problems lead them to a state where both members of a pair thought they were the master.
I've never seen a redundant PSU be worse than a single PSU, actually (like the dual-line-cord modules on many servers or network devices). PSU failures have gone way down in the past 15 years or so that I've been observing them. The only power supplies I routinely see dying are external transformers on low end network devices and on systems exposed to really dirty power.
I have seen facility-scale UPSes go bad, and sometimes in weird ways, but an order of magnitude less frequently than grid power.
I think we reached a crossover point where designing for facility-scale survivability vs. replicating facilities ceased to be worthwhile for most Internet applications sometime in the past 10 years. It doesn't really make sense to drop $2b on a ~100k sf datacenter like AboveNet used to do for e.g. 365 Main. There are still some systems where replication is a pain, but even for those, I think metro area replication shouldn't be that hard. Even just running 5km of fiber in a loop between a few buildings in the same town gets you a huge amount of resilience against most facility problems.
We used to use Heartbeat in a similar setup back in 2001. It was the worst architectural decision we ever made, and after one too many a failure (where STONITH/split-brain/etc killed the wrong machine, or both machines) we threw it out.
I was brought up to only use stonith via serial or other non-switched network connection. Running over the same network is bound to cause problems. But its not a great solution anyway.
We used to STONITH (with an APC network power switch) over serial. It didn't really help because serial ports have really bad quality control, and lose a lot of 'packets'. :)
We ended up with dedicated serial cards, and multiple network links just so STONITH worked properly. And even then, Heartbeat was buggy as hell back then, so we'd end up in a active-active situation way too many times and we'd end up with FS corruption.
We were running Reiser on DRBD (ahh, the good old days). Had to hack the kernel a fair bit to make it all work.
Anyway, in the end, we just abandoned the automated Heartbeat failover stuff, and just used it to alert us. We manually ran the failover scripts when a human determined that there was in fact a real failure.
Agreed. I completely agree, but the world isn't as kind. Providers often have funny rules about how you can cable things up in their datacenters, and as noted in #4 of where GitHub goes from here that needs to be addressed.
I don't know how good Heartbeat is these days. I'm just going based on our experience back then and GitHub's stated experience now.
We (in a new place) use our own fail over code now, but we're running databases instead of file servers -- and you usually don't want to STONITH a database.
For file servers we use Gluster, and it works great for us (I'm sure there are reasons GitHub are doing active-passive DRBD) and you don't need STONITH since it's shared-nothing.
There's still plenty of things to keep in mind with Gluster, though. If you use a replicated setup for redundancy, and you get a split network, you can very well end up with an inconsistent state because you can get into situations where you write to different replicas from different clients and replication doesn't succeed.
Gluster will just throw it's hands up during self-heal if that happens, and you'll need to manually resolve it.
Your filesystem structure won't break, but your files certainly can.
I'd say its more an implementation issue than software. And that this is a solved issue even 10+ years ago with heartbeat.
Our heartbeat links are segregated from the public network with separate network cards/switches. So the specific issue github hit here isn't what we would have encountered. We do have quad nic cards for a reason in the systems we run, this issue github hit is one of the various reasons you don't run heartbeat over the public topology. It will bite you in the ass no matter what cluster software you are running.
Unless you also have a disk heartbeat over shared fibre/scsi, or maybe serial but same difference. Depending upon the public network though is a lost cause.
I just realized that the Sys. Admin/Prod. Ops to Developer ratio here is crazy low. Everyone assumes I am talking about code changes when the article is about prod switching and network transit device changes.
MLAG or any LAG technology, LACP, bonds whatever should never impact the deployment of code. It should be invisible
when it is working. Obviously it is very visible when it breaks though.
I get the impression that issue wasn't network hardware but bad high availability design on fileserver side. Why do GitHub has failover network on same network hardware as primary network? But I am not surprised as I see this at lot of clients that they have failover network on separate VLAN on same network hardware. And, whenever they have network hardware issue, servers run into split brain problems.
The failover network should be totally different physically and logically from primary network. The heartbeat between file servers should be checked through both primary network and failover network. If a server can't be reached by its partner over primary network, it should be gracefully taken offline by partner through failover network.
High Availability strikes again. No surprise there.
But I am mildly flabbergasted by the fact that GitHub uses STONITH. The technology is as safe as open-core nuclear reactor, and it works reliably only in very simple conditions.
STONITH is designed for critical failure so you don't end up with a split-brain situation, which is far worse than a dead node. STONITH is a good thing. The problem here is more than cluster wasn't configure to survive a catastrophic switching failure.
GitHub as a sysadmin/system engineer I feel your pain, I understand it completely and know the horrors of failures leading into multi-hour recovery. That said, you need to do better. This is a heavily relied upon resource for the open source community. Perhaps you didn't estimate this sort of growth but now you are here I'm sorry but the weight does fall on your shoulders. I heavily commend you guys on the service you've provided thus far and what you've done to actually pull all the varying language/library communities together. I honestly want to see this scale and serve 5 9s year round. You need to take a good hard look at the architecture of the stack and find a way to get it multi-homed.
Please don't take this personally, but I'd like to propose a new rule of etiquette on HN: it's unacceptably lazy to say that some business "needs to do better" without explaining exactly what they should do better, or how.
GitHub is doing better by proactively upgrading their network. During the upgrade, they have run in to some technical difficulties. There are maybe, what, a thousand people in the U.S. that have worked on network administration for something GitHub's size? And far fewer who could have predicted this kind of trouble?
If we're going to shit on someone, we should at least make it nutrient-rich: include recommendations based on experience from dealing with problems of that nature and magnitude.
Actually you are right it is a bit lazy. Ok so the first question is, are they multi-homed (as in multiple datacenters)? If not, why not? Because that should be the first step to take. We could talk about all the ways in which you can build redundancy and automated recovery into a cluster of machines in a single datacenter but let me tell you the obvious point that the datacenter is a single point of failure. Rackspace are an extremely solid hosting provider, we used them in my old company for 5 years but even they can suffer fiber cuts, power cuts, UPS's dying, generators not working, someone performing network maintenance and crapping out the routing for the entire dc.
My first recommendation would be setting up a warm-standby or even a hot-standby if possible in another datacenter. It all comes down to state, meaning the data/repos. They are currently using drbd for block level replication of disks in their file servers, this can be done over wan using something like drbd proxy. Another option is to write something in house which replicates write requests to git repos hosted elsewhere. They wrote a collection of services to access git repos on the file servers via rpc, its entirely plausible to extend this. If they are using mysql anywhere thats a pretty simple task for master-slave replication or semi-sync replication in mysql 5.5. Redis also supports replication.
I'm sort of hinting at high availability with read/write but starting out with a read-only solution hosted in another datacenter on the order of 60-120 secs out of sync is not impossible. I think we would all benefit from that kind of continuity right?
Replicate your data to a secondary site, keep it in read-only mode, set low ttl's in dns. Take it a step further and start looking at globally distributed file systems if you are feeling adventurous.
That wouldn't have really helped them here though?
If the closely connected storage nodes were unable to detect whether state was consistent, how would have the storage node in another data center known?
I guess what your saying is in an error making the site entirely read only from another data center. That could be of some use but I would imagine that would require a complete rearchitecture to separate out all writes and also have a method of recombining stuff like logging which will still be written later on.
On the git level though it would be useful for a major use case of github which is to grab code.
If its a fileserver pair that go down and they are serving a subset of repos then what do you do? Well you could keep serving all the other repos and start a recovery process, whatever that may be. Perhaps redirect the requests for those repos to another datacenter where there's another copy. Or make the decision of failing over the entire site from primary to secondary datacenter.
My point was really from the standpoint of a system and network wide outage for which it will take multiple hours to recover. Its about business continuity. I've been making comments from the standpoint of a user and the open source community but what about from the other side. I guess its a subscription like service for paid for customers so it has no immediate consequence on revenue but in places I've worked downtime is money lost and customers would actually require us to have SLAs and disaster recovery procedures in place before they would sign contracts with us. Anyway, yea the step from single server to multi server, to multiple datacenters does normally involve some re-architecting at the state layer, everything else is load balancing.
I understand they have a lot of nerd/hacker cred in and around the tech hotspots of the country, but GitHub is by no measure "big". The most recent number I could find puts them at 33 servers total. I've worked for non-tech companies that have that many as hot standbys.
One big recommendation I'd offer is to have a secondary network with as few moving pieces as possible (think a dumb unmanaged Netgear switch or two) to run things like heartbeats and DRBD over. That sort of thing should not be living on the same segment as your production traffic.
All of the major vendors MLAG implementations are limited to two agg switches, assuming you have a pair of 24 ports and a 48 port rack level switch hanging off each, that still puts total servers at less than 1000.
By that simplistic logic then Arista 7500-series [1] would give you a 384-port aggregation (per device) with 48-port rack switches hung off it for something like 8000+ servers.
What stops someone from having multiple zones within a datacenter interconnected with suitably large LAGs thereby avoiding being limited by the port capacity of any one device?
Why is the exact number of servers important to the topic, or is it just to satisfy your curiosity? :)
I'm specifically addressing the original parents straw man.
> There are maybe, what, a thousand people in the U.S. that have worked on network administration for something GitHub's size?
I didn't intend to deflate any nerd egos, just putting things in realistic terms. GitHub is cool. It's the biggest in its industry. It's definitely an interesting problem. But it's still hosted within Rackspace, which in itself isn't that big.
Why does it matter that AWS is 99.95%? "Cloud" based services like AWS and AppEngine have more relaxed SLAs. GitHub is not a cloud service, its a centralized version control repository. And the 4 9s, well that's just something I've always strived for when providing a service but I do believe we should all hold ourselves to it.
When seen in that light, lots of nonessential services seem equally unimportant.
What's really the impact of Facebook downtime? You have to wait a little longer to "poke" your crush's profile/look at their picture/etc? Give them a break.
The other comment was closer to the heart of it: if those projects used Github's paid hosting maybe they'd have a reason to complain.
Part of the beauty of git is that it would be absolutely _trivial_ to have a secondary "backup URL" located somewhere else if Github were down. And git would automatically handle the rest.
tl;dr: It's really hard to get high availability systems right, and we still run the entire service as a single colo.
I can totally understand this kind of thing going wrong, but particularly given the service they provide, why not have a second colo, with a relatively recent clone of the repo, that you can route people to? Heck, you can likely even do an automatic merge once the other repo is working again...
Relatively recent clone? Sounds like that would screw customers up pretty bad if they don't realise the problem.
If they went to a second site, having synchronous commit to both sites is how it should be done, no? The extra latency on infrequent git pushes is far less an inconvenience than the possibility of grabbing the wrong code.
> Sounds like that would screw customers up pretty bad if they don't realise the problem.
I think there'd be a variety of ways to have the system to fail until the customer made some kind of adjustment that indicated they grokked that there was a failure (like say... changing your upstream).
I have absolutely ZERO knowledge on Enterprise Networking. But it strikes me that something as dump as Router, Switches or Network Equipment are still so unstable.
A Hypothetical questions, why not something a 8 ARM 64 bit core Linux computer as switches? Making more logic resides in software instead?
There is a move to put more of the switching and routing logic in software (google for openvswitch if you're interested) but part of the problem is that general purpose hardware doesn't stand a hope in hell of keeping up with interesting network data rates. You absolutely need to be doing a fair portion of the work in hardware, the CPUs just coordinate it.
It's about time we heard from them. I understand the timing of this was unfortunate (right before a holiday), but the trust alluded to in the conclusion of that post would be bolstered by faster post-mortems on major outages like this one.
I disagree. Other than for curiosity needs what's the value in a faster post mortem? Especially considering that a faster post-mortem would likely be a less accurate and less complete post-mortem? What is the difference in actionability for anyone between an update in 2 days vs an update in 4 days? I see none.
To me, a post-mortem doesn't just satisfy curiosity, it eases fears that the problem will return and informs me of future plans which may help prevent the problem, or may bring it back. It helps me form my own plan, since I'm a user of Github.
I for one spent my break hawking my email, in case further Github outages caused any of my automated deploy scripts to fail. I realize it's my own responsibility to write scripts that handle failure scenarios, but the fact is my company pays Github to host our repositories. Downtime happens, I'm understanding of that. But when it does, I want to know what's going on as soon as that information's available--especially when I'm on holiday.
I don't think a blog post written after they resolved all the issues would have been less accurate. It just would have inconvenienced whoever wrote it on a holiday.
Not the biggest deal in the world--it would take me a lot more than this slip-up to switch from Github. But IMO a service provider should get at least some information out faster than this.
I appreciate your point of view but respectfully disagree. The post-mortem would have absolutely been less accurate if we had delivered it sooner since we did not have the details about why the MLAG failover did not happen as expected until late in the evening on Christmas Eve. We've worked as quickly as possible since then to provide a full post-mortem.
I'd consider yourself lucky when it comes to Github post-mortems. They seem to respond very quickly when it comes to a large event. Compared to AWS write ups which can come week(s) after the event after all data has been gathered and a summary written.
I don't believe any company that is smart would put out a writeup as quickly as possible if they weren't sure of the facts, that just doesn't help anyone out if the company is spreading unverified information. Plus developers need to be focused on fixing the actual problem and root causes first, pretty write ups come second.
For upto the second info the status page should be more than enough.
Better to take a few days to gather evidence and really understand the problem, than rush out a post-mortem which is inaccurate or incorrect leaving the impression that downtime is not taken seriously and investigated thoroughly. IMO.
Who's their network switch vendor? I'm not a networking expert, but boy - it sure seems like their switch vendor has screwed some things up royally. Or perhaps this is common with all complex network topologies regardless of hardware vendor?
EDIT: I would just like to say, along with others, I greatly enjoy their postmortems and I feel as though I learn something every time. Kudos to them for being forthright. I host my personal and professional projects with them and am supremely confident that my data is as safe with them as it is with anyone.
This is likely the reason the large cloud companies all do testing that actively causes outages. I'm over-simplifying on purpose here: this is something that requires a lot of thought.
At first glance that seems foolish, but to quote you, "complex network topologies" are very prone to falling over badly. Since they all seem to be a one-off custom setup these days, how can you be sure it won't fall over?
2. Google does it. They have a team that goes around unplugging network cables and monitoring how fast the engineers can find and fix the problem. I can't dig it up but it was only a few months ago - hey, Google, your search engine can't find an article about you. :)
It doesn't sound to me like the switch upgrade was the problem here. Yes, it was the trigger, but the massive failure was caused by the failover setup.
The learning seems to be that if you're going to have automated failover, it must go in one direction only. That means you're guaranteed never to get flaps or mexican standoffs, and determining which node was active is a (relatively) simple question of "did a failover happen?"
I'm sure there's a downside to this, but I can't see one that outweighs the gains from having much simpler failure modes.
(oh, I see you are with GitHub. Somehow I suspect you did not push for this downgrade.)
(followup to sounds: I'm not particularly pro cisco except that all-cisco lets you avoid interop problems. Since at least some of this is cisco, there's a good argument for all cisco.
I've actually used HP and Dell successfully for certain things, but only because I kept it as simple as possible.
In the past using other stuff was necessary at the high end, too, because cisco didn't provide 10GE very well, etc.
What I am against is cheaping out on your ~few aggregation switches, which I've seen other people do, when you're already spending good money on everything else. Maybe it's different once you get to Google or Facebook scale, but I've never been responsible for quite that many switches, and an extra $500-$1000 per 24 ports isn't going to kill you.
I actually prefer non-cisco substantially for routing and for security products.)
Please don't take this the wrong way – can you explain some of your experiences with Cisco?
I sense you're pretty loyal to Cisco. I've had the exact opposite experience, getting rock solid uptime and no-compromises performance from commodity hardware. My experience with Cisco is typified by the IOS command line: hard to use, even harder to hire someone who makes it easy, and every minute you feel the pull on your wallet. The one place Cisco seems to shine is at the boardroom table when negotiating contract renewals. :)
Not the person you replied to, but in my experience the other place Cisco shines (relatively speaking) is in hellish environments.
This isn't so much applicable to datacenter usage, but to serve some of our satellite offices reqires placing equipment in places that I'd rather not have to have any hardware; in these situations cisco has been much more reliable than the HP switches we migrated away from.
It's also way easier to find people with experience using Cisco equipment in "moderately complex" environments than HP/Dell/Arista/Juniper/Force10/Foundry/Extreme/etc.
I've seen some really wonky behavior in Cisco equipment even in a really small ISP network, but I have no idea if this is normal or not. Maybe one of the very few people that's responsible for a network of GitHub's size could chime in; I'd be interested too.
Adapting a legacy network to deal with growing pains is hard. Anyone who says different is probably lying, or hasn't administered a network outside of 'bedroom scale' previously.
Whilst everyone prefers to work with a blank canvas that is a rare opportunity, and one which still has constraints like budget limitations or "We need this by Wednesday!".
All vendors have their foibles. No vendor is perfect. Datasheets are rarely 100% accurate, behaviour fluctuates based on what code train/version you're running on devices. Network vendors are like everyone else - humans - so just like everyone else they break things occasionally too.
One of the best ways to mitigate this as demonstrated by some of the largest players is to have N+1 physical sites and to be able to control traffic distribution to each of them, including taking one entirely offline without your customers noticing that it's happened. It's also extremely hard to do when you start looking at how CAP theorem applies to your given application, and not just the infrastructure changes required to network it all together, but the application changes required to not have it just explode on you.
Agreed that technical debt is a pain to deal with. But it's definitely possible to migrate and upgrade live systems without significant technical risk. The company just needs to make the decision to invest the proper resources and decide to have executive-level focus on Operations. From your comments it sounds like this was a failure of management to properly plan out these changes. Lack of resources shouldn't be an issue here, since clearly github's $100 Million+ in funding is enough to build a reliable HA system with adequate redundancies and hire the people who are qualified to run it. There just isn't anyone to point fingers at, including vendor screw-ups. All of those items would be addressed in a proper risk assessment plan.
That was my question too. My guess is that it's Arista, based on the MLAG, ISSU, and 'agent' references. There's a chance it could be Cisco or Juniper, but I don't think so. The author of their post-mortems does a very good job of removing any vendor-specific language that would give it away.
I think they should be more forward about naming and shaming, but I understand why they'd rather not. Personally I have horrible experience with Nortel/Avaya and would recommend anyone against using their equipment for core switching. The least worst, again in my experience, seem to be Cisco and Juniper.