Hacker News new | past | comments | ask | show | jobs | submit login
October 21 post-incident analysis (blog.github.com)
464 points by pietroalbini 3 months ago | hide | past | web | favorite | 114 comments

The timeline of events was interesting (and much appreciated), but the root cause analysis doesn't really go much deeper than "we had a brief network partition, and our systems weren't designed to cope with it", which still leaves a whole lot of question marks.

Of course, without detailed knowledge of how GitHub's internals work, all we can do is speculate. But just based on what was explained in this blog post, it sounds like they're replicating database updates asynchronously, without waiting for the updates to be acknowledged by slaves before the master allows them to commit. Which means the data on slaves is always slightly out-of-date, and becomes more out-of-date when the slaves are partitioned from the master. Which means that promoting a slave to master will by definition lose some committed writes.

If "guarding the confidentiality and integrity of user data is GitHub’s highest priority", then why would they build and deploy an automated failover system whose purpose is to preserve availability at the cost of consistency? And why were they apparently caught off-guard when it operated as designed?

(Reading point 1 under "technical initiatives", it seems that they consider intra-DC failover to be "safe", and cross-DC failover to be "unsafe". But the exact same failure mode is present in both cases; the only difference is the length of the time during which in-flight writes can be lost.)

I believe you're right. They go into their design in their MySQL HA post:


> In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. It provides a way to achieve lossless failovers: any change applied on the master is either applied or waiting to be applied on one of the replicas.

They only require one other replica (as opposed to a quorum) to be reachable from the master for the master to continue acknowledging writes. If a new master has been elected on the other side of the partition, both will continue acknowledging writes.

They noted this as a limitation that they were working on (but unfortunately a bit too late in hindsight):

> Notably, on a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high.

I'm by no means an expert on any of this stuff, but:

> They only require one other replica (as opposed to a quorum) to be reachable from the master for the master to continue acknowledging writes (from your comment)

> Orchestrator considers a number of variables during this process and is built on top of Raft for consensus... (from the article)

Doesn't quite make sense to me. Doesn't raft require the master to wait for a quorum before committing writes? I understood it as a pretty important aspect.

> A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers (from the raft paper [1])

I understood that the bold is only true if commits are acknowledged by the quorum

Edit: I'm not implying your post is incorrect, just trying to understand how the two fit together

[1]: https://raft.github.io/raft.pdf

> Doesn't quite make sense to me. Doesn't raft require the master to wait for a quorum before committing writes? I understood it as a pretty important aspect.

Different kinds of masters. I've forced myself to always include the application if terminology gets reused across applications in a discussion. As far as I understand this>

- The mysql-master only commits an sql-level-transaction if a mysql-replica acknowledges replication of the transaction. This avoids data loss during a failover.

- The orchestrator-nodes elect an orchestrator-master via raft mechanics.

- After this, the orchestrator-master starts pondering about the mysql-master / mysql-replica situation and potentially promotes the most current mysql-replica to mysql-master.

- This promotion requires writes on an orchestrator/raft level, which in turn requires the orchestrator-master to get an acknowledgement from the orchestrator-quorum.

This was my first thought, you beat me to it. It seems like they contradict each other. Is it Raft with quorum or is it a single replica node that is caught up with master? You can't have both (unless you only have three nodes).

> unless you only have three nodes

Interestingly enough this seems to be how managed sql solutions do it. You just have one primary and one failover replica in another zone. So a quorum write is just a synchronous write to both (2 out of 2). You don't have the split-brain problem because the original primary can't make progress if it can't contact the replica.



Github's problem is that they were trying to be too smart and allow any of their replicas to be candidate masters without increasing their quorum size. In theory this is higher availability than the managed sql solutions (they can be available even if their entire coast gets nuked) but they do it at the cost of consistency in the more common failure scenarios.

I only skimmed that post, and havent dealt with mysql in years, but .... that bit has another wiggle with writes committed to one of many replicas. Im presuming thats any one of many. Which means I can now have writes committed not just in two places (terrible), but to N number of places (zomg). Recovering that (linearly) would tend towards impossible without something like a vector clock. But if you had a reliable clock of changes you probably wouldnt be in that mess at all.

You don't need vector clocks for master/slave replication, even if there's 1 master and N slaves.

For consistency, all of the slaves should be applying the stream of updates to their local replicas in the same order that they happened on the master. That means each slave has seen a prefix of the master's data; its state can be uniquely represented by a position in the replication log. So you can easily determine which replica is most up-to-date by taking the one with the largest position. (If you're doing synchronous replication correctly, at least one replica is guaranteed to be completely up-to-date, even after an unexpected failure.)

Vector clocks only come into play when you are simultaneously accepting writes on multiple masters, and you need to keep track of which updates have already been applied from each source.

> If you're doing synchronous replication correctly, at least one replica is guaranteed to be completely up-to-date, even after an unexpected failure.

Exactly? Maybe? Im reading a lot in to little information here. It wasnt clear to me if "one or more" was the master + a single specific replica, or something closer to master + 1/N. And then there's the automated optimistic leader election. Combine those with (multiple) partitions and you could have a very bad day.

Rereading their RCA now I'm guessing the above isnt the case. "Restore the entire datacenter" reads more like they preserved the ~40 seconds of east coast partition writes, restored everything in east to a known checkpoint, and resumed replication off the more advanced masters in the west coast. If it's "just" two masters per cluster, both of which are available and internally consistent, you "just" have to reconcile those two write streams.

Github chose availability over consistency. Bad choice?

I would say yes. The service was available but it was even more chaotic because of all the inconsistencies.

It sounds as if a switch was replaced so network connectivity was lost for 40 seconds (a partition if you will). The network recovered but during the 40 seconds a lot of "things" had to queued and processed and things cascaded out of control. Whenever people talk about CAP they seem to spend time thinking about behavior when there is a partition, and that's neat an all, but the real challenging parts come when the partition is healed, and people don't seem to spend enough time thinking about that mode of execution.

Or when there isn’t a clear/complete partition and the network is kinda working but not consistently.

> But just based on what was explained in this blog post, it sounds like they're replicating database updates asynchronously, without waiting for the updates to be acknowledged by slaves before the master allows them to commit. Which means the data on slaves is always slightly out-of-date...

Anecdotally I think I see this regularly. Part of the build pipeline I work on requires recursively downloading a particular folder for a given repo. From time to time, I'll have the latest git ref give me data from the previous git ref. Everywhere else on GitHub the information is up-to-date. But not the API endpoint to download files. When this happens it takes roughly 5-7 minutes for the correct information to show up.

After reading this and all the other resources people have posted I'm definitely going to see if I can reliably reproduce the error.

Why would you use an API endpoint to download files in a build pipeline? Surely you should just use a clone of the repository?

I could believe that an endpoint to download files looks at a cache, and if you are just using 'master' or 'latest' as opposed to a commit, I could believe that would be out of date.

Bit late, but replying to this because my comment does have missing context. Most of the repos we have are large enough that downloading three individual files is faster than cloning the whole repo. Even with a shallow clone with a depth of 1. I've tried the combinations and even tried downloading a zip file of the repo. Still faster to download the three files. And I haven't even added parallel downloads because it's fine as is.

But given the issues, why continue to use the API? Because those files are related to the build pipeline logic and they change maybe once in 3-4 months. I can live with an error once in 3-4 months. Recently I was working on changing some of the logic of the pipeline and that's when I ran into it enough to say that there was something odd from Github's end.

"Which means that promoting a slave to master will by definition lose some committed writes."

Sure. But that is where the application code comes in. You usually have at least two db connections one to the slave connection and one to master. If you are checking if an account exists [important] you can ask the master since it will be in an up to date state. If you want to build someone's feed on github, then doing all your sql though the slave is fine. They also probably have layers of atomic cache on top which handles 99% of the requests [memcache,redis,whatever]. EG. Does the account exist? Check the cache, check master.

Restating what I think you're suggesting:

Use master for transactional read/writes.

Use replicas (slaves) for readonly.

More or less?

I'm sure that is what they do. But application code does not save you from this scenario. If your master changes, the application does not know that the new master was a node that did not have full replication.

Typical solution is to mix synchronous block level replication intra-DC with asychronous replication inter-DC. I believe RDS utilizes DRBD for this intra-region.

This may be why failover is "safe" in certain circumstances.

> it sounds like they're replicating database updates asynchronously, without waiting for the updates to be acknowledged by slaves before the master allows them to commit. Which means the data on slaves is always slightly out-of-date, and becomes more out-of-date when the slaves are partitioned from the master. Which means that promoting a slave to master will by definition lose some committed writes.

Unless you temporarily pause writes to the primary before promting a slave.

Sure, that's fine if you're talking about planned downtime. You still lose data if there's an unexpected failure, because by the time you realize the primary is gone, it's too late; the writes have already happened and been acknowledged.

> With this incident, we failed you, and we are deeply sorry.

I'm always impressed when people actually apologize instead of dancing around an almost apology.

Overall this is one of the best post mortems I've seen - great tone, very well written, super informative, has all of the steps (apology, information on issue, real steps to prevent it happening again) that hurt customers generally want to see. Really impressive timelines too - 43s reconnect after initial issue, 15 minutes to change status.

Gitlab overall seems to have more incidents like this and I really like their custom of working through them in public Google docs. Definitely seems a better idea for incident communication than relying on your own tech (github pages) for incident communication.

I think this no-fault mode of incident recovery is a bit overdone. In every HN thread, I find people advocating it as some sort of major revelation.

The mindset that problems are caused by failures in process, and not people is probably right. Even if not, I think it has become popular because it describes a world we would like to live in. It reduces anxiety, and might not negatively effect people's diligence because there tends to be diminishing returns with such negative feelings (c.f.: why the fear of starvation does not promote entrepreneurship).

It might, however, be misapplied when used for companies. Gitlab is the best example here: It seems like they actually came out ahead in the famous incident you're referring to, even though that incident was the most breathtaking series of organisational incompetence imaginable.

So it's completely ok to spare the one guy's head who happened to issue the final 'drop db;'. But let's not congratulate them too often for at least telling us that of their three redundant backup mechanisms, one had never worked, one silently stopped working months earlier, and last was just erased by 20-something who mixed up his terminal windows (paraphrasing from memory).

Definitely checkout Gitlab's post mortem from last year https://about.gitlab.com/2017/02/10/postmortem-of-database-o...

It's very informative and also dives deep into their "Whys" pertaining to the outage.

Thanks for sharing that. We are glad that you appreciate our transparency which is one of the core values at GitLab.

Freakonomics podcast had an episode on apologies recently - http://freakonomics.com/podcast/apologies/ - one of the more interesting things discussed was how the start and end of the apology has a dominating impact. It seems people are incredibly sharp at recognising insincerity too. I'd say github did a great job here.

Disclosure: I work on Google Cloud.

I’m a little confused by this part:

> While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.

At first, I had assumed this was Glacier (“it took a long time to download”). But the daily retrieval testing suggests it’s likely just regular S3. Multiple TB sounds like less than 10.

So the question becomes “Did GitHub have less than 100 Gbps of peering to AWS?”. I hope that’s an action item if restores were meant to be quick (and likely this will be resolved by migrating to Azure, getting lots of connectivity, etc.).

We’ve been burned by long download times not only to transfer from a remote cloud, but related to IOPs requirements when transferring tons of small files. Now in each DC, we keep a day or two of backups on several servers loaded with Enterprise PCIE with 40Gbps NICs to reduce that issue. Curious to know the # of terabytes needed (or if it was more of an IOPs issue)?

It sounds to me like the time involved was rebuilding MULTIPLE replicas.

Maybe I overestimated the “significant” part:

> A significant portion of the time was consumed transferring the data from the remote backup service.

I get the time to rebuild part, but I’m curious about the download part.

This was an interesting point, but prefixing it with a call to authority was both needless and distracting. Why did you include it?

I didn't read the Google Cloud part as an assertion of authority, just as a disclosure - if they're talking about competitors (and especially how the choice of competitor negatively impacted GitHub) I appreciate it.

(disclosure - I work for a competitor, not on cloud stuff)

I also appreciate it. Its very common for owners/employees to criticize/attack competitors online anonymously. While the GP wasn't attacking, its just nice to know he works for a competitor.

That wasn’t my intent, quite the opposite actually. I don’t want it to read as “S3 is awfully slow, that’s what the problem was. If they were using GCS then < awesome outcome >”. Knowing that I work at a competing vendor is useful knowledge in this case.

>I don’t want it to read as “S3 is awfully slow, that’s what the problem was. If they were using GCS then < awesome outcome >”.

You don't mention Google at all outside of the opening statement so who would read it that way?

Because if he doesn't disclose it and we find out, it looks terrible and like he's trying to hide something.

If I know a competitor wrote it, I don't read it as negative. To me, it's a welcome courtesy, but as you say, I couldn't assume it if absent; so I'd probably paint a negative picture of S3 with the same albeit unprefaced sentence.

(Plus, I'm always pleased to see someone not call it a 'disclaimer'!)

Disclosure of Interest should be a common practice across all Industry and web discussions. It provides a lot of additional perspective, most of the places do not adhere to this.

Fortunately HN has kept the culture of disclosure intact.

Google cloud is a significant direct competitor to AWS. Anything criticizing AWS reads that way to some extent.

He mentions Google frequently in his posts for context. It's very useful, and there was nothing disparaging that he said.

In my career, the worst outages (longest downtime) I can recall have been due to HA + automatic failover. Everything from early NetApp clustering solutions corrupting the filesystem to cross-country split-brain issues like this.

Admittedly, I don't recall all the incidents where automatic failover minimized downtime, and probably if a human had to intervene in each of those, the cumulative downtime would be more significant.

But boy, it sure doesn't feel like it.

In cases where you can rely on the system to self-repair, automatically moving writes quickly seems reasonable. But otherwise, it seems like you want the system to cope with the situation where writes are unavailable -- clearly a lot of things will be broken, but if writes fail fast, reads are still viable.

Assuming you have that, it's OK to rely on a human to assess the situation, make sure the dead master is really dead, salvage any partially replicated transactions, and crown a new master. With the right tools, it could take only a few minutes -- a bit longer if you have to wait for the old master to boot to see if it had locally committed transactions that didn't make it to the network. If it takes 5 minutes to resolve this (including time to get to the console), you can do this ten times a year and still have three nines.

For the more likely case where it's a network blip, the situation resolves itself (in a nice way) by the time the operator gets to the console.

Indeed, given recent history I’d almost suggest it is better to take the site down for a few minutes than let the automatic failover systems put you into a 24-hr degraded service situation.

We're more aware of simple processes that don't work well than of complex ones that work flawlessly. - Marvin Minsky, MIT AI lab co-founder ... via http://github.com/globalcitizen/taoup

I share that sentiment. HA and automated failover seems to be either simple because the application supports it, or really really hard.

Stateless applications are simple. Systems built with this in mind, like cassandra, elasticsearch, redis+sentinel just do it right after two or three settings like minimum quorum sizes.

But if you have systems without this builtin, like NFS, Mysql, Postgres? I guess we don't hear about the successful automated failures, but we surely hear about really messy automated failover attempts.

Even in cases like Cassandra it's far from flawless, mostly due to the massive complexity. Failover works great, compaction works great, schema changes work great, repairs work... but what happens if two of those happen at once? Or all of them? There have been quite a few bugs over the years that involve corner cases when the various coordination and deferred-work systems interact.

I wish we had a public database of outages / postmortems somewhere that anyone could contribute to, so that each of us could move from his own slowly acquired experience to more thorough statistical data. Does such thing already exist?

Reading this post-mortem and their MySQL HA post, this incident deserves a talk titled: "MySQL semi-synchronous replication with automatic inter-DC failover to a DR site: how to turn a 47s outage into a 24h outage requiring manual data fixes"


> In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. [...]

> Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.

> We have set our timeout at a reasonably low value: 500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs.


> The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.

> However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls.

> Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.

Network computing is rough. (I am not being at all sarcastic).

2000x multiplier between connectivity loss and service degradation.

This is rather common; too many HA technologies really aren’t designed for wide-area failover, or anything besides the total and complete failure of a node.

I once saw a ten-second “grey failure” in the network cause an entire SAN cluster to go read-only for a day while it checksummed all replicas.

If you don’t Chaos-monkey test these things in production, they really won’t work when needed. And you can’t just “unplug a box” in your testing, you need to test slowdowns and intermittent errors under production loads.

Few enterprises are willing to risk or invest in that sort of testing. So we all hope for the best and point fingers upstream at vendors when the HA doesn’t work as advertised.

I've suffered more outages at the hands of HA database technologies (that were administered in-house) than such have prevented. There are very few organizations I trust to run them properly.

I’ve always stuck with log shipping to a warm standby wherever possible for the same reason. Well-managed single instance databases can do 3+ nines easily; very few applications actually require more availability than that. Especially at 5x the cost in opex for a theoretical-but-not-really-wink-wink extra 9.

In github's case, of course, they can't do a single instance db for performance reasons, even before you get to HA. If I understand it right.

Depends on how your architecture, github as a whole likely has volume of writes greater than a single server instance, however either components could be split in separate dbs ( seems to be their approach) or shard / partition the data

Throughput is orthogonal to reliability. You can improve both by creating read replicas, but that doesn't imply automated fail-over.

I find their incident management times pretty impressive. 2 minutes to detect & alert, ~2+ minutes response, 10 minutes to initial triage, 15 minutes to internal change control, 17 minutes to escalation & public communication, 19 minutes to major incident escalation & broad engagement, 73 minutes to remediation start & further public communication.

Initial triage inside 10 minutes then change control, escalation, major incident, and public communication within another 9 minutes. That's hard to beat with humans in the loop.

Most places I've seen have similar times to that, all except the public communication bit.

Lots of companies are wary about letting engineers declare downtime directly - it could have costs down the line giving customers refunds etc.

Getting legal, PR, and 3 levels of management to all agree on a message before it gets published is a recipe for the downtime notice taking hours to be published.

Do you have a NOC or similar dedicated group? Both initial engagement and their escalation path had a ~2 minute response time. Paging oncall team member, vpn connection, etc, takes ~5-10 in my experience. Similarly their incident manager apparently assessed the situation and further escalated in ~2 minutes as well. Thats pretty dang fast for coming up to speed on a as yet to be diagnosed major incident.

WRT to public communication, yeah. Overly onerous and adhoc seems to be the norm. But we do have “analysts” publically using status post count as inverse evidence of reliability. Creates some real bad incentives for the actual customers.

For a core service, github is sure to have a 5 minute human SLA from dedicated SRE's who have an on call rotation etc. That means someone is already at their computer, logged in, and ready to go. They'll typically be investigating within 30 seconds.

Probably within another 30 seconds that human sees many alerts coming in that don't look like false alarms and hits the 'escalate' button.

If it's within office hours, anything that looks major will usually have a bunch of SRE's on the case within 3 or 4 minutes. Usually in various roles from 'I acknowledged this page, it's my job to fix it!' to 'I'm just laughing at your misfortune, but paying a little attention just incase the broken thing turns out to be that change I made yesterday and I have to take over mitigation efforts'.

It's great to see how open they are about what happened and how it got fixed. I also appreciate the decision to prioritise data integrity over site usability too.

Looks like Github could be OK after the Microsoft acquisition.

> I also appreciate the decision to prioritise data integrity over site usability too.

That actually goes against what happened; they seemed to have implemented an automated failover which resulted in a split-brain situation rather than letting one DC completely degrade to read-only.

However mistakes happen and hindsight is 20-20.

> Looks like Github could be OK after the Microsoft acquisition.

Meh, I'm not against the Microsoft acquisition but I think it's still too soon to be making judgment calls like this. I can't imagine they've had much influence over internal processes just yet.

Offtopic, but this reminds me: did google ever issue an incident report on why youtube went down earlier this month? A quick search didn't turn up anything.

I haven't seen anything disclosed. Best I've found are articles from 5+ years ago about previous YouTube outages.

I would've also appreciated some acknowledgement of the fact that one of their official Twitter accounts was tweeting links to a crypto scam during the outage.

It wasn't just youtube - various other google services had trouble too.

I'd guess a bit of internal but not regional google infrastructure, like spanner, had downtime.

There is often a trade-off between a large distributed central store and several independent ones. The primary con of the former being incidents like this one, whereas the con of the latter is separate systems to perform eventually consistent aggregation to support some centralized features adding complexity. I wonder if there is any value in GitHub decentralizing the metadata pipeline. So many of the actions are namespaced that this could be reasonable theoretically, at a large practical cost.

On a related note, I often reach for Cassandra when starting projects knowing that building my application around its limited access approach has data replication benefits in the future. For all the flexibility benefits to devs given by SQL/RDBMSs, there are flexibility downsides.

An interesting point. Namespaced data would include anything related to a repo, github pages domain or organization, ie. almost everything. This would therefore all be readily shardable. Non-namespaced are basically just user credentials and settings. However, looking forward in a post-GDPR/China data law world, "network political bloc" based sharding of user data may also be either a regulatory requirement or a worthwhile forward-looking protective measure. (We just had this conversation yesterday at Infinite Food.) One organization I believe global user data sharding goes on at already is Google.

Sounds like a ton of complicated, fragile work done under crunch times and probably lots of stress. Hats off to the teams at GitHub.

If I read things correctly, they made a fairly... interesting... tradeoff: 954 still-as-yet-unreconciled DB writes in exchange for 24 hours of site downtime.

I think I'd have made a different choice, but cool that they were upfront about it.

> we captured the ... writes ... that were not replicated ... For example, one of our busiest clusters had 954 writes

Their wording ("one of") makes it sound like they had up to perhaps one order of magnitude more (1k-10k), but they do not actually give us a useful number, merely say "it wasn't too much" and "one of an unknown number of total clusters had 1k".

> I think I'd have made a different choice

I think you might misunderstand how these things go and the tradeoffs.

At the beginning of the incident, they find themselves with lots of writes in the west-coast master which aren't in the east-coast one, and some in the east-coast one that aren't in the west-coast one.

Orchestrator cannot promote a working master because they have diverged.

Your choices are:

1. Lose data (large unknown amount) by dumping an hour of west-coast data and going back to east-coast data, can be done in maybe 3 hours total by just deleting west-coast and serving all traffic from east-coast for as you rebuild the west-coast cluster.

2. Lose data (small amount) by rebuilding east-coast from a backup so it can be promoted to, replicating west coast data to it, promoting to it (what they did, 24 hours of time), and then try to manually fix up the small amount of lost data after-the-fact (ongoing)

3. Develop tooling to automate the reconciliation of data while the site is down such that the east-coast side can be merge-promoted without a rebuild, probably takes at least 3 days to build and might break everything, but if it works it probably merge-promotes in under an hour.

4. Keep the site down until east-coast data is manually reconciled, still requires rebuilding east to promote west to it, but then requires manually handling some lost writes... probably about 30-40 hours total downtime.

5. Update the application servers to work fine when the us-west DC is the master, probably about 2 months of development with the site down for the duration.

Which choice would you have made instead? I'm willing to bet they went with either 2 or 4 (and if 4, changed to 2 when it took longer than expected). They probably assumed it would take about 4 hours, and then it simply took much longer than expected because computers are complicated, it turns out.

The solution I would have gone through (granted, in a very different industry, finance is quite different from developer SaaS) would probably be closest to #4: identify the writes that we're about to lose; get all the account/user ids somehow related to these writes, back up everything that needs to be backed up and then go ahead, "lose" that data in the live system, bring the site up while blocking only the involved accounts until data is manually reconciled from backups. Granted, this requires the possibility for some separation so that you actually can selectively make some customers data fully inaccessible so that other writes wouldn't affect it, but it seems reasonable for github because datasets are mostly isolated/grouped in repositories.

10000 writes can be manually reconciled, been there done that, and the number of users that it affects is small enough that it's not appropriate to keep the site down for everyone for a day if you can rather keep it down for 2 days for a small number of users that you can contact (and compensate!) separately; the downtime for the unaffected users could be significantly reduced that way.

I'd have opted for 1 personally (losing 30 minutes of data) assuming I knew the alternative was 24 hours. I'll grant you they probably figured it would be much faster than that.

Alternatively, why not just let the West-Coast replicate to an East-Coast slave, and switch em. Surely the peering bandwidth between West-Coast/East-Coast is higher than whatever they were doing pulling full backups from the cloud?

> I'd have opted for 1 personally (losing 30 minutes of data) assuming I knew the alternative was 24 hours [of read-only access to prod]

With no offence meant, I'm glad you don't get to make decisions at github then. I'd lose faith in github for that since their primary purpose is to store other people's data (git blobs+metadata, comments, etc). If it was their data, it could be fine, but since it's user data, I don't think losing it is acceptable, even if that means being read-only for an entire week.

> why not just let the West-Coast replicate to an East-Coast slave, and switch em?

At the time of the incident, east-coast has already diverged, so it's not possible to replay logs from west-coast without moving east-coast back in time to some point prior to west-coast (but a point recent enough that west-coast still has full replica logs to replay).

It's not possible to simply "rewind" a mysql database typically without restoring a backup.

It's not possible in many mysql clustering solutions to catch-up to a peer unless you're already within a recent time window where the replay logs are still available.

As such, the only option is to restore from backup and then catch up, which is what they did.

I assume if taking another backup of west-coast, transferring it to east-coast over the DC link, and restoring it was faster, they would have done it, but I would be totally unsurprised if that was about the same total time.

I'd also say that if you have a practiced procedure (restoring from a scheduled backup in the cloud) vs an ad-hoc procedure (custom backup, custom storage and transfer), the former is probably much safer to do when you're under fire and want to minimize risky moves stressed engineers have to make.

We're just speculating how much data would have been lost but one option would be to tell the owner of the data "hey you commented on pull request foo with comment 'baz' but we were unable to save it. Please comment again if necessary."

It's not a great UX but neither is 24 hours of downtime.

The source code is paramount, no argument there. But this wasn't about that. I find the metadata less important, but as you say, I don't make the decisions. The factor you aren't considering is opportunity cost given productivity loss. In your hyperbolic example where you claim you'd rather they were read-only for a week than lose 30 minutes of data? I'd rather get things done during that week personally, and if the price I have to pay is resubmitting a comment or re-approving a PR, I guess I'd pay it.

Also, MSSQL has had quick delta snapshot/restores forever (assuming you have them turned on and going every hour or so), mysql really does not have a similar feature?

MySQL does not have any kind of native snapshotting capability, no.

It is becoming somewhat common to run large MySQL deployments on filesystems which do (like ZFS) to get a similar effect, albeit without the ability to restore a snapshot on a host while that database host is online. There is even some experimental work going on in the community to use ZFS snapshots as a state-transfer mechanism for laggy nodes in a Galera MySQL cluster, which seems promising.

The incident lasted 24 hours, but the vast majority of that time was spent in a degraded state, not fully down. It's likely they didn't know the scope of the unreconciled writes until later in the incident, too.

I’m not super versed in MySQL failover, but am I correct to conclude the Orchestrator and RAFT also did them in? And isn’t that architectural components right to existence to exactly prohibit such a situation from happening? Genuine question.

Orchestrator and RAFT did not help them, but mostly it was their configuration. The setup allowed an automatic failover to the west cost data centre, so when the connectivity failed the tools did their job and voted in the west coast.

The solution applied is to prohibit cross-country failover in Orchestrator, but allow it to continue doing failover within regions.


Please assume best intentions. I read the whole thing. Top to bottom. Just trying to learn.

you're right, sorry.

Postmortems are always really interesting, if you're preparing for an interview where system design will be discussed you could do worse than incorporate reading a couple to help in preparation.

I wondered "is there a list of interesting postmortems, like there is for 'falsehoods programmers believe' posts?", and found one at https://github.com/danluu/post-mortems

a good halloween read for sure!!!

>At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds

lesson: a technician replacing a switch must be able to do it faster than the leader heartbeat timeout of the consensus protocol. (reminds me how in high school we trained to very quickly disassemble/assemble Kalashnikov machine gun - something like under 20 sec. total - the whole choreographed sequence of movements was learned and practiced like a samurai sword kata :)

>> It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.

So basically Orchestrator acted correctly but application layer is not nicely integrated with it? Sounds like something not well designed on application side, which offests the whole point of global consensus. Not much detail provided for what exactly went wrong on this though.

They mentioned that they'll take extra care of this but I'm still very concerned about the reality that two systems (orchestrator & app) are so loosely coupled.

I don't think that's the right interpretation of it. The "unable to support" in this case seems to have been "can't reliably work due larger latency between components". A tighter coupling between the two would likely not have encoded this knowledge either.

It seems like the obvious play here is to fail-over all workloads to the west coast, so they don't incur cross-regional latency. Do they explain why this wasn't possible? If so I cannot find it.

> All database primaries established in US East Coast again. This resulted in the site becoming far more responsive as writes were now directed to a database server that was co-located in the same physical data center as our application tier.

All their application servers were in the east coast DC. There wasn't anything to fail over to in the west coast DC. Why they had this fail over config for their DB is a bit unusual given the topology, it seems like they probably meant to use these replicas purely for DR.

The quote does not address my question. It just says that everything was fine once their masters were in the east again with their application servers. The question of why there are no application servers in the west to fail-over to is what I'm curious about.

That's a great question I was surprised to see unaddressed. It's unclear what workloads they normally have in East coast vs West coast, or if they have enough capacity to run everything in the West coast.

This is a great write up - making these public with all the gory technical details helps us all be better at our jobs.

I’m curious whether shutting down the east coast apps entirely and running off of the west coast was considered? Not enough capacity or some other problem?

edit: I guess the third technical initiative strongly implies that it was just a capacity issue: "This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact."

In the article they said that both regions consumed writes and got into an inconsistent state during the process.

> This effort was challenging because by this point the West Coast database cluster had ingested writes from our application tier for nearly 40 minutes. Additionally, there were the several seconds of writes that existed in the East Coast cluster that had not been replicated to the West Coast and prevented replication of new writes back to the East Coast.

As I read it their solution to the east coast write conflict was to just put those aside and reconcile them later (and they are still working on that) - I don't think that would impact their ability to turn off east coast app servers too.

It's better to go down hard than to have a crappy fix that tries to keep things alive.

On a side note, does anyone know if anything special was used to create the descriptive images in this blog post? They look great and describe connections between related regions really well.

Eg https://blog.github.com/assets/img/2018-10-25-oct21-post-inc...

The engineering team sketched them out by hand and were later translated into vectors by the design team.

The vector graphics were made in Sketch (https://www.sketchapp.com). Another popular alternative at GitHub is Figma (https://www.figma.com/).

That’s not it, but I really like Whimsical (https://whimsical.co) to produce great-looking diagrams.

I also don't know, but they don't look too difficult to make in Inkscape. That's my go-to tool for diagrams.


"..we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub."

The first diagram labeled "Normal Topology" shows Master in East and no other master. Later they acknowledge that lots of stuff doesn't work if Master is not in East because of latency. So then there's all this Orchestration, and it never could have worked in the first place?

That seems incredible - what am I missing?

The CAP theorem strikes again.

Untested RPC timeouts strike again. Every service needs integration tests that exercise timeouts. Some config and code can't be exercised in automated tests and needs regular disaster readiness testing.

Service client libraries need unit tests that show the library returning expected errors for all the failure scenarios: missing config, name lookup failure, unreachable, refusing connections, closing connections, returning error responses, returning garbage responses, refusing writes, responding with high latency, and responding with low throughput.

This makes me wonder why GitHub is effectively using a single database (though replicated with HA considerations and CQRS-ized).

From the article, it appears they are are partitioning based on function (commits in this DB, PRs in this cluster)... but I just don't see a strong business need to glob all commit data together into one massive datastore. Perhaps it's an economy-of-scale driver?

This comment... I've thought about this a lot. I think asking this question is probably the most meaningful insight into this failure that has appeared.

What GitHub has done is create a giant single point of failure in the form of a globe spanning under-tested mysql database to manage a vast number of unrelated git repos, most of which are dormant. At any moment only a minuscule fraction of these repos are actually mutating.

One can imagine a system that doesn't conflate every GitHub repo into a monster distributed mysql database. Each repo could be coupled with an independent SQLite database, for instance, that spins up in a few microseconds and synchronizes with its distributed copies only when necessary. Otherwise these lay dormant and available, safe from the vagaries of some under-specified planet scale mysql instance.

I imagine testing such a design would be vastly easier as well; one need not replicate the conditions of this mighty distributed mysql construct. Whatever aggregation is ultimately necessary (analysis, billing, etc.) could be performed with asynchronous ETL into some database with minimal coordination and no risk to availability for customers.

No reflection on the root cause? I am not into hardware but shouldn't there an redundant network connection in place or at least set one up?

Very impressive complexity. Really appreciate the transparency. Much has been a couple of stressful days.

Didn't lose data perhaps, but sounds like information was definitely lost and/or corrupted.

I didn't learn any of this in school. Any MOOCs for cloud architecture?

So they geographically replicated their mysql in order to survive such a partition and instead they destroyed their entire database and now have no reason to believe any of it is consistent at all.

Just five days before Microsoft completed the GitHub acquisition so they had to lower their QoS to meet the expectations of everyone using Azure...

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact