Of course, without detailed knowledge of how GitHub's internals work, all we can do is speculate. But just based on what was explained in this blog post, it sounds like they're replicating database updates asynchronously, without waiting for the updates to be acknowledged by slaves before the master allows them to commit. Which means the data on slaves is always slightly out-of-date, and becomes more out-of-date when the slaves are partitioned from the master. Which means that promoting a slave to master will by definition lose some committed writes.
If "guarding the confidentiality and integrity of user data is GitHub’s highest priority", then why would they build and deploy an automated failover system whose purpose is to preserve availability at the cost of consistency? And why were they apparently caught off-guard when it operated as designed?
(Reading point 1 under "technical initiatives", it seems that they consider intra-DC failover to be "safe", and cross-DC failover to be "unsafe". But the exact same failure mode is present in both cases; the only difference is the length of the time during which in-flight writes can be lost.)
> In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. It provides a way to achieve lossless failovers: any change applied on the master is either applied or waiting to be applied on one of the replicas.
They only require one other replica (as opposed to a quorum) to be reachable from the master for the master to continue acknowledging writes. If a new master has been elected on the other side of the partition, both will continue acknowledging writes.
They noted this as a limitation that they were working on (but unfortunately a bit too late in hindsight):
> Notably, on a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high.
> They only require one other replica (as opposed to a quorum) to be reachable from the master for the master to continue acknowledging writes (from your comment)
> Orchestrator considers a number of variables during this process and is built on top of Raft for consensus... (from the article)
Doesn't quite make sense to me. Doesn't raft require the master to wait for a quorum before committing writes? I understood it as a pretty important aspect.
> A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers (from the raft paper )
I understood that the bold is only true if commits are acknowledged by the quorum
Edit: I'm not implying your post is incorrect, just trying to understand how the two fit together
Different kinds of masters. I've forced myself to always include the application if terminology gets reused across applications in a discussion. As far as I understand this>
- The mysql-master only commits an sql-level-transaction if a mysql-replica acknowledges replication of the transaction. This avoids data loss during a failover.
- The orchestrator-nodes elect an orchestrator-master via raft mechanics.
- After this, the orchestrator-master starts pondering about the mysql-master / mysql-replica situation and potentially promotes the most current mysql-replica to mysql-master.
- This promotion requires writes on an orchestrator/raft level, which in turn requires the orchestrator-master to get an acknowledgement from the orchestrator-quorum.
Interestingly enough this seems to be how managed sql solutions do it. You just have one primary and one failover replica in another zone. So a quorum write is just a synchronous write to both (2 out of 2). You don't have the split-brain problem because the original primary can't make progress if it can't contact the replica.
Github's problem is that they were trying to be too smart and allow any of their replicas to be candidate masters without increasing their quorum size. In theory this is higher availability than the managed sql solutions (they can be available even if their entire coast gets nuked) but they do it at the cost of consistency in the more common failure scenarios.
For consistency, all of the slaves should be applying the stream of updates to their local replicas in the same order that they happened on the master. That means each slave has seen a prefix of the master's data; its state can be uniquely represented by a position in the replication log. So you can easily determine which replica is most up-to-date by taking the one with the largest position. (If you're doing synchronous replication correctly, at least one replica is guaranteed to be completely up-to-date, even after an unexpected failure.)
Vector clocks only come into play when you are simultaneously accepting writes on multiple masters, and you need to keep track of which updates have already been applied from each source.
Exactly? Maybe? Im reading a lot in to little information here. It wasnt clear to me if "one or more" was the master + a single specific replica, or something closer to master + 1/N. And then there's the automated optimistic leader election. Combine those with (multiple) partitions and you could have a very bad day.
Rereading their RCA now I'm guessing the above isnt the case. "Restore the entire datacenter" reads more like they preserved the ~40 seconds of east coast partition writes, restored everything in east to a known checkpoint, and resumed replication off the more advanced masters in the west coast. If it's "just" two masters per cluster, both of which are available and internally consistent, you "just" have to reconcile those two write streams.
Anecdotally I think I see this regularly. Part of the build pipeline I work on requires recursively downloading a particular folder for a given repo. From time to time, I'll have the latest git ref give me data from the previous git ref. Everywhere else on GitHub the information is up-to-date. But not the API endpoint to download files. When this happens it takes roughly 5-7 minutes for the correct information to show up.
After reading this and all the other resources people have posted I'm definitely going to see if I can reliably reproduce the error.
I could believe that an endpoint to download files looks at a cache, and if you are just using 'master' or 'latest' as opposed to a commit, I could believe that would be out of date.
But given the issues, why continue to use the API? Because those files are related to the build pipeline logic and they change maybe once in 3-4 months. I can live with an error once in 3-4 months. Recently I was working on changing some of the logic of the pipeline and that's when I ran into it enough to say that there was something odd from Github's end.
Sure. But that is where the application code comes in. You usually have at least two db connections one to the slave connection and one to master. If you are checking if an account exists [important] you can ask the master since it will be in an up to date state. If you want to build someone's feed on github, then doing all your sql though the slave is fine. They also probably have layers of atomic cache on top which handles 99% of the requests [memcache,redis,whatever]. EG. Does the account exist? Check the cache, check master.
Use master for transactional read/writes.
Use replicas (slaves) for readonly.
More or less?
This may be why failover is "safe" in certain circumstances.
Unless you temporarily pause writes to the primary before promting a slave.
I'm always impressed when people actually apologize instead of dancing around an almost apology.
Overall this is one of the best post mortems I've seen - great tone, very well written, super informative, has all of the steps (apology, information on issue, real steps to prevent it happening again) that hurt customers generally want to see. Really impressive timelines too - 43s reconnect after initial issue, 15 minutes to change status.
Gitlab overall seems to have more incidents like this and I really like their custom of working through them in public Google docs. Definitely seems a better idea for incident communication than relying on your own tech (github pages) for incident communication.
The mindset that problems are caused by failures in process, and not people is probably right. Even if not, I think it has become popular because it describes a world we would like to live in. It reduces anxiety, and might not negatively effect people's diligence because there tends to be diminishing returns with such negative feelings (c.f.: why the fear of starvation does not promote entrepreneurship).
It might, however, be misapplied when used for companies. Gitlab is the best example here: It seems like they actually came out ahead in the famous incident you're referring to, even though that incident was the most breathtaking series of organisational incompetence imaginable.
So it's completely ok to spare the one guy's head who happened to issue the final 'drop db;'. But let's not congratulate them too often for at least telling us that of their three redundant backup mechanisms, one had never worked, one silently stopped working months earlier, and last was just erased by 20-something who mixed up his terminal windows (paraphrasing from memory).
It's very informative and also dives deep into their "Whys" pertaining to the outage.
I’m a little confused by this part:
> While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.
At first, I had assumed this was Glacier (“it took a long time to download”). But the daily retrieval testing suggests it’s likely just regular S3. Multiple TB sounds like less than 10.
So the question becomes “Did GitHub have less than 100 Gbps of peering to AWS?”. I hope that’s an action item if restores were meant to be quick (and likely this will be resolved by migrating to Azure, getting lots of connectivity, etc.).
> A significant portion of the time was consumed transferring the data from the remote backup service.
I get the time to rebuild part, but I’m curious about the download part.
(disclosure - I work for a competitor, not on cloud stuff)
You don't mention Google at all outside of the opening statement so who would read it that way?
(Plus, I'm always pleased to see someone not call it a 'disclaimer'!)
Fortunately HN has kept the culture of disclosure intact.
Admittedly, I don't recall all the incidents where automatic failover minimized downtime, and probably if a human had to intervene in each of those, the cumulative downtime would be more significant.
But boy, it sure doesn't feel like it.
Assuming you have that, it's OK to rely on a human to assess the situation, make sure the dead master is really dead, salvage any partially replicated transactions, and crown a new master. With the right tools, it could take only a few minutes -- a bit longer if you have to wait for the old master to boot to see if it had locally committed transactions that didn't make it to the network. If it takes 5 minutes to resolve this (including time to get to the console), you can do this ten times a year and still have three nines.
For the more likely case where it's a network blip, the situation resolves itself (in a nice way) by the time the operator gets to the console.
Stateless applications are simple. Systems built with this in mind, like cassandra, elasticsearch, redis+sentinel just do it right after two or three settings like minimum quorum sizes.
But if you have systems without this builtin, like NFS, Mysql, Postgres? I guess we don't hear about the successful automated failures, but we surely hear about really messy automated failover attempts.
> In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. [...]
> Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.
> We have set our timeout at a reasonably low value: 500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs.
> The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.
> However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls.
Network computing is rough. (I am not being at all sarcastic).
I once saw a ten-second “grey failure” in the network cause an entire SAN cluster to go read-only for a day while it checksummed all replicas.
If you don’t Chaos-monkey test these things in production, they really won’t work when needed. And you can’t just “unplug a box” in your testing, you need to test slowdowns and intermittent errors under production loads.
Few enterprises are willing to risk or invest in that sort of testing. So we all hope for the best and point fingers upstream at vendors when the HA doesn’t work as advertised.
Initial triage inside 10 minutes then change control, escalation, major incident, and public communication within another 9 minutes. That's hard to beat with humans in the loop.
Lots of companies are wary about letting engineers declare downtime directly - it could have costs down the line giving customers refunds etc.
Getting legal, PR, and 3 levels of management to all agree on a message before it gets published is a recipe for the downtime notice taking hours to be published.
WRT to public communication, yeah. Overly onerous and adhoc seems to be the norm. But we do have “analysts” publically using status post count as inverse evidence of reliability. Creates some real bad incentives for the actual customers.
Probably within another 30 seconds that human sees many alerts coming in that don't look like false alarms and hits the 'escalate' button.
If it's within office hours, anything that looks major will usually have a bunch of SRE's on the case within 3 or 4 minutes. Usually in various roles from 'I acknowledged this page, it's my job to fix it!' to 'I'm just laughing at your misfortune, but paying a little attention just incase the broken thing turns out to be that change I made yesterday and I have to take over mitigation efforts'.
Looks like Github could be OK after the Microsoft acquisition.
That actually goes against what happened; they seemed to have implemented an automated failover which resulted in a split-brain situation rather than letting one DC completely degrade to read-only.
However mistakes happen and hindsight is 20-20.
Meh, I'm not against the Microsoft acquisition but I think it's still too soon to be making judgment calls like this. I can't imagine they've had much influence over internal processes just yet.
I'd guess a bit of internal but not regional google infrastructure, like spanner, had downtime.
On a related note, I often reach for Cassandra when starting projects knowing that building my application around its limited access approach has data replication benefits in the future. For all the flexibility benefits to devs given by SQL/RDBMSs, there are flexibility downsides.
I think I'd have made a different choice, but cool that they were upfront about it.
Their wording ("one of") makes it sound like they had up to perhaps one order of magnitude more (1k-10k), but they do not actually give us a useful number, merely say "it wasn't too much" and "one of an unknown number of total clusters had 1k".
> I think I'd have made a different choice
I think you might misunderstand how these things go and the tradeoffs.
At the beginning of the incident, they find themselves with lots of writes in the west-coast master which aren't in the east-coast one, and some in the east-coast one that aren't in the west-coast one.
Orchestrator cannot promote a working master because they have diverged.
Your choices are:
1. Lose data (large unknown amount) by dumping an hour of west-coast data and going back to east-coast data, can be done in maybe 3 hours total by just deleting west-coast and serving all traffic from east-coast for as you rebuild the west-coast cluster.
2. Lose data (small amount) by rebuilding east-coast from a backup so it can be promoted to, replicating west coast data to it, promoting to it (what they did, 24 hours of time), and then try to manually fix up the small amount of lost data after-the-fact (ongoing)
3. Develop tooling to automate the reconciliation of data while the site is down such that the east-coast side can be merge-promoted without a rebuild, probably takes at least 3 days to build and might break everything, but if it works it probably merge-promotes in under an hour.
4. Keep the site down until east-coast data is manually reconciled, still requires rebuilding east to promote west to it, but then requires manually handling some lost writes... probably about 30-40 hours total downtime.
5. Update the application servers to work fine when the us-west DC is the master, probably about 2 months of development with the site down for the duration.
Which choice would you have made instead? I'm willing to bet they went with either 2 or 4 (and if 4, changed to 2 when it took longer than expected). They probably assumed it would take about 4 hours, and then it simply took much longer than expected because computers are complicated, it turns out.
10000 writes can be manually reconciled, been there done that, and the number of users that it affects is small enough that it's not appropriate to keep the site down for everyone for a day if you can rather keep it down for 2 days for a small number of users that you can contact (and compensate!) separately; the downtime for the unaffected users could be significantly reduced that way.
Alternatively, why not just let the West-Coast replicate to an East-Coast slave, and switch em. Surely the peering bandwidth between West-Coast/East-Coast is higher than whatever they were doing pulling full backups from the cloud?
With no offence meant, I'm glad you don't get to make decisions at github then. I'd lose faith in github for that since their primary purpose is to store other people's data (git blobs+metadata, comments, etc). If it was their data, it could be fine, but since it's user data, I don't think losing it is acceptable, even if that means being read-only for an entire week.
> why not just let the West-Coast replicate to an East-Coast slave, and switch em?
At the time of the incident, east-coast has already diverged, so it's not possible to replay logs from west-coast without moving east-coast back in time to some point prior to west-coast (but a point recent enough that west-coast still has full replica logs to replay).
It's not possible to simply "rewind" a mysql database typically without restoring a backup.
It's not possible in many mysql clustering solutions to catch-up to a peer unless you're already within a recent time window where the replay logs are still available.
As such, the only option is to restore from backup and then catch up, which is what they did.
I assume if taking another backup of west-coast, transferring it to east-coast over the DC link, and restoring it was faster, they would have done it, but I would be totally unsurprised if that was about the same total time.
I'd also say that if you have a practiced procedure (restoring from a scheduled backup in the cloud) vs an ad-hoc procedure (custom backup, custom storage and transfer), the former is probably much safer to do when you're under fire and want to minimize risky moves stressed engineers have to make.
It's not a great UX but neither is 24 hours of downtime.
Also, MSSQL has had quick delta snapshot/restores forever (assuming you have them turned on and going every hour or so), mysql really does not have a similar feature?
It is becoming somewhat common to run large MySQL deployments on filesystems which do (like ZFS) to get a similar effect, albeit without the ability to restore a snapshot on a host while that database host is online. There is even some experimental work going on in the community to use ZFS snapshots as a state-transfer mechanism for laggy nodes in a Galera MySQL cluster, which seems promising.
The solution applied is to prohibit cross-country failover in Orchestrator, but allow it to continue doing failover within regions.
lesson: a technician replacing a switch must be able to do it faster than the leader heartbeat timeout of the consensus protocol. (reminds me how in high school we trained to very quickly disassemble/assemble Kalashnikov machine gun - something like under 20 sec. total - the whole choreographed sequence of movements was learned and practiced like a samurai sword kata :)
So basically Orchestrator acted correctly but application layer is not nicely integrated with it? Sounds like something not well designed on application side, which offests the whole point of global consensus. Not much detail provided for what exactly went wrong on this though.
They mentioned that they'll take extra care of this but I'm still very concerned about the reality that two systems (orchestrator & app) are so loosely coupled.
All their application servers were in the east coast DC. There wasn't anything to fail over to in the west coast DC. Why they had this fail over config for their DB is a bit unusual given the topology, it seems like they probably meant to use these replicas purely for DR.
I’m curious whether shutting down the east coast apps entirely and running off of the west coast was considered? Not enough capacity or some other problem?
edit: I guess the third technical initiative strongly implies that it was just a capacity issue: "This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact."
> This effort was challenging because by this point the West Coast database cluster had ingested writes from our application tier for nearly 40 minutes. Additionally, there were the several seconds of writes that existed in the East Coast cluster that had not been replicated to the West Coast and prevented replication of new writes back to the East Coast.
The vector graphics were made in Sketch (https://www.sketchapp.com). Another popular alternative at GitHub is Figma (https://www.figma.com/).
"..we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub."
That seems incredible - what am I missing?
Service client libraries need unit tests that show the library returning expected errors for all the failure scenarios: missing config, name lookup failure, unreachable, refusing connections, closing connections, returning error responses, returning garbage responses, refusing writes, responding with high latency, and responding with low throughput.
From the article, it appears they are are partitioning based on function (commits in this DB, PRs in this cluster)... but I just don't see a strong business need to glob all commit data together into one massive datastore. Perhaps it's an economy-of-scale driver?
What GitHub has done is create a giant single point of failure in the form of a globe spanning under-tested mysql database to manage a vast number of unrelated git repos, most of which are dormant. At any moment only a minuscule fraction of these repos are actually mutating.
One can imagine a system that doesn't conflate every GitHub repo into a monster distributed mysql database. Each repo could be coupled with an independent SQLite database, for instance, that spins up in a few microseconds and synchronizes with its distributed copies only when necessary. Otherwise these lay dormant and available, safe from the vagaries of some under-specified planet scale mysql instance.
I imagine testing such a design would be vastly easier as well; one need not replicate the conditions of this mighty distributed mysql construct. Whatever aggregation is ultimately necessary (analysis, billing, etc.) could be performed with asynchronous ETL into some database with minimal coordination and no risk to availability for customers.