
October 21 post-incident analysis - pietroalbini
https://blog.github.com/2018-10-30-oct21-post-incident-analysis/
======
teraflop
The timeline of events was interesting (and much appreciated), but the root
cause analysis doesn't really go much deeper than "we had a brief network
partition, and our systems weren't designed to cope with it", which still
leaves a whole lot of question marks.

Of course, without detailed knowledge of how GitHub's internals work, all we
can do is speculate. But just based on what was explained in this blog post,
it sounds like they're replicating database updates asynchronously, without
waiting for the updates to be acknowledged by slaves before the master allows
them to commit. Which means the data on slaves is always slightly out-of-date,
and becomes more out-of-date when the slaves are partitioned from the master.
Which means that promoting a slave to master will _by definition_ lose some
committed writes.

If "guarding the confidentiality and integrity of user data is GitHub’s
highest priority", then why would they build and deploy an automated failover
system whose purpose is to preserve availability at the cost of consistency?
And why were they apparently caught off-guard when it operated as designed?

(Reading point 1 under "technical initiatives", it seems that they consider
intra-DC failover to be "safe", and cross-DC failover to be "unsafe". But the
exact same failure mode is present in both cases; the only difference is the
length of the time during which in-flight writes can be lost.)

~~~
zawerf
I believe you're right. They go into their design in their MySQL HA post:

[https://githubengineering.com/mysql-high-availability-at-
git...](https://githubengineering.com/mysql-high-availability-at-github/#semi-
synchronous-replication)

> In MySQL’s semi-synchronous replication a master does not acknowledge a
> transaction commit until the change is known to have shipped to one or more
> replicas. It provides a way to achieve lossless failovers: any change
> applied on the master is either applied or waiting to be applied on one of
> the replicas.

They only require one other replica (as opposed to a quorum) to be reachable
from the master for the master to continue acknowledging writes. If a new
master has been elected on the other side of the partition, both will continue
acknowledging writes.

They noted this as a limitation that they were working on (but unfortunately a
bit too late in hindsight):

> Notably, on a data center isolation scenario, and assuming a master is in
> the isolated DC, apps in that DC are still able to write to the master. This
> may result in state inconsistency once network is brought back up. We are
> working to mitigate this split-brain by implementing a reliable STONITH from
> within the very isolated DC. As before, some time will pass before bringing
> down the master, and there could be a short period of split-brain. The
> operational cost of avoiding split-brains altogether is very high.

~~~
sealjam
I'm by no means an expert on any of this stuff, but:

> They only require one other replica (as opposed to a quorum) to be reachable
> from the master for the master to continue acknowledging writes (from your
> comment)

> Orchestrator considers a number of variables during this process and is
> built on top of Raft for consensus... (from the article)

Doesn't quite make sense to me. Doesn't raft require the master to wait for a
quorum before committing writes? I understood it as a pretty important aspect.

> A candidate must contact a majority of the cluster in order to be elected,
> which means that _every committed entry must be present in at least one of
> those servers_ (from the raft paper [1])

I understood that the bold is only true if commits are acknowledged by the
quorum

Edit: I'm not implying your post is incorrect, just trying to understand how
the two fit together

[1]: [https://raft.github.io/raft.pdf](https://raft.github.io/raft.pdf)

~~~
sethammons
This was my first thought, you beat me to it. It seems like they contradict
each other. Is it Raft with quorum or is it a single replica node that is
caught up with master? You can't have both (unless you only have three nodes).

~~~
zawerf
> unless you only have three nodes

Interestingly enough this seems to be how managed sql solutions do it. You
just have one primary and one failover replica in another zone. So a quorum
write is just a synchronous write to both (2 out of 2). You don't have the
split-brain problem because the original primary can't make progress if it
can't contact the replica.

[https://cloud.google.com/sql/docs/mysql/high-
availability](https://cloud.google.com/sql/docs/mysql/high-availability)

[https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Conce...](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html)

Github's problem is that they were trying to be too smart and allow any of
their replicas to be candidate masters without increasing their quorum size.
In theory this is higher availability than the managed sql solutions (they can
be available even if their entire coast gets nuked) but they do it at the cost
of consistency in the more common failure scenarios.

------
sixhobbits
> With this incident, we failed you, and we are deeply sorry.

I'm always impressed when people actually apologize instead of dancing around
an almost apology.

Overall this is one of the best post mortems I've seen - great tone, very well
written, super informative, has all of the steps (apology, information on
issue, real steps to prevent it happening again) that hurt customers generally
want to see. Really impressive timelines too - 43s reconnect after initial
issue, 15 minutes to change status.

Gitlab overall seems to have more incidents like this and I really like their
custom of working through them in public Google docs. Definitely seems a
better idea for incident communication than relying on your own tech (github
pages) for incident communication.

~~~
nraval1729
Definitely checkout Gitlab's post mortem from last year
[https://about.gitlab.com/2017/02/10/postmortem-of-
database-o...](https://about.gitlab.com/2017/02/10/postmortem-of-database-
outage-of-january-31/)

It's very informative and also dives deep into their "Whys" pertaining to the
outage.

~~~
dsumenkovic
Thanks for sharing that. We are glad that you appreciate our transparency
which is one of the core values at GitLab.

------
boulos
Disclosure: I work on Google Cloud.

I’m a little confused by this part:

> While MySQL data backups occur every four hours and are retained for many
> years, the backups are stored remotely in a public cloud blob storage
> service. The time required to restore multiple terabytes of backup data
> caused the process to take hours. A significant portion of the time was
> consumed transferring the data from the remote backup service. This
> procedure is tested daily at minimum, so the recovery time frame was well
> understood, however until this incident we have never needed to fully
> rebuild an entire cluster from backup and had instead been able to rely on
> other strategies such as delayed replicas.

At first, I had assumed this was Glacier (“it took a long time to download”).
But the daily retrieval testing suggests it’s likely just regular S3. Multiple
TB sounds like less than 10.

So the question becomes “Did GitHub have less than 100 Gbps of peering to
AWS?”. I hope that’s an action item if restores were meant to be quick (and
likely this will be resolved by migrating to Azure, getting lots of
connectivity, etc.).

~~~
hosay123
This was an interesting point, but prefixing it with a call to authority was
both needless and distracting. Why did you include it?

~~~
aeling
I didn't read the Google Cloud part as an assertion of authority, just as a
disclosure - if they're talking about competitors (and especially how the
choice of competitor negatively impacted GitHub) I appreciate it.

(disclosure - I work for a competitor, not on cloud stuff)

~~~
wnevets
I also appreciate it. Its very common for owners/employees to criticize/attack
competitors online anonymously. While the GP wasn't attacking, its just nice
to know he works for a competitor.

------
js2
In my career, the worst outages (longest downtime) I can recall have been due
to HA + automatic failover. Everything from early NetApp clustering solutions
corrupting the filesystem to cross-country split-brain issues like this.

Admittedly, I don't recall all the incidents where automatic failover
minimized downtime, and probably if a human had to intervene in each of those,
the cumulative downtime would be more significant.

But boy, it sure doesn't feel like it.

~~~
toast0
In cases where you can rely on the system to self-repair, automatically moving
writes quickly seems reasonable. But otherwise, it seems like you want the
system to cope with the situation where writes are unavailable -- clearly a
lot of things will be broken, but if writes fail fast, reads are still viable.

Assuming you have that, it's OK to rely on a human to assess the situation,
make sure the dead master is really dead, salvage any partially replicated
transactions, and crown a new master. With the right tools, it could take only
a few minutes -- a bit longer if you have to wait for the old master to boot
to see if it had locally committed transactions that didn't make it to the
network. If it takes 5 minutes to resolve this (including time to get to the
console), you can do this ten times a year and still have three nines.

For the more likely case where it's a network blip, the situation resolves
itself (in a nice way) by the time the operator gets to the console.

~~~
xenadu02
Indeed, given recent history I’d almost suggest it is better to take the site
down for a few minutes than let the automatic failover systems put you into a
24-hr degraded service situation.

------
terom
Reading this post-mortem and their MySQL HA post, this incident deserves a
talk titled: "MySQL semi-synchronous replication with automatic inter-DC
failover to a DR site: how to turn a 47s outage into a 24h outage requiring
manual data fixes"

[https://githubengineering.com/mysql-high-availability-at-
git...](https://githubengineering.com/mysql-high-availability-at-github/#semi-
synchronous-replication)

> In MySQL’s semi-synchronous replication a master does not acknowledge a
> transaction commit until the change is known to have shipped to one or more
> replicas. [...]

> Consistency comes with a cost: a risk to availability. Should no replica
> acknowledge receipt of changes, the master will block and writes will stall.
> Fortunately, there is a timeout configuration, after which the master can
> revert back to asynchronous replication mode, making writes available again.

> We have set our timeout at a reasonably low value: 500ms. It is more than
> enough to ship changes from the master to local DC replicas, and typically
> also to remote DCs.

[https://blog.github.com/2018-10-30-oct21-post-incident-
analy...](https://blog.github.com/2018-10-30-oct21-post-incident-
analysis/#2018-october-21-2252-utc)

> The database servers in the US East Coast data center contained a brief
> period of writes that had not been replicated to the US West Coast facility.
> Because the database clusters in both data centers now contained writes that
> were not present in the other data center, we were unable to fail the
> primary back over to the US East Coast data center safely.

> However, applications running in the East Coast that depend on writing
> information to a West Coast MySQL cluster are currently unable to cope with
> the additional latency introduced by a cross-country round trip for the
> majority of their database calls.

------
jrochkind1
> Connectivity between these locations was restored in 43 seconds, but this
> brief outage triggered a chain of events that led to 24 hours and 11 minutes
> of service degradation.

Network computing is rough. (I am not being at all sarcastic).

~~~
why_only_15
2000x multiplier between connectivity loss and service degradation.

~~~
tatersolid
This is rather common; too many HA technologies really aren’t designed for
wide-area failover, or anything besides the total and complete failure of a
node.

I once saw a ten-second “grey failure” in the network cause an entire SAN
cluster to go read-only for a day while it checksummed all replicas.

If you don’t Chaos-monkey test these things in production, they really won’t
work when needed. And you can’t just “unplug a box” in your testing, you need
to test slowdowns and intermittent errors under production loads.

Few enterprises are willing to risk or invest in that sort of testing. So we
all hope for the best and point fingers upstream at vendors when the HA
doesn’t work as advertised.

~~~
jeremyjh
I've suffered more outages at the hands of HA database technologies (that were
administered in-house) than such have prevented. There are very few
organizations I trust to run them properly.

~~~
tatersolid
I’ve always stuck with log shipping to a warm standby wherever possible for
the same reason. Well-managed single instance databases can do 3+ nines
easily; very few applications actually require more availability than that.
Especially at 5x the cost in opex for a theoretical-but-not-really-wink-wink
extra 9.

~~~
jrochkind1
In github's case, of course, they can't do a single instance db for
performance reasons, even before you get to HA. If I understand it right.

~~~
manquer
Depends on how your architecture, github as a whole likely has volume of
writes greater than a single server instance, however either components could
be split in separate dbs ( seems to be their approach) or shard / partition
the data

------
donavanm
I find their incident management times pretty impressive. 2 minutes to detect
& alert, ~2+ minutes response, 10 minutes to initial triage, 15 minutes to
internal change control, 17 minutes to escalation & public communication, 19
minutes to major incident escalation & broad engagement, 73 minutes to
remediation start & further public communication.

Initial triage inside 10 minutes then change control, escalation, major
incident, and public communication within another 9 minutes. That's hard to
beat with humans in the loop.

~~~
londons_explore
Most places I've seen have similar times to that, all except the public
communication bit.

Lots of companies are wary about letting engineers declare downtime directly -
it could have costs down the line giving customers refunds etc.

Getting legal, PR, and 3 levels of management to all agree on a message before
it gets published is a recipe for the downtime notice taking hours to be
published.

~~~
donavanm
Do you have a NOC or similar dedicated group? Both initial engagement and
their escalation path had a ~2 minute response time. Paging oncall team
member, vpn connection, etc, takes ~5-10 in my experience. Similarly their
incident manager apparently assessed the situation and further escalated in ~2
minutes as well. Thats pretty dang fast for coming up to speed on a as yet to
be diagnosed major incident.

WRT to public communication, yeah. Overly onerous and adhoc seems to be the
norm. But we do have “analysts” publically using status post count as inverse
evidence of reliability. Creates some real bad incentives for the actual
customers.

~~~
londons_explore
For a core service, github is sure to have a 5 minute human SLA from dedicated
SRE's who have an on call rotation etc. That means someone is already at their
computer, logged in, and ready to go. They'll typically be investigating
within 30 seconds.

Probably within another 30 seconds that human sees many alerts coming in that
don't look like false alarms and hits the 'escalate' button.

If it's within office hours, anything that looks major will usually have a
bunch of SRE's on the case within 3 or 4 minutes. Usually in various roles
from 'I acknowledged this page, it's my job to fix it!' to 'I'm just laughing
at your misfortune, but paying a little attention just incase the broken thing
turns out to be that change I made yesterday and I have to take over
mitigation efforts'.

------
gitgud
It's great to see how open they are about what happened and how it got fixed.
I also appreciate the decision to prioritise _data integrity_ over _site
usability_ too.

Looks like Github could be OK after the Microsoft acquisition.

~~~
wuunderbar
> I also appreciate the decision to prioritise data integrity over site
> usability too.

That actually goes against what happened; they seemed to have implemented an
automated failover which resulted in a split-brain situation rather than
letting one DC completely degrade to read-only.

However mistakes happen and hindsight is 20-20.

------
IvyMike
Offtopic, but this reminds me: did google ever issue an incident report on why
youtube went down earlier this month? A quick search didn't turn up anything.

~~~
FuckOffNeemo
I haven't seen anything disclosed. Best I've found are articles from 5+ years
ago about previous YouTube outages.

------
kodablah
There is often a trade-off between a large distributed central store and
several independent ones. The primary con of the former being incidents like
this one, whereas the con of the latter is separate systems to perform
eventually consistent aggregation to support some centralized features adding
complexity. I wonder if there is any value in GitHub decentralizing the
metadata pipeline. So many of the actions are namespaced that this could be
reasonable theoretically, at a large practical cost.

On a related note, I often reach for Cassandra when starting projects knowing
that building my application around its limited access approach has data
replication benefits in the future. For all the flexibility benefits to devs
given by SQL/RDBMSs, there are flexibility downsides.

~~~
contingencies
An interesting point. Namespaced data would include anything related to a
repo, github pages domain or organization, ie. almost everything. This would
therefore all be readily shardable. Non-namespaced are basically just user
credentials and settings. However, looking forward in a post-GDPR/China data
law world, "network political bloc" based sharding of user data may also be
either a regulatory requirement or a worthwhile forward-looking protective
measure. (We just had this conversation yesterday at Infinite Food.) One
organization I believe global user data sharding goes on at already is Google.

------
sanqui
Sounds like a ton of complicated, fragile work done under crunch times and
probably lots of stress. Hats off to the teams at GitHub.

------
eric_b
If I read things correctly, they made a fairly... interesting... tradeoff: 954
still-as-yet-unreconciled DB writes in exchange for 24 hours of site downtime.

I think I'd have made a different choice, but cool that they were upfront
about it.

~~~
TheDong
> we captured the ... writes ... that were not replicated ... For example,
> __one of __our busiest clusters had 954 writes

Their wording ("one of") makes it sound like they had up to perhaps one order
of magnitude more (1k-10k), but they do not actually give us a useful number,
merely say "it wasn't too much" and "one of an unknown number of total
clusters had 1k".

> I think I'd have made a different choice

I think you might misunderstand how these things go and the tradeoffs.

At the beginning of the incident, they find themselves with lots of writes in
the west-coast master which aren't in the east-coast one, and some in the
east-coast one that aren't in the west-coast one.

Orchestrator cannot promote a working master because they have diverged.

Your choices are:

1\. Lose data (large unknown amount) by dumping an hour of west-coast data and
going back to east-coast data, can be done in maybe 3 hours total by just
deleting west-coast and serving all traffic from east-coast for as you rebuild
the west-coast cluster.

2\. Lose data (small amount) by rebuilding east-coast from a backup so it can
be promoted to, replicating west coast data to it, promoting to it (what they
did, 24 hours of time), and then try to manually fix up the small amount of
lost data after-the-fact (ongoing)

3\. Develop tooling to automate the reconciliation of data while the site is
down such that the east-coast side can be merge-promoted without a rebuild,
probably takes at least 3 days to build and might break everything, but if it
works it probably merge-promotes in under an hour.

4\. Keep the site down until east-coast data is manually reconciled, still
requires rebuilding east to promote west to it, but then requires manually
handling some lost writes... probably about 30-40 hours total downtime.

5\. Update the application servers to work fine when the us-west DC is the
master, probably about 2 months of development with the site down for the
duration.

Which choice would you have made instead? I'm willing to bet they went with
either 2 or 4 (and if 4, changed to 2 when it took longer than expected). They
probably assumed it would take about 4 hours, and then it simply took much
longer than expected because computers are complicated, it turns out.

~~~
eric_b
I'd have opted for 1 personally (losing 30 minutes of data) assuming I knew
the alternative was 24 hours. I'll grant you they probably figured it would be
much faster than that.

Alternatively, why not just let the West-Coast replicate to an East-Coast
slave, and switch em. Surely the peering bandwidth between West-Coast/East-
Coast is higher than whatever they were doing pulling full backups from the
cloud?

~~~
TheDong
> I'd have opted for 1 personally (losing 30 minutes of data) assuming I knew
> the alternative was 24 hours [of read-only access to prod]

With no offence meant, I'm glad you don't get to make decisions at github
then. I'd lose faith in github for that since their primary purpose is to
store other people's data (git blobs+metadata, comments, etc). If it was their
data, it could be fine, but since it's user data, I don't think losing it is
acceptable, even if that means being read-only for an entire week.

> why not just let the West-Coast replicate to an East-Coast slave, and switch
> em?

At the time of the incident, east-coast has already diverged, so it's not
possible to replay logs from west-coast without moving east-coast back in time
to some point prior to west-coast (but a point recent enough that west-coast
still has full replica logs to replay).

It's not possible to simply "rewind" a mysql database typically without
restoring a backup.

It's not possible in many mysql clustering solutions to catch-up to a peer
unless you're already within a recent time window where the replay logs are
still available.

As such, the only option is to restore from backup and then catch up, which is
what they did.

I assume if taking another backup of west-coast, transferring it to east-coast
over the DC link, and restoring it was faster, they would have done it, but I
would be totally unsurprised if that was about the same total time.

I'd also say that if you have a practiced procedure (restoring from a
scheduled backup in the cloud) vs an ad-hoc procedure (custom backup, custom
storage and transfer), the former is probably much safer to do when you're
under fire and want to minimize risky moves stressed engineers have to make.

~~~
eric_b
The source code is paramount, no argument there. But this wasn't about that. I
find the metadata less important, but as you say, I don't make the decisions.
The factor you aren't considering is opportunity cost given productivity loss.
In your hyperbolic example where you claim you'd rather they were read-only
for a week than lose 30 minutes of data? I'd rather get things done during
that week personally, and if the price I have to pay is resubmitting a comment
or re-approving a PR, I guess I'd pay it.

Also, MSSQL has had quick delta snapshot/restores forever (assuming you have
them turned on and going every hour or so), mysql really does not have a
similar feature?

~~~
sethhochberg
MySQL does not have any kind of native snapshotting capability, no.

It is becoming somewhat common to run large MySQL deployments on filesystems
which do (like ZFS) to get a similar effect, albeit without the ability to
restore a snapshot on a host while that database host is online. There is even
some experimental work going on in the community to use ZFS snapshots as a
state-transfer mechanism for laggy nodes in a Galera MySQL cluster, which
seems promising.

------
tnolet
I’m not super versed in MySQL failover, but am I correct to conclude the
Orchestrator and RAFT also did them in? And isn’t that architectural
components right to existence to exactly prohibit such a situation from
happening? Genuine question.

~~~
sitharus
Orchestrator and RAFT did not help them, but mostly it was their
configuration. The setup allowed an automatic failover to the west cost data
centre, so when the connectivity failed the tools did their job and voted in
the west coast.

The solution applied is to prohibit cross-country failover in Orchestrator,
but allow it to continue doing failover within regions.

------
40acres
Postmortems are always really interesting, if you're preparing for an
interview where system design will be discussed you could do worse than
incorporate reading a couple to help in preparation.

~~~
cesarb
I wondered "is there a list of interesting postmortems, like there is for
'falsehoods programmers believe' posts?", and found one at
[https://github.com/danluu/post-mortems](https://github.com/danluu/post-
mortems)

~~~
bepvte
a good halloween read for sure!!!

------
trhway
>At 22:52 UTC on October 21, routine maintenance work to replace failing 100G
optical equipment resulted in the loss of connectivity between our US East
Coast network hub and our primary US East Coast data center. Connectivity
between these locations was restored in 43 seconds

lesson: a technician replacing a switch must be able to do it faster than the
leader heartbeat timeout of the consensus protocol. (reminds me how in high
school we trained to very quickly disassemble/assemble Kalashnikov machine gun
- something like under 20 sec. total - the whole choreographed sequence of
movements was learned and practiced like a samurai sword kata :)

------
azurezyq
>> It’s possible for Orchestrator to implement topologies that applications
are unable to support, therefore care must be taken to align Orchestrator’s
configuration with application-level expectations.

So basically Orchestrator acted correctly but application layer is not nicely
integrated with it? Sounds like something not well designed on application
side, which offests the whole point of global consensus. Not much detail
provided for what exactly went wrong on this though.

They mentioned that they'll take extra care of this but I'm still very
concerned about the reality that two systems (orchestrator & app) are so
loosely coupled.

~~~
detaro
I don't think that's the right interpretation of it. The "unable to support"
in this case seems to have been "can't reliably work due larger latency
between components". A tighter coupling between the two would likely not have
encoded this knowledge either.

------
jeremyjh
It seems like the obvious play here is to fail-over all workloads to the west
coast, so they don't incur cross-regional latency. Do they explain why this
wasn't possible? If so I cannot find it.

~~~
grogers
> All database primaries established in US East Coast again. This resulted in
> the site becoming far more responsive as writes were now directed to a
> database server that was co-located in the same physical data center as our
> application tier.

All their application servers were in the east coast DC. There wasn't anything
to fail over to in the west coast DC. Why they had this fail over config for
their DB is a bit unusual given the topology, it seems like they probably
meant to use these replicas purely for DR.

~~~
jeremyjh
The quote does not address my question. It just says that everything was fine
once their masters were in the east again with their application servers. The
question of why there are no application servers in the west to fail-over to
is what I'm curious about.

------
itsdrewmiller
This is a great write up - making these public with all the gory technical
details helps us all be better at our jobs.

I’m curious whether shutting down the east coast apps entirely and running off
of the west coast was considered? Not enough capacity or some other problem?

edit: I guess the third technical initiative strongly implies that it was just
a capacity issue: "This project has the goal of supporting N+1 redundancy at
the facility level. The goal of that work is to tolerate the full failure of a
single data center failure without user impact."

~~~
collinf
In the article they said that both regions consumed writes and got into an
inconsistent state during the process.

> This effort was challenging because by this point the West Coast database
> cluster had ingested writes from our application tier for nearly 40 minutes.
> Additionally, there were the several seconds of writes that existed in the
> East Coast cluster that had not been replicated to the West Coast and
> prevented replication of new writes back to the East Coast.

~~~
itsdrewmiller
As I read it their solution to the east coast write conflict was to just put
those aside and reconcile them later (and they are still working on that) - I
don't think that would impact their ability to turn off east coast app servers
too.

------
jacquesm
It's better to go down hard than to have a crappy fix that tries to keep
things alive.

------
crescentfresh
On a side note, does anyone know if anything special was used to create the
descriptive images in this blog post? They look great and describe connections
between related regions really well.

Eg [https://blog.github.com/assets/img/2018-10-25-oct21-post-
inc...](https://blog.github.com/assets/img/2018-10-25-oct21-post-incident-
analysis/recovery-flow.png)

~~~
mschoening
The engineering team sketched them out by hand and were later translated into
vectors by the design team.

The vector graphics were made in Sketch
([https://www.sketchapp.com](https://www.sketchapp.com)). Another popular
alternative at GitHub is Figma
([https://www.figma.com/](https://www.figma.com/)).

------
carlsborg
Basically

"..we will also begin a systemic practice of validating failure scenarios
before they have a chance to affect you. This work will involve future
investment in fault injection and chaos engineering tooling at GitHub."

------
tlynchpin
The first diagram labeled "Normal Topology" shows Master in East and no other
master. Later they acknowledge that lots of stuff doesn't work if Master is
not in East because of latency. So then there's all this Orchestration, and it
never could have worked in the first place?

That seems incredible - what am I missing?

------
arde
The CAP theorem strikes again.

------
mleonhard
Untested RPC timeouts strike again. Every service needs integration tests that
exercise timeouts. Some config and code can't be exercised in automated tests
and needs regular disaster readiness testing.

Service client libraries need unit tests that show the library returning
expected errors for all the failure scenarios: missing config, name lookup
failure, unreachable, refusing connections, closing connections, returning
error responses, returning garbage responses, refusing writes, responding with
high latency, and responding with low throughput.

------
Jupe
This makes me wonder why GitHub is effectively using a single database (though
replicated with HA considerations and CQRS-ized).

From the article, it appears they are are partitioning based on function
(commits in this DB, PRs in this cluster)... but I just don't see a strong
business need to glob all commit data together into one massive datastore.
Perhaps it's an economy-of-scale driver?

~~~
topspin
This comment... I've thought about this a lot. I think asking this question is
probably the most meaningful insight into this failure that has appeared.

What GitHub has done is create a giant single point of failure in the form of
a globe spanning under-tested mysql database to manage a vast number of
unrelated git repos, most of which are dormant. At any moment only a minuscule
fraction of these repos are actually mutating.

One can imagine a system that doesn't conflate every GitHub repo into a
monster distributed mysql database. Each repo could be coupled with an
independent SQLite database, for instance, that spins up in a few microseconds
and synchronizes with its distributed copies only when necessary. Otherwise
these lay dormant and available, safe from the vagaries of some under-
specified planet scale mysql instance.

I imagine testing such a design would be vastly easier as well; one need not
replicate the conditions of this mighty distributed mysql construct. Whatever
aggregation is ultimately necessary (analysis, billing, etc.) could be
performed with asynchronous ETL into some database with minimal coordination
and no risk to availability for customers.

------
LolNoGenerics
No reflection on the root cause? I am not into hardware but shouldn't there an
redundant network connection in place or at least set one up?

------
hartator
Very impressive complexity. Really appreciate the transparency. Much has been
a couple of stressful days.

------
Rapzid
Didn't lose data perhaps, but sounds like information was definitely lost
and/or corrupted.

------
person_of_color
I didn't learn any of this in school. Any MOOCs for cloud architecture?

------
romed
So they geographically replicated their mysql in order to survive such a
partition and instead they destroyed their entire database and now have no
reason to believe any of it is consistent at all.

------
anonymousisme
Just five days before Microsoft completed the GitHub acquisition so they had
to lower their QoS to meet the expectations of everyone using Azure...

