Hi all - I'm the head of engineering at GitHub. Please accept my sincere apology for this downtime. The cause was a bad deploy (a db migration that changed an index). We were able to revert in about 30 minutes. This is slower than we'd like, and we'll be doing a full RCA of this outage.
Thanks for taking the time to personally give a status update while things are on fire. I hope you and all the others who are dealing with this emergency will have an especially restful weekend.
I was just griping on Twitter yesterday about how many developers won't immediately revert an update that causes downtime, but will actually spend time trying to solve the problem while Rome burns.
Sometimes reverting is not reasonably possible--suppose you updated a database schema and clients immediately started filling it with new data that would have no home in any backup--you'd end up in another unanticipated state.
@keithba I have build a - private - GitHub action around https://github.com/sbdchd/squawk - for Postgres - that lints all our migrations files on each PR. The action extract raw SQL from the codebase and pass them into squawk.
It catches many exclusive locks migration or missing `index concurrently` that would otherwise have been release to production and causing downtime or degraded service. Maybe something you should start doing.
GitHub uses MySQL, not Postgres. They built the best-in-class online schema change tool gh-ost [1], and have a custom declarative schema change execution system built around Skeema [2], which contains a wealth of linters [3].
Even so, it's always possible for an engineer to submit a schema change which is detrimental to performance. For example, dropping an important index, or changing it such that some necessary column is no longer present. Linters simply cannot catch some classes of these problems, as they're application/workload-specific. Usually they must be caught in code review, but people make mistakes and could approve a bad change.
Disclosure: I'm the author of Skeema, but have not worked for or with GitHub in any capacity.
For those who are interested, on the first Wednesday of each month, I write a blog post on our availability. Most recent one is here: https://github.blog/2021-03-03-github-availability-report-fe...