Hi all - I'm the head of engineering at GitHub. Please accept my sincere apology...

mwcampbell · on March 12, 2021

Thanks for taking the time to personally give a status update while things are on fire. I hope you and all the others who are dealing with this emergency will have an especially restful weekend.

rvz · on March 12, 2021

There's a reason why deploying on a Friday is not really a good idea.

Uehreka · on March 12, 2021

I was just griping on Twitter yesterday about how many developers won't immediately revert an update that causes downtime, but will actually spend time trying to solve the problem while Rome burns.

Thank you for not doing that.

sk5t · on March 12, 2021

Sometimes reverting is not reasonably possible--suppose you updated a database schema and clients immediately started filling it with new data that would have no home in any backup--you'd end up in another unanticipated state.

hanniabu · on March 12, 2021

Any comment or insight you can share on the overall increase in downtime over the past few years?

junon · on March 12, 2021

Growth.

oomathias · on March 12, 2021

@keithba I have build a - private - GitHub action around https://github.com/sbdchd/squawk - for Postgres - that lints all our migrations files on each PR. The action extract raw SQL from the codebase and pass them into squawk. It catches many exclusive locks migration or missing `index concurrently` that would otherwise have been release to production and causing downtime or degraded service. Maybe something you should start doing.

evanelias · on March 12, 2021

GitHub uses MySQL, not Postgres. They built the best-in-class online schema change tool gh-ost [1], and have a custom declarative schema change execution system built around Skeema [2], which contains a wealth of linters [3].

Even so, it's always possible for an engineer to submit a schema change which is detrimental to performance. For example, dropping an important index, or changing it such that some necessary column is no longer present. Linters simply cannot catch some classes of these problems, as they're application/workload-specific. Usually they must be caught in code review, but people make mistakes and could approve a bad change.

Disclosure: I'm the author of Skeema, but have not worked for or with GitHub in any capacity.

[1] https://github.com/github/gh-ost

[2] https://github.blog/2020-02-14-automating-mysql-schema-migra...

[3] https://www.skeema.io/docs/options/#lint

oomathias · on March 15, 2021

Thanks, I didn't know about this. Indeed nothing is failproof.