Hacker News new | past | comments | ask | show | jobs | submit login

Hi all - I'm the head of engineering at GitHub. Please accept my sincere apology for this downtime. The cause was a bad deploy (a db migration that changed an index). We were able to revert in about 30 minutes. This is slower than we'd like, and we'll be doing a full RCA of this outage.

For those who are interested, on the first Wednesday of each month, I write a blog post on our availability. Most recent one is here: https://github.blog/2021-03-03-github-availability-report-fe...




Thanks for taking the time to personally give a status update while things are on fire. I hope you and all the others who are dealing with this emergency will have an especially restful weekend.


There's a reason why deploying on a Friday is not really a good idea.


I was just griping on Twitter yesterday about how many developers won't immediately revert an update that causes downtime, but will actually spend time trying to solve the problem while Rome burns.

Thank you for not doing that.


Sometimes reverting is not reasonably possible--suppose you updated a database schema and clients immediately started filling it with new data that would have no home in any backup--you'd end up in another unanticipated state.


Any comment or insight you can share on the overall increase in downtime over the past few years?


Growth.


@keithba I have build a - private - GitHub action around https://github.com/sbdchd/squawk - for Postgres - that lints all our migrations files on each PR. The action extract raw SQL from the codebase and pass them into squawk. It catches many exclusive locks migration or missing `index concurrently` that would otherwise have been release to production and causing downtime or degraded service. Maybe something you should start doing.


GitHub uses MySQL, not Postgres. They built the best-in-class online schema change tool gh-ost [1], and have a custom declarative schema change execution system built around Skeema [2], which contains a wealth of linters [3].

Even so, it's always possible for an engineer to submit a schema change which is detrimental to performance. For example, dropping an important index, or changing it such that some necessary column is no longer present. Linters simply cannot catch some classes of these problems, as they're application/workload-specific. Usually they must be caught in code review, but people make mistakes and could approve a bad change.

Disclosure: I'm the author of Skeema, but have not worked for or with GitHub in any capacity.

[1] https://github.com/github/gh-ost

[2] https://github.blog/2020-02-14-automating-mysql-schema-migra...

[3] https://www.skeema.io/docs/options/#lint


Thanks, I didn't know about this. Indeed nothing is failproof.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: