I’m a little confused by this part:
> While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.
At first, I had assumed this was Glacier (“it took a long time to download”). But the daily retrieval testing suggests it’s likely just regular S3. Multiple TB sounds like less than 10.
So the question becomes “Did GitHub have less than 100 Gbps of peering to AWS?”. I hope that’s an action item if restores were meant to be quick (and likely this will be resolved by migrating to Azure, getting lots of connectivity, etc.).
> A significant portion of the time was consumed transferring the data from the remote backup service.
I get the time to rebuild part, but I’m curious about the download part.
(disclosure - I work for a competitor, not on cloud stuff)
You don't mention Google at all outside of the opening statement so who would read it that way?
(Plus, I'm always pleased to see someone not call it a 'disclaimer'!)
Fortunately HN has kept the culture of disclosure intact.