Interesting. I assumed FB would have hot replicas (kept up to date with live master-slave replication at all times) - ready to go any time a main database fails. Cascade that to another layer of slaves and there's no need to restore anything ever.
Facebook has hot replicas in every region. But replicas and backups serve completely different purposes.
Replicas are for failover and read scalability. In terms of failover, when a master dies unexpectedly Facebook's automation fails over to promote a replica to be the new master in under 30 seconds and with no loss of committed data.
Backups are for when something goes horribly wrong -- i.e. due to human error -- and you need to restore the state of something (a row, table, entire db, ...) to a previous point in time. Or perhaps effectively skip one specific transaction, or set of transactions. Replicas don't help with this; as you mentioned, they're kept up-to-date with the master. So a bad statatement run on the master will also affect the replicas.
Occasionally you have some massive failure involving both concepts, like you have 4 replicas and they're all broken or corrupted in some way, then backups are helpful in that case as well.
I suppose at Facebook scale it might be infeasible, but couldn't you get the same effect by archiving log segments and a periodic binary full backup? This is precisely what I do with my PostgreSQL databases (though with some friendly automation with pg barman), I assume you could do the same with some tooling around MySQL's binlog facilities.
Yes, although if using the binlogs as-is, that's effectively incremental backup instead of differential. The disadvantages of incremental solutions are that they require more storage and take longer to restore (especially if only doing full backups every few days); the upside is less complexity.
They do, but that doesn't help you if bad code writes corrupted data, or other unexpected disasters which take down an entire replica set at once. For large public companies "I didn't think that failure would happen!" Isn't an acceptable excuse.
This is only a useful system if you can trust that failed databases simply disappear.
In failure modes like accidental delete without WHERE clause, or a write that corrupts the validity of the business logic, it's useless, as you can watch the logs to see all your slave machines keenly and unquestioningly repeating the issue.
We had that at my previous job, fortunately for some reasons the slave was delayed by like 10min, it saved us from a very very serious problem, especially when you do backups every 6h.
Per their post about their MySQL backup solution[1], it sounds like they sort of do this? Although it's not a slave that's ready to be promoted to master, but somehow a cold standby that is not far off from being a hot standby based on how recent the binlogs are?
The wording is a bit cryptic, but it does seem that they definitely have hot standby-esque capabilities in place, in addition to long-term storage of their incremental/full backups.