
Continuous MySQL backup validation: Restoring backups - jivid
https://code.facebook.com/posts/1007323976059780/continuous-mysql-backup-validation-restoring-backups
======
mappu
_> We do these by taking another full dump and only storing the difference
between it and the last full backup that was taken. _

Do you diff against the same base, or create an incremental chain? How many
diffs do you take in between recapturing a full image? At $DAYJOB we always
take full backups into a fast in-house deduplicating store.

 _> Periodically, each peon syncs with the ORC DB to look for new jobs
assigned to it_

Is there no better way to handling this than polling?

 _> LOAD - Load the downloaded backup into the peon’s local MySQL instance.
Individual tables are restored in parallel by parsing out statements
pertaining to those tables from the backup file, similar to Percona's
mydumper._

Presumably you can only get this parallelism by disabling FK integrity. Is it
re-enabled in the following VERIFY stage?

~~~
jivid
_> Do you diff against the same base, or create an incremental chain? How many
diffs do you take in between recapturing a full image? At $DAYJOB we always
take full backups into a fast in-house deduplicating store._

We always diff against the same base and have 5 days in between subsequent
full dumps. The number of days just comes from a trade off between space
occupied by the backups and time it takes to generate them.

 _> Is there no better way to handling this than polling?_

There's definitely different ways to approach this, we find polling works well
for us. We also use the same database for crash recovery, so doing the
assignments through it serves both purposes.

 _> Presumably you can only get this parallelism by disabling FK integrity. Is
it re-enabled in the following VERIFY stage? _

I'm not sure what you mean by parallelism through disabling FK integrity.
Splitting the backup into its tables means we can restore a subset of tables
instead of the entire backup. This allows us to load individual tables
concurrently, but also not have to wait to load a massive database if all we
need is a few small tables.

~~~
TimWolla
> I'm not sure what you mean by parallelism through disabling FK integrity.

Say you have a `user` table and a `post` table with `post.user_id` being a
FOREIGN KEY on `user.user_id`. Without disabling FK integrity you would not be
able to restore a post without restoring the user first. When restoring in
parallel this might or might not work out.

~~~
evanelias
Facebook (along with almost everyone else using MySQL at massive scale)
doesn't use foreign keys.

They scale poorly in MySQL, and they lose a lot of purpose in a massively
sharded environment anyway. For example, say you like a status post on
Facebook, or friend another user. It's very unlikely that the liked status or
friended user exists on the same shard as your account, and there's no way to
enforce a foreign key relationship in an inherently non-distributed database
like MySQL.

So instead integrity is handled at the application layer, with additional
background processes to fix the occasional integrity problem and detect
integrity anomalies.

------
wanderr
Why mysqldump instead of Percona's xtrabackup/innobackupex? I was under the
impression that the latter had a lot of advantages and have been considering
switching, wondering if there is a good reason not to.

~~~
jivid
We take logical backups, not physical, and mysqldump is the best option for
that. Having logical backups means we can do logical diffs as well (see
[https://youtu.be/Fe2oLZ4CWD4?t=951](https://youtu.be/Fe2oLZ4CWD4?t=951)), and
we've added table checksum support to mysqldump in our branch of MySQL
([https://github.com/facebook/mysql-5.6/commit/54acbbf915935a0...](https://github.com/facebook/mysql-5.6/commit/54acbbf915935a0161af9ddb7173579ce1348e73))

~~~
wanderr
xtrabackup does differential backups as well...is there another advantage to
doing logical backups? are the diffs significantly smaller?

~~~
jivid
In addition to what evanelias said, a logical dump also means we can load it
into a MySQL instance running a different storage engine as well. In our case,
it allows us to take a mysqldump from an InnoDB instance and load it into a
MyRocks instance if we wish.

------
predakanga
Interesting, though I was hoping for a bit more focus on the validation part -
unless I'm missing something, there's nothing protecting the actual SQL dumps
against bitrot.

I had to design against that particular problem recently, ended up taking a
pt-table-checksum at the time-of-dump and verifying newly restored backups
against that to ensure the backup's integrity.

Unfortunately that requires halting replication temporarily, so I was hoping
to hear of a more ingenious solution.

~~~
jivid
There's a few different ways to verify, a few of them involving stopping
replication, like you pointed out. These can also sometimes be quite
expensive, so depending on the type of verification required, the verification
method can be tuned.

We also implemented table checksums inside of mysqldump, allowing us to dump
out the restored data and compare checksums as well, if required.
[https://github.com/facebook/mysql-5.6/commit/54acbbf915935a0...](https://github.com/facebook/mysql-5.6/commit/54acbbf915935a0161af9ddb7173579ce1348e73)

~~~
predakanga
Very nice, thanks for the link!

I'll definitely be seeing if I can replace my system with that; reducing the
overhead to a single transaction would be a big win.

------
babuskov
Interesting. I assumed FB would have hot replicas (kept up to date with live
master-slave replication at all times) - ready to go any time a main database
fails. Cascade that to another layer of slaves and there's no need to restore
anything ever.

~~~
evanelias
Facebook has hot replicas in every region. But replicas and backups serve
completely different purposes.

Replicas are for failover and read scalability. In terms of failover, when a
master dies unexpectedly Facebook's automation fails over to promote a replica
to be the new master in under 30 seconds and with no loss of committed data.

Backups are for when something goes horribly wrong -- i.e. due to human error
-- and you need to restore the state of something (a row, table, entire db,
...) to a previous point in time. Or perhaps effectively skip one specific
transaction, or set of transactions. Replicas don't help with this; as you
mentioned, they're kept up-to-date with the master. So a bad statatement run
on the master will also affect the replicas.

Occasionally you have some massive failure involving both concepts, like you
have 4 replicas and they're all broken or corrupted in some way, then backups
are helpful in that case as well.

~~~
snuxoll
I suppose at Facebook scale it might be infeasible, but couldn't you get the
same effect by archiving log segments and a periodic binary full backup? This
is precisely what I do with my PostgreSQL databases (though with some friendly
automation with pg barman), I assume you could do the same with some tooling
around MySQL's binlog facilities.

~~~
evanelias
Yes, although if using the binlogs as-is, that's effectively incremental
backup instead of differential. The disadvantages of incremental solutions are
that they require more storage and take longer to restore (especially if only
doing full backups every few days); the upside is less complexity.

