
Making Backup Validation Easier - jaw
https://brokensandals.net/technical/backup-tooling/making-backup-validation-easier/
======
gruez
This seems like a worse than just hashing the file. Random bit flips will
probably go undetected using this method, but won't be with hashing.

~~~
jaw
I'm mostly trying to address cases where there is no original file that I
fully trust. If I'm exporting my data from some web app/service, I can't get a
hash of the data as it is in the actual source of truth on their servers, and
there's multiple points at which an error could be introduced before the
completed export file lands on my machine.

It's a good point that hashing is a better method when you have access to the
original files.

~~~
close04
> and there's multiple points at which an error could be introduced before the
> completed export file lands on my machine

Aren't all bets off at this point? I mean validating the backup seems skipping
steps if you are not validating the source. Scrolling through thumbnails is
better than nothing, sure. But it's really prone to false negatives. Corrupted
images can look good in a thumbnail and your eyes might just miss even glaring
corruption because you just scrolled too fast. If it's not an image file it
gets more challenging.

You seem to have one of those corner cases where basically no automated method
can solve your problem but the volume of data is just low enough to alleviate
the issues with a bit of manual intervention.

~~~
jaw
> Scrolling through thumbnails is better than nothing, sure.

"Better than nothing" is pretty much what I'm going for here. Almost all my
personal data stored in cloud services falls into this "corner case": I only
have indirect access to the source, it's important enough to me that I want to
do some level of checking, but it's not important enough to spend the huge
amounts of time it would take to inspect every individual datum.

------
close04
I think making a list of the files to be copied and their hashes, then a list
of files that were copied and their hashes, then comparing the 2 lists should
provide an even quicker way to validate. Or even hashing the entire source and
destination (hash of the list of hashes) and providing both values to the user
to visually compare.

As far as I can tell the method described in the article doesn't really
_validate_ the backups in any way, just provides some statistics that will
fail in very plausible ways.

And of course, if the data is important to you and there are special
circumstances that could affect the process, nothing beats an actual restore
test.

~~~
jaw
I replied to a similar point about hashing here -
[https://news.ycombinator.com/item?id=23032633](https://news.ycombinator.com/item?id=23032633)

You're correct that the methods I described are a far cry from actually
guaranteeing that the backup has no errors. In the same way that a unit test
doesn't prove code is error-free, but _can_ justify increased confidence in
the code, I'm interested in techniques that can justify increased confidence
in my backups. Particularly in cases where I don't have direct access to the
original data, and where exhaustively checking the data manually is too time-
consuming to be worth it.

------
wila
What I did is to do all my work in a VMware virtual machine.

Then I wrote software for backing up VM's automatically (disclaimer: this is a
commercial product I sell)

There's options for getting an email on success, failure or both. The VM files
are all hashed.

VMs are easy to restore, so an actual restore is pretty easy without risking
to overwrite the original. If a file hash does not match on restore, then my
software will complain, but continue the restore anyways.

FWIW, all my code etc... is also in source control, so I am not relying on a
single layer for that.

