Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to Ensure File Integrity
5 points by soupshield on May 18, 2021 | hide | past | favorite | 3 comments
I have a lot of data which I try to carefully backup. One thing which concerns me is the integrity of my files. It's no good backing them up if they become corrupt without me knowing. Most of my files like photos, old emails, etc will never change and I want to keep it that way. Bit rot or more likely, me inadvertently changing a file somehow.

I'm familiar with source control like git and if I could check all my files in to some sort of super git repo then a git status would show me if any of my files have become corrupt. Git isn't great at large files though and git LFS seems complicated and possibly overkill for this situation. I only need the "has a file changed" part so I'm wondering if anyone has a good way to handle this situation. I don't need versioning as I have that already with my backups.




One method I use is to get the md5sum of files and create a manifest which is just a sorted text file with the file-names and checksums. If you create a periodic manifest, you can create a function to send you the diff of what files have changed. i.e. cat two files, sort and uniq -c them. If a filename has 2 or more hashes across multiple passes and you did not expect it to change, it should become obvious. md5sum is sufficient for this task and faster than sha256sum.

Another method would be to use rsync in "--dry-run" and see what changed that way between your backups. If you have sets of files you know should never change, you could filter on them as a canary.


You can certainly use par2 or something you cook up yourself using SHAs, but it'd be a whole lot easier if you just used a filesystem that handles this for you.

I wrote up a lengthy description of the problem and several different solutions here:

https://photostructure.com/faq/how-do-i-safely-store-files

I personally use a Synology and an Unraid box, both using btrfs, with scheduled data scrubs and periodic SMART health checks.

I also rsnapshot to an external drive, just formatted as ext4, as Yet Another Copy of my stuff.


> One thing which concerns me is the integrity of my files. It's no good backing them up if they become corrupt without me knowing. Most of my files like photos, old emails, etc will never change and I want to keep it that way.

Look into par2 files (https://github.com/Parchive/par2cmdline).

Arrange your 'backup data' in some logical way, then generate par2 files for logical "groups" (i.e, per directory containing a set of files, etc.) with an amount of 'redundancy' you feel comfortable with.

You can then periodically run par2 scans of the static backups to both detect changes, and to repair the changes (provided the changes are not larger than the amount of redundancy you originally requested).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: