Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Saf – simple, reliable, rsync-based, battle tested, rounded backup (github.com/dusanx)
89 points by binaryapparatus 11 months ago | hide | past | favorite | 24 comments
I had this backup code working reliably for years, using local file system, vps/dedicated server, or remote storage for backup, then I finally get time to wrap README, iron few missing switches and publish. Should be production ready and reliable, so it could be useful to others. Contributors are welcome.


How do you automate the checking if the backup worked correctly, in face of saf bugs, rsync bugs/misconfiguration, or bit rot?

My solution is to pick a few random files (plus whatever is new), and compute their hashes on both local and remote versions. But it's slow and probabilistic. ZFS also helps, but I feel it's too transparent to rely on (what if the remote storage changes filesystem).

Those same questions always bug me, and I did try all from very smart to very brute force solutions. I love ZFS but then we can question ZFS and OS bugs in the same manner as saf or rsync -- that rabbit hole is deep and quickly becomes expensive since ZFS may need ECC ram and other more expensive components.

Lately, in last few years, I am leaning towards using many cheap backups instead of clever and more expensive ones, with the idea that many of them can't all break at the same time. Yes occasional checks are good but safety in numbers seems as a good strategy.

It is not an accident that saf tag line says "one backup is saf, two are safe, three are safer" ;)

"saf bugs, rsync bugs/misconfiguration"

On top of many cheap backups, I am also trying not to rely on any single peace of technology (I know, it is not ideal that hardware and OS remains the same on any computer no matter what backup is used). If I use saf as my preferred rsync based solution I will also use Borg or duply/duplicity as a additional backup to avoid rsync bugs.

Having two or more rsync based backups, so they all go trough the same rsync pipe, makes much less sense than mixing completely different backup solutions, right?

> But it's slow and probabilistic.

A couple of things I do:

1. Generate a list of files on both sides and the sizes & dates, and compare that ignoring any that have changed/appeared since before the last backup cycle started. Unless your backups are truly massive in terms of number of files this is practical to automate and run at least as often as your backup cycle, and this catches many system errors or simple failures of the backups to run at all.

2. Occasionally checksum the whole damn lot in your latest snapshot and the originals. This can take a lot of time (and expense of you are using child storage with read access charges) so you want to do it less often but it catches bit rot and similar issues. Again you have to skip files that have been touched since the start of the last backup cycle.

3. If you keep a checksum (or list of files with checksum) of each snapshot, occasionally pick one and verify it from scratch. As with hashing the latest snapshot this can be quite resource intensive for massive backups but is fine for mine. You can also just compare meta-data (files, sizes, dates) to a stored list which will catch some types of filesystem corruption affecting your older snapshots.

One of these days I'll might get around to tidy+documenting+publishing my scripts that run all this…

That's close to what I do[1]. The size and date comparison is done by rsync, and I keep a text file with all expected file hashes, so if there's any disagreement between copies I know which one to trust.

These hashes are also ordered so that the top files haven't been checked the longest; part of the script is to take the top N files, checksum them, and move them to the bottom of the list. This guarantees every file is checksum once per N days.

I also donwload a random file in every run, to make sure the connection is not broken.

My use case is personal photos and videos, so I also make sure that my local files are never changed.

And finally, I highly recommend Hetzner Storage Boxes. Not only are they dirty cheap while still giving you ZFS and samba access, you can actually SSH into the box and run simple commands on the files locally, like sha25sum, without paying for network transfers.

[1] https://github.com/boppreh/cloud_backup_script/

Wow, I like this a lot, as it looks easy to run and it can sync to multiple targets. My local backup consists of JBOD (not RAID, ZFS or BTRFS) so I think this should work nicely. I've been using a shell script for doing something similar for backup, but it lacked a lot of the features.

It might be safer to use an rsync lib that calls librsync or at least wraps the calls for you. I'm always suspicious of sub-shelling

How does it deal with interrupted backups?

Can it automatically prune backups older than N days?

I don’t see anything about encryption.

> How does it deal with interrupted backups?

Any new backup is hardlinked against previous in temporary 'in-progress' directory, then renamed to proper name at the end. If backup breaks, new 'saf backup' by default first removes 'in-progress' than starts things again (linking with latest good one) but you can 'saf backup --resume' to try to finish interrupted one. I prefer clean try again (which is the default) but --resume works well too.

> Can it automatically prune backups older than N days?

Yes, manually by 'saf prune' on top of 'saf backup' doing prune itself. Prune periods are defined in each .saf.conf, per backup source location, with the defaults of 2/30/60/730/3650 days, for all/daily/weekly/monthly/yearly backups. All defaults are easy to change per source.

> I don’t see anything about encryption.

saf doesn't deal with encryption, only with transport. I prefer to use other specialized tool for the encryption if I have such backup target that needs one.

> I don’t see anything about encryption.

Many prefer to deal with encryption separately, encrypting the volumes being backed up to rather than relying on the backup system to manage that.

Of course this adds a consideration to your system: how to backup your encryption keys, and use them to remount the volumes when needed, in a way that does not render the whole thing pointless by accidentally exposing your keys to the wild. Then again, the encryption included in many backup systems has these issues for you to resolve too.

Why not rsnapshot? I've been using it to backup servers to servers for a lot years.

In my understanding, rsnapshot is equivalent of 'saf backup' which is only one bit of saf functionality. saf has few more commands to be able to see and analyze what's on the backup target side.

rsnapshot uses centralized rsnapshot.conf, saf has git style .saf.conf per each backup source location.

Apart from using rsync, there are more differences than similarities between rshapshot and saf.

how is it better/safer than manually using rsync?

Have a look at restic for a good alternative to this.

Typical HN to immediately tell everyone to use something else when someone posts something they spent time and effort on making, just because it's not 100% unique

These alternative recommendations are exactly what I'm looking for when browsing the comments.

Author here. Yes, I am also the one that loves to see alternatives on topic of interest to me on HN. However putting an effort to publish different take on something already 'solved' is either because I want my own solution no matter what (I don't) or because other solutions didn't fully cover all my use cases (they didn't). I would gladly use somebody else code instead of my own, if that was possible.

Two scripts linked at the bottom of README.md are almost perfect and I used one or another for years, but one of them sometimes misses to hard link with previous backup, creating new full backup and wasting resources -- not something that's easy to detect since they all look correct before 'du' or similar analysis.

saf solves few of the problems I faced during the years in a elegant way: multiple targets, target related commands, reliability. Why not XY? Why not saf?

I didn't mean to detract from the post. I use both rsync and restic and this looks interesting. Just wanted to provide a recommendation for an alternative that does very similar things; I hope that's okay.

With me that's absolutely okay. We all need more choice and HN is a great way to learn about alternatives. Few nice things I borrowed from those two scripts mentioned in readme (credit given) wouldn't be possible without alternatives and open source.

"It behooves every man to remember that the work of the critic is of altogether secondary importance, and that, in the end, progress is accomplished by the man who does things." -- Theodore Roosevelt

Resric is great but the lack of empty passwords, and the response by the developer about it is very grating:


He very politely said he thinks it’s better to keep the password requirement in place and was deciding to do that. What’s grating about that? Personally, I think his concern about users mistakenly not setting a password could be alleviated with an explicit —insecure flag, or similar.

This is the exact reason why I do not use restic.

This is a backup tool, not a security one. The fact that the author does not understand this is a real problem and a red flag.

It’s a good idea to enforce passwords for security. The features of backups done right are incremental backup, snapshot, deduplication, encryption and compression.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact