
Incremental Backups Using GNU Tar and S3 - cirowrc
https://ops.tips/blog/incremental-backup-linux/
======
rsync
If you're intrigued by using basic unix primitives for tasks like this you'd
probably also be intrigued by a cloud storage product that _was built to act
like a unix primitive_.[1][2]

If you're interested in point-in-time snapshots, you're probably also
intrigued by our ZFS platform that gives you day/week/month/quarter/year
snapshots that you don't have to configure or maintain - you just do a "dumb"
rsync (or whatever) to us and the snapshots just appear.

If you're interested in _encrypted backups_ you should look into the 'borg
backup' tool which has become the de facto standard for remote, encrypted,
changes-only-upload backups.[3][4]

Finally, if S3 pricing is important, you should email us about our "HN
Readers'" discount.

[1] [https://www.rsync.net](https://www.rsync.net)

[2] Examples:
[http://www.rsync.net/resources/howto/remote_commands.html](http://www.rsync.net/resources/howto/remote_commands.html)

[3] [https://www.stavros.io/posts/holy-grail-
backups/](https://www.stavros.io/posts/holy-grail-backups/)

[4]
[http://rsync.net/products/attic.html](http://rsync.net/products/attic.html)

~~~
turblety
Regarding rsync.net, I'm sorry but I won't trust any company to protect my
data if they can't spend a few minutes to put an SSL certificate on their
website. The rsync.net url just redirects to an http link with no SSL
certificate.

~~~
ctdean
What's the threat model for serving a read-only, public, marketing site over
http? Or is it just a general principle?

~~~
harshreality
That website is not a read-only, public marketing site. It contains their
order page, which asks for personal information:
[https://www.rsync.net/signup/order.html](https://www.rsync.net/signup/order.html)

Their _unencrypted_ pricing page links to that encrypted order form page. We
all agree there should be no http to http transitions like that, right?

If you'll note, that encrypted order page is on the same host as their
unencrypted pages. Both rsync.net and www.rsync.net covered by the cert. _They
have SSL set up already, and they just purposely redirect away to http for
their static pages. That is a well-known ssl antipattern._

~~~
brazzledazzle
When you say "on the same host" do you mean "have the same DNS name" or do you
mean literally on the same server? It's possible it's just behind the same
load balancer so I'm curious what the threat model that you're concerned about
is specifically.

To be clear: I don't like transitions like that either but that concern is
something I've only previously had with sites that do e-commerce or login
portal that's not on a different (sub)domain. Apple and some banking sites are
notable examples that used to concern me (though I doubt they are still like
that).

~~~
mvkg
The threat for http to https transactions is that man in the middle can
rewrite, drop, or add data before the user reaches the https site. See
sslstrip[0] for an example of this attack.

[0]
[https://moxie.org/software/sslstrip/](https://moxie.org/software/sslstrip/)

------
loeg
You can do much the same thing with less work using the Tarsnap tool and
service. Rather than incremental backup, blocks are deduplicated on the client
and only unique blocks are backed up. This has about the same storage
efficiency as incremental backups, but has the benefit of not relying upon
"full" backup plus incremental diffs to achieve the final snapshot contents.

(I have no affiliation with Tarsnap other than Colin seems like a nice guy and
I am a customer.)

I have full backups configured every 12 hours. Just for example, on my last
backup, the total logical size was 12 GB; the compressed size of the
undeduplicated blocks was 7.2 GB; the total size of actual new data,
uncompressed, was 181 MB; and the final sum uploaded for this new full backup
was 72 MB. Logically, Tarsnap stores 2.9 TB of my backups, but after
compression and deduplication the "physical" requirement is only 16 GB.

For this I pay about 17¢ USD/day, or $62/year. I could probably try to lower
my use storage use somewhat (the largest component of that cost, 13.4¢/day)
but it hasn't been worth my time yet.

~~~
raphinou
Tarsnap is really slow to restore big files. Just saying in the hope it can
help some people.

~~~
Scaevolus
As a tar-based backup system without a separate index, restoring a single file
is approximately as expensive as restoring the entire backup.

Systems that split metadata and data streams do a lot better (restic, attic,
borg, bup).

~~~
cperciva
Err... Tarsnap _does_ split metadata and data.

------
funkaster
I used to have a "poor man's time machine" system based on rsync + hard links
to files that didn't change with new backups. Essentially it was the same
concept than time machine. Of course you couldn't upload a single "snapshot"
because tar wouldn't know what's a hard link. One advantage of using rsync is
that you can also keep track of things you delete.

Today I'm using zfs with real snapshots. For systems with no zfs support (my
wife's iMac for instance), I have a zfs fs that those systems rsync to, after
the rsync is done I create a snapshot. All scripted. The snapshots can be
stored in another server for an additional layer of backup, or incrementally
send them to s3 if you want.

~~~
Rotareti
_> I used to have a "poor man's time machine" system based on rsync + hard
links to files that didn't change with new backups._

I build something similar, that runs on a Raspberry Pi and creates backups for
the machines in my home network [0]. The Pi pings each machine every hour, if
a machine is online and a backup is due, it starts a backup process. My Pi
uses a USB battery as a UPS (unlimited power supply) [1]. I put all the
hardware in a little medicine cabinet on the wall [1]. It's been running
stable for month now, without a single reboot. It needs about 15 minutes to
backup my dev machine over WiFi. It's a little independent backup module, I'm
really happy with it. :)

[0] [https://github.com/feluxe/very-hungry-pi](https://github.com/feluxe/very-
hungry-pi)

[1]
[https://www.amazon.de/gp/product/B00FAU7ZB2/ref=oh_aui_searc...](https://www.amazon.de/gp/product/B00FAU7ZB2/ref=oh_aui_search_detailpage?ie=UTF8&psc=1)

[2] [https://www.amazon.de/HMF-Medizinschrank-Arzneischrank-
Hausa...](https://www.amazon.de/HMF-Medizinschrank-Arzneischrank-Hausapotheke-
Original/dp/B002CVKEWY)

~~~
pmalynin
Surely you mean Uninterruptible Power Supply. I would indeed be interested in
an unlimited one myself.

~~~
Rotareti
This makes sense now that you say it. Whatever it is, it _powers_ my Pi very
well :)

------
magnetic
If you have a large amount of data to backup, cloud storage may be too
expensive.

My data set is about 8 TB (my wife is a professional photographer), and it
would be too expensive to keep in S3, so I have an "offsite backup system"
that is hosted at my in-laws. It's just a RaspberryPi + an 8 TB drive
encrypted with LUKS (in case it gets stolen or tampered with).

Every night, the RPi syncs the data from my house with rsnapshot (which is a
TimeMachine like tool that uses hard links with rsync over ssh).

Because of how rsnapshot works, I can always go there and look for a file in
one of the directories: it looks just like the original hierarchy, and I can
just browse the filesystem as I would on the original system.

I also don't have to "trust a 3rd party" that the data is really there. I
remember some posts on HN about people who used some backup services
successfully... until restore time. I'm always cautious about the magical "set
it and forget it" service that is a black box.

The first sync has to be done "on the premises" because of the sheer amount of
data to transfer, but then the few daily Gigs of data can easily go over the
net while everyone is sleeping.

~~~
icebraining
S3 is expensive not just because it's "cloud", but because it keeps multiplies
copies in monitored drives, whereas your single drive is a ticking time bomb.

I have a couple of drives in RAID 1 (mirror), yet I still don't rely on it
exclusively for really important data.

~~~
magnetic
Well, all of my data exists in 3 copies. One in my house, one in the detached
"granny unit", and one in the offsite location that I described.

My single offsite drive is indeed a ticking time bomb, but it's easily
replaceable with no loss of data when it dies.

The problematic case is when all the drives hosting the 3 copies happen to die
at the same time. Perhaps I don't have good protection about this case, but I
think I've reached diminished returns in terms of backups.

------
rasengan
I would really recommend using duplicity[1]. It supports gpg, incremental
backups and more.

[1] [http://duplicity.nongnu.org/](http://duplicity.nongnu.org/)

~~~
abrkn
Can anyone else vouch for this being a stable, long term solution for backing
up my home directory from macOS? Every single backup/cloud file storage gives
me anxiety that I did something wrong or that there are bugs not discovered by
retrieving random archives as a test.

~~~
jopsen
From:
[http://duplicity.nongnu.org/features.html](http://duplicity.nongnu.org/features.html)

> Although you should never have to look at a duplicity archive manually, if
> the need should arise they can be produced and processed using GnuPG, rdiff,
> and tar.

I haven't set it up yet, but it's going to be my holiday project this
Christmas..

------
arca_vorago
Tar works for small things, but over an arbitrary file number file size combo
it becomes unwieldy. Rsync, rbackup, rsnapshot, are great tools based on
rsync, and these days Borg has been getting a lot of traction. Bacula is great
but complicated to setup and manage. There is a newer very interesting one
called Ur that seems promising... And my mind is blanking on some of the
others, I'll comment when I'm not mobile so I can look at my list.

Also don't forget zfs/btrfs functions that might be relevant.

~~~
blunte
I second Bacula as something to look into. The configuration does take some
time to understand, but it is incredibly flexible. It's also industrial
strength and reliable.

------
bauerd
Worth checking out for backups is git-annex with remote bup repositories:

* [https://github.com/bup/bup](https://github.com/bup/bup)

* [http://git-annex.branchable.com/special_remotes/bup/](http://git-annex.branchable.com/special_remotes/bup/)

~~~
cirowrc
I didn't know about `bup`! Thanks for the reference! I was reading the README
and it indeed looks pretty great. Have you been using it? What's your
experience with it?

The ability to recover from a left state seems pretty amazing:

> You can back up directly to a remote bup server, without needing tons of
> temporary disk space on the computer being backed up. And if your backup is
> interrupted halfway through, the next run will pick up where you left off.

------
terrik
A great, though non-free, backup tool that supports incremental encrypted
backups is Arq[0].

Arq supports painless back ups to multiple cloud storage providers, with
budget enforcements, etc.

[0] [https://www.arqbackup.com](https://www.arqbackup.com)

------
muxator
With Btrfs or ZFS the whole concept of incremental backup is reshuffled:
basically you have consistent full backups, only occupying the space of a full
backup + all the diffs, and you only transfer the changed blocks. It's really
something with no comparison to other methods.

While I love ZFS, let's not forget that btrfs has its strengths (a lot more
flexibility, mainline support), and, provided that your use case is single
disk, Raid 1 or raid 10, has been working super reliably for some years now.

------
aedocw
I recently wanted to get backups going in a similar way (incremental with
copies on-site and off-site). I ended up using Borg and rclone, sending the
backups to s3 compatible Wasabi storage. It's been working great, and coupled
with zerotier on machines that are not in the house (family members), I ended
up with a pretty resilient system that only costs a few dollars per month. I
wrote up the details here: [https://localconspiracy.com/2017/10/backup-
everything.html](https://localconspiracy.com/2017/10/backup-everything.html)

------
boramalper
I'm not sure if tar is a good format for backups, due to its extremely
sequential nature ( _e.g._ you can't get a list of the files in an archive
without scanning the whole archive...)

~~~
chrisseaton
What overhead does that add in practice? Like a second or so to scan a few GB
of tarball? Does that matter?

~~~
ctur
It often is, especially if the tarball is compressed (especially if you use
anything but Zstandard) or on a network store rather than fast local ssd.
Typically a rule of thumb is a hard drive with a spindle may five around
100mb/sec throughput, so that is likely the bigger bottleneck, followed by cpu
if using slower compression algorithms.

Zstandard at a high compression level gives a good tradeoff of decompression
speed vs underlying physical throughput, be it network or spindle.

Even so, yes, the sequential nature of tar is not great for this reason.

Squashfs files, which can be used on Linux and Mac (using fuse for OS X), are
a very good alternative for archIves, though without some of the features
described in the linked article.

------
IgorPartola
Note that one point of failure of this setup is that you are keeping the
snapshot/index filed locally. Ideally, you want to back those up as well.

------
atomi
I use restic to minio stored on a raid1 btrfs system and I love it.

------
ape4
You'd definitely want to do a full backup occasionally.

~~~
cirowrc
I totally agree!

