Hacker News new | past | comments | ask | show | jobs | submit login
Incremental Backups Using GNU Tar and S3 (ops.tips)
133 points by cirowrc on Dec 2, 2017 | hide | past | web | favorite | 63 comments

If you're intrigued by using basic unix primitives for tasks like this you'd probably also be intrigued by a cloud storage product that was built to act like a unix primitive.[1][2]

If you're interested in point-in-time snapshots, you're probably also intrigued by our ZFS platform that gives you day/week/month/quarter/year snapshots that you don't have to configure or maintain - you just do a "dumb" rsync (or whatever) to us and the snapshots just appear.

If you're interested in encrypted backups you should look into the 'borg backup' tool which has become the de facto standard for remote, encrypted, changes-only-upload backups.[3][4]

Finally, if S3 pricing is important, you should email us about our "HN Readers'" discount.

[1] https://www.rsync.net

[2] Examples: http://www.rsync.net/resources/howto/remote_commands.html

[3] https://www.stavros.io/posts/holy-grail-backups/

[4] http://rsync.net/products/attic.html

Regarding rsync.net, I'm sorry but I won't trust any company to protect my data if they can't spend a few minutes to put an SSL certificate on their website. The rsync.net url just redirects to an http link with no SSL certificate.

What's the threat model for serving a read-only, public, marketing site over http? Or is it just a general principle?

That website is not a read-only, public marketing site. It contains their order page, which asks for personal information: https://www.rsync.net/signup/order.html

Their unencrypted pricing page links to that encrypted order form page. We all agree there should be no http to http transitions like that, right?

If you'll note, that encrypted order page is on the same host as their unencrypted pages. Both rsync.net and www.rsync.net covered by the cert. They have SSL set up already, and they just purposely redirect away to http for their static pages. That is a well-known ssl antipattern.

When you say "on the same host" do you mean "have the same DNS name" or do you mean literally on the same server? It's possible it's just behind the same load balancer so I'm curious what the threat model that you're concerned about is specifically.

To be clear: I don't like transitions like that either but that concern is something I've only previously had with sites that do e-commerce or login portal that's not on a different (sub)domain. Apple and some banking sites are notable examples that used to concern me (though I doubt they are still like that).

The threat for http to https transactions is that man in the middle can rewrite, drop, or add data before the user reaches the https site. See sslstrip[0] for an example of this attack.

[0] https://moxie.org/software/sslstrip/

That load balancer is a server. It has the TLS key, and so is authoritative for be content of the site.

The model of “surrogate origin server” is sometimes more helpful than “middlebox” and similar.

Users are for example protected from ISP injecting ads or tracking in the website.

It’s a sign that security isn’t their first priority. It’s almost 0 effort to put ssl on an entire site. Pretty much no reason not to.

The actual answer is probably going to surprise you ...

The reason that the vast majority of rsync.net is not covered by https is because I personally have not yet decided how I feel, philosophically, about the notion of https-only websites.

As silly as it may sound to you, I very much like the fact that you can read rsync.net HOWTO pages with netcat. Or telnet'ing to port 80. Or grab them with old versions of 'fetch'.

I used to do all kinds of interesting and time-saving tasks on the pre-https internet with simple, static UNIX tools. As you can imagine, almost all of those workflows are now either broken or require a big huge pile of python libraries.

So in summary, I am sorry to deprive you of your righteous indignation. You can see from our properly implemented order form that we understand https just fine, thank you.

The real answer is that https-everywhere sounds like it's probably a good idea ... but also sounds like you should maybe get off of my lawn ... but I'm not sure yet.

If you want to support HTTP too, that's fine. But why redirect people who use HTTPS URLs back to the HTTP page?

In any case, modern software toolchains have no trouble with HTTPS. I can interact with it from the command line pretty easily with `wget` or `curl` and similar.

Man, you don't even understand basic security practices.

When going from a HTTP page to a HTTPS page, what guarantees can you give that I see the original content and clicked the proper link and that I was not MITM-ed? A potential attacker can MITM and easily redirect me to any order page they want... This is security 101.

I understand the attack you've outlined just fine.

What you don't understand is that people that don't notice being redirected to a different domain are not smart enough to be using rsync.net in the first place.

So it's not an issue.

Or, I should say, in the fifteen years that we have been providing "cloud storage"[1] it hasn't been an issue.

[1] It wasn't called "cloud storage" back then - our service predates the term.

Probably I (working in security and even checking certificates sometimes) would not even notice if it would point to a different domain like rsync-app.net or some random name with a valid cert. There are a ton of examples and valid points of putting different things to different TLDs. Why should I be suspicious? How should I know what domain, subdomain, whatever did you choose and why?

Maybe even https://order.rsync.net could be the link and YOU (the sysadmin of the service) might not even notice, because I'm pretty sure you don't check/monitor your DNS records every couple of minutes.

The reason "it did not happened yet", is not valid, because if could happen anytime in your service's lifetime. It's like an open door and no robbery happened yet, but the likelihood of it is happening is worse than if you at least close the door. It would be silly to complain "It has been open for a long time and there were no robbery." after it happened.

> "people that don't notice being redirected to a different domain are not smart enough to be using rsync.net in the first place."

This is just an assumption, I would not make that. You could be surprised. Sometimes even web developers don't understand how x509 certs and https work.

Your reasons may be plentiful, but if it's because you prefer the protocol, without the obstruction of handshake, I'd recommend using openssl s_client.

Obvious disclaimer: openssl binary may or may not be part of your system (although the same can be said about netcat and telnet)

Just checked... and OMG that was terrible.

How can they be trusted with backups when they can't even run a HTTPS website.

I've seen your service advertised for years, but only started using it myself since August. I've had two outages in that time, once lasting for two+ days.

However despite the outages I have to say I've been genuinely impressed at how smoothly the process of signing up was, and how well the (borg) product works.

I have my own local backups, so I'd only ever need to restore from your copies in the event of a catastrophic failure, and I while I hope that never happens I do test-restores now and again and things have always been great.

I've been a happy rsync.net customer for several years. It's my personal backup solution (borg + append-only + locked down SSH key) for a while, and I just had to use it to restore my computer yesterday.

Their rates are incredible, and their support is excellent. I highly recommend it.

You can do much the same thing with less work using the Tarsnap tool and service. Rather than incremental backup, blocks are deduplicated on the client and only unique blocks are backed up. This has about the same storage efficiency as incremental backups, but has the benefit of not relying upon "full" backup plus incremental diffs to achieve the final snapshot contents.

(I have no affiliation with Tarsnap other than Colin seems like a nice guy and I am a customer.)

I have full backups configured every 12 hours. Just for example, on my last backup, the total logical size was 12 GB; the compressed size of the undeduplicated blocks was 7.2 GB; the total size of actual new data, uncompressed, was 181 MB; and the final sum uploaded for this new full backup was 72 MB. Logically, Tarsnap stores 2.9 TB of my backups, but after compression and deduplication the "physical" requirement is only 16 GB.

For this I pay about 17¢ USD/day, or $62/year. I could probably try to lower my use storage use somewhat (the largest component of that cost, 13.4¢/day) but it hasn't been worth my time yet.

Tarsnap is really slow to restore big files. Just saying in the hope it can help some people.

It's slow to restore smaller files as well. It is certainly not near-line disaster recovery. Conversely, it has already saved my bacon a few times and I trust it's design.

As a tar-based backup system without a separate index, restoring a single file is approximately as expensive as restoring the entire backup.

Systems that split metadata and data streams do a lot better (restic, attic, borg, bup).

Err... Tarsnap does split metadata and data.

I used to have a "poor man's time machine" system based on rsync + hard links to files that didn't change with new backups. Essentially it was the same concept than time machine. Of course you couldn't upload a single "snapshot" because tar wouldn't know what's a hard link. One advantage of using rsync is that you can also keep track of things you delete.

Today I'm using zfs with real snapshots. For systems with no zfs support (my wife's iMac for instance), I have a zfs fs that those systems rsync to, after the rsync is done I create a snapshot. All scripted. The snapshots can be stored in another server for an additional layer of backup, or incrementally send them to s3 if you want.

> I used to have a "poor man's time machine" system based on rsync + hard links to files that didn't change with new backups.

I build something similar, that runs on a Raspberry Pi and creates backups for the machines in my home network [0]. The Pi pings each machine every hour, if a machine is online and a backup is due, it starts a backup process. My Pi uses a USB battery as a UPS (unlimited power supply) [1]. I put all the hardware in a little medicine cabinet on the wall [1]. It's been running stable for month now, without a single reboot. It needs about 15 minutes to backup my dev machine over WiFi. It's a little independent backup module, I'm really happy with it. :)

[0] https://github.com/feluxe/very-hungry-pi

[1] https://www.amazon.de/gp/product/B00FAU7ZB2/ref=oh_aui_searc...

[2] https://www.amazon.de/HMF-Medizinschrank-Arzneischrank-Hausa...

Surely you mean Uninterruptible Power Supply. I would indeed be interested in an unlimited one myself.

This makes sense now that you say it. Whatever it is, it powers my Pi very well :)

I similarly have a Pi in a cupboard doing the backups. In my case though, it uses wake-on-lan to wake the computers up in the middle of the night to run the backups, then lets them go back to sleep after.

I use rsnapshot to a encrypted external usb disk.

I used to have a Pi for that, but the poor IO was annoying when you needed to backup large files. I got a small Atom board instead, which runs circles around it. And since it's a regular x86 board, I can use rtcwake to make it sleep between each backup, so the power consumption is quite good.

I wrote a tool called Snebu (www.snebu.com), which is intended to give you the same thing as rsync + hard links. Uses find and tar to grab file metadata and retrieve files, on the back end it stores file data in .lzo compressed files with all metadata for each snapshot (including symlinks, hardlinks, etc) stored in a sqlite3 DB file.

The design goals were to make something that didn't store data in a proprietary format (you can analyse the backups using straight sql commands, and access the data using lzop), and to be able to back up systems without installing a client agent on them, and support compression, and avoid filesystem issues you run into with a large number of hard links (such as https://news.ycombinator.com/item?id=8305283). So far I've been using it to back up a few dozen RHEL 4 - RHEL 7 systems over the last couple years without issues.

I was surprised how easy it is to setup a time-machine-like backup with rsync. I'm backing up every hour to my server, and a cron job prunes the backups every night. Disarmingly simple.

I still use rsnapshot to manage such backups. I like that I can use an old Mac to run it. Doesn't ZFS require more from the system, RAM particularly?

Cool setup. But if you're incrementally sending snapshots to a non-ZFS host (S3), your S3 storage will keep growing indefinitely, because you can't delete any snapshots from there ever, since they reference each other. Correct?

Seems like you could probably use a keyframe like approach to get a nice middle ground. Every month, or biweekly or so, take a full backup and base the snapshots on that.

If you have a large amount of data to backup, cloud storage may be too expensive.

My data set is about 8 TB (my wife is a professional photographer), and it would be too expensive to keep in S3, so I have an "offsite backup system" that is hosted at my in-laws. It's just a RaspberryPi + an 8 TB drive encrypted with LUKS (in case it gets stolen or tampered with).

Every night, the RPi syncs the data from my house with rsnapshot (which is a TimeMachine like tool that uses hard links with rsync over ssh).

Because of how rsnapshot works, I can always go there and look for a file in one of the directories: it looks just like the original hierarchy, and I can just browse the filesystem as I would on the original system.

I also don't have to "trust a 3rd party" that the data is really there. I remember some posts on HN about people who used some backup services successfully... until restore time. I'm always cautious about the magical "set it and forget it" service that is a black box.

The first sync has to be done "on the premises" because of the sheer amount of data to transfer, but then the few daily Gigs of data can easily go over the net while everyone is sleeping.

S3 is expensive not just because it's "cloud", but because it keeps multiplies copies in monitored drives, whereas your single drive is a ticking time bomb.

I have a couple of drives in RAID 1 (mirror), yet I still don't rely on it exclusively for really important data.

Well, all of my data exists in 3 copies. One in my house, one in the detached "granny unit", and one in the offsite location that I described.

My single offsite drive is indeed a ticking time bomb, but it's easily replaceable with no loss of data when it dies.

The problematic case is when all the drives hosting the 3 copies happen to die at the same time. Perhaps I don't have good protection about this case, but I think I've reached diminished returns in terms of backups.

He doesn't just have a single drive though, he has (at least) two (the drive in active use and the off-site backup drive).

Not the ideal 3-2-1 rule of thumb, but also not a ticking time bomb.

Glacier would cost about $30 per month to store 8 terabytes, according to the price calculator (calculator.s3.amazonaws.com). S3 infrequent access storage would cost $100 per month. These amounts seem pretty cheap for storing 8 terabytes online and offline respectively.

If you do the math, you see that an 8TB drive can be found for $125 (granted, it's a special deal) at https://slickdeals.net/f/10919999-8tb-seagate-backup-plus-hu... - but it's quite common to find them around $200. Then the RPi 3 is under $50.

Without even talking about the difficulty of dealing with Glacier files in an incremental backup scheme, doing the initial sync, and checking the backup data regularly, we're talking about something that will probably cost $250 for a couple of years of backup (assuming the drive only lasts 2 years) vs $720 (glacier) or $2,400 (S3) if I use your numbers.

It seems like a significant difference to me, especially because the assumptions on how often my drive will fail are quite pessimistic.

I would really recommend using duplicity[1]. It supports gpg, incremental backups and more.

[1] http://duplicity.nongnu.org/

Though anecdotal, the one time I tried to restore my data from a duplicity backup (aside from when I initially started using it), I got a stack trace with an unhandled KeyError. I was never able to find out why the error happened or how to fix it, but thankfully I was just testing my ability to restore.

Anyway, I would recommend against using it - errors like that should be impossible when you're not doing anything exotic.

Can anyone else vouch for this being a stable, long term solution for backing up my home directory from macOS? Every single backup/cloud file storage gives me anxiety that I did something wrong or that there are bugs not discovered by retrieving random archives as a test.

From: http://duplicity.nongnu.org/features.html

> Although you should never have to look at a duplicity archive manually, if the need should arise they can be produced and processed using GnuPG, rdiff, and tar.

I haven't set it up yet, but it's going to be my holiday project this Christmas..

There is also Duplicati https://github.com/duplicati/duplicati

Similarly there’s the annoyingly named duplicati which also works pretty well. I have it set to send snapshots to google drive, encrypted of course.

Tar works for small things, but over an arbitrary file number file size combo it becomes unwieldy. Rsync, rbackup, rsnapshot, are great tools based on rsync, and these days Borg has been getting a lot of traction. Bacula is great but complicated to setup and manage. There is a newer very interesting one called Ur that seems promising... And my mind is blanking on some of the others, I'll comment when I'm not mobile so I can look at my list.

Also don't forget zfs/btrfs functions that might be relevant.

I second Bacula as something to look into. The configuration does take some time to understand, but it is incredibly flexible. It's also industrial strength and reliable.

Worth checking out for backups is git-annex with remote bup repositories:

* https://github.com/bup/bup

* http://git-annex.branchable.com/special_remotes/bup/

I didn't know about `bup`! Thanks for the reference! I was reading the README and it indeed looks pretty great. Have you been using it? What's your experience with it?

The ability to recover from a left state seems pretty amazing:

> You can back up directly to a remote bup server, without needing tons of temporary disk space on the computer being backed up. And if your backup is interrupted halfway through, the next run will pick up where you left off.

A great, though non-free, backup tool that supports incremental encrypted backups is Arq[0].

Arq supports painless back ups to multiple cloud storage providers, with budget enforcements, etc.

[0] https://www.arqbackup.com

With Btrfs or ZFS the whole concept of incremental backup is reshuffled: basically you have consistent full backups, only occupying the space of a full backup + all the diffs, and you only transfer the changed blocks. It's really something with no comparison to other methods.

While I love ZFS, let's not forget that btrfs has its strengths (a lot more flexibility, mainline support), and, provided that your use case is single disk, Raid 1 or raid 10, has been working super reliably for some years now.

I recently wanted to get backups going in a similar way (incremental with copies on-site and off-site). I ended up using Borg and rclone, sending the backups to s3 compatible Wasabi storage. It's been working great, and coupled with zerotier on machines that are not in the house (family members), I ended up with a pretty resilient system that only costs a few dollars per month. I wrote up the details here: https://localconspiracy.com/2017/10/backup-everything.html

I'm not sure if tar is a good format for backups, due to its extremely sequential nature (e.g. you can't get a list of the files in an archive without scanning the whole archive...)

What overhead does that add in practice? Like a second or so to scan a few GB of tarball? Does that matter?

It often is, especially if the tarball is compressed (especially if you use anything but Zstandard) or on a network store rather than fast local ssd. Typically a rule of thumb is a hard drive with a spindle may five around 100mb/sec throughput, so that is likely the bigger bottleneck, followed by cpu if using slower compression algorithms.

Zstandard at a high compression level gives a good tradeoff of decompression speed vs underlying physical throughput, be it network or spindle.

Even so, yes, the sequential nature of tar is not great for this reason.

Squashfs files, which can be used on Linux and Mac (using fuse for OS X), are a very good alternative for archIves, though without some of the features described in the linked article.

you need to download the entire thing to scan it.

if it already resides locally on a seekable media , there's litle drawback

I guess it can be cumbersome as well if there are many small* files (since you need to iterate over all of them) until you find the one you are looking for.

Think of it like a linked list.

*: As opposed to a few large files.

This issue occurred to me as well (especially for encrypted tarballs). My workaround was to create a listing of the files that accompanies the tarball, such as: "backup.tar" and "backup.list".

Note that one point of failure of this setup is that you are keeping the snapshot/index filed locally. Ideally, you want to back those up as well.

I use restic to minio stored on a raid1 btrfs system and I love it.

You'd definitely want to do a full backup occasionally.

I totally agree!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact