Hacker News new | comments | show | ask | jobs | submit login
Duplicity + S3: Easy, cheap, encrypted, automated full-disk backups (phusion.nl)
171 points by tortilla on Nov 11, 2013 | hide | past | web | favorite | 98 comments

I use Arq[1] on my macs and it does the same thing but handles all the "complicated" bits that I'd rather be programmatically dealt with. I have 6 macs, about 2TB of backups in Glacier and a ~20GB "system image" backup in hot-storage on S3.

[1] http://www.haystacksoftware.com/arq/

I also use Arq and recommend it to everyone on OS X. It has a sane transparent format (modelled on git), it can store to Glacier and it doesn't periodically explode like Back Blaze used to.

Interested to hear more about your "system image". How do you do it and what do you store in it? Documents / preferences / Library etc? How would you use it in a recovery situation?

It's a separate backup vault in Arq that's maintained by clean OS install + my essential documents, dotfiles, and an encrypted PII volume with scans of passport/SSC/Drivers license/Birth certificate. It's all on a 60GB SSD that's booted every week to update to the latest versions of everything and then to be Arq backed up.

Restoration is downloading the vault to a new SSD and shoving it in a computer. The "root" user is included in the backup as an encrypted volume. My setup's a little strange in that the OS disks on my computers are filevaulted and each user's homedir is also in an encrypted volume only they have the password too that's mounted to the correct /Users/folder with fstab.

(It'd probably be helpful to note that I have a hackintosh this is done on with 10 internal bays, in which one resides this SSD with a clone of this SSD elsewhere)

Could you write that up in a blog post please? I'd love to read how that works. (I'm the author of Arq)

I just needed to upvote you for creating Arq. For a developer it's easy to guess at what it's going to do and it doesn't ever seem to do anything it shouldn't.

There is one question I've had about it. I keep everything in Glacier (about 700GB, I think). My understanding is that nothing is deleted from Glacier. If I deleted a whole backup set and then re-added it, would it upload the content again?

Also, I'm keeping an eye on http://www.filosync.com and I'm planning on trying it out soon. We currently run aerofs which is mostly really good - but sometimes we get into a situation (of our own making, we run it on an aws spot instance that can vanish) and sorting it out is hard work (and time consuming). The most frustrating aspect is that we don't know how it works, so we don't know what to expect when we perform certain operations. Sometimes it doesn't work how we expect and that bites us (I'm currently waiting for 12GB to sync - even though all the machines already have all the content).

Given my (really positive) experiences with Arq I'd like to try a product that was a bit more open about how it worked.

I could probably whack something together, but what specifically were you wondering about?

I'm wondering about the whole process. It sounds like an external drive that you periodically copy your documents, dot files etc to, and then let Arq back up? Or is it something else? (You talk about "booting" the disk -- do you mean booting the computer from that disk, or just mounting it?) Also I didn't understand the "root" user part. And is this a multi-user computer? You mentioned something about "each user".

Ah, I suppose I should break down my setup.

My main system: Hackintosh -- core i5 / 32GB RAM / GTX680, 2 x 250GB SSD, 2x 60GB SSD, 6 x 3TB 7.2K drive. (The handy thing here is my hackintosh build runs the vanilla kernel and I patch all my kexts at runtime so all the binaries are stock -- thus the backups are portable). This is my workstation and it's primarily a single-user system but there is the occasional remote user login. I use this same setup on all my macs though, and this is how it works:

   1 60GB SSD for the OS, applications, and caches
   X number of SSDs (usually 256GB), one for each user
The 60GB OS SSD is filevaulted with a password known to all users, so the disk is FDE'd while any of us can boot it. Each user SSD is filevault encrypted with their personal key. Each user is assigned a bash alias to mount their drive (So you can paste keys if you're using pubkeys with secure entry):

       maroch='diskutil cs unlockVolume <logicalvolumeid>' ##mount aroch's home directory
And there's the associated fstab entry

    /Volumes/aroch /Users/aroch hfs rw,bind 0 0
The "root" account is an admin account setup much the same way but but it is just an encrypted DMG and is mounted at boot using launchd and a hardware dongle as the password. If no dongle is present, there's a non-admin account with a real homedir that can also be used to elevate into an admin shell

The 60GB backup SSD (in slot #4): An install of (now) OSX Mavericks that includes my standard set of homebrew installs, xcode and the ~20 applications I use daily + Arq. It also has the a backup of my fstab and the root account DMG. There's a daily cron that copies my personal dotfiles to it as well. There's a weekly cron that reboots the workstation into the backup SSD and there's a startup script on the backup SSD that checks for system updates /MAS updates, applies them, Arq backup (Takes ~10mins usually) and then reboots back into the main OS SSD. The files total about 17GB right now but the delta is usually less than 100mb.

If needed, the OS and settings can be restored with Arq by using the backup SSD's vault

sounds like an awesome setup. So in theory you could restore your backup system drive from the workstation to a laptop as you are running the vanilla kernel ? How would one do that in practice ? I was thinking about a way to sync my workstation and Laptop in an easy way on a system level at night/morning..would something like this work for that ? EG using the nightly system drive image of the workstation and clone it to the laptop in the morning ?

Yes, in fact I have restored from my workstation to my mac mini and MBP in the past. Closing across two drives daily is probably not worth it. You'd be better off rsync / rdiff over your local network

ok sounds reasonable, but can you also sync system files (like installed homebrew etc) that way ?

Yes, you can sync /opt/ and make sure to have some key excludes in there (/var/ and /etc/ being the most likely)

Wow! That is an impressive setup!

Is there something like Arq compatible with both OSX and Windows ?

Features I'm interested in: - must use S3 or similar, as long as it's private and "unlimited" - must support multiple clients, even though I don't need real time sync - should encrypt on the client without need of a separate sw

I've been using Dropbox + CloudFogger so far, but dropbox doesn't scale anymore with the quantity of data I have, mostly due to the number of files (and I don't like depending on two different sw for one task).

The only two I have used are git-annex [1] and duplicati [2]. I much prefer git-annex but the windows version isn't so stable.

[1] http://git-annex.branchable.com/

[2] http://www.duplicati.com/

Another happy Arq user here -- I just use it for Glacier backup, no need for hot S3 storage.

How do you like glacier? It seems it's fairly cheap until you want your data back. Any insights?

Personally I've factored that in to my cost of recovery. It's going to cost me about £120 to restore my data over the course of a week. It depends on your use case but for me - in the event that glacier is the last point I can get my data from, £120 is a small price to pay to get it back. I used backblaze before and it was about the same to get them to post out a hdd (which was an awesome service on the one occasion I had to use it).

In addition, Glacier should be thought of as "offline" storage , like a tape rotation scheme. When thought of that way, the cost for restoration is non-existent.

Its not for "I may need this file next week" (thats what your local NAS / Time Machine is for...Glacier is for "what happens if this place burns down". At that point a week and £120 is not high on the list of your concerns...

It's too bad Amazon doesn't offer a discounted rate to restore from Glacier if you pay upfront.

Sorry to chase you between threads, but yes, I'd like to get in touch for the Getty files! Care to share contact info?

m8r-fhf8r1@mailinator.com (apologies, would prefer not to share my public email address here)

I'll reply back from my personal email account.

I like it quite a bit, Arq is my "backup of last resort" and what I use for provision new macs if they're not with me. I have daily/weekly/monthly local backups and semi-local, bi-weekly backups (stored offsite or in a building with a different mac). Setting up a new Mac remotely costs about $30 for the 60GB provisioning profile to be downloaded and restored (maybe more or less depending on the internet connection at the place). I pay a few hundred yearly for backups that I have no doubt will be there if I need them.

Arq is simple and works great, although it seems that a half of my S3 bills are for the thousands of requests that Arq makes back and forth. Simpler upload & forget backup model probably could save a few bucks...

I make, on average, 40K requests a month which is like $2 while my storage costs are ~$22. At the low end you'll be paying what seems like a lot for requests but as things scale your requests costs become a much smaller percentage.

You an cut down on requests by moving to daily backups (what I do) and narrowing down what you're backing up. Backing up your applications folders will take a ton of requests (Some apps have upwards of 1000 assets whose bless values may change if you open the application and this trigger Arq)

Has anyone been able to accurately figure out how much Glacier will costs once storage is retrieved from it?

Above the free retrieval quota (5% of your storage per month), pricing is based entirely on your peak hourly retrieval rate for the month, billed at $7.20 x [peak rate measured in GB/hr].

So for example, if you retrieve 100 GB in one hour, that's $720. If you retrieve 10 TB at the rate of 100 GB/hr for 100 hours, that's also $720.

Translated into bandwidth (assuming you're pulling at this rate for at least an hour), it's about $3 per Mbps. So if you limit your retrieval to 10 Mbps, you'll pay $30, regardless of how much data you retrieve. This makes it fairly easy to cap your expenditure, if you have a client that can stagger its requests appropriately (you need to throttle the Glacier retrieval requests, possibly using Glacier's range requests if you have very large files, not throttle at the network level).

Here's my last bill:

    $0.010 per GB / month - Storage 	2112.146 GB-Mo 	21.12
    $0.050 per 1,000 Requests 	40,447 Requests 	2.02

My last "restoration" was actually setting up a new iMac at a remote site involved downloading ~60GB over ~20hour and cost me about $30

Do you recommend Glacier over Crashplan? I'm thinking about switching

They are very different.

Glacier takes hours just to get the listing of files, or start the file download.

Their pricing is very different. Crashplan charges per computer. Glacier charges per GB of storage and per GB of transfer out.

In what cases would be better to use Glacier? I don't have too many GB, ~30GB.

Glacier is for when you want to store essentially never-changing backups that you expect to very, very rarely access. You can store a couple GB or petabytes.

Think an archive of family videos and photos spanning 30 years, or all the tax documents for a large enterprise.

A 30GB backup, that's updated nightly, on AWS Glacier will cost somewhere in the neighborhood of $1 a month. A 300GB backup somewhere around $5 on AWS.

Glacier is cheaper for small backups that you almost never touch.

This article is about backups for servers. I think some people are missing that part.

For my personal home machines, I was recently researching backup systems. I decided against cloud backups and opted for the simple solution of a USB 3.0 drive. It's cheap, and if I ever need to get my backup quickly, I just pick it up and take it. For critical documents, I can't imagine most people have more than a few megabytes anyways. That stuff can be archived/encrypted periodically and sent to cloud email (gmail, yahoo) much cheaper and easier than worrying with S3.

For the actual backups, I'm using Linux so I went with rsync. See here for roughly how it's done: http://www.mikerubel.org/computers/rsync_snapshots/

The USB drive is encrypted, I then tell rsync to use hard links and it does the incremental backup. The nice thing about hard links is you can get at any file you want to, in any backup you want to while preserving space. Files in new backups that exist in prior backups are simply hard links. Then you just "rm" older backups when no longer needed. It's an incredibly elegant solution on Linux, especially if you combine it with LVM snapshots.

Caution: if you are using a single drive you are in danger.

I also recently did a bunch of research on the best way to back up around 500 GB's of data (too expensive for my taste to put on something like S3 and I also needed a NAS). The conclusion I walked away with was that at least two drives and ZFS is a must. Silent drive corruption is a real problem. ZFS takes care of that much more intelligently than any hardware or software RAID and sure as hell better than no redundancy at all.

IMO, the idea setup for a home backup system is two servers with ECC RAM with 2-3 drives each running ZFS and connected over at least a gigabit link. One of these would be the primary NAS server, and the other the backup. All important documents: your Documents folder, tax returns, home videos, pictures of kids, etc. go on the primary server. The backup retains months of copies of the primary server and exposes the files over SMB and AFP as read-only (unless you are all Linux, don't bother with NFS as nothing else supports it properly, even Macs). The backup server also has raw space for system-level backups, such as Time Machine, etc. Basically, space that's not backed up twice.

My current setup is simpler: just one machine with no ECC (this makes me very nervous), with two drives running ZFS in a mirror setup. Since I am treating this array as reliable, I am using it for both primary storage and backup storage, though I will be changing this at some point soon when I get the time/money to build things out a bit. My biggest worry is physically separating primary and backup servers to different locations, at least within the house.

As for software, I use netatalk for AFP (Mac file sharing), Samba for SMB (Windows), and rsnapshot for backups. rsnapshot is particularly nice because it's the least pain in the neck to browse backups. It is similar to Time Machine but FOSS.

I've researched this for an extremely long time as well and came to very similar conclusions:

* Small-ish type stuff on SpiderOak and Dropbox (both on free tiers). I assume both of these are "compromised".

* Data that I want to keep /private/ go into an encrypted USB flash drive and onto Tarsnap. I doubt this will ever get larger than 10-20MB.

* I have ~5 full HDD images of Linux boxes that I need to keep for archival purposes and storing that on a cloud backup service would be a pain to upload and download. Those are on a Synology 213+ and on a larger capacity USB3 drive.

I'm still evaluating git-annex, but right now I can't seem to find a workflow to my liking. I feel like it /could/ be something I'd use to replace most of the above, but I can't see it yet.

I'm still evaluating git-annex

I thought git-annex would be a good solution but I can't seem to get it to stop locking files I'm working on without using its 'unsafe' mode. I can't seem to figure out the workflow it was designed to support.

May I ask which model of USB drive you're using, if it works reliably with Linux?

I was reading about these just last weekend, and surprised to hear that several of the current, big name brand USB hard drives seem to come with "clever" software, not necessarily easily removable, that means they don't work like a vanilla disk any more. The result is that operating systems without supplied drivers, such as Linux, won't necessarily be able to access the drives properly.

I have no way to tell how much of this is real and how much is down to misunderstanding, but it was a recurring theme and applied to multiple major vendors, so as appealing as this simple approach is, I don't really feel like trusting my critical back-ups to any of these drives until I've got some facts.

Seagate and WD drives have worked fine for me, both on ARM and x86.

WD drives detect a wd_ses device as well as the drive and I think that gives me LED control (great for turning off blinkenlights at night)

My main problem with backup drives at the moment is getting hdparm/power management settings to stick so the drives will spin down after an amount of time. I've had more luck with USB than ESATA in this case.

I'd recommend AGAINST any ESATA/USB "docks" - the USB to SATA bridges seem to have a very high failure rate (https://bugzilla.redhat.com/show_bug.cgi?id=895085 for the first one I saw on Google)

I use this, specifically: http://www.newegg.com/Product/Product.aspx?Item=N82E16822178...

It's a Seagate Backup Plus 3TB. I believe Newegg had it on sale for $100 when I picked it up. Been using it for maybe 3 months now. So far so good.

The drive seems to have no problem spinning down. But I typically unmount and unplug when I'm not using it, just to be on the safe side.

I also looked into NAS, which is what I originally wanted to go with. But my concern was none of the cheap ones seemed capable or fast enough to do encryption.

I use a similar approach, rsnapshot. What I worry about though is that it seems very wasteful when backing up a whole system. Each time the backup is happing I hear the disk churning for half an hour, and I realize it's mostly recreating hard links to files that _haven't_ changed. So I wonder which backup method would require the least amount of disk activity.

I just use plain rsync and then do a btrfs snapshot each time. Very fast and easy to set up. Probably just as easy with zfs if you prefer that.

Of course I've got some setup around creating datestamped snapshots and keeping the N last ( http://sprunge.us/OKHb?bash for the bash function that does this), but in essence it's just

$ rsync src trg/current $ btrfs subvolume snapshot -r trg/current trg/snapshot-$(date)

Not sure. You might want to see if the --checksum option is enabled (or disabled?). Also, rsync man page says it does a checksum even without that option, to verify the file has been copied correctly. This may be what is causing a lot of I/O.

I also forgot to mention that hashdeep/md5deep should probably be integrated into this sort of backup system (http://md5deep.sourceforge.net/). What hashdeep does is generates a list of checksums for a given backup. You would store this list with the backup, and then use hashdeep to "audit" the backup when you restore it, or to check if hardware is failing.

I use duplicity extensively and I've written my own frontend around it. Duplicity itself works very well, though has many areas in which it could improve.

One is, it could play within the Python ecosystem more nicely; it doesn't install with the usual setup.py mechanism and also is very resistant to any kind of API/in-process usage; it's various features are locked up tight within its commandline interface so writing frontends means you pretty much have to pipe to a process to work with it.

Another is that it needs a feature known as "synthetic backup". If you read up on how people use duplicity, a common theme is the need to "run a full backup" periodically. This is because you don't want to have months worth of backups as an endless long stream of incremental files; it becomes unwieldy and difficult to restore from. In my case, I'm backing up files from my laptop, and a full backup takes hours - I'd rather some process exists which can directly squash a series of incremental backup files into a full backup, and write them back out to S3. I'm actually doing this now, though in a less direct way; my front-end includes a synthetic backup feature which I run on my synology box - it restores from the incremental backup to a temp directory, then pushes everything back up as a new, full backup. My laptop doesn't need to remain open, it only needs to push small incremental files, and I get a nice full backup of everything nightly.

The issue I have with duplicity is that every N backups have to be a full backup, so a large backup (such as my photo collection) becomes impossible.

Is this a technical limitation? I'm not seeing anything in the docs that forces you to do a full backup after N amount of backups (maybe I missed this).

There is no technical limitation. You can do as many incremental backups as you want and are not forced to do a full backup after the initial one. As a practical matter though, I've found it's a good idea to occasionally do a full backup because the metadata can get out of sync (especially if something goes wrong like you lose your connection during an incremental backup).

In addition to the other reasons given, there's a reliability problem in relying on an ever-growing chain of incremental backups: you can only recover your latest state if every increment in the chain is uncorrupted.

I would much rather use an incremental backup that records the diffs in reverse, so that the latest state is stored explicitly, and the earlier versions are recovered through a chain of reverse diffs.

Duplicity's security goals make it hard to implement this kind of reverse diff structure.

You can do as many incrementals as you want, but you can't delete any.

So if you want to expire old backups after a while you must do a full backup, then delete the entire full+incrementals chain.

rdiff-backups stores diffs in reverse so you can always easily delete the oldest ones.

Duplicity archives and encrypts all it's files so it really doesn't have much choice in the matter. rdiff-backups doesn't encrypt anything - it can't - it needs the full file on the other end to use for rsync.

You can do many many incrementals, but the restoration process, especially from a remote source is incredibly slow.

its not that bad. i do it from a shiva plug over a 10mbit DSL link for a few dozen gigs regularly. its as fast as the DSL will go.

I've been using BackBlaze. I find it both cost effective and time saving. Here's why:

Price: It costs only $3.96 a month if you pay for two years up front. ($4.17 a month if you pay for only 1 year up front.) They offer unlimited storage of everything on internal and external HDDs. 30 days of file revisions too. I think I have close to 1TB of data backed up with their service. (Using Duplicity w/ S3 would cost me $90 a MONTH in comparison.)

Time: I've always been using Windows File History to backup onto a external USB HDD, but I wanted an off-site backup. I initially started backing up to S3 myself and even dabbled with Glacier. However, maintaining this setup proved to be quite messy. Dealing with encryption and decryption was painful as well. Having a simple foolproof interface was worth it.

I've been using them for the past two years as a remote backup. Locally, Time Machine. I don't even have to think about it which is the most appeal to me. I come home and everything gets backed up around midnight.

Quick correction: it's not true that rdiff-backup has to be installed on the remote server. Rdiff-backup works brilliantly with a dumb target for storage and that's how we use it in production.

Our servers periodically run rdiff-backup of all important data to a /backup partition on the same server. Then this /backup partition -- with the backup version history and metadata -- is rsynced to various dumb backup storage locations.

We've tried many backup solutions, and rdiff-backup is by far the fastest and most robust backup program we know.

You might want to reconsider this approach. I do something similar, but my remote backup also uses rdiff-backup. While it's awfully tempting to rsync the local backup to a remote location, it puts your entire backup at risk because rsync doesn't care about maintaining the state that rdiff-backup depends on. If your local disk is failing with transient errors, or your local /backup mount point disappears, the next rsync could mirror the corruption, leaving you without any backup at all. It's not worth the time savings and after the initial full backup the network overhead is negligible, even over ssh.

> is rsynced to various dumb backup storage locations.

Rsync requires a program on the other end. True, it's a more commonly installed program than rdiff-backup but it's not a dump backup.

The difference is that rdiff-backup requires the exact same version on both ends.

What's the benefit of Duplicity over Tarsnap http://www.tarsnap.com/ ? Thanks

I played a bit with both and they both worked well.

Tarsnap: easier to set up and use Duplicity: fully open source, and I think the restore speed was significantly faster ... but I've not properly benchmarked it apples-to-apples.

If you are backing up a lot of frequently changing files (eg. database backups) then tarsnap's bandwidth charges (probably caused by the underlying EC2 charges) can swamp the storage cost. Running duplicity against rsync.net avoids this.

restore speed was significantly faster with duplicity? I haven't used tarsnap yet, but one of my main gripes with duplicity is the rather slow restore speed. Is tarsnap even slower?

Depends. tarsnap restore times should be constant regardless of which backup you pick, since it's just downloading the relevant blocks. duplicity does full backup + diffs, so if your data changes quite a bit you can end up downloading a lot of data that doesn't land on disk. (i believe that's right, not verified.)

Duplicity is an open source project that can be used with different types of storage (I use them with OpenStack Object Storage instead of S3).

You can host your own backups if you want (supports ftp, sftp, ... check the docs: http://duplicity.nongnu.org/duplicity.1.html), but Tarsnap is an hosted solution.

I know it's not the point of the article, but "Duplicity" is a horrible name. It has negative connotations of dishonesty and is only in the same family as the word "duplication."

How would one do something similarly easy and economically efficient for a Windows platform? We've looked at different services but they are always so expensive. I would love a piece of software that we can place on each server and let it do its thing.... of course there is always SQL Server to contend with, again... expensive solutions exist.

Check out Crashplan. You can use their software for incremental backups to the destination of your choice at no cost, and it's multi-platform provided you install Java.

Another Crashplan fan here. You can also install the Crashplan client on headless Linux-based NAS devices, so long as they can run Java (such as ReadyNAS).

I feel your pain. I actually ported Duplicity to Windows about 3 years ago for a client, but the code was never merged (or commented on).[1]

In the end, that client decided that it was easier to use a commercial package, given their typical use case. Tried Carbonite (didn't work, support unhelpful) but settled on JungleDisk. YMMV.

1. https://code.launchpad.net/~kevinoid/duplicity/windows-port

Duplicati[1] is essentially duplicity reimplemented for windows. I use that on few computers and it seems to mostly work.

1. http://www.duplicati.com/

I haven't use it but some of our customers use it and seems to work pretty well.

Also it's easy to use: http://www.memset.com/blog/backup-your-windows-server-memsto...

has a very stupid limitation: can not run as a service. Programmer is struggling since years to implement this feature.

Personally I'd probably use something to sync the lot over to a linux box and do the rest of the work from there. Depends on how much data you need to sync / how many servers are involved. Could be a pain to restore too (depending on the type of data you're backing up) so you would want to weigh all of that up too. With the Sql Server backups you could run periodic backups from Sql Server itself and have those backed up - depending on the data this could be really suboptimal but I think sql server allows incremental backups in the backup set, doesn't it? Been a number of years since I've had to deal with it.

You can use some sort of rsync alternative to get the content from the windows boxes to the linux machine (cwrsync is recommended - though I've never used it).


Has anyone tried the new Glacier-clone of OVH [0] yet? At 0.008 EUR/GBP it seems to be priced very reasonable, and there are no crazy retrieval price structures (altrough retrieval has a 4 hour lead time, just like Glacier).

[0] http://www.ovh.co.uk/cloud/archive/

Cheap? Pish. A lot of server hosts have a local backup solution.

For a 2GB Linode, we're talking about backing up 96GB of storage. More than that, three backups are made (daily, weekly and 2-weekly). 288GB of active backup storage. Handled for you. For $10 a month.

Amazon is over twice that before you even consider the hassle of setting it up and restoring backups.

Until linode breaks or is compromised and your server and backups both get broken or hacked and deleted.

I don't trust hosting my server and backups with the same company. I backup to s3 with s3 credentials that forbid removal; files are expired out of the duplicity-related s3 buckets using native amazon functionality, not duplicity. It does mean there are some stale incrementals, but that's a small price to pay.

Homepage at linode.com lists 2GB Linode at $40 a month. Am I missing something?

duplicity is nice, but it doesn't do dedupe. It's also slow.

I currently prefer http://liw.fi/obnam/ which does dedupe and every backup is a snapshot. Only downside is that encrypted backups are broken if you use the latest gpg.

I use obnam too. Way better then duplicity with generation snapshots.

If you're interested in this, you should look into DreamObjects (http://www.dreamhost.com/cloud/dreamobjects/). It's a redundant cloud storage service which is compatible with the S3 and Swift protocols at less than half the price ($0.05/GB). It can be used with duplicity (http://www.dreamhost.com/dreamscape/2013/02/11/backing-up-to...).

duplicity + anything = easy,cheap,encrypted, automated full-disk/incremental/etc backups.

duplicity's simple yet pretty awesome.

S3 is actually not that cheap. I have well over 100 GB of data to back up (and that's not even a full backup by any means, just the stuff I don't want to lose and can't get back by redownloading and reinstalling things), which racks up a significant bill on S3. CrashPlan isn't cheap either but the cheapest ones of the bunch, so I use them instead. Actually getting the data back in the case of disaster would take a while but I could certainly restore the most important things first, then backfill the rest as I go.

I'm hoping that Space Monkey (http://www.spacemonkey.com/) helps solve some of this. It backups up to a local appliance-style box, then syncs with other space monkey boxes in a peer to peer manner. Like cloud storage, but cheaper at $10/month for 1TB.

Duplicity also works with Rackspace Cloud files as detailed here: http://gsusmonzon.blogspot.com/2013/07/backup-with-duplicity...

I couldn't disagree more. Duplicity is awesome, I had it running in a prod environment that had some intense security requirements (addressed with GPGcards for keying)

But for system backups, you need something simple and easy. Just get arq/etc.

Duplicity has consistently failed to restore for me in the form of deja-dup on Ubuntu.

Restoring from duplicity takes a proportional amount of time related to the backup chain length, as archives are stored as forward-deltas.

Nothing against forward deltas, but just consider that:

- restoring time increases with increment count - risk of corruption of a single archives will make all further increments worthless.

These are very important things to consider when doing a backup.

Currently, you have to limit the chain length by performing a full backup every N increments (with N being as short as possible), which defeats the purpose of efficient increments.

I have requested the ability to specify manually the ancestor of the increments (by time or count), so that one could implement a non-linear hierarchy with multiple levels like one normally does with tar/timestamps, but the request was dubbed as "unnecessary given the delta efficiency" (despite the fact that efficiency is just one variable). Having a 3 level backup (daily, weekly, monthly) would make duplicity much more space efficient, reduce the number of full backups needed, and shorten the chains to the level that would make restores of "last year" actually _possible_.

I sent several patches to fix duplicity behavior with files larger than 1gb (by limiting block counts), which got integrated, but are still a far cry to make duplicity work decently as a whole-system backup&restore solution. It's just too slow. And like you said, several bugs afflicted duplicity in the past that would make restore fail in many circumstances. I also debugged my share of issues, which led me to think that very few people actually tried to restore from arbitrary increments with duplicity and/or used them to archive a large system.

Many got fixed, but I won't consider duplicity again until I can control the incremental ancestor and reduce the chain lengths (and it's silly to think that "rdiffdir", distributed with duplicity, would allow for that easily).

Nowdays I use rdiff-backup, and use a second script to tar/encrypt the deltas after the backup has completed.

I'm keeping an eye on "bup" (https://github.com/bup/bup), but I cannot backup "forever", thus without the ability of purging old versions it's only useful in a limited set of cases.

> I'm keeping an eye on "bup" (https://github.com/bup/bup), but I cannot backup "forever", thus without the ability of purging old versions it's only useful in a limited set of cases.

I wrote ddar before I knew about bup. It doesn't have this limitation; you can arbitrarily remove any archive. However, it does not do encryption, so I wouldn't recommend using it to store on S3 directly.

I actually tried again today, and this is still promiment in the limitations:

bup currently has no features that prune away old backups.

Because of the way the packfile system works, backups become "entangled" in weird ways and it's not actually possible to delete one pack (corresponding approximately to one backup) without risking screwing up other backups.

Thanks for pointing that out. Encryption is not really an issue for me. It's very easy to perform encryption using a encrypting FUSE filesystem or afterwards when needed.

Just for clarification, doesn't this backup approach only give you your most recent backup? In other words, is it true that it doesn't do "versions", i.e. doesn't let you restore from n backups ago?

Isn't silence-unless-failed just a way of doing > /dev/null ? Since the stderr isn't redirected cron will get the errors but the other stuff goes to null.

I'm currently using Duplicity and a raspberry pi to push my encrypted files off of a NAS to a VPS on a daily basis. Works out to be $1.38/month for 50gb storage.

The main reason why I prefer Duplicity is its portability. I can easily use the same setup for backups over ssh, rsync, s3, ftp, whatever.

OOOrrrr... QNAP home-disk array + deja-dup = automated, really cheap full-disk backup NOT stored on some US-NSA-server..

duplicity + s4[0], because s4 is greater than s3.

[0] https://leastauthority.com/

Ahh the bums who think they have right to my name, despite me using it since '00 or so. This still looks like an interesting service.

care to elaborate?

They both call themselves "phusion".

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact