Attic is one of the new-generation hash-backup tools (like obnam, zbackup, Vembu Hive etc). It provides encrypted incremental-forever (unlike duplicity, duplicati, rsnapshot, rdiff-backup, Ahsay etc) with no server-side processing and a convenient CLI interface, and it does let you prune old backups.
All other common tools seem to fail on one of the following points
- Incremental forever (bandwidth is expensive in a lot of countries)
- Untrusted remote storage (so i can hook it up to a dodgy lowendbox VPS)
- Optional: No server-side processing needed (so i can hook it up to S3 or Dropbox)
If your backup model is based on the old' original + diff(original, v1) + diff(v1, v2).. then you're going to have a slow time restoring. rdiff-backup gets this right by reversing the incremental chain. However, as soon as you need to consolidate incremental images, you lose the possibility of encrypting the data (since encrypt(diff()) is useless from a diff perspective).
But with a hash-based backup system? All restore points take constant time to restore.
Duplicity, Duplicati 1.x, and Ahsay 5 don't support incremental-forever. Ahsay 6 supports incremental-forever at the expense of requiring trust in the server (server-side decrypt to consolidate images). Duplicati 2 attempted to move to a hash-based system but they chose to use fixed block offsets rather than checksum-based offsets, so the incremental detection is inefficient after an insert point.
IMO Attic gets everything right. There's patches for windows support on their github. I wrote a munin plugin for it.
Disclaimer: I work in the SMB backup industry.
If the original box is ever compromised, I don't want the attacker to gain any access to the backup. If you use a dumb storage like S3 as your backup server, you need to store your keys on the original box, and anyone who gains control of the original box can destroy your S3 bucket as well. Ditto for any SSH-based backup scheme that requires keys to be stored on the original box. A compromised box could also lie about checksums, silently corrupting your backups.
Backups should be pulled from the backup box, not pushed from the original box. Pushing backups is only acceptable for consumer devices, and even then, only because we don't have a reliable way to pull data from them (due to frequently changing IP addresses, NAT, etc).
The backup box needs to be even more trustworthy than the original box, not the other way around. I'm willing to live with a significant amount of overhead, both in storage and in bandwidth, in order not to violate this principle.
The backup box, of course, could push encrypted data to untrusted storage, such as S3. But only after it has pulled from the original box. In both cases, the connection is initiated from the backup box, not the other way around. The backup box never accepts any incoming connection.
Does Attic support this kind of use case? The documentation doesn't seem to have anything to say about backing up remote files to local repositories. I don't see any reason why it won't be supported (since rsync does), but "nominally supported" is different from "optimized for that use case", and I suspect that many of the latest generation of backup tools are optimized for the opposite use case.
"Ditto for any SSH-based backup scheme that requires keys to be stored on the original box. A compromised box could also lie about checksums, silently corrupting your backups."
This is a good thought - you should indeed be thinking about an attacker compromising your system and then using the SSH keys they find and wiping out the offsite backup. All they need to do is look into cron and find the jobs that point to the servers that ...
So how do we solve this ? All of our accounts have ZFS snapshots enabled by default. You may not be aware of it, but ZFS snapshots are immutable. Completely. Even root can't write or delete in a snapshot. The snapshot has to be deliberately destroyed with ZFS commands run by root - which of course, the attacker would not have access to. It's a nice safety net - even if your current copy is wiped out, you have your ZFS snapshots in place.
"Backups should be pulled from the backup box, not pushed from the original box."
This was the tipping point - I had to comment. Since day one, we have, free of charge, set up "pull jobs" for any customer that asks for it. Works just like you'd like it to on whatever schedule they can cram into a cron format. It's a value add we've always been happy to provide.
 You know who we are.
 Yes, if you don't notice for 7 days and 4 weeks that the attacker has wiped you out, at that point your snapshots will all rotate into nothingness as well. Nothing's perfect.
But in the modern commercial market, who are you going to trust as your backup box provider? A USA company subject to NSLs? Run your own in a rack somewhere? Having an untrusted server greatly decreases cost and increases the chance that you can actually produce a successful backup infrastructure at all.
It is possible to do it safely.
Since you say you're "willing to live with a significant amount of overhead", i would suggest a two-tier push/pull configuration. Desktop pushes to site A; then site B pulls from site A. This also increases redundancy and spreads out the attack surface.
Append-only is another good solution - i don't believe attic formally supports this today but it should be as simple as patching `attic serve` to ignore delete requests. Good first patch.
(Also if you really trust your backup server, then you don't need encryption anyway and can just run rdiff-backup over ssh.)
Not really. On my production boxes, I usually set up a "backup" account and give it read-only access to paths that need to be backed up. Nothing fancy, just standard POSIX filesystem permissions. The backup box uses this account to ssh in, so it can only read what it needs to read, and it can never write anything to the production box. I wouldn't call that "full control".
> Desktop pushes to site A; then site B pulls from site A.
What you described is similar to my own two-tier configuration, except I pull first and then push to untrusted storage like S3 (using encryption, of course). The first step uses rsync over ssh. The second step is just tar/gzip/gpg at the moment, but if I want deduplication I can easily switch to something like tarsnap.
- if your backup box is on another network then it can be coerced into malicious reads (leaking private information, trade secrets, your competitive advantage etc).
- if it's on the same network then it's subject to your same failure patterns.
Push backup has some disadvantages, but there's a lot of peace-of-mind in never (intentionally) granting additional users access to the unencrypted data.
Two-tier is one approach. There's another comment in this thread about snapshotting filesystems (ZFS, or i suppose LVM snapshots might be easier) which would be another method of addressing concerns about the client tampering with the backed up data.
When I had no resources for this (eg: low income student), I had a server at my mom's place that did this for me. Low-cost offsite, trustable backup server for personal usage.
I believe this isn't strictly necessary if you use asymmetric cryptography (e.g. curve25519). For a file, generate a temporary key pair, use it and the backup's public key to encrypt the file, then throw out the private key and send the encrypted file + public key to the server.
Apple uses this technique to move files to the "Accessible while unlocked" state without having the key for that state (i.e. while the device is locked).
- Generate a temporary key
- Symmetrically encrypt with that key
- Encrypt that key with your long-term assymetric private key, and send the encrypted version along your backups.
And before you hack around your own version, I'd like to point out this is exactly what PGP (and really, any crypto scheme that involves asymmetric keys) does. So, basically, just GPG your backups.
1. Use sub accounts on rsync.net so backups from different parts of the system are isolated from each other.
2. Use a different GPG keypair and passphrase for each host being backed up.
3. Have an isolated machine out on the internet somewhere (that, importantly, isn't referenced by anything in the main system including documentation / internal wikis i.e. so the attackers don't know it exists) that does a daily copy of the latest and previous full backup plus any current incrementals directly from rsync.net's storage. This way I'm still covered (and can restore relatively quickly) if an attacker gets in to the system and deletes the rsync.net hosted backups for lulz.
If you're truly paranoid or need to protect backups going back over months you could also introduce a final routine that duplicates the data from the ghost machine to Amazon glacier (and then optionally pay for an HDD to be shipped periodically to your offices).
The ZFS snapshots are immutable. Completely. Even root can't write or delete in a snapshot. The snapshot has to be deliberately destroyed with ZFS commands run by root - which of course, the attacker would not have access to.
Also, thanks for your business :)
Even stricter access controls (write once, no read) might help with that. Not sure if you can do that with S3 though.
Even if the source can only perform new backups, it's a timing attack with a deduplicating system. The attacker can attempt to back up chosen data to infer properties of the existing backups.
You can remove this only by removing deduplication (or by crippling deduplication to work only at the server-side, and incur wasteful network requests)
"Attic can initialize and access repositories on remote hosts if the host is accessible using SSH."
Fantastic. Will work perfectly here. We are happy to support this just like we've supported duplicity all of these years. EDIT: appears obnam also works over plain old SSH. Can't tell about zbackup, however...
As always, email us to discuss the "HN Readers" discount.
Although upon reading yours, it looks they perform somewhat different tasks.
It's possible that this is a bad way to run things since my master server is then a SPoF. I do trust this server more since I monitor it much more closely, than I would a dozen random VPS's with a half-dozen different providers.
The scenario you're describing, however, sounds like the opposite in terms of trust. And in that case pull may make sense. However it doesn't sound like attic itself natively supports that sort of config. I could envision a sort of hybrid approach where the local machine encrypts to a local attic repository, and then the remote backup server pulls a copy of it. There's nothing stopping you from setting that up, either with attic as-is or with this wrapper script.
Here's a blog post about it:
take it with a grain of salt, but i am surprised to say that attic is at least as fast as bup.
i've been amazed to notice how all my files have atimes from july, when i stopped using rsync to make backups and switched to bup. :p
Could you or someone explain why these fantastic sounding tools don't get a developed front-end? Or if they do why am I missing them?
The best solution I've found is ChronoSync.
For duplicity, there is a quite good ui in the form of deja dup (see http://www.howtogeek.com/108869/how-to-back-up-ubuntu-the-ea...). It's really nice and easy to use, and if I recall correctly it's installed by default on Ubuntu.
Moreover, many of these backup tools run on servers (or as an automated background process) where a graphical interface is more of a handicap than an asset.
I think that in open-source software "pretty interface" sounds often like "customer-oriented" which in turn sounds like "getting paid".
There are currently multiple interfaces available to bup, but each one has some quirks. bup's web interface is still very embryonic and would need the magic touch of some designers / integration specialists to make it fun to work with.
My personal choice of attic over obnam was based on this performance comparison
Elsewhere in this discussion people not some shortcomings of bup, namely not having its own encryption and not having the ability to delete old backups. For my applications, lack of encryption isn't an issue, since I make the backups locally on a full-disk encrypted device and transmit them for longterm storage (to another full disk encrypted device) only with ssh. The lack of being able to easily delete old backups is also not an issue since (1) I don't want to delete them (I want a complete history), and (2) the approach to deduplication and compression in bup makes it extremely efficient space wise, and it doesn't get (noticeably) slower as the number of commits gets large; this is in contrast to ZFS, where performance can degrade dramatically if you make a large number of snapshots, or other much less space efficient approaches where you have to regularly delete backups or you run out of space.
In this discussion people also discuss ZFS and deduplication. With SageMathCloud, the filesystem all user projects use is a de-duplicated ZFS-on-Linux filesystem (most on an SSD), with lz4 compression and rolling snapshots (using zfssnap). This configuration works well in practice, since projects have limited quota so there's only a few hundred gigabytes of data (so far less than even 1TB), but the machines have quite a lot of RAM (50+GB) since they are configured for lots of mathematics computation, running IPython notebooks, etc.
It's a great tool, but since you contributed already, I cannot understate the importance of pruning old archives. For online/disk-based backup solutions, space is always going to run out eventually.
I'm using bup where I already know the backup size will grow in a way that I can manage for the next 1-2 years.
For "classical" backup scenarios though, where binaries and many changes are involved and the backup grows by roughly 10-20% a week due to changes alone, I have to resort to tools where I can prune old archives because I would either have to reduce the number of increments (which I don't want to do) or increase the backup space by a factor of 50x (which I practically cannot do either).
Others have complained here that bup doesn't support deleting old backups. ddar doesn't have such an issue. Deleting snapshots work just fine (all other snapshots remain).
I think the underlying difference is that ddar uses sqlite to keep track of the chunks, whereas bup is tied to git's pack format, which isn't really geared towards large backups. git's pack files are expected to be rewritten, which works fine for code repositories but not for terabytes of data.
The real struggle of bup is how to know whether a hash is already stored, and how to know it screaming fast. It could be interesting to compare bup style and standard sqlite style, as you do in ddar.
Also, it seems ddar stores each object in its own file, like git loose objects. SQLite has a page  that compares this and storing blobs in SQLite, and I don't know what's the median size of your objects but if it's < 20k it seems to be better to just store them as blobs.
I can't remember the exact number off the top of my head, but I designed the average size of each object to be much bigger - more like 64-256M than kilobytes. IMHO this works far better for backups. So I just use the filesystem to store the blobs, which I think works better.
Obnam, attic and similar use a normal read/write disk area, without any server side processing, so presumably an errant/malicious user is free to delete the entire backup?
You can create a key that has write only access for your automated backups and a different key that has full access for administrative purposes.
That is what we do anyway.
However, bup goes one step farther and has builtin support for "par2" which adds error correction - in a way, it efficiently re-duplicates chunks so that whichever one (or two, or however many you decide) break, you can still recover the complete backup.
The reason I ask, is I had a difficult time finding a backup tool that suited my own needs, so I wrote and open-sourced my own (http://www.snebu.com), and now that some people are starting to use it in production I'd like to get a deeper peer review to ensure quality and feature completeness. (I actually didn't think I'd be this nervous about people using any of my code, but backups are kind of critical so it I'd like to ensure it is done as correct as possible).
Not polished, but it's working for me.
- only runs on machines I control
- server requirement is only rsync, ssh and coreutils
- basic conflict detection
- encfs --reverse to encrypt locally, store remotely
- history is rsnapshot-style hard links
- inspect history using sshfs
- can purge old history
encfs isn't ideal but it's the only thing that does the job. Ideally I'd use something that didn't leak so much, but it doesn't exist.
I can't work (very effectively) in two places at once, so I don't need robust merging, just CYA synchronisation. Using only rsync features I can do a full 2-way rsync merge and catch potential conflicts, erring on the conservative so I have reasonable confidence I don't lose any work.
Minimal workstation dependencies: only bash, encfs, rsync, ssh, coreutils/findutils and optionally atd for automation. encfs is optional, too.
Instead of dvc-autosync and XMPP I just use periodic execution. I partition my stuff into smaller/high-frequency vs larger/lower-frequency to keep this efficient. These are triggered from bash (PROMPT_COMMAND, in the background) and recursive at (atd).
The local data is unencrypted on disk from this tool's POV. I use encfs --reverse and rsync the result. To browse the history, I combine sshfs with encfs in forward mode.
Linux only because that's what I use, but it should be possible to support OSX.
All in all I'm pleased I'm able to use such minimal tooling for such a successful result.
Obnam and bup seemed to work mostly the way I wanted to but obnam was by far the most mature tool, so this is what I chose in the end.
On the plus side, it provides both push and pull modes. Encryption and expiration works. The minus points are no Windows support, and some horror stories about performance. Apparently it can slow to a crawl with many files. I haven't run into that problem despite hundreds of gig in the backup set, but most are large files.
On the whole it's been very stable and unobtrusive during the time I've used it, but I haven't used it in anger yet. So a careful recommendation for obnam from me.
however it's a dangerous feature to add in (backup tools should never screw up their storage, and this feature goes and removes things) so a lot of care is needed.
the boring answer is: it's coming, and we need a lot of help for vetting patches and testing them out.
So they did the easy parts and skipped the hard parts.
Hardly a perfect backup.
I'm not sure a backup with all three of: delete old backups, encryption, and upload only differences even exists.
I think it might be one of those "pick any two" things, but if anyone knows of a backup with all three let me know.
How are you calculating a diff of an encrypted file? I looked through the technical documentation but if it explains it I missed it.
Have you thought about how to quantify this tradeoff?
I suppose you could pad each encrypted chunk so they're all the same size, but then if you don't want to waste a ton of space you'd have to restrict your chunking algorithm to output chunks with relatively similar sizes, at which point you lose some of the benefits of chunking.
That doesn't mean that it's impossible, of course; just that it would require someone smarter than me. ;-)
git annex is for more than just backups. In particular, it lets you store files on multiple machines and retrieve them at will. This lets you do backups to e.g. S3, but it also lets you e.g. store your mp3 collection on your NAS and then easily copy some files to your laptop before leaving on a trip. Any changes you make while you're offline can be sync'ed back up when you come back online.
You can prune old files in git-annex , and it also supports encryption. git-annex deduplicates identical files, but unlike Attic &co, it does not have special handling of incremental changes to files; if you change a file, you have to re-upload it to the remote server.
git-annex is actively developed, and I've found the developer to be really friendly and helpful.
 You can prune the old files, but because the metadata history -- basically, the filename to hash mapping -- is stored in git, you can't prune that. In practice you'd need to have a pretty big repository with a high rate of change for this to matter.
Edited for formatting.
"git-annex is not a backup system."
some ppl have reported using an encrypted storage backend like ecryptfs to store their bup repositories in. that option shouldn't be too hard to put together.
I like & use Duplicity, but it's a little finicky to set up right and some of the more obscure features are counterintuitive or even broken.
My personal obstacle in using a tool like bup is the back-up space. I could definitely use this for on-site/external storage devices, but I also like to keep online/cloud copies. I currently use CrashPlan for that which affords me unlimited space. If CrashPlan would let me use their cloud with bup, wow, I would switch in a heartbeat. Perhaps cloud backup tools could learn some tricks from bup.
It is really easy to set up what folders to backup and where, and I use it whenever a backup is simply take all files from X, do the rolling backups at Y, and done.
The one most likely to be a showstopper seems to be: "bup currently has no way to prune old backups."
git's tools, namely filter-branch and gc have been reported to work on limited-size bup repositories, but it very quickly eats up all ram and cpu and never finishes because of the sheer amount of objects that are usually stored in a bup repository
There are existing discussions about the way bup pruning would have to work:
Sounds like they are actually making progress on it.
Looking at http://burp.grke.org/burp2/08results1.html it seems it can outperform Bup in some situations.
Can someone more experienced with ZFS say why?
It boils down to the fact that ZFS maintains a mapping from hashes to LBNs. This allows write-time deduplication (as opposed to a scrubber that runs periodically and retroactively deduplicates already written blocks). This is somewhat memory intensive though. For smaller ZFS pools you can get away with just having lots of RAM (and with or without dedupe ZFS performs better the more RAM you have). For larger ones, you can add a SSD to act as additional disk cache.
Here's a quick description of that setup:
Note in this example that they were already showing 128GB of RAM for a 17TB pool; the L2ARC was to augment that. In general, ZFS was designed with a much higher RAM/Disk ratio than a workstation typically has.
bup currently doesn't do that. but there's been some talk of using inotify or another such method of knowing exactly which files are modified when they are so that bup could instantly work on those.
in theory it should be feasible, it's not implemented yet however
there are some areas in which bup is trying to innovate in order to change how backups are considered (storage size being one of those)