
Bup – towards the perfect backup - hachiya
http://wrouesnel.github.io/articles/bup%20-%20towards%20the%20perfect%20backup/
======
mappu
A shoutout for attic [https://attic-backup.org/](https://attic-backup.org/)

Attic is one of the new-generation hash-backup tools (like obnam, zbackup,
Vembu Hive etc). It provides encrypted incremental-forever (unlike duplicity,
duplicati, rsnapshot, rdiff-backup, Ahsay etc) with no server-side processing
and a convenient CLI interface, and it _does_ let you prune old backups.

All other common tools seem to fail on one of the following points

\- Incremental _forever_ (bandwidth is expensive in a lot of countries)

\- Untrusted remote storage (so i can hook it up to a dodgy lowendbox VPS)

\- Optional: No server-side processing needed (so i can hook it up to S3 or
Dropbox)

If your backup model is based on the old' original + diff(original, v1) +
diff(v1, v2).. then you're going to have a slow time restoring. rdiff-backup
gets this right by reversing the incremental chain. However, as soon as you
need to consolidate incremental images, you lose the possibility of encrypting
the data (since encrypt(diff()) is useless from a diff perspective).

But with a hash-based backup system? All restore points take constant time to
restore.

Duplicity, Duplicati 1.x, and Ahsay 5 don't support incremental-forever. Ahsay
6 supports incremental-forever at the expense of requiring trust in the server
(server-side decrypt to consolidate images). Duplicati 2 attempted to move to
a hash-based system but they chose to use fixed block offsets rather than
checksum-based offsets, so the incremental detection is inefficient after an
insert point.

IMO Attic gets everything right. There's patches for windows support on their
github. I wrote a munin plugin for it.

Disclaimer: I work in the SMB backup industry.

~~~
kijin
Sorry, but "Untrusted remote storage" and "No server-side processing" are
exactly the opposite of what I need.

If the original box is ever compromised, I don't want the attacker to gain any
access to the backup. If you use a dumb storage like S3 as your backup server,
you need to store your keys on the original box, and anyone who gains control
of the original box can destroy your S3 bucket as well. Ditto for any SSH-
based backup scheme that requires keys to be stored on the original box. A
compromised box could also lie about checksums, silently corrupting your
backups.

Backups should be _pulled_ from the backup box, not _pushed_ from the original
box. Pushing backups is only acceptable for consumer devices, and even then,
only because we don't have a reliable way to pull data from them (due to
frequently changing IP addresses, NAT, etc).

The backup box needs to be even more trustworthy than the original box, not
the other way around. I'm willing to live with a significant amount of
overhead, both in storage and in bandwidth, in order not to violate this
principle.

The backup box, of course, could push encrypted data to untrusted storage,
such as S3. But only after it has pulled from the original box. In both cases,
the connection is initiated from the backup box, not the other way around. The
backup box never accepts any incoming connection.

Does Attic support this kind of use case? The documentation doesn't seem to
have anything to say about backing up remote files to local repositories. I
don't see any reason why it won't be supported (since rsync does), but
"nominally supported" is different from "optimized for that use case", and I
suspect that many of the latest generation of backup tools are optimized for
the opposite use case.

~~~
mappu
By pulling backups, you're giving the backup box full control over your
computer, meaning that yes it must be more trustworthy. Push backups can
indeed allow the initiator to wreck the remote state.

But in the modern commercial market, who are you going to trust as your backup
box provider? A USA company subject to NSLs? Run your own in a rack somewhere?
Having an untrusted server greatly decreases cost and increases the chance
that you can actually produce a successful backup infrastructure at all.

It is possible to do it safely.

Since you say you're "willing to live with a significant amount of overhead",
i would suggest a two-tier push/pull configuration. Desktop pushes to site A;
then site B pulls from site A. This also increases redundancy and spreads out
the attack surface.

Append-only is another good solution - i don't believe attic formally supports
this today but it should be as simple as patching `attic serve` to ignore
delete requests. Good first patch.

(Also if you really trust your backup server, then you don't need encryption
anyway and can just run rdiff-backup over ssh.)

~~~
kijin
> _By pulling backups, you 're giving the backup box full control over your
> computer_

Not really. On my production boxes, I usually set up a "backup" account and
give it read-only access to paths that need to be backed up. Nothing fancy,
just standard POSIX filesystem permissions. The backup box uses this account
to ssh in, so it can only read what it needs to read, and it can never write
anything to the production box. I wouldn't call that "full control".

> _Desktop pushes to site A; then site B pulls from site A._

What you described is similar to my own two-tier configuration, except I pull
first and then push to untrusted storage like S3 (using encryption, of
course). The first step uses rsync over ssh. The second step is just
tar/gzip/gpg at the moment, but if I want deduplication I can easily switch to
something like tarsnap.

~~~
mappu
I guess it depends on your security model. With one single pull backup,

\- if your backup box is on another network then it can be coerced into
malicious reads (leaking private information, trade secrets, your competitive
advantage etc).

\- if it's on the same network then it's subject to your same failure
patterns.

Push backup has some disadvantages, but there's a lot of peace-of-mind in
never (intentionally) granting additional users access to the unencrypted
data.

Two-tier is one approach. There's another comment in this thread about
snapshotting filesystems (ZFS, or i suppose LVM snapshots might be easier)
which would be another method of addressing concerns about the client
tampering with the backed up data.

------
williamstein
I've long been a huge fan up bup, and have even contributed some code. I might
be by far their single biggest user, since I host 96748 bup repositories at
[https://cloud.sagemath.com](https://cloud.sagemath.com), where the snapshots
for all user projects are made using bup (and mounted using bup-fuse).

Elsewhere in this discussion people not some shortcomings of bup, namely not
having its own encryption and not having the ability to delete old backups.
For my applications, lack of encryption isn't an issue, since I make the
backups locally on a full-disk encrypted device and transmit them for longterm
storage (to another full disk encrypted device) only with ssh. The lack of
being able to easily delete old backups is also not an issue since (1) I don't
want to delete them (I want a complete history), and (2) the approach to
deduplication and compression in bup makes it extremely efficient space wise,
and it doesn't get (noticeably) slower as the number of commits gets large;
this is in contrast to ZFS, where performance can degrade dramatically if you
make a large number of snapshots, or other much less space efficient
approaches where you _have_ to regularly delete backups or you run out of
space.

In this discussion people also discuss ZFS and deduplication. With
SageMathCloud, the filesystem all user projects use is a de-duplicated ZFS-on-
Linux filesystem (most on an SSD), with lz4 compression and rolling snapshots
(using zfssnap). This configuration works well in practice, since projects
have limited quota so there's only a few hundred gigabytes of data (so far
less than even 1TB), but the machines have quite a lot of RAM (50+GB) since
they are configured for lots of mathematics computation, running IPython
notebooks, etc.

~~~
tenfingers
I've also been using bup to replace (in some areas) my use of rdiff-backup.

It's a great tool, but since you contributed already, I cannot understate the
importance of pruning old archives. For online/disk-based backup solutions,
space is always going to run out eventually.

I'm using bup where I already know the backup size will grow in a way that I
can manage for the next 1-2 years.

For "classical" backup scenarios though, where binaries and many changes are
involved and the backup grows by roughly 10-20% a week due to changes alone, I
_have_ to resort to tools where I can prune old archives because I would
either have to reduce the number of increments (which I don't want to do) or
increase the backup space by a factor of 50x (which I practically cannot do
either).

------
rlpb
I wrote a very similar tool before I knew about bup - ddar
([https://github.com/basak/ddar](https://github.com/basak/ddar) \- with more
documentation at
[http://web.archive.org/web/20131209161307/http://www.synctus...](http://web.archive.org/web/20131209161307/http://www.synctus.com/ddar/)).

Others have complained here that bup doesn't support deleting old backups.
ddar doesn't have such an issue. Deleting snapshots work just fine (all other
snapshots remain).

I think the underlying difference is that ddar uses sqlite to keep track of
the chunks, whereas bup is tied to git's pack format, which isn't really
geared towards large backups. git's pack files are expected to be rewritten,
which works fine for code repositories but not for terabytes of data.

~~~
rakoo
It's true that git's pack files are made for being rewritten, but bup doesn't
do that. Every new run will create a new pack along with its .idx (which means
that some packs may be quasi-empty) and the size of packs is capped at 1GB
(Giga, not Gibi).

The real struggle of bup is how to know whether a hash is already stored, and
how to know it _screaming fast_. It could be interesting to compare bup style
and standard sqlite style, as you do in ddar.

Also, it seems ddar stores each object in its own file, like git loose
objects. SQLite has a page [0] that compares this and storing blobs in SQLite,
and I don't know what's the median size of your objects but if it's < 20k it
seems to be better to just store them as blobs.

[0] [https://www.sqlite.org/intern-v-extern-
blob.html](https://www.sqlite.org/intern-v-extern-blob.html)

~~~
rlpb
> I don't know what's the median size of your objects but if it's < 20k...

I can't remember the exact number off the top of my head, but I designed the
average size of each object to be much bigger - more like 64-256M than
kilobytes. IMHO this works far better for backups. So I just use the
filesystem to store the blobs, which I think works better.

~~~
rakoo
Doesn't that alleviate the benefit of deduplication, if you're working on
multi megabytes objects ? You'll end up copying a lot, unless I'm missing
something obvious...

~~~
rlpb
Most backups have very large areas of duplication. If a small file has
changed, chances are that the small files around it have changed also. So de-
duplicating with a larger chunk size seems to work fine in practice.

------
femto
Is there anything out there that does continuous incremental backups to a
remote location (like obnam, attic, ...) but allows "append only" access. That
is, you are only allowed to add to the backup, and the network protocol
inherently does not allow past history to be deleted or modified? Pruning old
backups might be allowed, but only using credentials that are reserved for
special use.

Obnam, attic and similar use a normal read/write disk area, without any server
side processing, so presumably an errant/malicious user is free to delete the
entire backup?

~~~
Negitivefrags
Tarsnap.

You can create a key that has write only access for your automated backups and
a different key that has full access for administrative purposes.

That is what we do anyway.

------
beagle3
Haven't seen this mentioned - but, since bup de-duplicates chunks (and thus
may take very little space - e.g., when you backup a 40GB virtual machine,
each snapshots takes little more than the actual changes inside the virtual
machine), every byte of the backup is actually very important and fragile, as
it may be referenced from thousands of files and of snapshots. This is of
course true for all dedupping and incremental backups.

However, bup goes one step farther and has builtin support for "par2" which
adds error correction - in a way, it efficiently re-duplicates chunks so that
whichever one (or two, or however many you decide) break, you can still
recover the complete backup.

~~~
abcd_f
I saw "par2" and got all excited to see if lelutin re-implemented all the
Galois field goodness from scratch, looked at the source and - no, bup merely
spawns the par2 binary. Damn :)

------
derekp7
I was wondering if someone's done a side-by-side comparison of the various
newer open-source backup tools? Specifically, I'm looking for performance,
compression, encryption, type of deduplication (file-level vs. block-level,
and dedup between generations only vs. dedup across all files). Also, the
specifics of the implementation, since some of the tools don't really explain
that too well, along with any unique features.

The reason I ask, is I had a difficult time finding a backup tool that suited
my own needs, so I wrote and open-sourced my own
([http://www.snebu.com](http://www.snebu.com)), and now that some people are
starting to use it in production I'd like to get a deeper peer review to
ensure quality and feature completeness. (I actually didn't think I'd be this
nervous about people using any of my code, but backups are kind of critical so
it I'd like to ensure it is done as correct as possible).

~~~
konradb
There's
[http://burp.grke.org/burp2/08results1.html](http://burp.grke.org/burp2/08results1.html)

------
uint32
Like any good hacker I got tired of other solutions that didn't quite match my
needs and made my own dropbox-like backup/sync using only rsync, ssh and
encfs.

[https://github.com/avdd/rsyncsync](https://github.com/avdd/rsyncsync)

Not polished, but it's working for me.

    
    
        - only runs on machines I control
        - server requirement is only rsync, ssh and coreutils
        - basic conflict detection
        - encfs --reverse to encrypt locally, store remotely
        - history is rsnapshot-style hard links
        - inspect history using sshfs
        - can purge old history
    

shell aliases showing how I use it are in my config repository

encfs isn't ideal but it's the only thing that does the job. Ideally I'd use
something that didn't leak so much, but it doesn't exist.

~~~
rsync
How does that compare to this:

[https://raymii.org/s/articles/Set_up_your_own_truly_secure_e...](https://raymii.org/s/articles/Set_up_your_own_truly_secure_encrypted_shared_storage_aka_Dropbox_clone.html)

~~~
uint32
Same principle just different mechanics and assumptions.

I can't work (very effectively) in two places at once, so I don't need robust
merging, just CYA synchronisation. Using only rsync features I can do a full
2-way rsync merge and catch potential conflicts, erring on the conservative so
I have reasonable confidence I don't lose any work.

Minimal workstation dependencies: only bash, encfs, rsync, ssh,
coreutils/findutils and optionally atd for automation. encfs is optional, too.

Instead of dvc-autosync and XMPP I just use periodic execution. I partition my
stuff into smaller/high-frequency vs larger/lower-frequency to keep this
efficient. These are triggered from bash (PROMPT_COMMAND, in the background)
and recursive at (atd).

The local data is unencrypted on disk from this tool's POV. I use encfs
--reverse and rsync the result. To browse the history, I combine sshfs with
encfs in forward mode.

Linux only because that's what I use, but it should be possible to support
OSX.

All in all I'm pleased I'm able to use such minimal tooling for such a
successful result.

------
xorcist
I tried some backup software (of the rdiff variety, not the amanda variety)
last year when I set up a small backup server for friends and family.

Obnam and bup seemed to work mostly the way I wanted to but obnam was by far
the most mature tool, so this is what I chose in the end.

On the plus side, it provides both push and pull modes. Encryption and
expiration works. The minus points are no Windows support, and some horror
stories about performance. Apparently it can slow to a crawl with many files.
I haven't run into that problem despite hundreds of gig in the backup set, but
most are large files.

On the whole it's been very stable and unobtrusive during the time I've used
it, but I haven't used it in anger yet. So a careful recommendation for obnam
from me.

------
franole
Does anyone use zpaq[1]? It has compression, deduplication, incremental
backup, encryption, backup versioning (unlike bup, with the ability to delete
old ones), and its written un C++. But im not sure about performance over
network and how its compare with bup or rsync.

[1] [http://mattmahoney.net/dc/zpaq.html](http://mattmahoney.net/dc/zpaq.html)

------
mynegation
Deleting old backups and the lack of encryption is what stopped me from using
bup.

~~~
lelutin
some ppl have already started working on this and there's been activity on the
mailing list lately about this topic.

however it's a dangerous feature to add in (backup tools should _never_ screw
up their storage, and this feature goes and removes things) so a lot of care
is needed.

the boring answer is: it's coming, and we need a lot of help for vetting
patches and testing them out.

~~~
metrix
I think naming backups and archiving the old files should be separate
applications - that could be generic to whatever backup tool you decide to
use.

------
jlebar
Adding a plug for git-annex. [https://git-annex.branchable.com/](https://git-
annex.branchable.com/)

git annex is for more than just backups. In particular, it lets you store
files on multiple machines and retrieve them at will. This lets you do backups
to e.g. S3, but it also lets you e.g. store your mp3 collection on your NAS
and then easily copy some files to your laptop before leaving on a trip. Any
changes you make while you're offline can be sync'ed back up when you come
back online.

You can prune old files in git-annex [1], and it also supports encryption.
git-annex deduplicates identical files, but unlike Attic &co, it does not have
special handling of incremental changes to files; if you change a file, you
have to re-upload it to the remote server.

git-annex is actively developed, and I've found the developer to be really
friendly and helpful.

[1] You can prune the old files, but because the metadata history --
basically, the filename to hash mapping -- is stored in git, you can't prune
that. In practice you'd need to have a pretty big repository with a high rate
of change for this to matter.

 _Edited for formatting._

~~~
glandium
[http://git-annex.branchable.com/not/](http://git-annex.branchable.com/not/)

"git-annex is not a backup system."

------
eli
Is there an easy way to have the backups encrypted at rest? That's a nice
feature of Duplicity. I don't have to worry about someone hacking my backup
server or borrowing my USB drive having access to my data.

~~~
lelutin
currently bup doesn't implement encryption (since it's a pretty hard feature
to get right and we do want to finish coding other key features -- like old
backup removal -- before we get to that)

some ppl have reported using an encrypted storage backend like ecryptfs to
store their bup repositories in. that option shouldn't be too hard to put
together.

~~~
eli
Makes sense, I'd probably prioritize pruning old backups first too. bup looks
quite cool regardless, I'll be keeping an eye on it.

I like & use Duplicity, but it's a little finicky to set up right and some of
the more obscure features are counterintuitive or even broken.

------
keehun
This seems like a fantastic tool, and I would love to try this out. And, it's
free!

My personal obstacle in using a tool like bup is the back-up space. I could
definitely use this for on-site/external storage devices, but I also like to
keep online/cloud copies. I currently use CrashPlan for that which affords me
unlimited space. If CrashPlan would let me use their cloud with bup, wow, I
would switch in a heartbeat. Perhaps cloud backup tools could learn some
tricks from bup.

~~~
cpach
You might want to have a look at Tarsnap:
[https://www.tarsnap.com/efficiency.html](https://www.tarsnap.com/efficiency.html)

~~~
keehun
Aware of Tarsnap and looks very attractive. I'd rather pay a flat-rate out
because I store a lot of video (Musician) and other large files. I have about
750GB pre-deduplication stored. That's a lot more $ when I go to Tarsnap.

------
zanny
If you want a fantastic graphical frontend for bup, there is kup, which is a
kde app: [http://kde-
apps.org/content/show.php/Kup+Backup+System?conte...](http://kde-
apps.org/content/show.php/Kup+Backup+System?content=147465)

It is really easy to set up what folders to backup and where, and I use it
whenever a backup is simply take all files from X, do the rolling backups at
Y, and done.

------
rcthompson
If you're considering using it, keep in mind the limitations:
[https://github.com/bup/bup/blob/master/README.md#things-
that...](https://github.com/bup/bup/blob/master/README.md#things-that-are-
stupid-for-now-but-which-well-fix-later)

The one most likely to be a showstopper seems to be: "bup currently has no way
to prune old backups."

~~~
Smudge
I can see how this would be theoretically possible in the same way I could see
using `git filter-branch` to remove one or more commits from a code
repository. But as it requires walking back up the tree to recalculate all of
the commit hashes based on the new state of your files, I suspect it would be
an extremely slow/expensive operation in bup's case. Someone who knows more
about bup's internals can correct me if I'm wrong.

~~~
anoother
Isn't this exactly what Obnam does?

~~~
Smudge
I suspect that Obnam doesn't cause all of the de-duplicated chunks to get so
"entagled" (as bup puts it, in their readme).

There are existing discussions about the way bup pruning would have to work:

[https://groups.google.com/forum/#!searchin/bup-
list/prune/bu...](https://groups.google.com/forum/#!searchin/bup-
list/prune/bup-list/86j9ovQ-TaI/NOZPfOAu8WkJ)

Sounds like they are actually making progress on it.

------
labianchin
I've been using duply [http://duply.net/](http://duply.net/) for a while. It
is a simple frontend for duplicity
[http://duplicity.nongnu.org/](http://duplicity.nongnu.org/). I find it very
easy to setup. It also provides encrypted backups trough GPG.

------
konradb
There's also Burp which is worth a look
[http://burp.grke.org/index.html](http://burp.grke.org/index.html)

Looking at
[http://burp.grke.org/burp2/08results1.html](http://burp.grke.org/burp2/08results1.html)
it seems it can outperform Bup in some situations.

------
fragmede
> That is a dataset which is already deduplicated via copy-on-write semantics
> (it was not using ZFS deduplication because you should basically never use
> ZFS deduplication).

Can someone more experienced with ZFS say why?

~~~
aidenn0
"Basically never" is an overstatement, but it is true to the point of "Never
unless you already know why I said 'basically never'"

It boils down to the fact that ZFS maintains a mapping from hashes to LBNs.
This allows write-time deduplication (as opposed to a scrubber that runs
periodically and retroactively deduplicates already written blocks). This is
somewhat memory intensive though. For smaller ZFS pools you can get away with
just having lots of RAM (and with or without dedupe ZFS performs better the
more RAM you have). For larger ones, you can add a SSD to act as additional
disk cache.

Here's a quick description of that setup:

[https://blogs.oracle.com/brendan/entry/test](https://blogs.oracle.com/brendan/entry/test)

Note in this example that they were already showing 128GB of RAM for a 17TB
pool; the L2ARC was to augment that. In general, ZFS was designed with a much
higher RAM/Disk ratio than a workstation typically has.

~~~
pwildani
ZFS is also very far away from the state of the art in online dedup. For
instance, [http://users.soe.ucsc.edu/~avani/wildani-
icde13dedup.pdf](http://users.soe.ucsc.edu/~avani/wildani-icde13dedup.pdf) has
a theoretical dedup regime that needs only 1% of the RAM for 90% of the
benefit.

------
0x0
This looks very interesting as a replacement for rdiff-backup. Hopefully the
missing parts aren't too far away (expire old backups, restore from remote).

------
jshb
Can this new tool do incremental realtime disk image backup like Acronis True
Image?

~~~
lelutin
do you mean something that would automatically update the backup when a file
is changed on disk?

bup currently doesn't do that. but there's been some talk of using inotify or
another such method of knowing exactly which files are modified when they are
so that bup could instantly work on those.

in theory it should be feasible, it's not implemented yet however

~~~
williamstein
I had my Ph.D. student (Andrew Ohana) spend a while last summer implementing
exactly this using python-inotify, since I wanted it to greatly improve the
efficiency of [https://cloud.sagemath.com](https://cloud.sagemath.com), which
makes very frequent snapshots. It's pretty solid and is on github:
[https://github.com/ohanar/bup/tree/bup-
watch](https://github.com/ohanar/bup/tree/bup-watch) He's been busy with his
actual math thesis work and teaching, so hasn't got this upstreamed into bup.
It also depends on changes he made to bup to store the index using sqlite
instead of some custom format.

------
greensoap
Given that old backups cannot be remove, isn't backuppc a better solution?

~~~
lelutin
backuppc is a mature technology and can very well be trusted for collecting
backups.

there are some areas in which bup is trying to innovate in order to change how
backups are considered (storage size being one of those)

