Bup 0.01: It backs things up

cperciva · on Jan 4, 2010

This is actually very similar to the "multitape" utility I wrote back in October/November 2006 which turned into "multitar" (when I integrated it with libarchive and made several major optimizations) in January-May 2006 before turning into Tarsnap (when I added encryption and an online storage protocol).

Important differences between bup and multitape: I used C, not Python; I used a more sophisticated chunking algorithm; I used tape names rather than just numbering them.

e1ven · on Jan 4, 2010

I'm curious- Was this ever released anywhere? I can see you mentioned a multitape layer in tarsnap, but I'm curious if you released a standalone version of the archiving tool.

Hopefully you don't view bup as competition for tarsnap. There's a lot of situations such as Xen backups, where I really want to dump things to a local backup machine, then off to tape- I'm not interested in any form of hosted service, but tools that make binary diffs efficient and easy would certainly be welcome.

It might also be interesting to expand bup (or multitape/multitar, if it is public), to use the method described in your thesis paper- http://www.daemonology.net/bsdiff/

I apologize for potentially touching on a delicate subject. I certainly wouldn't want to come across as advising someone to use your theories to steal food out of your mouth, but local v. remote bkp seem to be sufficiently different markets.

cperciva · on Jan 4, 2010

Was this ever released anywhere?

No; I never really considered it to be useful except as a step towards Tarsnap.

Hopefully you don't view bup as competition for tarsnap

Not really, no. Of course, the author could follow the same path as I took, of integrating this with tar code, and end up producing a competitor to Tarsnap.

It might also be interesting to expand bup (or multitape/multitar, if it is public), to use the method described in your thesis paper- http://www.daemonology.net/bsdiff/*

That doesn't really work. Binary diffs are about comparing old and new files to produce a small patch; snapshotting compares the new file to a list of parts of the old file*. It's a tradeoff between needing more local state (with binary diffs you need to have the old file to compare against) and having larger deltas (with snapshotting, you have a new chunk even if only part of it changed).

That said, my experience with bsdiff was certainly useful in terms of shaping how I think about efficient deltas and compression, so even though none of the ideas translate directly it definitely helped me in writing Tarsnap.

pasbesoin · on Jan 6, 2010

Nit: That URL needs cleansing by adding a space between it and the trailing asterisk when submitting (a problem with HN's parsing of '❄blah❄' markup that I've noticed/noted before; I seem to recall the breakage scenario as occurring when the markup is used at the end of a paragraph).

Test case: The last word of this sentence -- sans any training punctuation -- has the markup

Err... nope: It looks like it's only when wrapping a URL in the markup http://www.google.com/* a̶n̶d̶/̶o̶r̶ ̶w̶h̶e̶n̶ ̶t̶h̶a̶t̶ ̶U̶R̶L̶ ̶i̶s̶ ̶a̶t̶ ̶t̶h̶e̶ ̶e̶n̶d̶ ̶o̶f̶ ̶a̶ ̶p̶a̶r̶a̶g̶r̶a̶p̶h̶ ̶̶h̶t̶t̶p̶:̶/̶/̶w̶w̶w̶.̶g̶o̶o̶g̶l̶e̶.̶c̶o̶m̶/̶̶

Looks like I need to break some continuing italicization now.

Trying the same sort of markup but also wrapping a trailing space produces http://www.google.com/ , which is better..

----

Where ❄ represents an asterisk.

wisty · on Jan 4, 2010

Backing up the file system solves the wrong problem. The real problem is that applications (for example, VMs and databases) don't automatically map onto the file system, and even sophisticated users (even sophisticated ones like the Jeff Atwood) don't find this very easy. I can see 3 ways around this:

- Applications could register with the back-up utility, so bup (for example) knows to get a hotcopy from the svn repository and a dump from the database.

- Applications could be told to dump to a specific locations on the file system on a regular basis.

- Unix magic could be used, so reading from certain parts of the file system would trigger a dump from the appropriate applications. I'm not quite sure if this is possible (I'm a Unix weenie).

I don't care is best (they would all work). The real solution would have the following features:

- Automated nags (SVN-style) about dirty looking locations.

- A white-list of locations to suppress nags on (the same way SVN can be set to ignore the /bin directory, and the .pyc files).

- A way to resolve the nags (i.e. telling the backup server what commands to run in order to backup certain applications).

- A file format for backup hints (left in a hidden file called .bup in the program's main directory), so applications could automatically tell give hints to the backup program on how to get them to dump.

A nice GUI that auto-suggests backp-up commands (with shell integration like tortise-SVN) would be cool, but not essential.

I've tried to use "applications" consistently in the post. It could mean a database, a repo, website server, or anything. As long as the "application" has some sane way to be back-up up.

And yes, I do know that talk is cheap.

idlewords · on Jan 4, 2010

Jeff Atwood is a poor example to cite here. His data loss had nothing to do with the subtleties of how applications map onto the file system; it was just due to carelessness. His published advice to others (host images on S3 and back up your files to a different machine than the one you're on) would have prevented the whole mess.

wisty · on Jan 6, 2010

Yeah, but his host messed up by not being able to back up virtual machines.

idlewords · on Jan 6, 2010

That's like saying his disk messed up by crashing

Braaf · on Jan 4, 2010