
Bup: Efficient file backup system based on the git packfile format - tekacs
https://github.com/bup/bup
======
tobias2014
Another backup possibility I currently use: ZFS on a backup server (not
necessarily ZFS on the system that should be backed up), pull data with rsync
on the backup host to a ZFS, after that make a snapshot for an "incremental
backup".

So simplified it's like: rsync -avx remote:/etc /backup/ && zfs snapshot
backup@`date`

With zfSnap
([https://github.com/graudeejs/zfSnap](https://github.com/graudeejs/zfSnap))
you can tell how long incremental backups/snapshots are kept, "rsync && zfSnap
-d -a 1w backup"

You can take advantage of the /backup/.zfs/snapshot directory to access all
snapshots, built-in compression and possible data deduplication.

If you also have ZFS on the remote host, you can use zfs send and zfs receive
to transfer the snapshot directly to the backup server, instead of using rsync
for the diff.

~~~
tekacs
rsync.net is an example of a host that does something like this (with daily
snapshots)

    
    
        $> ssh rsyncnet ls .zfs/snapshot
        daily_2014-02-09
        daily_2014-02-10
        daily_2014-02-11
        daily_2014-02-12
        daily_2014-02-13
        daily_2014-02-14
        daily_2014-02-15
    

They allow you to customise the length of time for which these snapshots are
kept, too (IIRC) (at the cost to you of the incremental extra storage)

~~~
res0nat0r
rsync.net looks awesome, but it is just way too expensive for me.

~~~
rsync
You should email us and ask about the "new customer HN reader discount".

Also, note that 1TB accounts are 15c per GB, per month, and 10TB accounts are
7c per GB, per month - all with no traffic or usage costs.

This compares very favorably with S3 and blows the mozy pro pricing out of the
water[1].

[1]
[https://mozy.com/product/mozy/business](https://mozy.com/product/mozy/business)

------
gcr
Bup is lovely. I used it to back up my huge home folder and only switched away
to rdiff-backup because (at the time) there was no support for deleting old
revisions.

Is there any support for that? (Of course, for a large enough hard drive, it's
not much of a problem...)

~~~
rlpb
I wrote ddar, which is basically this but solves that particular problem, by
using something other than the git packfile format.

[http://www.synctus.com/ddar](http://www.synctus.com/ddar) and
[http://github.com/basak/ddar](http://github.com/basak/ddar)

It's recently been made available on Homebrew, too.

~~~
aristidb
Nice. Could you add some description there how the deduplication works? I
assume it uses rolling checksums to create chunk boundaries, just like bup?

~~~
rlpb
> I assume it uses rolling checksums to create chunk boundaries, just like
> bup?

That's right.

------
natch
How do people who would use this kind of thing manage to have remote servers
with terabytes of available disk space on them?

Anything is possible with money, of course, but how is this anything other
than really expensive?

For example AWS S3 would be $235/month (that's $2,820/year!) for 3TB not even
including any data-out transfer charges. Sure there are others that are
cheaper but only marginally so.

Is this really what people are doing? Makes the commercial services sound
really cheap.

~~~
apenwarr
My suggestion is to backup your cloud servers, which are expensive and
redundant and have good uplink speeds, to home servers which are cheap and
have good downlink speeds. You don't need your backup file server to be ultra-
reliable or even up all the time, so the cheapest possible PC sitting on a
home internet connection is a pretty good choice. That way, 3TB is just $150
or so plus your electricity, and it's not a per-month fee.

~~~
mverwijs
I'm curious as to how you would restore that much data quickly, from a home-
uplink to the cloud?

~~~
cbhl
You would send the hard drive via snail-mail to Amazon.

A mass restoration is expected to be rare, so it's okay for it to be a bit
more expensive.

------
atso
I hear about rdiff-backup, but I think its two main drawbacks are:

* on the webpage, there is no new release since 2009.

* has no de-duplication.

I was considering moving my 5+ years old rdiff-backup system to any of those
new, promising programs:

* obnam [[http://liw.fi/obnam/](http://liw.fi/obnam/)]

* attic [[https://pythonhosted.org/Attic/](https://pythonhosted.org/Attic/)]

They both do automatic de-duplication, old backup deletion and remote
encryption.

~~~
limmeau
The original rdiff-backup author went on to create duplicity. Maybe a big part
of the rdiff-backup community has followed him?

I'm using obnam now.

~~~
sciurus
For a link, [http://duplicity.nongnu.org/](http://duplicity.nongnu.org/)

I backup locally using duplicity, then I ship those files off to Amazon
Glacier using mt-aws.

[https://github.com/vsespb/mt-aws-glacier](https://github.com/vsespb/mt-aws-
glacier)

------
mattdeboard
I'm assuming you're not the author. But just in case the author wanders by:
How did you decide which parts to write in C?

~~~
apenwarr
It's not too hard actually. A line of python is roughly 80x slower than a line
of C (no exaggeration). But a typical line of python does a lot more than a
typical line of C. So things you can do with a "loose" loop (like once per 64k
block) is usually ok in python. Things you have to do with a "tight" loop
(like once per byte) need to be in C.

I once did a presentation about python performance optimization lessons from
bup:
[http://lanyrd.com/2011/pycodeconf/sghxk/](http://lanyrd.com/2011/pycodeconf/sghxk/)

And it's true, I'm not the most active maintainer anymore. The people who took
over seem to be doing a pretty good job though.

~~~
tekacs
I'd be curious to know if your stance on PyPy has changed at all since 2011
(if indeed it's something that you've taken any new long looks at since) given
their progress in that time.

I know that I would humbly submit at the least that my position has moved to
believing that PyPy is a viable option for high-speed code (albeit in
substantial part due to better interaction with C, nowadays).

------
atmosx
The most efficient backup system for operating systems I've used so far is
'tarsnap'. The only drawback is that restore is really slow.

~~~
tekacs
I love Tarsnap, but S3 storage costs aren't exactly brilliant. Figuring out
exactly how you wish to store keys can also be another thought, upfront
(albeit one that arises from the increased security that you get 'for free').

Alternately, CrashPlan and other consumer-style services have a bad habit of
using very slow, heavy, world-slowing systemwide file update scanning. :/

Having said this, a discussion of the merits and flaws of Tarsnap and similar
backup services is something I'm fairly certain I've seen lengthy discussions
of on similar HN posts.

([https://news.ycombinator.com/item?id=5767116](https://news.ycombinator.com/item?id=5767116)
is a good source for lots of that sort of discussion)

------
leephillips
Not really closely related, but another solution that uses git infrastructure
to back up large files is git-annex:

[https://git-annex.branchable.com/](https://git-annex.branchable.com/)

~~~
rsync
... and there are some tangential benefits to using git-annex as well:

[http://rsync.net/products/git-annex-
pricing.html](http://rsync.net/products/git-annex-pricing.html)

------
brimstedt
so, its 2014 and still people use homegrown variations of tar, rsync, git and
whatnot. Or a half done solution like this, or an abandoned solution like box
backup.

why on earth isnt there already a perfect cross platform open source backup
program? :)

I know,i know..why dont i make one myself? because we dont need a nother half
done solution :-b

~~~
gcr
Every "perfect cross platform open source program" had to start as a homegrown
variation of tar, rsync, git, and whatnot. It's not like they fall out of the
sky.

~~~
brimstedt
of course things dont come out of nothing, i just find it so odd that we have
open source top quality OSes, monitoring systems, programming languages, ides,
browsers, graphics suites, etc etc, but no backup ..

~~~
pjc50
Not having a backup isn't a "pain point". Until you have a disaster. And
disasters are infrequent enough ..

------
gesman
This is an outstanding project with great potential.

Killer for many commercial, overpriced services.

~~~
natch
Where would you host the backups?

~~~
gesman
How about here:

[http://www.soyoustart.com/ca/en/offers.xml](http://www.soyoustart.com/ca/en/offers.xml)

------
kimjotki2
a 'backup system' that runs on python. how oxymoron.

~~~
jrockway
Seriously. I write all my critical software in assembly so that my super-fast
disks and networks aren't bottlenecked by unnecessary CPU instructions! Backup
software always values speed over correctness!

