Hacker News new | past | comments | ask | show | jobs | submit login

I wrote a very similar tool before I knew about bup - ddar (https://github.com/basak/ddar - with more documentation at http://web.archive.org/web/20131209161307/http://www.synctus...).

Others have complained here that bup doesn't support deleting old backups. ddar doesn't have such an issue. Deleting snapshots work just fine (all other snapshots remain).

I think the underlying difference is that ddar uses sqlite to keep track of the chunks, whereas bup is tied to git's pack format, which isn't really geared towards large backups. git's pack files are expected to be rewritten, which works fine for code repositories but not for terabytes of data.

It's true that git's pack files are made for being rewritten, but bup doesn't do that. Every new run will create a new pack along with its .idx (which means that some packs may be quasi-empty) and the size of packs is capped at 1GB (Giga, not Gibi).

The real struggle of bup is how to know whether a hash is already stored, and how to know it screaming fast. It could be interesting to compare bup style and standard sqlite style, as you do in ddar.

Also, it seems ddar stores each object in its own file, like git loose objects. SQLite has a page [0] that compares this and storing blobs in SQLite, and I don't know what's the median size of your objects but if it's < 20k it seems to be better to just store them as blobs.

[0] https://www.sqlite.org/intern-v-extern-blob.html

> I don't know what's the median size of your objects but if it's < 20k...

I can't remember the exact number off the top of my head, but I designed the average size of each object to be much bigger - more like 64-256M than kilobytes. IMHO this works far better for backups. So I just use the filesystem to store the blobs, which I think works better.

Correction: I'd forgotten the details. Looks like I aimed for 256k, and it is this size that works well. I did consider filesystem performance when I chose the size, intending flat file blob storage here.

Doesn't that alleviate the benefit of deduplication, if you're working on multi megabytes objects ? You'll end up copying a lot, unless I'm missing something obvious...

Most backups have very large areas of duplication. If a small file has changed, chances are that the small files around it have changed also. So de-duplicating with a larger chunk size seems to work fine in practice.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact