Hacker News new | comments | show | ask | jobs | submit login

There is a Better Way. Instead of using fixed sized blocks, use variable sized blocks. Decide the block boundaries using the data in the blocks themselves. This will reduce your search from O(n^2) to O(n).

Tarsnap does this. My project (ddar) does the same.

I've heard that called 'anchored hashing'. Its recently been used by FreeArc: http://encode.ru/threads/43-FreeArc?p=28237&viewfull=1#p...

How does ddar do it?

ddar calculates a 32-bit Rabin fingerprint over a sliding 48-byte window (a Rabin fingerprint is a type of checksum which allows quick calculation of a "slide" with just the previous fingerprint, the new byte coming in, and the old byte going away). The fingerprint is compared against a mask which has n bits set, where 2^n is the desired target block size. If there's a match, then the location of the window is taken to be a block boundary. With random data this should lead to a binomial distribution of block sizes centred around the target block size. But minimum and maximum block sizes are also enforced for pathological cases which skews this slightly.

Blocks whose boundaries are determined by this algorithm are hashed (SHA256 currently). The hash is used to key each block in a reference-counted key/value store.

Then each archive member is just a list of block hashes.

If you were to elaborate on and brush up the description, it would make an excellent HN submission. Good stuff anyways.

I wrote a library that does this. I haven't posted it publicly, because it's part of a much larger project (not yet complete), but could put it on github if anybody is interested.

There are implementations in Lua, Java, and C.

This would be really cool.

I'll ping you on twitter when I post it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact