

Time and Space Tradoffs in Version Control Storage - johns
http://www.ericsink.com/entries/time_space_tradeoffs.html

======
rarrrrrr
This is why SpiderOak (and I suspect Tarsnap, and probably others) use a
completely different approach to storing historical versions that don't
explicitly involve deltas.

Instead, they chunk files and store individual data blocks, and use an
adaptation of the rsync algorithm to be able to tell exactly which bytes have
changed from one version to the next. Any segments of changed bytes are stored
as new data blocks. This has two big advantages: 1) any version can be
restored by retrieving exactly the set of data blocks that compose it and 2)
any version in the chain can be purged, and use a sort of garbage collection
to determine which blocks can then be removed.

The adaptation of the rsync algorithm has to be able to work with an arbitrary
number of data block sizes (normal rsync has a max of 2 sizes -- one uniform
size for all blocks except the last block, and size of the tailing block.)
This can make it very slow if not handled smartly.

Another option the author might consider when making the delta chain is to
compress the versions first with zlib using the --rsyncable option, and then
make the deltas between the compressed versions. Normally compression means
that two very similar uncompressed files will have very different compressed
output, because even small changes cascade through the rest of the stream.
--rsyncable fixes that, and adds about 1% to the compressed size.

~~~
visitor4rmindia
zlib with the --rsyncable option sounds very interesting. Do you have any
pointers to more info? I'd really like to use it but I can't find any
references on <http://www.zlib.net/>

~~~
rarrrrrr
Oops, it's actually a patch to gzip, not zlib.

It's included in newer Debian/Ubuntu distros and explained in "man gzip" I
think this is the original patch:
<http://www.samba.org/netfilter/diary/gzip.rsync.patch>

------
amalcon
I don't see why a reverse diff chain would necessarily require re-encoding all
previous versions on every commit. A commit would be something like this:

\- Diff the new and current versions

\- Place the new version and the diff in the repository, remove the current
version

\- Verify the integrity of the current version

which executes in time independent of the number of versions.

He might be referring to the zlib encoding, where if you always put the
current version at the "beginning" of a single large, everything will need to
be re-encoded. This is a limitation of zlib, not of encoding in general.
Besides, you should be able to work around it (even using vanilla zlib) by
keeping separate, uncompressed indices for the beginning of each diff, and
then keeping the current version at the end of the file instead.

