

Golang implementation of Bentley/McIlroy compression - jgrahamc
https://github.com/cloudflare/bm

======
mutagen
For those that don't know their RFCs / compression algorithms (like me):

Original paper:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&type=pdf)

RFC: [http://tools.ietf.org/html/rfc3284](http://tools.ietf.org/html/rfc3284)

John Graham-Cumming's amusing blog post applying the algorithm to everyone's
favorite Rick Astley song: [http://blog.jgc.org/2012/06/compression-of-lyrics-
of-never-g...](http://blog.jgc.org/2012/06/compression-of-lyrics-of-never-
gonna.html)

~~~
jgrahamc
I think this was a more fun compression of that song:
[http://blog.jgc.org/2012/06/animated-solution-never-gonna-
gi...](http://blog.jgc.org/2012/06/animated-solution-never-gonna-give-
you.html)

------
twotwotwo
I babbled at you on Twitter earlier--I wrote a differ somewhat like this
that's at
[https://github.com/twotwotwo/dltp/blob/master/diff/diff.go](https://github.com/twotwotwo/dltp/blob/master/diff/diff.go).

First, confession that my code style, interface, etc. are fairly awful
(panics, mixing Reader/Writer with passing []bytes around, and there's even a
line commented 'why?'). Can't justify it; I just never properly cleaned up the
first thing I got working. I also suspect you've got some performance wins
over my code--you probably saved overhead by not calling encoding/binary for
varints, for example, and your 'radix' constant (257) almost certianly makes
the hashing faster (the multiply even optimizes to an 'lea' instr, I think).
Also, (de)serializing the dicts is handy and probably crucial for your use
case.

Here are some things we're doing different--just to document them, not
claiming that anything is a win:

The rolling hash: We're using similar multiply-add-modulus hashes. I'm relying
on uint32 wraparound for the modulus (Go spec says you can rely on
wraparound), and I'm doing another multiply when I subtract values out of the
hash instead of using a 'save' array.

Blocks vs hash bits: I'm stealing a trick from rzip, where instead of saving
hashes of non-overlapping blocks, I save hashes whenever a certain number of
bits of the hash are zero. I don't know what works better, empirically.

Min. match length: I used 24 after trying out various values. Too low and you
find short matches when you could get longer ones, too high and you miss
matches. The right value is probably data-dependent anyway.

The hash table: I'm using a 128k-entry array that's directly indexed by some
bits of the rolling hash. Because I'm hashing a lot of documents and only
using each hashtable once, I worked out a scheme to reuse the array for
multiple diff tasks without garbaging or zeroing it: I made the values in the
array indices into _all the bytes this MatchState has ever hashed_ , not into
the current document. After fetching an offset out of the hash table, I check
if it's before the start of the latest doc (if h < base) and ignore it if so,
and otherwise subtract 'base' from it to convert it into an offset into the
current doc. Costs something during hashing and matching, but clearing the
table was ultimately costing me more.

The encoding: I'm encoding each copy/literal as a protobufs-style signed
varint. Positive numbers give a number of literal bytes to copy into the
output, negative numbers give a number of bytes to be copied from the
reference text, zero means end of diff. Copy lengths are followed by another
signed varint that gives the location in the reference doc where the copy
should start, relative to a "cursor" position. Having that "cursor" allows a
copy that starts right after the end of the last copy to have a slightly
shorter encoding. Short encoding of matches isn't that critical anyway,
compared with doing well at _finding_ matches, so that part I sort of overdid.

Things I was intrigued by but haven't actually tried:

\- rzip separates the 'instructions' ('insert X literal bytes', 'copy Y bytes
from position Z in the original') from the text data. For rzip, that seems to
improve the secondary compressor (bzip)'s compression ratio. I'm curious if it
helps.

\- I think the Git packfile format makes the literal instruction always a
single byte, but the max. literal len is 127. I wonder if that saves output
bytes on net.

\- Lots of other packers look at multiple match candidates for the longest
match. It would probably make smaller diffs, but not at all sure that the
complexity and CPU-time costs would be worth it.

\- I could probably eke out a small CPU-time win by matching from "the ends"
of the input first, since often one can find longish matches there without
hashing.

\- I wonder if there's any win in checking whether a match can be extended
backwards to completely cover a preceding match. It seems complicated, and
probably not a huge win.

Thanks a lot for open sourcing this. When I get time, may try dropping bm into
the program I was playing with. (And both my code and my words above are
probably a little fried, forgive--the words are rushed, and the code was
nights-and-weekends stuff and my first Go project ever.)

------
hblanks
Thanks John Graham-Cumming! It's neat to see (what at least might be) some
part of the Railgun making it back into the go community.

~~~
jgrahamc
Thanks. We've been striving to open source as much stuff as possible (both
large and small). [http://cloudflare.github.io/](http://cloudflare.github.io/)

One thing I've been using Github for is keeping an archive of all the talks
I've given and related code and data: [https://github.com/cloudflare/jgc-
talks](https://github.com/cloudflare/jgc-talks)

------
redbad
Frustratingly non-idiomatic Go code :(

~~~
jgrahamc
What is non-idiomatic about it?

~~~
redbad
A mix of lint-y and style problems, and over-use of named returned parameters.

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L4...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L47)

\-- comment block should start with Dictionary.

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L5...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L52)

\-- comment should precede the declaration.

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L6...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L67)
and others

\-- spurious newlines

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L7...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L76)

\-- needless named return parameters

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L1...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L139)

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L1...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L125)

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L2...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L245)

(many others)

\-- prefer early return or continue over if/else

[https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L1...](https://github.com/cloudflare/bm/blob/master/src/bm/bm.go#L156)

\-- more boilerplate due to the needless decision to use named return params

~~~
pkulak
Named return parameters are unfortunate. Probably better to never use them
unless you have to (like if you are "catching" a panic). Otherwise, I think
you are nitpicking a bit.

~~~
frou_dh
That usage with _recover_ would be to be able to set return parameters that
might otherwise be returned as their zero values, correct?

~~~
pkulak
Exactly. You can't return from the outer function in a defer, but you do have
access to named return values. It's a bit of a kludge.

