

Compression algorithms used by Google - helwr
http://feedblog.org/2008/10/12/google-bigtable-compression-zippy-and-bmdiff/

======
barrkel
In-memory compression is another technique not to be forgotten. If you can
efficiently compress your in-memory representations of data you would
otherwise have to hit disk or network for, you can dramatically increase the
amount of data you can cache in memory, especially if you're aware of the
structure and general patterns of the data.

Once I had to cache file paths, millions of them. I got a fairly easy 6x
improvement by storing them in a Patricia trie, with each node's string stored
as a range of UTF-8 encoded bytes, and indexed by arrays of bitfields - i.e.
bit-packed arrays. Child nodes were similarly indexed using bitfield arrays.
Turning what would be a 32-bit (or worse, 64-bit) pointer into (e.g.) a 28-bit
integer really starts to add up when you have millions of them.

------
edge17
db2 and oracle both implement similar compression algorithms. There's nothing
terribly special about google's compression algos.

~~~
MWinther
That they are used by Google makes them somewhat special, in that one would
assume they have put some thought and experimentation into choosing them.
Also, that would make for pretty good field-testing of them.

~~~
edge17
Any company building enterprise level products puts thought and
experimentation into choosing their algorithms. Google just advertises their
research publicly as part of their branding.

------
seiji
A few months ago I extracted bmdiff from hypertable to use standalone. Give it
a try: <http://bitbucket.org/mattsta/bmz/src>

~~~
coderdude
Thanks for extracting bmz into a standalone. I actually went searching for it
after reading the article's comments last night and came across your link. One
question I have is how up-to-date is this version of bmz? It says that it's
from a Hypertable commit dated Jan 25, 2009. Would I be able to find a newer
version in Hypertable's source?

