Hacker News new | past | comments | ask | show | jobs | submit login
Google releases snappy, the compression library used in Bigtable (code.google.com)
235 points by tonfa on Mar 22, 2011 | hide | past | web | favorite | 83 comments

I did a double-take when I saw this -- the library is called "zippy" internally, but there must have been some kind of trademark issue with that.

This is used in more than just BigTable; it's used extensively inside Google, in particular for the RPC system. It's great to see it open-sourced!

There are quite a lot of hits about random projects being called zippy.

Maybe the issue was regarding the comic strip, "Zippy the pinhead"?

My first association was with the character from the UK children's TV show Rainbow: http://en.wikipedia.org/wiki/Zippy_(Rainbow)

I doubt it's a trademark issue, as trademarks tend to be usage-specific. (e.g. Apple Computer vs. Apple Music) It's probably just a matter of Google feeling there are too many different things already associated with the name 'Zippy.'

(I also now have the "Rainbow" theme tune stuck in my head.)

Do you remember go?

Go is common enough that it is untrademarkable. Zippy? Not so much, hence the name change.

I wonder if they had evaluated LZO (http://www.oberhumer.com/opensource/lzo/) before writing this. It is quite well-tested (a variant on it runs in the Mars Rovers) and very very fast: the author reports 16MB/sec on a Pentium 133, on modern architectures it should easily get to the 500MB/sec claimed by snappy.

"The LZO algorithms and implementations are copyrighted OpenSource distributed under the GNU General Public License."

Does this even make sense? Can you apply the GPL to an algorithm? As I understand it, if there's no patent I should be able to implement it with no problems.

People who want to upsell you a commercial version tend to say such things.

    Be warned: the main source code in the 'src' directory 
    is a real pain to understand as I've experimented with 
    hundreds of slightly different versions. It contains 
    many #if and some gotos, and is *completely optimized  
    for speed* and not for readability. Code sharing of the 
    different algorithms is implemented by stressing the 
    preprocessor - this can be really confusing. Lots of 
    marcos and assertions don't make things better.
Given the author's statements, I don't know that I'd feel comfortable using LZO in a production environment.

We used it in our games since Ultimate Spiderman. I've got assembly version working on the PS2 IOP chip (33Mhz, the chip used to run PS1 games, otherwise I/O and sound for PS2).

I've got 5mb/s decompressed data speed, so that was speeding our disk access. I had to tweak just a little bit the source code and make sure unaligned writes (4 bytes) were used and that made it x2 or x3 faster.

The README says:

  "In our tests, Snappy usually is faster than algorithms in the same class (e.g. LZO, LZF, FastLZ, QuickLZ, etc.) while achieving comparable compression ratios."

Interesting, I had missed that. It also says

> Finally, snappy can benchmark Snappy against a few other compression libraries (zlib, LZO, LZF, FastLZ and QuickLZ), if they were detected at configure time.

I'll do some tests, I'm curious about the results

Please post the results :)

  ~/snappy-read-only $ ./snappy_unittest -norun_microbenchmarks -lzo testdata/*
  testdata/alice29.txt                     :
  LZO:    [b 1M] bytes 152089 ->  82691 54.4%  comp  64.5 MB/s  uncomp 206.4 MB/s
  SNAPPY: [b 4M] bytes 152089 ->  90965 59.8%  comp 171.5 MB/s  uncomp 375.8 MB/s
  testdata/asyoulik.txt                    :
  LZO:    [b 1M] bytes 125179 ->  73217 58.5%  comp  52.5 MB/s  uncomp 173.2 MB/s
  SNAPPY: [b 4M] bytes 125179 ->  80207 64.1%  comp 137.8 MB/s  uncomp 301.7 MB/s
  testdata/baddata1.snappy                 :
  LZO:    [b 1M] bytes  27512 ->  26487 96.3%  comp  24.4 MB/s  uncomp 491.6 MB/s
  SNAPPY: [b 4M] bytes  27512 ->  26675 97.0%  comp 305.2 MB/s  uncomp 1465.7 MB/s
  testdata/baddata2.snappy                 :
  LZO:    [b 1M] bytes  27483 ->  26528 96.5%  comp  24.3 MB/s  uncomp 499.1 MB/s
  SNAPPY: [b 4M] bytes  27483 ->  26724 97.2%  comp 331.8 MB/s  uncomp 1660.4 MB/s
  testdata/baddata3.snappy                 :
  LZO:    [b 1M] bytes  28384 ->  27380 96.5%  comp  24.1 MB/s  uncomp 488.2 MB/s
  SNAPPY: [b 4M] bytes  28384 ->  27476 96.8%  comp 275.7 MB/s  uncomp 1346.5 MB/s
  testdata/cp.html                         :
  LZO:    [b 1M] bytes  24603 ->  11621 47.2%  comp  57.2 MB/s  uncomp 258.5 MB/s
  SNAPPY: [b 4M] bytes  24603 ->  11838 48.1%  comp 190.0 MB/s  uncomp 443.5 MB/s
  testdata/fields.c                        :
  LZO:    [b 1M] bytes  11150 ->   4663 41.8%  comp  73.8 MB/s  uncomp 259.8 MB/s
  SNAPPY: [b 4M] bytes  11150 ->   4728 42.4%  comp 207.3 MB/s  uncomp 431.7 MB/s
  testdata/geo.protodata                   :
  LZO:    [b 1M] bytes 100000 ->  20423 20.4%  comp 115.8 MB/s  uncomp 429.9 MB/s
  SNAPPY: [b 4M] bytes 100000 ->  23488 23.5%  comp 359.0 MB/s  uncomp 625.8 MB/s
  testdata/grammar.lsp                     :
  LZO:    [b 1M] bytes   3721 ->   1781 47.9%  comp  66.8 MB/s  uncomp 311.4 MB/s
  SNAPPY: [b 4M] bytes   3721 ->   1800 48.4%  comp 214.9 MB/s  uncomp 461.2 MB/s
  testdata/house.jpg                       :
  LZO:    [b 1M] bytes 126958 -> 127173 100.2%  comp  20.2 MB/s  uncomp 1420.8 MB/s
  SNAPPY: [b 4M] bytes 126958 -> 126803 99.9%  comp 2037.6 MB/s  uncomp 7578.7 MB/s
  testdata/html                            :
  LZO:    [b 1M] bytes 102400 ->  21027 20.5%  comp 115.1 MB/s  uncomp 423.8 MB/s
  SNAPPY: [b 4M] bytes 102400 ->  24140 23.6%  comp 362.4 MB/s  uncomp 703.5 MB/s
  testdata/html_x_4                        :
  LZO:    [b 1M] bytes 409600 ->  82980 20.3%  comp 120.9 MB/s  uncomp 416.4 MB/s
  SNAPPY: [b 4M] bytes 409600 ->  96472 23.6%  comp 359.0 MB/s  uncomp 698.0 MB/s
  testdata/kennedy.xls                     :
  LZO:    [b 1M] bytes 1029744 -> 357315 34.7%  comp 133.5 MB/s  uncomp 533.6 MB/s
  SNAPPY: [b 4M] bytes 1029744 -> 425735 41.3%  comp 294.1 MB/s  uncomp 432.6 MB/s
  testdata/kppkn.gtb                       :
  LZO:    [b 1M] bytes 184320 ->  71671 38.9%  comp  84.3 MB/s  uncomp 235.9 MB/s
  SNAPPY: [b 4M] bytes 184320 ->  70535 38.3%  comp 231.7 MB/s  uncomp 373.8 MB/s
  testdata/lcet10.txt                      :
  LZO:    [b 1M] bytes 426754 -> 221290 51.9%  comp  57.6 MB/s  uncomp 181.9 MB/s
  SNAPPY: [b 4M] bytes 426754 -> 243710 57.1%  comp 153.7 MB/s  uncomp 341.4 MB/s
  testdata/mapreduce-osdi-1.pdf            :
  LZO:    [b 1M] bytes  94330 ->  76999 81.6%  comp  24.9 MB/s  uncomp 810.9 MB/s
  SNAPPY: [b 4M] bytes  94330 ->  77477 82.1%  comp 709.0 MB/s  uncomp 1669.5 MB/s
  testdata/plrabn12.txt                    :
  LZO:    [b 1M] bytes 481861 -> 294610 61.1%  comp  51.4 MB/s  uncomp 164.8 MB/s
  SNAPPY: [b 4M] bytes 481861 -> 329339 68.3%  comp 129.8 MB/s  uncomp 281.9 MB/s
  testdata/ptt5                            :
  LZO:    [b 1M] bytes 513216 ->  86232 16.8%  comp 119.0 MB/s  uncomp 506.1 MB/s
  SNAPPY: [b 4M] bytes 513216 ->  93455 18.2%  comp 479.8 MB/s  uncomp 670.8 MB/s
  testdata/sum                             :
  LZO:    [b 1M] bytes  38240 ->  17686 46.2%  comp  57.8 MB/s  uncomp 267.0 MB/s
  SNAPPY: [b 4M] bytes  38240 ->  19837 51.9%  comp 194.5 MB/s  uncomp 401.0 MB/s
  testdata/urls.10K                        :
  LZO:    [b 1M] bytes 702087 -> 309320 44.1%  comp  55.7 MB/s  uncomp 265.1 MB/s
  SNAPPY: [b 4M] bytes 702087 -> 357267 50.9%  comp 216.9 MB/s  uncomp 499.3 MB/s
  testdata/xargs.1                         :
  LZO:    [b 1M] bytes   4227 ->   2450 58.0%  comp  56.3 MB/s  uncomp 286.2 MB/s
  SNAPPY: [b 4M] bytes   4227 ->   2509 59.4%  comp 173.8 MB/s  uncomp 397.3 MB/s
This is on a Core2, 64bit, GCC 4.5.2. Please someone port it to C and linux-kernel code so it could be used with zram instead of LZO.

You should probably mention why someone would choose this library over another compression library. I think good advice would be to use Snappy to compress data that is meant to be kept in memory, as Bigtable does with the underlying SSTables. If you are reading from disk, a slower algorithm with a better compression ratio is probably a better choice because the cost of the disk seek will dominate the cost of the compression algorithm.

Another incredible internal project open-sourced by Google. I really respect Google's dedication to improving the speed of the internet in general, and to open source.

Of course this benefits them as well, but it's a form of enlightened self-interest that, to me, is very refreshing compared to for example Microsoft, and other companies that only care about their own software/platforms and only release stuff on need-to-know basis.

IMHO, the build system could do with a little work:

* The various bits generated from and added by the autotools shouldn't be committed. autoreconf -i works really well these days. That's INSTALL Makefile.in aclocal.m4 compile config.guess config.h.in config.sub configure depcomp install-sh ltmain.sh missing mkinstalldirs.


* Needs to call AC_SUBST([LIBTOOL_DEPS]) or else the rule to rebuild libtool in Makefile.am won't work.

* A lot of macro calls are underquoted. It'll probably work fine, but it's poor style.

* The dance with EXTRA_LIBSNAPPY_LDFLAGS seems odd. It'd be more conventional to do something like:

and set the -version-info flag directly in Makefile.am. If it's to allow the user to provide custom LDFLAGS, it's unnecessary: LDFLAGS is part of libsnappy_la_LINK. Here's the snippet from Makefile.in:

    libsnappy_la_LINK = $(LIBTOOL) --tag=CXX $(AM_LIBTOOLFLAGS) \
            $(LIBTOOLFLAGS) --mode=link $(CXXLD) $(AM_CXXFLAGS) \
            $(CXXFLAGS) $(libsnappy_la_LDFLAGS) $(LDFLAGS) -o $@
* There should be an AC_ARG_WITH for gflags, because automagic dependencies aren't cool: http://www.gentoo.org/proj/en/qa/automagic.xml

* Shell variables starting with ac_ are in autoconf's namespace. Setting things like ac_have_builtin_ctz is therefore equally uncool. See http://www.gnu.org/s/hello/manual/autoconf/Macro-Names.html :

> To ensure that your macros don't conflict with present or future Autoconf macros, you should prefix your own macro names and any shell variables they use with some other sequence. Possibilities include your initials, or an abbreviation for the name of your organization or software package.

* Use AS_IF instead of directly using the shell's `if`: http://www.gnu.org/software/hello/manual/autoconf/Limitation... and http://www.gnu.org/s/hello/manual/autoconf/Common-Shell-Cons... .

* Consider adding -Wall to either AUTOMAKE_OPTIONS in Makefile.am or as an argument to AM_INIT_AUTOMAKE. If you don't mind using a modern automake (1.11 or later), also call AM_SILENT_RULES([yes]). Even MSYS has automake-1.11 these days.


* Adding $(GTEST_CPPFLAGS) to both snappy_unittest_CPPFLAGS and snappy_unittest_CXXFLAGS is redundant. See this part of Makefile.in:

    snappy_unittest-snappy-test.o: snappy-test.cc
    @am__fastdepCXX_TRUE@   $(CXX) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(snappy_unittest_CPPFLAGS) $(CPPFLAGS) $(snappy_unittest_CXXFLAGS) $(CXXFLAGS) -MT snappy_unittest-snappy-test.o -MD -MP -MF $(DEPDIR)/snappy_unittest-snappy-test.Tpo -c -o snappy_unittest-snappy-test.o `test -f 'snappy-test.cc' || echo '$(srcdir)/'`snappy-test.cc
    @am__fastdepCXX_TRUE@   $(am__mv) $(DEPDIR)/snappy_unittest-snappy-test.Tpo $(DEPDIR)/snappy_unittest-snappy-test.Po
    @AMDEP_TRUE@@am__fastdepCXX_FALSE@      source='snappy-test.cc' object='snappy_unittest-snappy-test.o' libtool=no @AMDEPBACKSLASH@
    @am__fastdepCXX_FALSE@  $(CXX) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(snappy_unittest_CPPFLAGS) $(CPPFLAGS) $(snappy_unittest_CXXFLAGS) $(CXXFLAGS) -c -o snappy_unittest-snappy-test.o `test -f 'snappy-test.cc' || echo '$(srcdir)/'`snappy-test.cc
* snappy_unittest should be in check_PROGRAMS, not noinst_PROGRAMS. That way, it's built as part of `make check`, not `make all`.

I wouldn't be surprised if this used to be part of a big internal build system, and they hacked up an autoconf/automake setup for the public release.

Definitely - this is the case with pretty much every Google component that gets open sourced. The only exceptions are projects like Chromium and Android that were designed from the start to be open-sourced.


Sorry to be a little offtopic, but do you have any pointers for where to learn this kind of knowledge (more "best practices" than "getting started") for autoconf et al? I just started using them and I am sure many of your criticisms would apply to my code; I'd like to do better... Thanks.

In case anyone wants to know, this is my go-to recommendation for "getting started": http://www.lrde.epita.fr/~adl/autotools.html . Extremely thorough, and shows how all the different pieces work together. I'd read it anyway, even though you've got started. It's very, very good.

I've also found Diego Pettenò's "Autotools Mythbuster" to be quite good: http://www.flameeyes.eu/autotools-mythbuster/index.html . His old article "Best practices with autotools" isn't bad, but a little light: http://www.linux.com/archive/articles/114061

You may also want to browse the autoconf and automake tags of Diego's blog: http://blog.flameeyes.eu/tag/autoconf and http://blog.flameeyes.eu/tag/automake

A thorough reading of the autoconf and automake manuals also points out common pitfalls.

My opinion is that many people hate autotools because they don't understand them. It was designed to handle just about any crazy build scenario you can throw at it -- of course it going to be a little complex.

I started by spending a few hours forcing myself to read the following. It explains how autotools works and (importantly) why.


It is critical to understand what the autotools are actually doing before you just go and copy'n'paste somebody else's configure.ac and/or Makefile.am.

When you are ready to get your feet wet, here is a really good quick summary from the GNOME project:


Once you understand what is going on, finding specifics in the GNU manuals (and being able to interpret them) is much easier. Yes, you should understand how m4 works (it's not rocket science).

Soon, somebody will convert the build system to CMake, move it to github, clean up the insulting directory structure, then nobody will look at their google code page ever again.

People act like CMake is an improvement, but it's not. The language is not very good (lists as semicolon-separated strings, seriously?). For example: its pkg-config support is completely broken. It takes the output of pkg-config, parses it into a list (liberally sprinkling semicolons where the spaces should be) and then the semicolons make it into the compiler command line, causing all manner of cryptic errors.

Stick with automake. Seriously.

CMake is also very difficult to debug (e.g. to find out why a library test is failing), harder to fix once you've debugged it, has strange ways of accepting extra compiler/linker flags from the environment, has poor --help, tries to allow creating Xcode projects but mostly produces nonsense, etc...

I think it might actually be worse than autoconf in every way, which is surprising considering how bad autoconf is. The handwritten non-macro-expanding not-much-autogenerating configure/makefile in ffmpeg/libav/x264/vp8 is easier to deal with than either.

I disagree with you that autoconf is bad. Its design came from a lot of locally-optimal choices that don't look so good in 2011, and there's a lot of legacy code being copied around in people's configure.ac files.

To me, it's not perfect, but it's pretty good. Then again, I'm known to my friends as "that guy who knows automake" :-).

On Windows, automake is an order of magnitude slower to compile than the projects generated by CMake, not to mention that compiling with MSVC is very difficult to make work at all with autotools. automake just isn't a viable option if your projects need to be portable to Windows.

Speed differences like this are often down to process creation.

A lot of automation routines designed on unix-a-like systems involve creating short lived processes with reckless abandon because creating and tearing down a process in most Unix environments is relatively efficient. When you transplant these procedures to Windows you are hit by the fact that creating or forking a process there is relatively expensive. IIRC the minimum per-process memory footprint is higher under Windows too, though this doesn't affect build scripts where generally each process (or group of processes if chaining things together with pipes and redirection) is created and finished with in turn rather than many running concurrently.

This is why a lot of Unix services out there are process based rather than thread based but a Windows programmer would almost never consider a process based arrangement over a threaded one. Under most unix-a-like OSs the difference between thread creation and process creation is pretty small so (unless you need very efficient communication between things once they are created or might be running in a very small amount of RAM) a process based modal can be easier to create/debug/maintain which is worth the small CPU efficiency difference. Under Windows the balance is different: creating processes and tearing them down after use is much more work for the OS than operating with threads is, so a threaded approach is generally preferred.

I always used MinGW on windows.

OT: If I understand correctly, CMake was purpose-built to support building Kitware's visualization application. Their app uses Tcl as an embedded language; how they could already be using Tcl in their app then insist on building an ad-hoc language into CMake (versus using Tcl, which already supports looping, conditions, variable setting/getting etc) is an occasional wonder to me.

I hate CMake. I hate autotools, too. But, if there's going to be a replacement for autotools, it's gotta be better than CMake. At least autotools is standard on every Linux/UNIX system these days. CMake is just another build dependency for very little gain.

Don't use them. Makefiles are fine. So far i haven't seen the need for configure magic. On the other hand, i do not work on software, which needs to compile on archaic AIX systems.

For a Win/OS X/Linux portable build, a Makefile should suffice. Example: https://github.com/MatzeB/cparser/blob/master/Makefile

That's probably fine as long as you're just compiling .c to .o. I suspect it breaks down in the face of anything more complex. For instance, I've got a problem right now where I need to use objcopy to turn arbitrary binaries into .o files, and that requires knowledge about the toolchain on the user's machine which, as far as I can tell, can only be gathered by compiling a throwaway file and sniffing its output with objdump. That's exactly the sort of task which autotools is good at, but I'm desperately trying to find a different way to get at the information so I don't have to introduce autotools to what is otherwise an already... erm... interesting build chain.

Why not just convert the binary into an array literal in a C source file and compile that?

Whoah! Hold on, now. autotools (and CMake) exists for a reason. They, in many cases make your life easier than maintaining makefiles. I do think make is much easier to work with (with some warts) than autotools, and it's certainly more comprehensible, since it's so much smaller and contains far fewer bits of magic. But, to just throw out everything autotools were designed to deal with doesn't make sense.

Sure, if your build is simple, make works fine. But, the projects I've worked on where autotools was used, simply using make would have been a horrible experience. And, in most cases, the projects started out using make by itself and then moved to using autotools when the number of platform specific makefiles became too big to maintain.

Configuration magic aside, have fun supporting everything in http://www.gnu.org/prep/standards/html_node/Makefile-Convent... . You never know which features your eventual users will rely on. That's one of the main problems automake solves.

Custom-made build systems tend to break conventions that are useful to packagers (destdir) or home-directory installs (prefix) or developers (ccache, though your example isn't guilty of the last one). I'd rather build something that handles all this and integrates well everywhere than see variations that need custom patching.

What are your thoughts instead on djb's redo, which is being implemented and so far working nicely at https://github.com/apenwarr/redo ?

redo and ninja are replacements for make. They have a dependency graph and build it. Tool integration, configuration, feature detection, lifecycle (install, release, deploy…) have to be handled by something else.

I think that tup build build tool is wa-a-ay better solution than CMake. http://gittup.org/tup/

> The various bits generated from and added by the autotools shouldn't be committed. autoreconf -i works really well these days. That's INSTALL Makefile.in aclocal.m4 compile config.guess config.h.in config.sub configure depcomp install-sh ltmain.sh missing mkinstalldirs.

This is a matter of opinion. When you have dependencies which rely on specialist tools, it is a good idea (and accepted practice, though this obviously could be argued), to commit your generated files too. This means that the files don't change depending on the version of autoconf/bison/etc that's installed on a user's machine.

As I recall, years ago some folks wrote up all these "version control best practices" for some conference, and this was one of the "rules". But it's common sense too - autotools is deep enough magic that most people won't know to run bootstrap.sh, or autoreconf -i, or whatever.

(Ironically, google code doesn't agree with this, and won't let you trim generated code like this from the diffs they send. See http://code.google.com/p/support/issues/detail?id=197).)

> As I recall, years ago some folks wrote up all these "version control best practices" for some conference, and this was one of the "rules". But it's common sense too - autotools is deep enough magic that most people won't know to run bootstrap.sh, or autoreconf -i, or whatever.

I disagree. At this point in the game, autotooled systems are pretty common, and to assert that they are deep magic like saying "CMake is deep enough magic that most people won't know to run mkdir build && cd build && cmake .., or whatever".

Developers are not like most people. A release tarball (created with `make distcheck`, say) will absolutely contain configure and other generated files. That's one of the points of the autotools - to have minimal dependencies at build time.

Developers, on the other hand, will need to look up a project's dependencies, install other special tools (parser generators, for instance). This should be documented somewhere, along with the bootstrap instructions.

These days (in my experience, which I believe is typical), no-one even uses release tarballs, everyone just uses the repositories.

Maybe you can require your developers to install esoteric tools, but that's no way to solicit contributions. For a compiler I worked on, we had a dependency on libphp5, which was enough to discourage anybody. But we also had dependencies on a particular version of flex, gengetopt, bison, autoconf, automake and _maketea_. The final one was a tool we wrote and had spun out into a separate project, which was written in Haskell, and required exactly ghc 6.10. Do you think we were going to get contributions from people who needed to install all of those just to fix a minor issue?

Bad comparison. Yours was an extreme case. Requiring semi-recent versions of autoconf and automake (in addition to any other dependencies the app has) is not unreasonable. They're not "esoteric tools." Neither is the need to run autogen.sh instead of (or before) configure.

Autoconf generates makefiles and such which make assertions about the current state of the machine it ran on. Running a build using those those generated files out on a different machine (or the same machine in the future!) would seem to be defying reality. I think you have to accept that your build isn't deterministic unless you ensure your build has no dependencies except on tool binaries you check in right along with the code.

I believe you've misunderstood me (I can see how what I said was ambiguous), so let me clarify.

What I mean is that you're using a different version than is being tested by the other authors. So for example, if developer A used flex 2.35, but developer B has flex 2.36 installed on their system, they might get a different result (or a bug, or a security flaw in the resulting binary) etc.

I'll add that this isn't without cause - it has caused me personally many hours of errors.

Note that you can use autoconf to require exact versions, but then you require that the developer install an exact version of bison, even though they aren't touching the parser (or exact versions of autoconf even though they aren't touching the build system).

As I see it, you should either check in the flex and bison binaries themselves (and whatever runtime dependences they have) and only use those during the build, or accept that your build will fail and require maintenance as time goes by, doing whatever's simplest without trying hard to avoid that. Autoconf is intended to facilitate dependencies on components outside source control which happen to exist on a single machine, so it's only appropriate to use at all if you go the latter route.

Most people don't commit Makefile, but do commit configure and Makefile.in (both of which are still generated automatically, but it's not quite as bad as committing config.status or Makefile).

> When you have dependencies which rely on specialist tools, it is a good idea [...] to commit your generated files too.

The problem I encounter regularly is that people have slightly different versions of autotools (e.g., autoconf 2.62 and autoconf 2.63).

When they commit with svn (and are not super careful), several commits have small diffs in the generated files (added newline, '\n' vs. real newline, etc.), making merges painful.

Good point. I generally had the rule to use the version of the tool that was last used (upgrading the tool could be done separately). This was a small chore, but only for the people who touched that portion of the code base (and touching the parser/lexer/IR definitions/build system is individually pretty rare).

When you can measure efficiency improvements like this in millions of dollars, I'm sure this makes a whole hell of a lot of sense. But for anyone below, say, Twitter's scale: is this ever an engineering win over zlib?

Many years ago I had to do a bunch of sorting and merging datasets on some underpowered hardware. I found that I got significant speedups by compressing my data during mergesorts because I was able to squeeze extra passes in RAM before data had to hit disk. (To be clear, I was writing to and from files, but found the data still in cache when I went to read it again.) If compression/decompression were an order of magnitude faster, the win could have been even bigger and more obvious.

Don't think of this as, "We can cheaply compress large amounts of data to save space." Think of this as, "We can compress/decompress stuff on the fly just because it is convenient for us." For the latter kind of usage the efficiency is an enabler, and the fact that the compression is supposed to be temporary makes interacting with other things a complete non-issue.

It's a little difficult to say. One space where there seems to be very little research that I've been investigating recently is compression for database records (or in our case, serialised complex objects stored in a data grid) where the statistical model is build across many records and then is constant during compression and decompression, as opposed to being adaptive. This means that you can exploit global redundancy across many records and you don't need to store the model with every item, which is good if your items are small, you store zillions of them and space is at a premium (i.e. you're storing it in RAM).

These use cases are normally pretty application specific though, so I imagine a lot of in-house code gets written for things like this. Seems like a pretty similar use case to this, I'd be interested in seeing details of their algorithm - I can't see it anywhere obvious on their site.

There's a lot to DB compression depending on the type of data access. The PFOR approach is amazing and simple for data arranged in columns without much deviation. For serialized trees you usually end up in LZ* land. But with careful organization it can be tolerable.

Perhaps a more pertinent question: Is it any better than LZJB or LZO?

The LZO codebase is one of the worse spaghetti code messes I've seen in my life.

It's got a better license than LZJB (and maybe LZO depending on your views).

Doesn't Google use GPL2 all over the place?

Just to use one example, I would assume that the crawler/indexer/ranker "secret sauce" in the appliance can't link against any GPL libraries.

If you don't distribute your GPL-tainted code, I don't believe you need to do anything to comply with the GPL. This is why the Affero GPL exists.

The keyword in the parent comment was "appliance" :)

Doh! Missed that. Yep.

And LZO.

Websites will benefit indirectly from this through packages getting better performance (like databases or caching systems). If this algorithm is as good as Google reported a few years ago this is big, big news.

Note: have you seen the LZO code? I bet not.

Curious, it's written in C++ .

IMHO I think straight C would have been easier for World + Dog to link against.

If exports are within `extern "C"`, it shouldn't be any more difficult.

Had a look at the code, it's quite neat and tidy, i'm really impressed and surprised considering the need for speed/optimisation in libraries like this tends to make the code unreadable...

For pure speed, check out QuickLZ[1]. While it probably doesn't compress well as Snappy, it does hit 300MB+/core. But, it's GPL instead of Apache.

1: http://www.quicklz.com/

They claim in the README that Snappy is usually faster than QuickLZ. Haven't tested myself, though.

They compare with QuickLZ 1.0 from 2006 (you can see that from snappy_unittest.cc).

I've done a quick benchmark with the files on quicklz.com on the same machine with QuickLZ 1.5.0:

test file; library; compressed size; compression speed in mb/s

average QuickLZ 47.9% 308 snappy 53.0% 261

proteins.txt QuickLZ 1.5.0 35.6% 331 snappy 40.5% 232

plaintext.txt   QuickLZ 1.5.0 48.1% 245 snappy 55.5% 193

gdb.exe   QuickLZ 1.5.0 45.8% 270 snappy 51.1% 214

flower.bmp QuickLZ 1.5.0 86.7% snappy 91.5% 208

northwind.mdf QuickLZ 1.5.0 23.2% snappy 26.4% 456


I work at Google, and I've been working on the Snappy open-sourcing. First of all, let me state that when doing our benchmarks, we were definitely comparing against the latest public version (1.5.0) of QuickLZ. I'm having problems reproducing your benchmark results; could you provide some more details about the exact settings, compiler version, architecture etc.? I can't even get all of the compression ratios to match up exactly with the ones I'm seeing, so there must be some sort of difference between the setups. Are you perchance running Snappy with assertions enabled? (The microbenchmark will complain if you do, so it's easy to check.)

For what it's worth, I ran the QuickLZ test suite on my own machine (you can see the exact testing environment below), and got quite different results:

  qlz-testsuite/flower.bmp                 :
  QUICKLZ: [b 1M] bytes 5922816 -> 5087471 85.9%  comp 132.4 MB/s  uncomp 140.8 MB/s
  SNAPPY: [b 4M] bytes 5922816 -> 5369439 90.7%  comp 180.6 MB/s  uncomp 484.0 MB/s
  qlz-testsuite/gdb.exe                    :
  QUICKLZ: [b 1M] bytes 8872609 -> 4074990 45.9%  comp 177.4 MB/s  uncomp 213.3 MB/s
  SNAPPY: [b 4M] bytes 8872609 -> 4530851 51.1%  comp 224.7 MB/s  uncomp 468.7 MB/s
  qlz-testsuite/northwind.mdf              :
  QUICKLZ: [b 1M] bytes 2752512 -> 628607 22.8%  comp 291.3 MB/s  uncomp 409.6 MB/s
  SNAPPY: [b 4M] bytes 2752512 -> 726437 26.4%  comp 456.6 MB/s  uncomp 790.1 MB/s
  qlz-testsuite/plaintext.txt              :
  QUICKLZ: [b 1M] bytes 2899483 -> 1436466 49.5%  comp 155.4 MB/s  uncomp 182.7 MB/s
  SNAPPY: [b 4M] bytes 2899483 -> 1629408 56.2%  comp 195.2 MB/s  uncomp 487.5 MB/s
  qlz-testsuite/proteins.txt               :
  QUICKLZ: [b 1M] bytes 7254685 -> 2600129 35.8%  comp 218.2 MB/s  uncomp 261.2 MB/s
  SNAPPY: [b 4M] bytes 7254685 -> 2934406 40.4%  comp 269.4 MB/s  uncomp 556.9 MB/s

  Average compression ratio (geometric mean, lower is better): QuickLZ 43.7%, Snappy 48.9%.
  Average compression speed (harmonic mean): QuickLZ 180.9 MB/s, Snappy 238.0 MB/s.
  Average decompresion speed (harmonic mean): QuickLZ 212.5 MB/s, Snappy 536.9 MB/s.
In addition, there's one nearly-incompressible file listed on the same page, which I also ran on (it's not included in the averages above):

  qlz-testsuite/NotTheMusic.mp4            :
  QUICKLZ: [b 1M] bytes 9832475 -> 9832565 100.0%  comp 289.3 MB/s  uncomp 2815.9 MB/s
  SNAPPY: [b 4M] bytes 9832475 -> 9485433 96.5%  comp 1308.4 MB/s  uncomp 2477.1 MB/s
This is on a Nehalem 2.27GHz, running Debian GNU/Linux 6.0 in 64-bit mode, GCC version 4.4.5, with flags -O2 -g -DNDEBUG for both Snappy (1.0.0) and QuickLZ (1.5.0), running snappy_unittest to compare. QuickLZ was left at the default settings, ie. the files were exactly as downloaded from http://www.quicklz.com/quicklz.[ch] as of today. In particular, this means QuickLZ was running in unsafe mode, which means it could crash on corrupted compressed input. (Safe mode, which is comparable to how Snappy runs, is, according to http://www.quicklz.com/download.html, 15–20% slower at decompression, so the difference would be bigger.)

Exact benchmark results will, as always, vary with differing compiler, set of flags, CPU, and data set. The average numbers from this benchmark, however, show QuickLZ 1.5.0 (in unsafe mode) compressing about 11% more densely than Snappy 1.0.0 does, but Snappy compressing about 32% faster and decompressing about 153% faster (ie., about 2.53 times as fast).

/* Steinar */

Sorry, had assertions on. Getting same results now (using my own benchmark function though)

Benchmarking: benchdata/northwind.mdf

snappy: Compressed 2752512 bytes into 726437 (26.4%) at 653.0 MiB/s

QuickLZ: Compressed 2752512 bytes into 620933 (22.6%) at 456.8 MiB/s

Benchmarking: benchdata/gdb.exe

snappy: Compressed 8872609 bytes into 4530844 (51.1%) at 333.1 MiB/s

QuickLZ: Compressed 8872609 bytes into 4056279 (45.7%) at 267.6 MiB/s

Benchmarking: benchdata/pic.bmp

snappy: Compressed 18108198 bytes into 16561772 (91.5%) at 292.5 MiB/s

QuickLZ: Compressed 18108198 bytes into 15784615 (87.2%) at 197.2 MiB/s

Benchmarking: benchdata/plaintext.txt

snappy: Compressed 2988604 bytes into 1657747 (55.5%) at 289.4 MiB/s

QuickLZ: Compressed 2988604 bytes into 1436646 (48.1%) at 238.5 MiB/s

Benchmarking: benchdata/proteins.txt

snappy: Compressed 7344249 bytes into 2964303 (40.4%) at 371.6 MiB/s

QuickLZ: Compressed 7344249 bytes into 2586659 (35.2%) at 321.9 MiB/s

I may convert my monthly archive of websites over to snappy; the speed of the compression / decompression will allow me to implement a more consolidated storage scheme than I'm using now.

Sometimes its odd to think a thousand lines of C++ is something folks were waiting to be released for years.

I'm puzzled by snappy-stubs-internal.h l105-118 Why would one log by instantiating a class, not using the result, therefore leading to the destructor being called which writes the log message? Can anyone come up with a reason for this?

It uses the same construct as VLOG above does, which allows you to << an error string into CRASH_UNLESS and VLOG. The weirdness comes in when you want to conditionalize the log stream, which isn't implemented in the stubs here. The real Google logging classes are probably more sophisticated, and this is just a shim to get the snappy code to work unchanged.

Specifically the answer to your question is that printing in the destructor is there to print a newline right before calling abort but after having printed whatever was <<'ed to CRASH_UNLESS.

Check out libjingle's logging implementation which has some similar weirdness in it: http://code.google.com/p/libjingle/source/browse/trunk/talk/...

Anyone feels like writing a tiny C client? Looks like this comes only as a lib.

I think it is already a command line application (if you use the google cmdline flag lib, see the README).

I don't believe so. Look at the Makefile.am. It only declares a libtool library (lib_LTLIBRARIES = libsnappy.la, which causes libsnappy.so.x.y.z and libsnappy.a to be built), and a unittest (TESTS = snappy_unittest,noinst_PROGRAMS = $(TESTS)).

EDIT: I'm wrong:

> Actually, snappy_unittest is more than just a test, despite the name; in particular, if you have the gflags package installed, it can be used for benchmarking. Thus, it's useful to have it built as part of make all. I'm closing this one (but thanks for all the others!).


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact