
Brotli Compressed Data Format - Sami_Lehtinen
http://tools.ietf.org/html/draft-alakuijala-brotli-01
======
ot
It looks like the authors just looked at DEFLATE and tweaked it mainly by
better engineering and taking into account modern hardware specifications
(larger window size, ...), but largely ignoring the theoretical research that
has been carried out on LZ77 in the last few years.

For example a paper from last SODA [1] shows that by using a better
optimization algorithm and just simple universal codes instead of Huffman, it
is possible to beat zlib in space and compete with Snappy in decompression
speed (and I'm sure that the decoder can be further optimized). This is just
the last one on the topic but there have been quite a few.

Caveats: compression is fairly slow (but can be improved with heuristics) and
the datasets used in the experiments are big and very repetitive (so it is not
clear how it performs on small files), but still I think that many ideas
should finally find their way into modern compression formats.

[1]
[http://arxiv.org/pdf/1307.3872v1.pdf](http://arxiv.org/pdf/1307.3872v1.pdf)

Disclaimer: the paper authors are from my CS department.

------
jzwinck
This is a bit strange: this seems to come from Google, who already invented
another general-purpose compression scheme called Snappy[1], which is also
usable in streaming. TFA mentions gzip and deflate but not Snappy. Brotli
seems to aim for better compression whereas Snappy aims for more speed, but
they're both based on LZ77.

Brotli has something else in common with Snappy: a lack of a specified framing
mechanism. When Snappy first came out it did not have one, so if you wanted to
write Snappy-compressed data to a file, you might invent your own header to
frame the compressed data. They later added this to Snappy, but by then it was
too late: implementations had already come up with their own, mutually-
incompatible framing. The same mistake seems to be being made again with
Brotli.

[1] [https://code.google.com/p/snappy/](https://code.google.com/p/snappy/)

~~~
robryk
You're right about the aims of snappy and brotli: snappy has compression
ratios worse than deflate, because it's only LZ77: there is no entropy
encoding (eg. Huffman) afterwards. Brotli is similar to deflate in that it
does LZ77 first and then compresses its output stream by an entropy code.

------
e_e
I'm a little disturbed that

    
    
        onclick="javascript:
    

was common enough in their corpus to end up in the dictionary.

Readable version:
[https://gist.github.com/anonymous/f66f6206afe40bea1f06](https://gist.github.com/anonymous/f66f6206afe40bea1f06)

~~~
duskwuff
The corpus seems oddly constructed. I have to wonder where they got the
English text from -- it seems to have a very academic bias, with lots of
strings containing "University", for instance. There's also a lot of
redundancy; for instance, I see some near-duplicates including:

    
    
        12010. "University of"
        12129. "University of "
        13106. "the University of"
    
        13227. "Oxford University"
        13534. " Oxford University"
    

(I'm sure there's other similar duplicates; I just happened to notice them
while looking at all the "university"-ies in the corpus.)

~~~
acqq
It's really strange standard. For those that didn't analyze: It's part of the
"static dictionary given in Appendix A" which has to be present in any
implementation.

Especially if we know it's supposed to be used to compress the fonts.

------
userbinator
I think using a static dictionary is a bit of a "cheat" since it's not truly
"compressing", but just moving some of that data somewhere else, i.e. the
compressor/decompressor increases in size. It also makes it less general.
Compare this with e.g. pure LZ77 or Huffman, where all the data in the output
is coming from the input itself and not somehow embedded in the algorithm.

Incidentally I really like LZ77 and its variants; it's my favourite
compression algorithm, due to its incredible simplicity (a decompressor is
just a bit over a dozen instructions) and intuitiveness. Thus I've always
considered it odd that Huffman's paper was more than 20 years before LZ's -
the Huffman algorithm is rather more complex. Perhaps the LZ algorithm was
well known already, and considered too trivial to write a paper about? I do
know that many have rediscovered LZ independently, without ever studying data
compression theory.

------
contingencies
TLDR: _Stream-oriented, resource-conscious, lossless compression algorithm to
replace gzip /deflate, with a compression ratio comparable to the best
currently available general-purpose compression methods and which decompresses
much faster than current LZMA implementations_.

------
rossjudson
Don't miss the point here, which is this:

 _Can be produced or consumed, even for an arbitrarily long sequentially
presented input data stream, using only an a priori bounded amount of
intermediate storage, and hence can be used in data communications or similar
structures, such as Unix filters;_

It's supposed to be a _safe_ compression standard.

~~~
userbinator
That's pretty common for compression algorithms - for Huffman the only thing
needed for decoding/encoding is the code table and possibly some adaptive
overhead on top of that, and with LZ it's a buffer the size of the sliding
window. At the moment I can't think of any others that need unbounded space
growing with the size of the input - maybe some of the ones based on Markov
chains?

------
isomorphic
Some actual source code here, inside a font compressor:

[https://code.google.com/p/font-compression-
reference/source/...](https://code.google.com/p/font-compression-
reference/source/browse/#git%2Fbrotli%2Fenc)

------
robryk
The following two sets of slides describe the format and in what ways it shall
be better than deflate:

[http://lists.w3.org/Archives/Public/public-webfonts-
wg/2013O...](http://lists.w3.org/Archives/Public/public-webfonts-
wg/2013Oct/att-0008/Brotli_Compression_Algortihm_-_motivation.pdf)

[http://lists.w3.org/Archives/Public/public-webfonts-
wg/2013O...](http://lists.w3.org/Archives/Public/public-webfonts-
wg/2013Oct/att-0008/Brotli_Compression_Algorithm_-
_outline_of_a_specification.pdf)

EDIT: They are from October 2013, so they may be inaccurate.

------
twotwotwo
Context: rev 2.0 of the Web font standard was going to use LZMA, it had issues
(IP, decoding speed, clear spec for independent implementations), so Google
proposed a new format.

WOFF report on Brotli:
[http://www.w3.org/TR/WOFF20ER/#candidateb](http://www.w3.org/TR/WOFF20ER/#candidateb)
(and #candidatea says what they thought of LZMA)

Reference code: [https://code.google.com/p/font-compression-
reference/](https://code.google.com/p/font-compression-reference/) (this aims
for very slow but good compression, like their zopfli zlib encoder)

Google's presentation:
[https://docs.google.com/presentation/d/1aigINmRR7fw_ml8rz0rJ...](https://docs.google.com/presentation/d/1aigINmRR7fw_ml8rz0rJ3NTv08Qb3n6lZ_qvmxo8CzQ/present)

To the authors, I suspect it was a feature that it's 'just' DEFLATE with 4MB
windows + more context for entropy encoding + tuned-up coding + a static
dictionary + some other stuff--the WOFF spec mentions similarty to DEFLATE as
a benefit, and they would like to keep it easy enough to for others to write
decoders.

I kind of wonder if it was derived from something Google had been using
internally. If you decode the static dictionary in the spec, it has a bunch of
words and phrases from the most-used languages as you'd expect, but also some
things that look like common code fragments from Web pages, which would be
right down Google's alley. And that's not stuff you need for font compression
in particular.

------
zvrba
This looks like (yet another) NIH type of work from google.

------
crazy2be
Presentation announcing the format, and discussing advantages/disadvantages
compared to existing formats:

[https://docs.google.com/presentation/d/1aigINmRR7fw_ml8rz0rJ...](https://docs.google.com/presentation/d/1aigINmRR7fw_ml8rz0rJ3NTv08Qb3n6lZ_qvmxo8CzQ/present)

------
m_mueller
Brotli.. I wonder where the name comes from. It sounds almost like the Swiss
German word 'Brötli' (= small loaf of bread) - but I can't find any
relationship to Switzerland.

~~~
robryk
This is not the only compression related piece of software/spec from Google
named after a kind of bread: there are also gipfeli[1] and zopfli[2].

[1] [https://code.google.com/p/gipfeli/](https://code.google.com/p/gipfeli/)
[2] [https://code.google.com/p/zopfli/](https://code.google.com/p/zopfli/)

~~~
m_mueller
Heh, I knew I've seen this naming pattern somewhere before, thanks for
pointing it out. Well, maybe at least CS people will come to know some Swiss
German words other than Muesli ;).

------
ShirsenduK
What is its Weisman Score? :D

~~~
BorisMelnik
hah first thing that came to my mind as well

------
mcot2
Pied piper.

~~~
kapilvt
check the rfc author.. clearly hooli :-)

