
Gzip + poetry = awesome - jvns
http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equals-awesome/
======
ot
Compression is truly fascinating. It's what got me into reading computer
science papers several years ago, and then became one of my research topics.

What is shown here is the LZ77 [1] _factorization_ of a string. Compression in
general works by exploiting redundancy in the source, and natural language is
highly redundant, since many words repeat often. Hence the factors in the
factorization often look like frequent words or n-grams.

A recent line of research is _grammar compression_ , which tries to turn a
text into a _tree of rules_ that generate that text. While still not very good
at general-purpose compression, the generated tree are much more interesting
than the LZ77 factorization, since they "discover" something that looks like a
syntactic parsing of the string, finding syllabes, words, phrases,
sentences...

In Craig Nevill-Manning's Ph.D. thesis [2] introduction there are several
examples of the inferred grammar of pieces of text, music, fractals, source
code, etc... While the algorithm presented there (Sequitur) is now kind of
obsolete, the thesis is very interesting because it makes some considerations
with a linguistic perspective.

[1]
[http://en.wikipedia.org/wiki/LZ77_and_LZ78](http://en.wikipedia.org/wiki/LZ77_and_LZ78)

[2]
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.1...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.1112&rep=rep1&type=pdf)

~~~
jmmcd
Wow, that PhD thesis is very interesting. I use grammars to generate music
(and other things sometimes), but I would also really like to be able to infer
them.

Since you say it's obsolete, do you have any links to papers or software
related to successors of the Sequitur algorithm?

~~~
kevgnulldev
I do a bit of algorithmic composition as well (mostly microsound/granular
synthesis) using NN and other graphical models (e.g. HMMs) for higher level
compositional control... for grammar inference check out:

[http://www.mitpressjournals.org/doi/abs/10.1162/neco.1992.4....](http://www.mitpressjournals.org/doi/abs/10.1162/neco.1992.4.3.393)

Abstract: We show that a recurrent, second-order neural network using a real-
time, forward training algorithm readily learns to infer small regular
grammars from positive and negative string training samples. We present
simulations that show the effect of initial conditions, training set size and
order, and neural network architecture...

~~~
jmmcd
That sounds perfect -- thanks!

~~~
kevgnulldev
You're welcome. You might like this as well (the PDF is just the front matter
of the book):

[http://download.springer.com/static/pdf/967/bfm%253A978-3-64...](http://download.springer.com/static/pdf/967/bfm%253A978-3-642-66438-0%252F1.pdf?auth66=1382903728_da0c75be6801cbf6bc3be0f7e38a4094&ext=.pdf)

~~~
jmmcd
Hmm, that link doesn't work for me (probably some kind of session id is
needed). Can you post another url?

~~~
kevgnulldev
sorry for the late reply... just saw your post:
[http://www.springer.com/computer/ai/book/978-3-540-88008-0](http://www.springer.com/computer/ai/book/978-3-540-88008-0)

~~~
jmmcd
Thanks! Looks like I have a lot of reading to do...

------
CGamesPlay
Audio is unnecessary. The video shows a slow-motion decoding of a gzipped
version of the poem. The red text between brackets is a chunk of text that was
earlier stored into the huffman tree (example "W{hile I }" means that the
"hile I" was previously encoded; it occurred in the substring "while I
pondered"). You can see the red chunks quickly occupy the larger volume of the
poem, which visually highlights the symmetry in lyric that the computer is
using to encode the file as gzip.

Pretty neat.

~~~
codezero
s/decryption/decompression

~~~
CGamesPlay
Fixed, thanks :)

~~~
codezero
I appreciate your indulging my pedantic nature!

------
heyadayo
Mirror:

[https://web.archive.org/web/20131025035238/http://jvns.ca/bl...](https://web.archive.org/web/20131025035238/http://jvns.ca/blog/2013/10/24/day-16-gzip-
plus-poetry-equals-awesome/)

[http://www.youtube.com/watch?feature=player_embedded&v=D2JWS...](http://www.youtube.com/watch?feature=player_embedded&v=D2JWSNDgkoE)

------
alexholehouse
For some reason I find theories and approaches in compression really
interesting. For those unfamiliar I recommend Blelloch's introduction to
compression
([http://www.cs.cmu.edu/~guyb/realworld/compression.pdf](http://www.cs.cmu.edu/~guyb/realworld/compression.pdf))

~~~
willvarfar
Another excellent source is Matt Mahoney's Data Compression Explained:
[http://mattmahoney.net/dc/dce.html](http://mattmahoney.net/dc/dce.html)

~~~
tehwalrus
And yet another is David Mackay, who taught me at undergrad:

[http://www.inference.phy.cam.ac.uk/mackay/itila/](http://www.inference.phy.cam.ac.uk/mackay/itila/)

scroll to the bottom for the free PDF.

(the book is about all sorts of stuff, but compression is the main example
used when defining information theory.)

~~~
kevgnulldev
Along similar lines, but a more focused book (no learning, etc) is the classic
Cover and Thomas:

[http://www.elementsofinformationtheory.com/](http://www.elementsofinformationtheory.com/)

------
mvleming
Wow, this is a incredible visualization of how compression works. I never
understood how it worked before, but the simple mentioning of pointers and
then that video was all it took for me.

I've always wondered if this is true: If we approach infinite computational
power, does the amount of information we need to represent data decrease?
(Excuse any incorrect use of my terminology here.) I think about a number like
pi that, as far as we know, has an infinite number of digits, and
theoretically speaking every message is contained in that number. So if we
just get the pointer of where that message is contained in the number and then
calculate the number up to that pointer then we'll have ourselves the message.
Hence, more computational power, less information needed.

~~~
sillysaurus2
There is a limit beyond which you can't compress data any further, called the
Shannon limit.

Any sequence of bytes is just a number. So if you think of pi as an RNG, then
the chances of finding a run of N digits equal to another number with N digits
is (1/10) to the power of N, which quickly becomes intractible.

In reality the digits of pi are biased, so finding a particular number of N
digits is even less likely.

It would be easier to search for that number by randomly seeding an RNG then
searching the RNG output for the number. Then you could just store the
seed+offset, which may be significantly less than the Shannon limit. But since
the chance of encountering such a number is (1/256)^N, it quickly becomes
impossible. And even if it weren't, the receiver would need to invoke the RNG
[offset] times, which will be a massive number of times due to the
probabilities involved. So it's not like you could precompute the index: the
receiver still needs to compute the answer, which requires just as many
computations as the sender.

In general, the closer you try to get to the Shannon limit, the more
computation that is required. And perfect compression is impossible in
practice except in constrained cases, so I'd speculate it requires infinite
computational resources.

~~~
ot
> There is a limit beyond which you can't compress data any further, called
> the Shannon limit.

I think you are confusing things a bit here. The Shannon limit has nothing to
do with lossless compression, it refer to communication in _noisy_ channels.

What you are referring to is the Shannon entropy, but even that is not
actually a lower bound to compression of actual data, because it is defined
for stochastic sources.

When you have an actual string and you want to compress it, the only
meaningful lower bound is the Kolmogorov complexity of the string, that is the
size of the smallest program that generates that string. It is of course
incomputable, but it is what you would get if you had infinite computing
power.

~~~
sillysaurus2
I appreciate you correcting my terminology, but beyond that you haven't
actually disagreed with anything I've said.

Humans call it either the Shannon limit, or Shannon entropy, or Kolmogorov
complexity of a string, or whatever other name humans have come up with for
human names. The central point is _there is some limit_ which you can't
circumvent through trickery like "find it embedded within pi," regardless of
what species you are, or what your species chooses to name fundamental
limitations of nature.

A program to generate digits of pi is small, and pi embeds every possible
string. But generating up to that offset of pi is impossible in practice. And
it probably won't net you any savings anyway, since you'll still need to
communicate the offset. The offset's length in digits is probably close to the
length of the original string (after perfect compression), so it's unlikely
you'd be able to circumvent Shannon entropy.

But, if you do manage to create a practical way to bypass Shannon entropy, I'd
imagine it'll be headline news, along with netting you quite a lot of money.

~~~
iliis
If you're talking about Kolmogorov Complexity here, you can't really
"circumvent" it. Kolmogorov Complexitiy of a string is basically defined as
the size of the shortest possible representation of said string. If you manage
to"bypass" the Kolmogorov Limit you actually just "lowered" it (well, it
alwayws was by definition this low).

Your main point is totally valid however. You cannot compress infinitely. A
simple proof: For any given N bits you can encode [at most] 2^N different
strings. As there exist more than N strings there must be one which cannot be
represented by N bits. As N can be arbitrarily large you'll always find
strings which cannot be compressed lower than some chosen size. QED.

------
annnnd
I get this:

    
    
        Offline Site: jvns.ca
        You have requested a site that is currently offline. This generally happens when a site is temporarily disabled for some reason, but has not been permanently removed...

~~~
josephb
Unexpected HN traffic is likely to have used all the site owners pre-paid
credit.

Edit: To clarify, NearlyFreeSpeech hosting is prepaid.

~~~
jvns
Yup, apparently the $2 I had left in my account didn't cover it. Back up now!

------
tempestn
This is extremely cool. For anyone curious what the various compression levels
of gzip do, the full explanation is here:
[http://www.gzip.org/algorithm.txt](http://www.gzip.org/algorithm.txt).
Basically the higher the compression level, the more times it spends searching
for the longest possible matching string to point to.

------
_quasimodo
As the author mentioned, poetry usually compresses quite well, thanks to
rhyming. Here is another fun example:

[http://codegolf.stackexchange.com/questions/6043/were-no-
str...](http://codegolf.stackexchange.com/questions/6043/were-no-strangers-to-
code-golf-you-know-the-rules-and-so-do-i)

------
JacksonGariety
Can anyone explain what is going on here?

~~~
jws
gzip works by constructing, on the fly, a dictionary of strings it has seen
before and sort of inserting them by reference.

When you see something red and in braces, that is a bit that has occurred
before and been inserted by reference.

I'm guessing that it is happening at a constant symbol rate, you'll notice it
speeding up as it includes more and more text by reference instead of a single
character at a time.

------
pdknsk
What I noticed in the other text, Hamlet, is that HAMLET in the last sentence
has no pointer to Hamlet, obviously. This seems like an opportunity for
optimisation, for text at least.

Usually a word is either lowercase, capitalised or uppercase. In more complex
and rare cases this could be efficiently encoded bitwise (0 = keep, 1 = swap
case), so HAMLEt becomes 011110 plus a pointer to Hamlet.

I wonder if any compression algorithm does this. Probably not, because the
benefit is likely minimal at significantly increased de/compression time.

~~~
cbhl
One problem is that way of thinking is distinctly ASCII-centric -- what
happens when you have accents, umlats, Chinese, and the rest of the stuff that
comes out of Unicode?

I suspect the encoding you propose would actually increase the size of the
average plaintext, simply because people rarely will type "Hamlet" and then
later type "HaMlEt" in the same document. I think a far more common case would
be "Hamlet" (beginning of sentence) and "hamlet" (inline), where you can reuse
the substring "amlet". (Either that, or they're consistent with the usage of
all-caps, like MICROSOFT(R).)

Besides, if you know you're doing English, you can save 12.5% simply by using
7-bit ASCII instead of of 8-bit ASCII.

~~~
pdknsk
It seems simple enough to implement case-pairs for Unicode. Most likely it
already is. Either way, you're right. It was just a simple idea.

~~~
judk
[http://www.unicode.org/faq/casemap_charprop.html](http://www.unicode.org/faq/casemap_charprop.html)

Still provincial

------
tete
Julia programming Julia? Pretty cool too! That's how you recognize a real
hacker...

~~~
jvns
=D

Since a lot of the Julia core is written in Julia, it's also possible to have
Julia working on Julia in Julia.

------
aviksh
The deflation algorithm used by gzip is a variation of LZ77.

------
aortega
This is a demonstration of a simplified LZ77 algorithm, not Gzip.

Gzip is a unix utility, LZ77 is an algorithm, this distinction is not
pedantic.

This is what happens when you go to "hacker school" before regular CS school.

~~~
shawabawa3
> Gzip is a unix utility, LZ77 is an algorithm, this distinction is not
> pedantic.

It's basically the definition of pedantic.

Gzip uses the LZ77 algorithm, so it's a simplified demonstration of what gzip
does. It's also a simplified demonstration of LZ77.

~~~
jvns
The video title is in fact not quite accurate -- I chose to err in the
direction of "cool soundbyte", since "LZ77 + poetry = awesome" doesn't have
quite the same ring.

aortega is quite right that this video _actually_ explains LZ77, which is only
one part of how gzip compresses. gzip (or the DEFLATE algorithm) also uses
Huffman coding to compress the alphabet and pointers after LZ77 compression. I
explain a bit more about the phases at
[http://jvns.ca/blog/2013/10/23/day-15-how-gzip-
works](http://jvns.ca/blog/2013/10/23/day-15-how-gzip-works)

And if you _really_ want to know how it works, you should read
[http://www.infinitepartitions.com/art001.html](http://www.infinitepartitions.com/art001.html),
which is the reference I used for writing my implementation of gunzip. It
includes C source code.

The dig at hacker schoolers was unnecessary, though :)

~~~
aortega
Well this is a much better description, thanks you.

Actually I like dumbed-down explanations as an introduction to something, but
also I believe they should include a description like this one at the end, for
the people that want to know more.

Sorry if the hacker school bashing sounded too harsh, but I believe that a
little constructive criticism once in a while helps. Also, consider it a part
of becoming a hacker :)

