Hacker News new | comments | show | ask | jobs | submit login
Gzip + poetry = awesome (jvns.ca)
370 points by jvns on Oct 25, 2013 | hide | past | web | favorite | 87 comments

Compression is truly fascinating. It's what got me into reading computer science papers several years ago, and then became one of my research topics.

What is shown here is the LZ77 [1] factorization of a string. Compression in general works by exploiting redundancy in the source, and natural language is highly redundant, since many words repeat often. Hence the factors in the factorization often look like frequent words or n-grams.

A recent line of research is grammar compression, which tries to turn a text into a tree of rules that generate that text. While still not very good at general-purpose compression, the generated tree are much more interesting than the LZ77 factorization, since they "discover" something that looks like a syntactic parsing of the string, finding syllabes, words, phrases, sentences...

In Craig Nevill-Manning's Ph.D. thesis [2] introduction there are several examples of the inferred grammar of pieces of text, music, fractals, source code, etc... While the algorithm presented there (Sequitur) is now kind of obsolete, the thesis is very interesting because it makes some considerations with a linguistic perspective.

[1] http://en.wikipedia.org/wiki/LZ77_and_LZ78

[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=

Wow, those are some interesting links.

As somebody who's learned a foreign language to near-fluency as an adult, I've spent a lot of time trying to figure out how my brain infers a grammar from large quantities of input. Subjectively, it feels like my brain learns fixed phrases, and then notices that certain chunks can be replaced by subtrees of the appropriate type.

There are complications: inflections, gender, agreement, probability, and so on. But still, grammar compression looks like a fascinating formulation of the problem.

Yes that's what algorithms like RePair and Sequitur do: they build the tree bottom-up by replacing frequent subtrees with a new rule at each iteration.

I think that this is also the basic idea behind the usage-based model in linguistics, a theory on how we learn languages:

    The usage-based model is based on inductive learning,
    meaning that linguistic knowledge is acquired in a
    bottom-up manner through use. It allows for redundancy 
    and generalizations, because the language user generalizes
    over recurring experiences of use.

Very interesting. I feel like the very concept of a grammar is only a model we cast on language to make it easier to deal with. Langage is pretty free and can work without it. Grammar feels to me like a normalization "by contract". We normalize, then study the normalization tool itself without being aware it is an artefact. We model the model, give it free parameters, randomness, serendipity to make it accomodate reality and its endless string of special cases. We even try to separate meaning from experience.

This all feels more like a network of relations and probabilities of coocurrence of various (free?) orders. I followed your link and I strongly feel the suffix "grammar" is the real burden here. We're trying to put structure on what is actually creating it.

Every language has grammar by definition. We're just trying to discover it.

It's like the laws of physics: We have laws which seem to work, but we know they're not the whole story, so we observe more of what happens and try to bring our laws into harmony with the actual underlying laws.

Wow, that PhD thesis is very interesting. I use grammars to generate music (and other things sometimes), but I would also really like to be able to infer them.

Since you say it's obsolete, do you have any links to papers or software related to successors of the Sequitur algorithm?

I do a bit of algorithmic composition as well (mostly microsound/granular synthesis) using NN and other graphical models (e.g. HMMs) for higher level compositional control... for grammar inference check out:


Abstract: We show that a recurrent, second-order neural network using a real-time, forward training algorithm readily learns to infer small regular grammars from positive and negative string training samples. We present simulations that show the effect of initial conditions, training set size and order, and neural network architecture...

That sounds perfect -- thanks!

You're welcome. You might like this as well (the PDF is just the front matter of the book):


Hmm, that link doesn't work for me (probably some kind of session id is needed). Can you post another url?

sorry for the late reply... just saw your post: http://www.springer.com/computer/ai/book/978-3-540-88008-0

Thanks! Looks like I have a lot of reading to do...

It depends on what you want to do. RePair [1] is currently the best grammar-based compressor, but it doesn't give good theoretical guarantees. In theory, instead, there is an O(log n) approximation to the smallest grammar problem [2], but AFAIK doesn't work very well in practice. In the same paper there is a good introduction to the problem and an analysis of the previous algorithms, including Sequitur.

[1] http://www.larsson.dogma.net/dcc99.pdf

[2] http://www.cs.virginia.edu/~shelat/papers/GrammarIEEE.pdf


I think RePair can be regarded as grammar-based (though they don't say so), but I don't think it produces a grammar which would be useful in any context other than reproducing the original sentence.

What I would really love (and maybe I'll just have to write it myself?) is a practical algorithm (not too worried about worst-case guarantees etc) which produces a semi-readable grammar from multiple example sentences, and can then be turned around to generate lots of new sentences.

The concept of grammar compression basically didn't exist at the time RePair was published, but it is by all means a grammar compressor (same goes for LZ78).

If you are interested in learning what a grammar should look like (which is very different from optimizing the number of rules) you may want to look into stochastic grammars http://en.wikipedia.org/wiki/Stochastic_grammar.

Well, yes, all my grammars are stochastic (else I'd be generating the same piece of music every time).

Audio is unnecessary. The video shows a slow-motion decoding of a gzipped version of the poem. The red text between brackets is a chunk of text that was earlier stored into the huffman tree (example "W{hile I }" means that the "hile I" was previously encoded; it occurred in the substring "while I pondered"). You can see the red chunks quickly occupy the larger volume of the poem, which visually highlights the symmetry in lyric that the computer is using to encode the file as gzip.

Pretty neat.

(Its not a Huffman tree. When you replace a fragment with a reference to previous content, its termed 'dictionary coding'. Huffman trees are what you use for `entropy coding`, which is how you compress the bits you didn't zap with the dictionary.)


Fixed, thanks :)

I appreciate your indulging my pedantic nature!

For some reason I find theories and approaches in compression really interesting. For those unfamiliar I recommend Blelloch's introduction to compression (http://www.cs.cmu.edu/~guyb/realworld/compression.pdf)

Another excellent source is Matt Mahoney's Data Compression Explained: http://mattmahoney.net/dc/dce.html

And yet another is David Mackay, who taught me at undergrad:


scroll to the bottom for the free PDF.

(the book is about all sorts of stuff, but compression is the main example used when defining information theory.)

Along similar lines, but a more focused book (no learning, etc) is the classic Cover and Thomas:


And let's not forget Marcus Hutter and Jurgen Schmidhuber, who use compression to explain general reasoning and human curiosity

http://prize.hutter1.net/ http://www.idsia.ch/~juergen/interest.html

Wow, this is a incredible visualization of how compression works. I never understood how it worked before, but the simple mentioning of pointers and then that video was all it took for me.

I've always wondered if this is true: If we approach infinite computational power, does the amount of information we need to represent data decrease? (Excuse any incorrect use of my terminology here.) I think about a number like pi that, as far as we know, has an infinite number of digits, and theoretically speaking every message is contained in that number. So if we just get the pointer of where that message is contained in the number and then calculate the number up to that pointer then we'll have ourselves the message. Hence, more computational power, less information needed.

There is a limit beyond which you can't compress data any further, called the Shannon limit.

Any sequence of bytes is just a number. So if you think of pi as an RNG, then the chances of finding a run of N digits equal to another number with N digits is (1/10) to the power of N, which quickly becomes intractible.

In reality the digits of pi are biased, so finding a particular number of N digits is even less likely.

It would be easier to search for that number by randomly seeding an RNG then searching the RNG output for the number. Then you could just store the seed+offset, which may be significantly less than the Shannon limit. But since the chance of encountering such a number is (1/256)^N, it quickly becomes impossible. And even if it weren't, the receiver would need to invoke the RNG [offset] times, which will be a massive number of times due to the probabilities involved. So it's not like you could precompute the index: the receiver still needs to compute the answer, which requires just as many computations as the sender.

In general, the closer you try to get to the Shannon limit, the more computation that is required. And perfect compression is impossible in practice except in constrained cases, so I'd speculate it requires infinite computational resources.

> There is a limit beyond which you can't compress data any further, called the Shannon limit.

I think you are confusing things a bit here. The Shannon limit has nothing to do with lossless compression, it refer to communication in noisy channels.

What you are referring to is the Shannon entropy, but even that is not actually a lower bound to compression of actual data, because it is defined for stochastic sources.

When you have an actual string and you want to compress it, the only meaningful lower bound is the Kolmogorov complexity of the string, that is the size of the smallest program that generates that string. It is of course incomputable, but it is what you would get if you had infinite computing power.

I appreciate you correcting my terminology, but beyond that you haven't actually disagreed with anything I've said.

Humans call it either the Shannon limit, or Shannon entropy, or Kolmogorov complexity of a string, or whatever other name humans have come up with for human names. The central point is there is some limit which you can't circumvent through trickery like "find it embedded within pi," regardless of what species you are, or what your species chooses to name fundamental limitations of nature.

A program to generate digits of pi is small, and pi embeds every possible string. But generating up to that offset of pi is impossible in practice. And it probably won't net you any savings anyway, since you'll still need to communicate the offset. The offset's length in digits is probably close to the length of the original string (after perfect compression), so it's unlikely you'd be able to circumvent Shannon entropy.

But, if you do manage to create a practical way to bypass Shannon entropy, I'd imagine it'll be headline news, along with netting you quite a lot of money.

If you're talking about Kolmogorov Complexity here, you can't really "circumvent" it. Kolmogorov Complexitiy of a string is basically defined as the size of the shortest possible representation of said string. If you manage to"bypass" the Kolmogorov Limit you actually just "lowered" it (well, it alwayws was by definition this low).

Your main point is totally valid however. You cannot compress infinitely. A simple proof: For any given N bits you can encode [at most] 2^N different strings. As there exist more than N strings there must be one which cannot be represented by N bits. As N can be arbitrarily large you'll always find strings which cannot be compressed lower than some chosen size. QED.

>In reality the digits of pi are biased, so finding a particular number of N digits is even less likely.

This is actually not known to be true. In fact, it's conjectured that pi is normal[1]--meaning that we should expect every sequence of digits of a given length to be as likely as any other, no matter what base we're using.

So if you know the position at which the string you're interested in appears, you can just 'compress' by giving the location in the binary expansion of pi at which your message appears, and the length of your message in bits.

Where this fails, of course, is that you're likely to have to search a long time to find your message in pi, and you're also probably unlikely to be able to express your starting offset in fewer bits than it would take to write your full message.

[1] http://en.wikipedia.org/wiki/Normal_number

> So if you know the position at which the string you're interested in appears, you can just 'compress' by giving the location in the binary expansion of pi at which your message appears, and the length of your message in bits.

It's a bit tongue in cheek, but this is exactly what piFS[1] does.

[1] https://github.com/philipl/pifs

I love his commitment to the joke.

That's beautiful.

So pi containing 3.141592653579 is equally likely as pi containing 1415926535797, which is equally likely to contain ABCDEFGHIJKL for every random ABCDEFGHIJKL? Fascinating. Completely impossible to take advantage of, too, but fascinating.

> Completely impossible to take advantage of

That depends on your goals, to do geeky things it's pretty cool :). I know my birth date (in the "ddmmyy" format) appears at the 262768th decimal of π.

By the way this example validates part of the answer of sillysaurus2 to mvleming: I need 6 digits to encode 6 other. Not that good a compression.

If pi is normal, and the two strings are of equal length, then yes... not only that, but if pi is normal it's likely that each appears infinitely often.

By the way, sorry to point this out, but after 535 pi actually goes 5358979...

"if pi is normal it's likely that each appears infinitely often."

Not likely, _certain_. If there is a sequence S that appears a finite number of times, there is a N such that as never appears after digit N. That, in turn, means that, in the limit of L => infinity, the probability that S appears in the first L digits is zero.

Oh. Right. Whoops. Thanks.

Hmm.. it would also mean that somewhere in pi there is a sequence of 1 trillion consecutive zeros.

If the conjecture of being normal is true, yes.

however, we can write a non-infinite computer program that generates arbitrary digits of pi, meaning we can compress it infinitely better than the Shannon Entropy of the actual baseN representation - because it is a special case (most data can't be defined by a few words of mathematical definition).

So if we just get the pointer of where that message is contained in the number and then calculate the number up to that pointer then we'll have ourselves the message.

As others have pointed out, the pointer would be longer than the original text in many cases.

It's very easy to express in layman's terms why perfect compression does not exist. If you had a compression algorithm that compressed every piece of data, it would introduce non-determinism. Say you had 4 bits (16 possibilities) and always compressed to fewer than 4 bits (8 possibilities), then one or more compressed datapoints would map to multiple uncompressed datapoints. Or in other words, you would lose information.

For a nice, format description, see Shannon's original paper:


There is a flaw in your random sequence approach; lets see if you can find it? Its a fun thinking exercise.

You can estimate the minimum bound on compression using Shannon's estimate http://en.wikipedia.org/wiki/Entropy_(information_theory)

An excellent problem for hobby coders is the Hutter Prize. Even if you don't win - and that'd be front page HN news if you did - its a really fun challenge and an excellent introduction to compression: http://prize.hutter1.net/

Also worth googling is Kolmogorov complexity.

Kolmogorov complexity captures your example. Kolmogorov complexity is the length of the shortest computer code that encodes a sequence. So the digits of Pi has a low computational complexity, because short programs exist that iteratively estimate PI. Random sequences have a high Kolmogorov complexity as their are no ways of generating the data.

Kolmogorov complexity measures the compressibility of a sequence. Its related to the information content, and is NP-Hard to compute!

The measure that takes into account the time to compute a sequence is the newer measure called "logical depth", but I only heard about that last week. The sequence of Ramsey numbers would have be very logically deep. A short program exists, but it takes forever to computer. The logical depth captures the amount of time you are saved by having the sequence precomputed.

It's impossible to have a lossless compression system that will always compress anything down to less size than it was originally, otherwise you could run it as many times as you want and compress everything down to 1 bit.

This includes "pointer to pi digits". What would probably happen there is the byte representation of the offset would be longer than the input string itself!

Although, now for curiousity, I'd like to actually try that "offset in pi" algorithm. :) Just for fun.

Not to spoil your weekend project, but someone's done it already :)


It's possible that for many messages, the pointer of the message into the representation of pi may actually be larger than the message itself.

It would be better to say that this is a visualization of how a certain kind of dictionary compression works.

The outer limits of data compression lie in using predictive models plus arithmetic coders. A visualization of how they work might look very different.

I get this:

    Offline Site: jvns.ca
    You have requested a site that is currently offline. This generally happens when a site is temporarily disabled for some reason, but has not been permanently removed...

Unexpected HN traffic is likely to have used all the site owners pre-paid credit.

Edit: To clarify, NearlyFreeSpeech hosting is prepaid.

Yup, apparently the $2 I had left in my account didn't cover it. Back up now!

This is extremely cool. For anyone curious what the various compression levels of gzip do, the full explanation is here: http://www.gzip.org/algorithm.txt. Basically the higher the compression level, the more times it spends searching for the longest possible matching string to point to.

As the author mentioned, poetry usually compresses quite well, thanks to rhyming. Here is another fun example:



To clear this up: This is exactly how gzip actually works. I wrote a version of gunzip from scratch and just inserted some print statements in the middle to get this visualization.

If anyone is interested, the part of the source code that produces the visualization is here: https://github.com/jvns/gzip.jl/blob/visualization/gzip.jl#L...

I ended up compiling Julia just so I could watch it in action myself: http://showterm.io/2b572805187a499194e91#fast


THIS IS COOL. I did not know about showterm.io.

You are quite right, I am an idiot.

Frankly, learning about LZW[0][1] and the Burrows-Wheeler Transform[2] was probably one of the highlights of my first undergrad algorithms course. The latter certainly made me feel like an idiot, too.

[0] http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welc...

[1] For the record, LZW isn't used in gzip; it was the patent-encumbered compression format used in GIF.

[2] http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transfo... (used in bzip2)

BWT is mindblowing. So many applications; e.g. it's used by pretty much all popular biological sequence alignment software to compress genomes so they fit in memory during the alignment :D

Can anyone explain what is going on here?

> Once upon a midnight dreary, while I {pon}ered

It's a visual demonstration how gzip works. The first line says it all. 'pon' matches the bytes in 'upon', the first first instance, thus it's pointer to those bytes instead of "duplicating" the pattern. As the text goes on, there are more matches, thus more pointers.

gzip works by constructing, on the fly, a dictionary of strings it has seen before and sort of inserting them by reference.

When you see something red and in braces, that is a bit that has occurred before and been inserted by reference.

I'm guessing that it is happening at a constant symbol rate, you'll notice it speeding up as it includes more and more text by reference instead of a single character at a time.

For those who don't feel like watching a youtube video that shows a recording of the animated command line output, simply click this link:


The author explains what he did to produce this output, and provides links to his program's source code in his post, so OP's link is still a worthwhile read, but that link cuts to the chase.

What I noticed in the other text, Hamlet, is that HAMLET in the last sentence has no pointer to Hamlet, obviously. This seems like an opportunity for optimisation, for text at least.

Usually a word is either lowercase, capitalised or uppercase. In more complex and rare cases this could be efficiently encoded bitwise (0 = keep, 1 = swap case), so HAMLEt becomes 011110 plus a pointer to Hamlet.

I wonder if any compression algorithm does this. Probably not, because the benefit is likely minimal at significantly increased de/compression time.

One problem is that way of thinking is distinctly ASCII-centric -- what happens when you have accents, umlats, Chinese, and the rest of the stuff that comes out of Unicode?

I suspect the encoding you propose would actually increase the size of the average plaintext, simply because people rarely will type "Hamlet" and then later type "HaMlEt" in the same document. I think a far more common case would be "Hamlet" (beginning of sentence) and "hamlet" (inline), where you can reuse the substring "amlet". (Either that, or they're consistent with the usage of all-caps, like MICROSOFT(R).)

Besides, if you know you're doing English, you can save 12.5% simply by using 7-bit ASCII instead of of 8-bit ASCII.

It seems simple enough to implement case-pairs for Unicode. Most likely it already is. Either way, you're right. It was just a simple idea.

Your idea is good, and many modern compression systems, like the PAQ system, exploit it [1].

[1] https://en.wikipedia.org/wiki/PAQ#Text_preprocessing

Julia programming Julia? Pretty cool too! That's how you recognize a real hacker...


Since a lot of the Julia core is written in Julia, it's also possible to have Julia working on Julia in Julia.

The deflation algorithm used by gzip is a variation of LZ77.


That's just great! Many thanks for the info.

chill out downvoters, I was being sarcy

This is a demonstration of a simplified LZ77 algorithm, not Gzip.

Gzip is a unix utility, LZ77 is an algorithm, this distinction is not pedantic.

This is what happens when you go to "hacker school" before regular CS school.

> Gzip is a unix utility, LZ77 is an algorithm, this distinction is not pedantic.

It's basically the definition of pedantic.

Gzip uses the LZ77 algorithm, so it's a simplified demonstration of what gzip does. It's also a simplified demonstration of LZ77.

The video title is in fact not quite accurate -- I chose to err in the direction of "cool soundbyte", since "LZ77 + poetry = awesome" doesn't have quite the same ring.

aortega is quite right that this video actually explains LZ77, which is only one part of how gzip compresses. gzip (or the DEFLATE algorithm) also uses Huffman coding to compress the alphabet and pointers after LZ77 compression. I explain a bit more about the phases at http://jvns.ca/blog/2013/10/23/day-15-how-gzip-works

And if you really want to know how it works, you should read http://www.infinitepartitions.com/art001.html, which is the reference I used for writing my implementation of gunzip. It includes C source code.

The dig at hacker schoolers was unnecessary, though :)

Well this is a much better description, thanks you.

Actually I like dumbed-down explanations as an introduction to something, but also I believe they should include a description like this one at the end, for the people that want to know more.

Sorry if the hacker school bashing sounded too harsh, but I believe that a little constructive criticism once in a while helps. Also, consider it a part of becoming a hacker :)

That last line was not necessary, there are plenty of people with CS degrees who wouldn't know the difference between different versions of LZW.

Yet the basic premise of the video is still wrong. Gzip uses many algorithms internally, among them LZ77 and huffman. It's like doing a video about "this is how mysql works" and showing the quicksort algorithm.

I agree with you. Gzip is merely a container format, just like AVI, MKV are for video/audio content. Gzip supports several compression encodings with Deflate being the the most common one I believe.

Who says you need to be CS?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact