What is shown here is the LZ77  factorization of a string. Compression in general works by exploiting redundancy in the source, and natural language is highly redundant, since many words repeat often. Hence the factors in the factorization often look like frequent words or n-grams.
A recent line of research is grammar compression, which tries to turn a text into a tree of rules that generate that text. While still not very good at general-purpose compression, the generated tree are much more interesting than the LZ77 factorization, since they "discover" something that looks like a syntactic parsing of the string, finding syllabes, words, phrases, sentences...
In Craig Nevill-Manning's Ph.D. thesis  introduction there are several examples of the inferred grammar of pieces of text, music, fractals, source code, etc... While the algorithm presented there (Sequitur) is now kind of obsolete, the thesis is very interesting because it makes some considerations with a linguistic perspective.
As somebody who's learned a foreign language to near-fluency as an adult, I've spent a lot of time trying to figure out how my brain infers a grammar from large quantities of input. Subjectively, it feels like my brain learns fixed phrases, and then notices that certain chunks can be replaced by subtrees of the appropriate type.
There are complications: inflections, gender, agreement, probability, and so on. But still, grammar compression looks like a fascinating formulation of the problem.
I think that this is also the basic idea behind the usage-based model in linguistics, a theory on how we learn languages:
The usage-based model is based on inductive learning,
meaning that linguistic knowledge is acquired in a
bottom-up manner through use. It allows for redundancy
and generalizations, because the language user generalizes
over recurring experiences of use.
This all feels more like a network of relations and probabilities of coocurrence of various (free?) orders. I followed your link and I strongly feel the suffix "grammar" is the real burden here. We're trying to put structure on what is actually creating it.
It's like the laws of physics: We have laws which seem to work, but we know they're not the whole story, so we observe more of what happens and try to bring our laws into harmony with the actual underlying laws.
Since you say it's obsolete, do you have any links to papers or software related to successors of the Sequitur algorithm?
Abstract: We show that a recurrent, second-order neural network using a real-time, forward training algorithm readily learns to infer small regular grammars from positive and negative string training samples. We present simulations that show the effect of initial conditions, training set size and order, and neural network architecture...
I think RePair can be regarded as grammar-based (though they don't say so), but I don't think it produces a grammar which would be useful in any context other than reproducing the original sentence.
What I would really love (and maybe I'll just have to write it myself?) is a practical algorithm (not too worried about worst-case guarantees etc) which produces a semi-readable grammar from multiple example sentences, and can then be turned around to generate lots of new sentences.
If you are interested in learning what a grammar should look like (which is very different from optimizing the number of rules) you may want to look into stochastic grammars http://en.wikipedia.org/wiki/Stochastic_grammar.
scroll to the bottom for the free PDF.
(the book is about all sorts of stuff, but compression is the main example used when defining information theory.)
I've always wondered if this is true: If we approach infinite computational power, does the amount of information we need to represent data decrease? (Excuse any incorrect use of my terminology here.) I think about a number like pi that, as far as we know, has an infinite number of digits, and theoretically speaking every message is contained in that number. So if we just get the pointer of where that message is contained in the number and then calculate the number up to that pointer then we'll have ourselves the message. Hence, more computational power, less information needed.
Any sequence of bytes is just a number. So if you think of pi as an RNG, then the chances of finding a run of N digits equal to another number with N digits is (1/10) to the power of N, which quickly becomes intractible.
In reality the digits of pi are biased, so finding a particular number of N digits is even less likely.
It would be easier to search for that number by randomly seeding an RNG then searching the RNG output for the number. Then you could just store the seed+offset, which may be significantly less than the Shannon limit. But since the chance of encountering such a number is (1/256)^N, it quickly becomes impossible. And even if it weren't, the receiver would need to invoke the RNG [offset] times, which will be a massive number of times due to the probabilities involved. So it's not like you could precompute the index: the receiver still needs to compute the answer, which requires just as many computations as the sender.
In general, the closer you try to get to the Shannon limit, the more computation that is required. And perfect compression is impossible in practice except in constrained cases, so I'd speculate it requires infinite computational resources.
I think you are confusing things a bit here. The Shannon limit has nothing to do with lossless compression, it refer to communication in noisy channels.
What you are referring to is the Shannon entropy, but even that is not actually a lower bound to compression of actual data, because it is defined for stochastic sources.
When you have an actual string and you want to compress it, the only meaningful lower bound is the Kolmogorov complexity of the string, that is the size of the smallest program that generates that string. It is of course incomputable, but it is what you would get if you had infinite computing power.
Humans call it either the Shannon limit, or Shannon entropy, or Kolmogorov complexity of a string, or whatever other name humans have come up with for human names. The central point is there is some limit which you can't circumvent through trickery like "find it embedded within pi," regardless of what species you are, or what your species chooses to name fundamental limitations of nature.
A program to generate digits of pi is small, and pi embeds every possible string. But generating up to that offset of pi is impossible in practice. And it probably won't net you any savings anyway, since you'll still need to communicate the offset. The offset's length in digits is probably close to the length of the original string (after perfect compression), so it's unlikely you'd be able to circumvent Shannon entropy.
But, if you do manage to create a practical way to bypass Shannon entropy, I'd imagine it'll be headline news, along with netting you quite a lot of money.
Your main point is totally valid however. You cannot compress infinitely. A simple proof: For any given N bits you can encode [at most] 2^N different strings. As there exist more than N strings there must be one which cannot be represented by N bits. As N can be arbitrarily large you'll always find strings which cannot be compressed lower than some chosen size. QED.
This is actually not known to be true. In fact, it's conjectured that pi is normal--meaning that we should expect every sequence of digits of a given length to be as likely as any other, no matter what base we're using.
So if you know the position at which the string you're interested in appears, you can just 'compress' by giving the location in the binary expansion of pi at which your message appears, and the length of your message in bits.
Where this fails, of course, is that you're likely to have to search a long time to find your message in pi, and you're also probably unlikely to be able to express your starting offset in fewer bits than it would take to write your full message.
It's a bit tongue in cheek, but this is exactly what piFS does.
That depends on your goals, to do geeky things it's pretty cool :). I know my birth date (in the "ddmmyy" format) appears at the 262768th decimal of π.
By the way this example validates part of the answer of sillysaurus2 to mvleming: I need 6 digits to encode 6 other. Not that good a compression.
By the way, sorry to point this out, but after 535 pi actually goes 5358979...
Not likely, _certain_. If there is a sequence S that appears a finite number of times, there is a N such that as never appears after digit N. That, in turn, means that, in the limit of L => infinity, the probability that S appears in the first L digits is zero.
As others have pointed out, the pointer would be longer than the original text in many cases.
It's very easy to express in layman's terms why perfect compression does not exist. If you had a compression algorithm that compressed every piece of data, it would introduce non-determinism. Say you had 4 bits (16 possibilities) and always compressed to fewer than 4 bits (8 possibilities), then one or more compressed datapoints would map to multiple uncompressed datapoints. Or in other words, you would lose information.
For a nice, format description, see Shannon's original paper:
You can estimate the minimum bound on compression using Shannon's estimate http://en.wikipedia.org/wiki/Entropy_(information_theory)
An excellent problem for hobby coders is the Hutter Prize. Even if you don't win - and that'd be front page HN news if you did - its a really fun challenge and an excellent introduction to compression: http://prize.hutter1.net/
Also worth googling is Kolmogorov complexity.
Kolmogorov complexity measures the compressibility of a sequence. Its related to the information content, and is NP-Hard to compute!
The measure that takes into account the time to compute a sequence is the newer measure called "logical depth", but I only heard about that last week. The sequence of Ramsey numbers would have be very logically deep. A short program exists, but it takes forever to computer. The logical depth captures the amount of time you are saved by having the sequence precomputed.
This includes "pointer to pi digits". What would probably happen there is the byte representation of the offset would be longer than the input string itself!
Although, now for curiousity, I'd like to actually try that "offset in pi" algorithm. :) Just for fun.
The outer limits of data compression lie in using predictive models plus arithmetic coders. A visualization of how they work might look very different.
Offline Site: jvns.ca
You have requested a site that is currently offline. This generally happens when a site is temporarily disabled for some reason, but has not been permanently removed...
Edit: To clarify, NearlyFreeSpeech hosting is prepaid.
 For the record, LZW isn't used in gzip; it was the patent-encumbered compression format used in GIF.
 http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transfo... (used in bzip2)
It's a visual demonstration how gzip works. The first line says it all. 'pon' matches the bytes in 'upon', the first first instance, thus it's pointer to those bytes instead of "duplicating" the pattern. As the text goes on, there are more matches, thus more pointers.
When you see something red and in braces, that is a bit that has occurred before and been inserted by reference.
I'm guessing that it is happening at a constant symbol rate, you'll notice it speeding up as it includes more and more text by reference instead of a single character at a time.
The author explains what he did to produce this output, and provides links to his program's source code in his post, so OP's link is still a worthwhile read, but that link cuts to the chase.
Usually a word is either lowercase, capitalised or uppercase. In more complex and rare cases this could be efficiently encoded bitwise (0 = keep, 1 = swap case), so HAMLEt becomes 011110 plus a pointer to Hamlet.
I wonder if any compression algorithm does this. Probably not, because the benefit is likely minimal at significantly increased de/compression time.
I suspect the encoding you propose would actually increase the size of the average plaintext, simply because people rarely will type "Hamlet" and then later type "HaMlEt" in the same document. I think a far more common case would be "Hamlet" (beginning of sentence) and "hamlet" (inline), where you can reuse the substring "amlet". (Either that, or they're consistent with the usage of all-caps, like MICROSOFT(R).)
Besides, if you know you're doing English, you can save 12.5% simply by using 7-bit ASCII instead of of 8-bit ASCII.
Since a lot of the Julia core is written in Julia, it's also possible to have Julia working on Julia in Julia.
Gzip is a unix utility, LZ77 is an algorithm, this distinction is not pedantic.
This is what happens when you go to "hacker school" before regular CS school.
It's basically the definition of pedantic.
Gzip uses the LZ77 algorithm, so it's a simplified demonstration of what gzip does. It's also a simplified demonstration of LZ77.
aortega is quite right that this video actually explains LZ77, which is only one part of how gzip compresses. gzip (or the DEFLATE algorithm) also uses Huffman coding to compress the alphabet and pointers after LZ77 compression. I explain a bit more about the phases at http://jvns.ca/blog/2013/10/23/day-15-how-gzip-works
And if you really want to know how it works, you should read http://www.infinitepartitions.com/art001.html, which is the reference I used for writing my implementation of gunzip. It includes C source code.
The dig at hacker schoolers was unnecessary, though :)
Actually I like dumbed-down explanations as an introduction to something, but also I believe they should include a description like this one at the end, for the people that want to know more.
Sorry if the hacker school bashing sounded too harsh, but I believe that a little constructive criticism once in a while helps. Also, consider it a part of becoming a hacker :)