Yes, compression is definitely a rabbit hole - essentially an infinite rabbit hole. Optimal compression requires perfect understanding of the underlying material. So to compress English text well, you need to understand English very well - it's grammar, morphology, semantics, etc.
I crawled very deep into this particular rabbit hole. I built text compressors that exploited linguistic structure. I was able to quantify improvements in my understanding of text by observing improvements in the compression rate. The beautiful aspect of this project is its rigor: lossless compression is such a demanding challenge that you cannot possibly deceive yourself. If you have a bug in your code, you will either decode the wrong result or fail to get a codelength reduction. Using this methodology I was able to build a statistical parser without using any labeled training data.
Unfortunately for me, the advent of GPT and other DNN-based LLMs made this research obsolete. It may still be interesting for humans to build theories of syntax and grammar, but GPT has this information encoded implicitly in its neural weights in a much more sophisticated way than linguistic theories can achieve.
Let's say your text is "I like red cats", your current token input is at "I like red". Let's say the model thinks "cats" is 25% likely. It also thinks "dogs" is eg 50% likely for the next token. So rather than storing 1 bit (= log p) to encode dog in the output encoded code, it can use 2 bits to store "cats". Totally lossless. At decode time, you're reading the code, so it's going to just pick the "cats" token anyway, even though it's not the likeliest, because it's not randomly sampling here, it's just using the underlying probabilistic model to do entropy coding. It's not lossy here, since we're not letting the model/sampling process freely choose what the text is made of.
Arithmetic coding is one of the classic information theory methods that produce essentially-optimal results. Like everywhere else in information theory, the eminence of the probability distribution is fully recognised, and given it, arithmetic coding gives you an information-theoretically optimal lossless compression (to the degree that the probability distribution describes the data well) (by treating the objects in question as numbers and intuitively, like the other post says, only keeping diffs).
By running the decoder during encoding. If the decoder already produces the right output, which it will most of the time, then you don’t need to store anything in the file. If it produces the wrong output, store a correction, then continue from there. The final file only needs to include the corrections.
In those datasets you could improve compression by adding apriori knowledge about their context in the real world - if the compression program knows about historical S&P500 prices, it will be pretty good at compressing all historical stock info. Kolmogorov complexity is indistinguishable from AI
They are not random (e.g. weather data should have a daily or seasonal pattern). And the fact that there does exist a random-looking data doesn't refute the original claim because AI can conclude that the data has a close-to-maximal entropy (i.e. "incompressible") instead.
Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers. Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 1GB of Wikipedia better than your predecessors, your (de)compressor likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI.
The Task:
Losslessly compress the 1GB file enwik9 to less than 114MB. More precisely:
- Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2 < L := 114'156'155 (previous record).
- If run, archive.exe produces (without input from other sources) a 10^9 byte file that is identical to enwik9.
- If we can verify your claim, you are eligible for a prize of 500'000€×(1-S/L). Minimum claim is 5'000€ (1% improvement).
- Restrictions: Must run in ≲50 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine.
> compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2 < L
That criterion is rather more complicated than just taking the size S2 of archive.exe
There is no logical meaning to the sum of the size of a compressor and its output that I can see.
Disregarding the compressor size would make this contest easier to understand as simply trying to determine the Kolmogorov Complexity (i.e. information content) of enwik9.
I looked for the motivation of including compressor size and only found this in the FAQ:
> By just measuring L(D)+L(A), one can freely hand-craft large word tables (or other structures) used by C and D, and place them in C and either D or A. By counting both, L(C) and L(D), such tables become 2-3 times more expensive, and hence discourages them.
Discouraging word tables seems like a weak and somewhat arbitrary justification for complicating the measure of merit. I don't think the contest would be any less interesting if the nature of the compression would be disregarded. One could even argue that compression of Wikipedia taking way more resources than decompression is justified by having to perform it only once, while its result could be decompressed millions of times.
I can't follow. If you'd disregard the size of the decompressor, you could encode as much data as you want in there, e.g. just assign a number to each article and include the corresponding text in the decompressor, or assign a number to each sentence occuring in Wikipedia, just dump the 1GB of data in there verbatim, etc...
That would make the size of the compressed file meaningless.
Disregarding the time complexity of the compression is interesting, even ignoring the space complexity during compression or the size of the compressor.
But the size of the decompressor can't be ignored when trying to "measure" Kolgomorov complexity of the source data.
> There is no logical meaning to the sum of the size of a compressor and its output that I can see.
There is if you have a time limit. Otherwise you could spend a week computing a better result and embed that directly into the compressor - for use only if you're competing that Wikipedia file.
And it doesn't matter if all of Wikipedia is embedded in the compressor — if you can reconstruct the original using only the decompressor and the compressed data, the process can be as expensive as you like.
That's the exciting part.
It's also a vague connection to what's exciting about LLMs, regarding the amount of information they can reproduce with comparatively small size of their weights. But since decompression must be lossless, it's unclear to me if this approach could really help here.
> And it doesn't matter if all of Wikipedia is embedded in the compressor
That's not what I said. I said you can embed a better result. As in, make an algorithm that performs ok in general case, but bake in a super-optimised result that specifically applies to Wikipedia, even if it takes months to precompute.
> Why is a time limit needed?
Would you like to judge my compressor against other ones? It will take 2000 years to get the result, but it's totally worth it, I promise. It's a random search of optimal binaries that produce the Wikipedia and it will stop once I'm over the winning threshold.
Thanks for replying. I see a bit more of your point now I guess.
I still tend to consider the needed complexity to produce that binary as irrelevant, because the challenge is not interested in practical compression, only in space complexity and the relation between information entropy and intelligence, no?
If such a highly-specialized binary could be theoritically produced for other corpuses of the same size as well, that would be very interesting! Regardless of how practical their computation is.
And if it's not possible for othe corpuses, the complexity has to be somehow embedded in the decompressor, not the compressor, or am I missing something? Only alternative would be that the compression exploits some data inherent to the machine architecture that is not part of the executable. Still, this only shifts the problem around.
The idea reminds of an optimized ordering of Gödel numbers.
The pidgeonhole principle limits what a general lossless compressor can do, but we don't know how the minimum size of a compressed file scales with the output size, because this problem is related to the Halting problem. Even if we assume infinite compute, we can't know the exact limits for calculating a function |c(n)| that specificies the minimum length of a program that computes an arbitrary output of size |n|. If I'm not totally ignorant and lacking information here.
>since decompression must be lossless, it's unclear to me if this approach could really help here.
All lossy compression can be made lossless by storing a residual. A modern language transformer is well suited to compression because they output a ranked list of probabilities for every token, which is straightforward to feed into an entropy encoder.[0]
On the other hand, LLMs expose the flaw in the Hutter prize - Wikipedia is just too damn small, and the ratio of "relevant knowledge" to "incidental syntactic noise" too poor, for the overhead of putting "all human knowledge" in the encoder to be worth it. A very simple language model can achieve very good compression already, predicting the next word almost as well as a human can in absolute terms. The difference between "phone autocomplete" and "plausible AGI" is far into the long tail of text prediction accuracy.
Probably a state of the art Hutter prize winning strategy would simply be to train a language model on Wikipedia only, with some carefully optimal selection of parameter count. The resulting model would be useless for anything except compressing Wikipedia though, due to overfit.
[0] A practical difficulty is that large language models often struggle with determinism even disregarding the random sampling used for ChatGPT etc, when large numbers of floating point calculations are sharded across a GPU and performed in an arbitrary order. But this can easily be overcome at the expense of efficiency.
> On the other hand, LLMs expose the flaw in the Hutter prize
I'm not sure why it would be a flaw — isn't this more a sign of how universally interestint the rules are?
Out of curiosity, I didn't dive deeply into the previous winner's implementation, but are there NNs (+ error correction matrix, that's why I mentioned the FAQ in other comment) among them?
Regurgitating something verbatim seems less like intelligence vs summarizing the important details. Additionally, a human can even choose the requirements for lossy compression in the first place - do I want a 1 sentence summary of what I read, a 1 paragraph summary, a synopsis, a description of the themes, how the themes relate to other works by the author/creator, etc etc.
So I fail to see the fundamental connection between intelligence and lossless compression unless you just choose to define intelligence in some way that connects it to lossless compression.
I once had to run a program hundreds of times and capture all the stack traces to help the author catch a stochastic bug. I zipped up all the stack traces and sent the resulting file to the author. The compressed file was only a few dozen MB, and the uncompressed stack traces all together were only a few hundred MB... on my ZFS filesystem with compression enabled. When the author unzipped the file on their end, they were surprised to find around 10 GB of very repetitive stack traces.
My favorite uncompressed file size surprise comes from the world of Dreamcast game piracy.
Pirates would publish game disc images in a ZIP file that was say 100mb. When uncompressed the iso file would be 750mb. The reason for the huge compression ratio was because the iso would contain a 600mb empty file, which compressed down to almost nothing.
The purpose was so the when the image was burned to a cdr, the game data would be burned on the outer part of the physical media, and the empty file burned to the inner. The goal was to speed up read times when playing a game.
I think the author should study probability- and information theory if he's serious about compression. It is mathematically deep, and goes into interesting tangents like information geometry and quantum information theory.
Brotli itself has a pre-filled dictionary, which was designed specifically for whatever was most common for internet traffic at the time. So lots of stuff to make XML headers tiny, along with a lot of common English words.
Nice curiosity driven exploration! I share the positive sentiment towards such project-based learning as well. It was a good read James, thank you. One thing you might consider to include is more quantitative data in terms of compression progress for methods you applied
I’m interested about how the future exploring of methods you suggested at the end would work out, especially treating text compression like image compression
> got the number 1024, I had a problem. Space could be saved if coffee was 1.
Sounds like OP is encoding as ascii encoded digits? Seems like there should be space to be saved there by encoding as actual numeric bytes instead, although of course since you will need more than 256 numbers, you'll have to have an encoding for variable-width numbers similar to what UTF-8 does. Or if you don't need more than 65K, maybe just two-byte-width numbers would still save you significant space over ascii digits (where anything over 9 is already two bytes!)
I don't entirely understand the exersize. Like, if the size of the compressing software and data (such as the dictionary) are taken into account too. And it seems unlikely this kind of naive reinventing the wheel approach is going to beat, say, gzip in the first place? (or am I really wrong there?)
But it is fun and educational to invent new DIY compressions!
So has anyone tried lossless compression with an LLM?
Method: create a list of next token predictions, ordered by describing probability, and encode the actual token as its position in the ordered list (so, eg, the most likely token is encoded as zero, second most likely as 1, etc).
If the LLM is good, this should provide an excellent encoding.
> If the LLM is good, this should provide an excellent encoding.
For the purposes of compression "contests" like this one, the size of the model would count against your compressed data size. You're going to have a hard time fitting a useful model into <100 MB.
The thing is, we typically have three parts of a compression scheme: the data, the model (dictionary, Huffman tree, etc), and the program. The model is typically learned from the data at compression time.
The program itself is general, however, and as far as I know, doesn't count against the total. Ie, the size of the gzip binary isn't part of the total for the purposes of the contest.
So if an LLM is genuinely useful without fine tuning on the target data, you should make it part of the program.
The data-specific model could be a self-prompt produced from reading and summarizing the data, which would help with the initial context-free predictions...
{Edit} Ha, looking at the competition, it does indeed include the full size of the gzip binary in the metric. Weird.
> {Edit} Ha, looking at the competition, it does indeed include the full size of the gzip binary in the metric. Weird.
The competition is specific to one corpus of test data; it doesn't include testing on other data. If the size of the program weren't included, you could hard-code all of the test data into it and claim victory with an infinite compression ratio.
Fun fact: the idea of using an AI language model as compression has existed since at least 1988. It was mentioned in passing by Vernor Vinge in his short story The Blabber.
Of course it can. A typical structure for a lossless compressor is a predictor feeding into an entropy encoder. Your predictor tells you how likely each next token is, and the entropy encoder uses that information to encode more likely tokens with fewer bits. As long as the compressor and decompressor make the same predictions, you’ll be able to reconstruct the original data exactly. And if the predictions are good, you’ll compress the data well.
An LLM (specifically the type people usually mean, like ChatGPT) uses a big transformer network to assign probabilities* to the all possible next tokens. To generate text, you randomly pick from the most likely next tokens, then repeat. But you can just leave out that second part, and use the transformer as your predictor for compression.
* Actually the log of the probability, and in modern, RLHF tuned models, they don’t quite represent probabilities anymore, but anyway
How much research has been done on lossy text compression? As in, would it be possible to rewrite text with fewer words that conveys the same meaning? I'd imagine AI could do it.
This is one of the most common reactions, and the answer is, there's no real difference: lossy compression is the same thing as lossless. You simply store some extra bits to correct the errors or lost data (typically in an arithmetic encoding framework where you can plug in arbitrary predictors or combinations of predictors), and now it's back to the lossless setting. So all real compression algorithms are effectively 'lossy' already, and making the distinction gains you nothing. You instead spend your time thinking about how to predict better at all.
I crawled very deep into this particular rabbit hole. I built text compressors that exploited linguistic structure. I was able to quantify improvements in my understanding of text by observing improvements in the compression rate. The beautiful aspect of this project is its rigor: lossless compression is such a demanding challenge that you cannot possibly deceive yourself. If you have a bug in your code, you will either decode the wrong result or fail to get a codelength reduction. Using this methodology I was able to build a statistical parser without using any labeled training data.
Unfortunately for me, the advent of GPT and other DNN-based LLMs made this research obsolete. It may still be interesting for humans to build theories of syntax and grammar, but GPT has this information encoded implicitly in its neural weights in a much more sophisticated way than linguistic theories can achieve.