Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Specially compression algos that use arithmetic coding with interval weights adjusted based on the prediction of what is likely coming next are very similar. They adjust the arithmetic coding (https://en.wikipedia.org/wiki/Arithmetic_coding) based on the context of the byte/bit to predict, so the more accurate the predicted continuation is, the more efficient is the encoding. The task is very similar to that of the transformers like GPT. A perfect prediction will almost have no additional storage cost because the arithmetic interval doesn't get smaller, and thus no bit gets stored - but anyway, you have to count the size of the decompressor to get a fair benchmark.


How is it similar?

There’s been a lot of study going the other direction - using neural networks to aid the entropy prediction in classical compression algorithms - but I’m not seeing the conceptual link between how transformer/attention models work internally and how gzip works internally beyond “similar words are easy to compress”

I’m not seeing it because GPT representations are just vectors of fixed, not varying, size


An LLM or, well, any statistical model, is about prediction. As in, given some preceding input, what comes next?

One way to measure the accuracy of the model, as in it's "intelligence", is to use the predictions to turn input into all the differences from the prediction; if it's good at predicting then there will be fewer differences and it will compress it.

So seeing how well your model can compress some really big chunk of text is a very good objective measure of it's strength and compare it to the strength of others?

So a competition is born! :)


Good summary.

The LLM vs a static tree has some interesting oppositions. With a static tree as emitted by a compression alg will probably many times beat an LLM. As it has full knowledge of the whole stream (or in gzips version that window). So it can do things where it can look back and say 'hm the tree I spit out was not that good let me build a better one'. Where as an LLM does not really have that before hand knowledge. Using a pre-cooked LZW tree for all inputs would be more akin to using an LLM.


I would envisage the LLM is allowed to train on each and every input token. So, to begin with, it knows nothing; but to predict the very last token, it has internalised the whole preceding stream.

Now I wouldn't expect it to be particularly competitive in enwik8 or enwik9, but the question would be: is there any max-model-size and input-length for which it would right now pull ahead and become the best known or at least competitive predictor?


I would expect it to be 'ok'. Basically as if you used a pre-trained LZW table only and shipped that along with each stream but the results would be mixed. A compressor has the advantage of foresight and hindsight whereas a LLM would only have hindsight. As any input stream would be basically at the mercy of the previous streams fed into it. Those may or may not be optimal.

It is an interesting hypothesis. But my gut feeling is I would expect a LLM to perform on average worse. Competitive? Yes, but still worse. But it is something I am sure someone will test.

From the hundreds of different compressor models I have made for myself over the years. Usually believe it or not the compressed data is usually the best part. It is the decode tree/table/key/whatever that usually ends up crowding out the savings on the compressed data. In this case it would be the LLM weights or whatever the LLM spits out for the tree/decode.


gzip uses LZ and Huffman coding and not arithmetic coding with a predictor, so yes, these are not similar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: