
Hutter Prize: Compress a 100MB file to less than the current record of 16 MB - tosh
http://www.hutter1.net/prize/index.htm
======
abetusk
The author, Marcus Hutter, created the "Hutter Prize" [1] that offers a
C50,000 prize to persons who can compress 100Mb of a Wikipedia file to under
15Mb.

From the prize site:

    
    
        This compression contest is motivated by the fact
        that being able to compress well is closely
        related to acting intelligently, thus reducing
        the slippery concept of intelligence to hard file
        size numbers.
    
    

[1]
[http://www.hutter1.net/prize/index.htm](http://www.hutter1.net/prize/index.htm)

~~~
baddox
I might be missing something here. Where does intelligence come in? On any
given deterministic computer platform, there is a finite number of 15MB
executables. And there is a finite number of 100MB text files. But the second
number is much larger than the first, and thus not every 100MB text file has a
15MB executable that outputs it.

Of course, maybe we can find a 15MB executable that outputs the specific
Wikipedia file he’s asking for, and there might be clever ways to search for
or construct that executable on a specific computer platform, but it doesn’t
strike me as some particularly generalizable sense of compression requiring
intelligence.

~~~
Scarblac
He has a book on "Universal Artificial Intelligence" , that defines what he
considers AI and then solves it. The only catch is that it relies on
Kolmogorov complexity, which is incomputable.

The Kolmogorov complexity of a string is (loosely) the size of the smallest
possible Turing machine that will output that string when run. So a string
with 1000x "a" has a lower complexity than a random string of length 1000.

He sees GAI as a system that, given all its input up to now (from sensors and
the like) and some value-function that defines the goal we're trying to reach,
takes the action that has the highest expected value of the value-function.
It's hard to give a better definition of intelligence in a completely general
sense: given all input, choose the most likely best action.

That requires some guessing about the future: given all this input so far, for
all possible next inputs in the next time step, what is the likelihood of
each.

And his answer is: the likelihood is correlated with the Kolmogorov complexity
of the sequence of all inputs so far, plus the possible future input we're
currently considering. The idea is that there is information in the inputs so
far, and the sequence that best uses that has the lowest Kolmogorov complexity
and is most likely to occur. If we've, seen 999 "a" s, we expect another one.
If we've seen a car-shaped object slide left to right on previous camera
images, we expect it to continue its movement. If the input so far is
completely random then every possible future is equally likely. Et cetera.

Given that, it becomes possible to calculate the value function for every
combination of possible action and possible next sensor input, and the AI can
know what to do.

Except of course for the detail that this is (wildly) incomputable.

But what is somewhat correlated to Kolmogorov complexity? _Compressibility_.

So that is why Hutter is interested in better compression algorithms.

All this is very loosely stated from memory, I was interested in this stuff
over a decade ago...

~~~
0-_-0
Why aren't you interested anymore? We're getting closer to actually doing what
you just outlined.

~~~
Scarblac
I got a wife and kids so have no free time anymore, and nowadays I feel there
are far more important problems than AI in the world (climate change, huge
loss of biodiversity, cheap very effective propaganda).

------
dang
Thread from 2014:
[https://news.ycombinator.com/item?id=7405129](https://news.ycombinator.com/item?id=7405129)

2008:
[https://news.ycombinator.com/item?id=143704](https://news.ycombinator.com/item?id=143704)

------
asperous
"as a path to AGI" anyone else disagree with this? Creating a highly efficient
compressor seems very much not AGI

~~~
gwern
All something like GPT-2 is, is a text compressor, a model for an arithmetic
encoder specifically. It predicts the probability of the next token
conditional on a history. You can directly measure the 'bits per
character'/'perplexity' as a measure of (compression) performance, and that is
typically how these language models are evaluated: eg
[http://nlpprogress.com/english/language_modeling.html](http://nlpprogress.com/english/language_modeling.html)
[https://paperswithcode.com/task/language-
modelling](https://paperswithcode.com/task/language-modelling)

The Hutter dataset is included, incidentally, and the top is a Transformer-XL
(277M parameters) at 0.94 BPC. Does all of that seem 'very much not AGI'?

~~~
doctorpangloss
Yes, it does not at all seem like AGI! You’ve explained succinctly why the
GPT2 result is maybe not on the road to AGI.

~~~
gwern
Language models trained in a predictive fashion achieve ever closer to human
compression rates (humans would only get ~0.7-0.8 BPC on Hutter), and are
responsible for not just the creepiest & most realistic text generation to
date but also setting SOTAs across all language-related AI benchmarks like
SuperGLUE, designed to test language understanding and real-world reasoning,
or chatbots. If that has nothing whatsoever to do with progress towards AGI,
we'll just have to agree to disagree.

------
devwastaken
Whenever new compression algs come up I can't help but think they're cheating
with their dictionaries by pre-defining most common words and character
sequences, such as html tags. I would wonder if it's more ideal for human
language to simply have the best dictionary. Perhaps even a negotiation on
both ends to define what that dictionary is. If browsers had an efficient
dictionary of all English words and phrases, compression would seem to tiny.

~~~
terrelln
In this competition, and in similar competitions, the size of the binary used
to decompress is taken into account. If you wanted to use a dictionary, you
would need to pay for it in binary size. In this competition, the file must be
self-decompressing.

Dictionaries are powerful tools when compressing small data. But once the data
is large enough they stop mattering so much. See the dictionary compression
section of [https://engineering.fb.com/core-
data/zstandard/](https://engineering.fb.com/core-data/zstandard/).

------
sagebird
I have a concept I have explored a bit but has yet to yield a positive
response (I have not experimented long enough to come to any conclusion):

Can you add text to a string of text to make the compressed string shorter?

I suppose you could call it “compressor hinting” and it would be specific to
the compressing algorithm. The added text would be tagged through an escape
sequence, so it could be removed after the decompressing stage.

My naive approach is to add randomly generated hints to at randomly chosen
location and then gzip/ungzip. I haven’t had success yet. I think that the
potential is limited by the expressiveness of the compressor’s “instruction
set” - ie - can it understand generalized hints.

~~~
waterhouse
There are some compressors (zstd is the one I'm thinking of) that accept
"dictionaries", which are meant to be produced from datasets similar to what
you're going to compress; I would guess they contain something resembling
frequency tables.
[https://github.com/facebook/zstd](https://github.com/facebook/zstd) has some
description but doesn't explain precisely what the dictionary contains.

~~~
terrelln
Zstd's dictionaries contain two things:

1\. Predefined statistics based on the training data for literals (bytes we
couldn't find matches for), literal lengths, match lengths, and offset codes.
These allow us to use tuned statistics without the cost of putting the tables
in the headers, which saves us 100-200 bytes. 2\. Content. Unstructured
excerpts from the training data that are very common. This gets "prefixed" the
the data before compression and decompression, to seed the compressor with
some common history.

Dictionaries are very powerful tools for small data, but they stop being
effective once you get to 100KB or more.

------
zelon88
I'll have to give this a go. I'll probably fail miserably but it sounds like
fun. Related: [1] Here is my (sophomoric) attempt at a compression algorithm,
and [2] here is the blog post I wrote about it.

[1] [https://github.com/zelon88/xPress](https://github.com/zelon88/xPress)

[2] [https://www.honestrepair.net/index.php/2019/03/08/xpress-
an-...](https://www.honestrepair.net/index.php/2019/03/08/xpress-an-
experiment-in-data-compression/)

~~~
mruts
Have you run it on the competition dataset? What’s the size of the executable
+ compressed data?

~~~
zelon88
I'm not sure. Currently the compression variables are somewhat inefficient.
I've compressed similar files down to about 17mb using hardcoded variables but
I've yet to stumble across one that will reliably compress a variety of data.
I was hoping to test it enough to come across patterns that would enable me to
develop heuristics, changing the variables according to different input file
characteristics. This is simply extremely time consuming. I have a non-open
version of a listener for server applications that I've sort of moved my
testing over to since I created the client compressor/extractor above.

------
rasz
not updated in a while, there are better results (14,838,332) noted here
[http://mattmahoney.net/dc/text.html](http://mattmahoney.net/dc/text.html)

~~~
hinkley
Looks like it shaved off half a percent total size but takes 7 times as long
to run.

~~~
rasz
and reading compressor description it seems to be a hand written version of
mentioned somewhere above GPT-2, with hard coded English language model as one
of the heuristics

------
tombert
A few years ago this was trending on hacker circles [1]. This repo is
obviously a joke, but I always wondered if it could be taken more seriously
and made into something cooler. Assuming that Pi is normally distributed, then
every potential 1kb chunk of a file would be in too.

I tried programming this idea very briefly a couple years ago but I got pulled
away for something else.

[1] [https://github.com/philipl/pifs](https://github.com/philipl/pifs)

~~~
krastanov
I imagine the BigInt index that points to the starting digit for your data
would take more space than the data itself. Here is an example in julia:

    
    
        julia> setprecision(BigFloat, 10_000) do 
                   findfirst("123","$(BigFloat(pi)))")
               end
        1926:1928
    

I.e. the string "123" starts at index 1926.

On the other hand, this type of "encoding" always fun to think about. I would
suggest reading "Permutation City" by Greg Egan if you are amused by this.

~~~
earenndil
> I imagine the BigInt index that points to the starting digit for your data
> would take more space than the data itself

Assuming optimal conditions (i.e. optimal randomness), you can expect that as
many pieces of data will have pointers smaller than them, as pointers bigger
than them; you can't win that way necessarily, only in some cases.

But maybe if you can find two 'opposite' algorithms, such that for any data
where the pointer is larger with the first, it's smaller in the second...

(From what I know of information theory, the extra bit you have to use to
specify which algorithm to use will outweigh all savings, but it's still fun
to think about.)

~~~
krastanov
I doubt that... You are comparing a random variable (the size of the data) to
a sum of random variables (the index of the data). Thus I would expect the
index to be greater than the size... But I am fairly uncertain.

------
benbristow
That's easy. Just need to use Pied Piper's middle-out algorithm.

------
CamperBob2
More interesting would be a semantic compressor, one that didn't necessarily
return the exact words in the corpus but that would turn a small amount of
compressed data into a document with very similar meaning.

E.g., "Because trees tend to have a low albedo, removing forests would tend to
increase albedo and thereby cool the planet" is a sentence from the file that
would need to be compressed. It should be acceptable for the decompressed text
to read, "Eliminating large groups of trees would improve surface reflectivity
and provide support for a planetary cooling trend," or to employ any number of
similar phrasings.

Any form of AGI arrived at through Hutter's exercise would be more akin to the
intelligence shown by an eidetic person when they recite the telephone book,
and computers are already good at that.

~~~
londons_explore
Given enough time and compute, an AGI could enumerate all possible ways to say
the sentence in your example, and then store a number representing which
sentence is the exact same one as the input.

That number will be much smaller than the text of the sentence.

~~~
CamperBob2
That number would have to specify one of all possible sentences, though, not
just one from the subset in question. I'd assume it would be longer in that
case.

------
MilesTeg
I thought of a potential loophole. What about a non-deterministic extractor?
Let's say you can make your program smaller by having only a 1% chance of
extracting perfectly. Just submit the program (for an expected) 100 times and
claim your prize.

~~~
svantana
Through the magic of information theory, I can tell you that this particular
trick will save you log2(100) = 6.6 bits. Probably not enough to make a
difference.

~~~
MilesTeg
I see. You are right.

------
JyrkiAlakuijala
Alexander Rhatushnyak -- the first, second, third and fourth winner of the
Hutter prize -- is the main contributor to JPEG XL lossless mode. Perhaps we
all will eventually get practical benefit from his ability to build amazing
compression solutions.

------
jerrre
Since it's about compressing one specific sequence of bytes, and compression
time/cost doesn't matter - only decompression - wouldn't finding a set of
[neural network inputs + weights + processing code + error correction] (I
think in the direction of GANs or auto-encoders) be a way to have a high
chance of finding improvements? Not 100% sure it would be in spirit of the
contest, and if the cost of training would offset the reward...

~~~
gwern
A fully NN solution (as opposed to using a light sprinkling like all
competitive solutions do) requires big binaries for the most part, while an
arbitrary program can easily memorize exact sequences to find repetition while
using a tiny NN for some finetuning of predictions. A pure NN solution like a
Transformer-XL _does_ turn in the record-setting performance on natural
language datasets including WP... unfortunately, the required NN model sizes
alone tend to be larger than the entire uncompressed WP corpus here, and so
have zero chance of ever being the minimum of model+compression. (An
observation which I think supports the idea that the Hutter Prize has long
outlived its intended use for measuring progress towards AGI; it's now just
sort of a 'demo scene' version of AI benchmarking, testing intelligence within
extreme constraints of sample efficiency/compute, rather than a true useful
benchmark.)

The fact that small compressors like gzip or zpaq do so well at small
data/compute but then can't compete as you scale up to tens or hundreds of
gigabytes of text (amortizing the cost of those fancy NN models) can be
considered a version of the Bitter Lesson:
[http://www.incompleteideas.net/IncIdeas/BitterLesson.html](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)

~~~
imtringued
The bitter lesson seems to be striving for the wrong goal. Replicating human
performance will at best give you human performance. Humans are incredibly
inefficient. They require multiple orders of more computational resources just
to match a machine. Since we do not have an abundance of computational
resources, trying to replicate humans will always result in inferior results
than simply running conventional algorithms.

~~~
gwern
Why then do systems following the bitter lesson like AlphaZero strongly
outperform humans? (Hint: it involves scaling.)

------
Flow
This test maybe should not focus on the output being byte-identical to the
input file, but to be structurally identical. It's an XML file, with another
semi-structured language inside the text nodes. There's opportunity to utilize
that fact.

------
shultays
I can compress it to 0b! With a decompressor that will be around 100mb.

~~~
CamouflagedKiwi
Unfortunately this seems to be foreseen, the decompressor size is included in
the measurement.

~~~
hinkley
On a more serious note, I think his can still be gamed a little bit.

For instance, DEFLATE has default variable width encoding tables built in
based on letter frequency of the English language as measured some time ago.
If you use an algorithm with any kind of defaults those count against the
cost, but they don’t necessarily cost _more_ to tune it for a solitary input.

~~~
krasin
DEFLATE compressed data consists of blocks. Each block can be either
uncompressed, or compressed with a pre-defined Huffman table (the variant
mentioned in the parent comment), or compressed with a dynamic Huffman table.
Many DEFLATE compressors would consider all three options (at higher -j
values) and choose the best. So, one does not need to tune DEFLATE to get the
advantage of dynamic Huffman tables. And any kind of arithmetic encoding (or
the new hotness, Assymetric Numeral Systems) would do a better job.

For a competition like this, all simple approaches have probably been tried
already. The current leader has been tuning their approach for 5 years:
[http://mattmahoney.net/dc/text.html#1159](http://mattmahoney.net/dc/text.html#1159)

------
machrider
Do you think he would notice if the test machine downloaded a ~100MB file from
the web?

~~~
raphaelj
Can't he just disable the Internet interface?

~~~
imtringued
It's not written in the restrictions so it wouldn't be fair.

~~~
throwawaywego
> Programs must run without input from other sources (files, _network_ ,
> dictionaries, etc.) under Windows or Linux without additional installations.

------
nihilarian
Right click, create shortcut, 4kb

~~~
cvs268
Reminds me of this one-time a girl at school had copied a shortcut to a DOC
file onto the floppy-disk and submitted it at school.

The teacher had a hard time explaining shortcuts to her. The girl kept
insisting that the PC at school was broken as the shortcut worked in her
machine at home.

~~~
sokoloff
What I hope really happened is that the savvy student bought herself an extra
day to complete the assignment!

