
The $5000 Compression Challenge - igul222
http://www.patrickcraig.co.uk/other/compression.htm
======
haberman
Poetic justice. I love this quotation from Mike:

"I think that I did not make a mistake except to believe that you truly
intended to attempt to compress the data I was sending you."

Mike is happily preying upon people by taking their money to enter what he
believes is an impossible contest, but the moment he is outsmarted he appeals
to morals and calls into question whether Patrick was acting honestly.

I'm guessing there is some backstory on this newsgroup involving people who
would claim to invent compression algorithms that do the impossible. I'm
imagining that one day Mike thought "time to get these people to put their
money where their mouth is."

Still, it's pretty low to pose the challenge without saying "I'm taking this
bet because it's highly unlikely that you can actually succeed. This is
explained in the FAQ. Are you sure you want to give me $100?"

~~~
svantana
Actually, it is most likely possible, it's just very difficult. For example,
it is likely that the data file was generated using a deterministic computer
program. And that program is most likely a lot smaller than the outputted
data. Therefore, one could use that very program as the "decompressor" -
voila!

The problem is - it would take astronomical computational resources to
recreate that program, although there is a very straightforward algorithm to
find it - just test out all possible programs starting from 0,1,2,3, and so
on...

~~~
praptak
The data was taken from random.org, whose source of randomness is a physical
process, highly unlikely to being "algorithmizeable" (is that even a word?)

------
Nate75Sanders
Mike Goldman does not need to ever issue a challenge again.

It is WIDELY agreed upon by people who engage in this activity regularly
(known as "prop betting") that the spirit of the law does not matter one bit.

He was way out of his element here and Patrick was well within his rights to
do what he did.

~~~
dustyleary
A great set of stories from an amazing life: [http://www.amazon.com/Amarillo-
Slim-World-Full-People/dp/B00...](http://www.amazon.com/Amarillo-Slim-World-
Full-People/dp/B000GG4GS0)

Amarillo Slim engaged in these sort of prop bets all the time. Some of the
more interesting stories where he "abuses the rules":

* He bet that he could hit a golf ball a mile, "as long as bounces and rolls are allowed... flat ground of course"... When the wager was accepted, he teed off at a frozen pond, where it wasn't that difficult to accomplish.

* Two variations on beating a champion at their own game:

\- Slim played Minnesota Fats at pool, "as long as I can provide the cue
sticks"... And then beat Fats when they both had to play with broomsticks
instead of real pool cues.

\- Slim bet that he could beat a very good ping pong player, "as long as I get
to pick the rackets"... When the wager was accepted, Slim brought two
identical cast iron skillets and let his opponent pick which "racket" he
wanted. Without having practiced, his opponent didn't stand a chance. \--
Later, someone else approached Slim and wanted to do the same bet. They had
heard of the original bet, and had an Asian pingpong champion "ringer". (It
seems pitifully obvious to me that Slim wouldn't pull the same trick after the
story was out, but apparently this gambler had his Asian ringer practice with
a skillet and thought he would be "safe")... For the second bet, Slim used
"standard" Coca-Cola bottles, and won again.

------
ars
The pigeonhole principle only applies to a _specific_ compression algorithm.

If you get to custom write the algorithm for the data you can arrange things
so that for this datafile it will be smaller, yet larger for any other file,
and still meet the principle.

i.e. this is not actually impossible. If you can find ANY redundancy in the
data, and you can code a very small decompresser specifically for that, you
can probably win.

And due to the nature of randomness, there are _always_ numeric streaks, and
other patterns in the data. The larger the datafile the better the chance of
finding some sort of pattern or streak.

The decompresser does not have to be large either - it would be perfectly
legal to use a perl script for example. (i.e. so you don't have to use space
writing IO handling code).

Edit: I suspect I may be wrong here.

~~~
jgeralnik
No.

Let's say you have a file of size X. You want to write a program of size Y
which will decompress an X-Y-1 sized file back to the original program. Does
such a program and file exist for every file of size X?

Using the pigeon hole principle, we can see that it does not. There are 2^X
files of size X, but only 2^(X-1) inputs of length Y+X-Y-1. Therefore, not
every file can be compressed this way. It doesn't matter what your
decompressor is written in or how, you still can't do it.

~~~
czr80
This is correct for the general case, but I think his argument is that for any
specific file you can (almost certainly) find some redundancy and so find an
algorithm to compress that specific file.

~~~
ars
That is exactly my argument.

But now I think I'm wrong because what if you take the new data (i.e. the
decompressor plus the compressed data), and try to compress that.

~~~
bloaf
You can't compress indefinitely because presumably your algorithm does not
shrink in size. If your algorithm was 200 bytes, you could never successfully
use it to compress a 199 byte file.

If this were possible, there would be a minimum file size, beyond which
further compression were impossible.

------
CJefferson
This is an old story, but a fun one.

To me it is clear that Mike Goldman should have paid up. Patrick clearly
completed the challenge placed on him.

While Patrick didn't obey the spirit of the challenge he obeyed the law, and
Mike knew full well that the spirit of his challenge was impossible!

~~~
emn13
Also, I'm not sure it was very wise of Mike to assume the data was truly
incompressible. It's unlikely someone would manage to compress it (certainly
for just 5k), but if you truly believe you have a source for billions or
trillions of bits and that those bits then have exactly 100% as much entropy -
well, you're taking a lot on faith.

Indeed the fact that the content length was preselected implies some amount of
regularity that might be (abused).

Oh well.

------
splicer
Here's another approach:

Let the compressed file contain a hash (say, SHA1) of the original file. The
decompression program then generates random files of the chosen size. If a
generated file's hash doesn't match the desired hash, delete it. Now run the
program for a very long time. The program is _likely_ to eventually reproduce
the original file (along with a bunch of files that happen to have the same
hash), and you win :)

~~~
Beltiras
Let's say you have a file of size n and a hash of size log(n). This means you
have n/log(n) collisions (matches). I made a similar suggestion on another HN
thread linking that same story (I suggested using two radically different
hashing functions, one hash result would not necessarily have an evenly
distributed result from the other hashing function when you vary the
plaintext). Even while posting I thought that I _must_ be wrong, I just didn't
_feel_ wrong about it. Now I do.

~~~
splicer
The purpose of using a hash is just to avoid running out of disk space. With
enough disk space, you don't even need to hash/delete files.

------
vanni
Previous discussion:

<http://news.ycombinator.com/item?id=4616704>

------
algorias
Even without doing any filesystem tricks, the challenge seems quite possible.
While it's true that writing a compressor that compresses all input data is
impossible, this barrier is removed because the compressor only needs to work
on a single file.

~~~
UnoriginalGuy
But you're forgetting what the challenge actually is. It is file+decompressor
< input.

So by the very nature of it if your decompressor is non-zero in length, you
have to find some efficiency/tricks to make the file smaller.

Realistically the overhead on the decompressor will be 10 KB and that is
before any kind of actual logic, so you are looking at shaving a minimum of 50
KB off of the original file (which is completely random).

~~~
algorias
I'm not forgetting that. It's just that any specific random file will have
(with high probability), patterns that can be exploited. It's not guaranteed
to work on all files of course. The theoretical limit (per bijection argument)
is that at most half the strings of length n can be losslessly represented in
strings of length n-1, because there are half as many strings of length n-1
than of length n.

So if, for example, 2% of files can actually be reduced in size (I don't know
what the actual number is, and if it is even computable) that's still positive
EV in a $100 vs $5000 bet.

~~~
jgeralnik
The file is not really random, though, since Mike generated it and had time to
make sure it was a "good" file before sending it on.

That is, if the file was truly random, it would have a chance of 1/(2^X) of
being all 0, where X is the size of the file. But since Mike would reject that
file, the chance is actually 0.

Same for all files with easily exploitable patterns - for example, I am sure
that Goldman checked that the file could not be compressed with gzip before
sending it.

So the EV is probably not positive, even if there is a small chance that a
random file of size X can be compressed.

------
RyanZAG
It is definitely possible to win this challenge though.

Consider an arbitrary long series of integers. Somewhere within this series of
integers, there will be some kind of randomly created pattern, since this is a
property of an infinite set. eg. somewhere within the data set, there could be
the values [1, 2, 3, ... 10] or [1, 3, 9, .. 27] or [1, 2, 4, 16, 32] - it
does not matter which of these patterns, exist, only that there does exist
some mathematical pattern in the data.

The chances of there being no pattern in a big enough set of random data is
impossible as there is a finite number of possible data combinations for bytes
[1..256][1..256] etc. I guess a data set of 256^256 bytes would guarantee a
pattern, but I'm sure there is a far smaller number that would give 99%
confidence.

Once you find a pattern in the data, you can remove that pattern and replace
it with code that will recreate the pattern using a data offset. ie. you
remove the pattern from the data completely, and replace it with a smaller
piece of code to recreate that pattern exactly and insert it into the correct
position.

The key here is that once the data has been generated, it is no longer 'random
data', but a fixed input. eg, you cannot compress a random unknown string of
bytes, but you can compress the string [1,2,4,16..]

The output data would have all possible mathematical patterns removed from it,
and the decompression code would be just a list of mathematical functions and
data offset points.

~~~
lifthrasiir
You are recreating Kolmogorov complexity there. Kolmogorov complexity is
defined as the minimal amount of program (for some programming language) to
recreate the given string. By the definition, your strategy will compress the
data no better than its Kolmogorov complexity. In this aspect, Kolmogorov
complexity measures the data's true randomness. If the random data has been
generated correctly its complexity should be near its length, however, so your
strategy won't work. (And there is enough evidence that this was the case.) It
does not matter whether the data is already known or not; it is the data's
inherent randomness prohibiting you to compress it.

To be sure, of course, Patrick Craig did not compress it in the pure
information theoretic sense. Mike Goldman failed to equate the goal to that
information theoretic compression however.

~~~
RyanZAG
If you pipe a random process into an image viewer, eventually you will get an
image that can be compressed with PNG compression. If you let the random
process run, you will eventually get enough of these images compressible by
PNG that will allow you to save enough bytes to fit in a PNG decompression
library. This may only occur after a ridiculous amount of bytes, but it is
guaranteed to eventually occur following the properties of randomness.

So the method will work, but it may take a very large amount of data before it
does. If the method does not work, it implies that a random process cannot
generate an image that can be compressed with PNG - and that is definitely
false.

~~~
jgeralnik
But you also have to store the index at which the image occurs, _and the index
is bigger than the gains from the compression_.

This is similar to the argument that since pi is (probably) normal, any
sequence of characters appears in it and we can simply use indexes into pi
instead of storing numbers.

The problem is that given a file of size X, that sequence of X bytes probably
only occurs at more than 2^X places into pi.

------
DanBC
If you're interested in efficient compression with English you might like the
Hutter Prize (<http://prize.hutter1.net/>) - a €50,000 prize if you can
compress the 100MB file _enwik8_ to less than the current record of about
16MB. They claim it's an AI problem. I guess it could be if you're compressing
a bunch of different English texts, but this challenge has a single text. A
program that provides very efficient compression of this text might not be so
great for other texts.

Mike's challenge took a specially selected random file. As other people said,
this was an attempt to show compression kooks that some files are not
compressible (if you include the size of the decompressor).

------
pavanky
Makes me wonder if a "decompressor" adding generating hundreds of decompressed
files by brute force (of which the original file is one) would be a valid
solution. You could store a certain section of the original as the "compressed
file"

------
kyle4211
I can't find anything which solidly points to this challenge being 'likely'
impossible.

This all is not so simple as pointing at Kolmogorov complexity. Randomness is
not so much inherent as relative to the machine on which you're running your
program.

[http://rjlipton.wordpress.com/2011/06/02/how-powerful-are-
ra...](http://rjlipton.wordpress.com/2011/06/02/how-powerful-are-random-
strings/)

------
xentronium
Why even bother with file concatenation? Just use 3M chars long filename for
an empty file to "compress" 3Mb of data.

~~~
jlgreco
DOS I assume.

------
kunil
I don't get it, why he didn't just write a simple algorithm and get the prize?
It looks like an easy money

~~~
CJefferson
You mean, why didn't he write a simple compression algorithm which actually
shrunk the data?

The short answer is, it was (almost certainly) impossible. You cannot compress
random data. The only reason files you compress normally seem to compress is
almost every file you come across normally contains some structure and
therefore some redundancy.

~~~
kunil
It is not random, input file is constant. Surely you can write an algorithm to
compress a specific file.

~~~
ghshephard
But the sum of your decompression program and compressed file would then be
larger than the original data file.

~~~
haberman
Mike's contest didn't limit the total size of the (input, random) data file,
so you could make it arbitrarily large to make the cost of the decompression
program arbitrarily small.

~~~
ghshephard
If the data file is truly random, then there is no decompression program +
compressed file that would be smaller than the original random file. That's
the entire point.

~~~
haberman
I'm not saying you could! I am only rebutting your earlier line of reasoning.

 _IF_ you could write an algorithm that would compress a specific block of
random data by any non-zero percentage, and _IF_ you can make the original
random data arbitrarily large, _THEN_ the overall size of the code to
implement this algorithm would not matter, because you could amortize it over
an arbitrarily large random data file.

However I am _not_ claiming that such an algorithm to compress a specific
block of random data exists! However other people are arguing that this is
indeed theoretically possible: <http://news.ycombinator.com/item?id=5025527>

------
splicer
Just post the original data file on the web, then have your "decompression"
program download the file. The "compressed" file could simply contain a URL.

