
The $5000 Compression Challenge - jpablo
http://www.patrickcraig.co.uk/other/compression.htm
======
jpablo
I wonder if Mike Goldman ever made a response after everything was published.
I found very distasteful when he threatened to accuse Patrick of fraud.

He should have recognized that the rules where vaguely stated and at the very
least offer an apology. Plain refusing to download all his files, when he had
expressly accepted multiple files at being ok was kind of weird.

The google groups link in the page no longer works so I wonder if that's
archived elsewhwere.

~~~
hermannj314
> when he threatened to accuse Patrick of fraud

Exactly. He was the one committing the fraud.

To quote Mike Goldman (fta): "I cannot fathom that someone would read the
comp.compression faq first and then want to participate after understanding
the futility of the effort. On the other hand, some people just don't
understand information theory too well."

He took $100 from this guy knowing he was frauding him! It was like watching a
game of nerd 3-card monte.

~~~
jd
Exactly, I consider Mike to be acting in bad faith here. Mike was willing to
accept $100 (which in most of the world constitutes many hours of labor) not
just once, but every time people wanted to play again. He was willing to
accept real money for something that he believed was literally _impossible_.
Mike didn't hint at the impossibility in any way. E.g. if he had said "Have
you read the FAQ and you still sure you want to play?" I would view it in
different light. But no, Mike seemed all to eager to accept the money.

So yes, then when a taker appears he's either not very bright (and therefore
you shouldn't take advantage of him) or he's trying to out-hussle (and playing
legalese with the terms is then fair game). After Mike got beaten at his own
game he then continued to (a) weasel out of it (b) threaten legal action (c)
attempt to deflect.

Not exactly a class act by Mike.

------
hermannj314
Just to clarify, the guy made a contest that he knew, statistically, could not
be won except if the random data file he produced just happened to be
compressible (due to luck alone).

In other words, he knowingly accepted money in a contest of chance not a
contest of skill. So this contest (assuming either participant was in the US
at the time) violated federal and state laws against unlicensed gambling
operations, correct? [Assuming Mike Goldman did not possess a license to
operate an online casino in 2001]

------
mm_alex
surprised all the comments here are about whether he achieved compression or
not. Point = Missed!

the entire thing was an ingenious trolling exercise, as he says, to 'out trick
the trickster'. his goal was to exploit the loose wording of the challenge,
and prove a point (possibly winning $5k in the process); I found it an
entertaining story along those lines.

it proves nothing (new) about compression, or lack of it; he even stats that
the consensus at the time was 'unanimous' that no compression had taken place.
that's not where the interest in this link/story lies.

thanks for the original link, OP.

------
adamtj
The argument is over whether the challenge is identical with its spirit or
with its statement. Patrick clearly failed the spirit of the challenge. He
tried to exploit a loophole in the statement of the challenge. So what which
is the actual challenge?

Actually, it doesn't matter. Patrick failed to complete the challenge _as
stated_ , which is important because he was exploiting a loophole in the
statement. Technically, his files were smaller. But technically, he was
required to _send_ Mike his files and decompresser. Instead, he put them on
his server and sent Mike their location. Mike then had to _get_ the files. It
goes both ways.

Finding loopholes, while clever, isn't terribly interesting.

~~~
jd
> But technically, he was required to send Mike his files and decompresser.
> Instead, he put them on his server and sent Mike their location. Mike then
> had to get the files. It goes both ways.

In that case Patrick could just argue he hadn't sent the files at all (since
Mike refused to accept them by refusing to download them) and that therefore
the bet was still on, so he can still send the files by email or whatever.

------
kevinpet
Look at the relevant part of the exchange:

> > I meant can I send you a compressor and several compressed files whose
> total file size is less than the original uncompressed file and from which I
> can regenerate the original uncompressed file > > Patrick

> Sure -- but you send me a _decompressor_ , I don't need the compressor.

Was this "compressor" vs. "decompressor" actually a slip up, or intentional to
distract from the importance of what he was asking?

------
jere
I read through the compression FAQ and found it quite interesting. I had never
heard the counting argument before, but it's very intuitive.

Part of the FAQ is a collection of crazies claiming that they have created
novel compression techniques. One entry is about Jules Gilbert, but it ends
abruptly. I googled him and found that the crazy continues....

>I am afraid that I am not a fan of Google. They are gay-friendly and also
they suppressed lot's of stuff about Obama before he was nominated, and again,
before he was elected.

<http://scientopia.org/blogs/goodmath/2010/08/13/969/>

------
DaveLister
Obviously Patrick met the requirements of the loosely worded challenge.

Though I wonder if the inverse would be acceptable. Take 100 files all the
same size, but smaller than the filesystem block size. These files are n _100
bytes, but take up blocksize_ 100 bytes on the disk. Then, using tar (which
itself does not compress, just tacks files end to end with metadata in
between), 'compress' them into a single file that takes up fewer disk blocks
than before.

------
zwdr
I dont think he completed the challenge. He found a loophole, so _maybe_ you
could say he was successful. In exploiting the rules that is, not in solving
the problem that Goldman stated. Still a clever solution, but obviously not
what was asked for.

~~~
raldi
Every technique is "just a loophole" when it's first discovered.

~~~
zwdr
But he didnt actually succeed in getting the desired result- the problem was
that the statement of the problem was flawed. There obviously is a difference
between what Goldstein meant, and what he said.

------
yk
Perhaps a stupid question, but is the challenge not also solvable with
'honest' comp.sci? The way I think about it at the moment, is that I take a
family of functions, C_k, which map the space of n elements strings to the
space of [1, 2n] element strings ( the union of m element strings for 0 > m >
2n) in a essentially random way. The output is then in [1, n - sizeof(C_k) -1]
with some probability p and if I try long enough, I find a compression scheme
which compresses the given input.

~~~
geoffschmidt
No, it is defInitely not possible. If it were, you could do it over and over
again and compress any file down to an arbitrarily small size. Two things to
search for are "pigeonhole principle" and "kolmogerov complexity."

Essentially the problem is that any encoding of a string either preserves the
length of all possible inputs, or shortens some inputs and lengthens others,
or lengthens all inputs. There is no encoding that shortens all strings
because of the pigeonhole principle. To apply this to the challenge, see that
the challenge is to encode the string as a program that when executed produces
the string. Just because the encoding language is Turing complete does not
mean that it can evade the pigeonhole principle.

~~~
unconed
But the challenge isn't to compress random strings, the challenge is to
compress _a_ specific random string. This means that given enough time, you
should be able to come up with an algorithm or transformation that turns that
specific string into something of lower entropy. I'm not saying this is very
feasible, but it doesn't violate the pigeonhole principle.

For example, there is a fixed size algorithm to generate the digits of pi,
even though pi is incompressible.

~~~
im3w1l
Yes it does. Lets say for simplicity that you are coding your solution in
python, and that your output should be written to standard out, and the output
file should be 10^6 bytes. Consider the function that takes a source file of
at most 10^6 to the standard output of said program when interpreted. The set
from which files you need to output, are drawn has cardinality 2^(10^6). The
set of python programs (possibly invalid) that are strictly shorter has
cardinality 2^(10^6-1). If all files were possible to compress, than a python
interpreter would provide an injective map (one program gives maximum one
output, possibly none if the program has errors) from a smaller set to a
larger set. This is impossible. Therefore there is at least one file(actually
a lot of them but this is beside the point) which cannot be compressed. Mike
could send this one.

EDIT: Messed up my bits and bytes and fenceposts: size 1: 256^(10^6). Size 2:
sum (i=1 to 10^6) 256^(10^6-i)

~~~
im3w1l
Praptak: >Only assuming he'd be able to generate one. Is there a known
algorithm for generating a provably Kolmogorov-uncompressible strings?

In the theoretical case he has almost a 255/256 of doing it he generates a
string in a truly random fashion. In the practical case his odds are much
better. Many potential programs will a) refuse to run and b) give the same
output.

~~~
praptak
> In the theoretical case he has almost a 255/256 of doing it he generates a
> string in a truly random fashion.

An interesting result. Do you have a link to the proof?

~~~
Dylan16807
Proof:

There are k strings of bytelength n.

There are k * 256 strings of bytelength n+1.

k input strings can decompress to at most k output strings

Therefore only k / (k * 256) of length n+1 strings can be compressed by a
byte.

The reason for 'almost' is the miniscule number of strings that will compress
more than one bye.

~~~
praptak
Ah right, I totally missed the fact that compression by 1 bit doesn't beat the
challenge!

------
sp332
_I think anyone would agree that if you sent me an uncompressed file of size
1000 bytes and I sent you a decompressor of size 400 bytes and a compressed
file of size 599 bytes, I would have met the challenge in spite of the fact
that these two files will take up more disk space than the original file._

I don't agree, since the compressor can be tuned to the data you could just
move some of the data into the "decompressor" and end up with an arbitrarily
small "compressed" file.

~~~
anonymouz
Read again: The sum total of the file sizes of the "decompressor" and
"compressed file" had to be less than the original file. So moving data into
the decompressor would do nothing.

He did "win" on a technicality: The challenge did not count the size of the
filesystem metadata towards the sum, so he hid information there.

Also tuning the decompressor to the specific data-set was entirely allowed,
even intended. Otherwise it would be obvious that the challenge is impossible,
in this way it depends on a sufficiently random data set.

~~~
sp332
I'm responding specifically to the part that I quoted, which would have
allowed the compressor + compressed data to be larger than the original data.

~~~
raldi
The part you quoted doesn't say that.

Regardless, it boils down to this: Do you feel the rules of the challenge
forbid shifting an arbitrary amount of information from the compressed file
into the decompressor?

~~~
loumf
Shifting from the compressed file to the decompressor is totally allowed. What
is not is shifting it to the filesystem. For example, he could have named the
decompressor file with the first 512 bytes of data and then the decompressor
concatenates the filename with the rest of the data. However the filesystem is
still storing more data than it was before.

------
jules
Since he offers $5000 but you only need to pay $100 to participate it is quite
easy to win money with the challenge. Ask for a file of 2 bits in length. If
it is "11" then compress it to "1". Otherwise compress it to "0X" where X is
the contents of the file. Now you have a 1 in 4 chance of winning the
challenge. Since you pay $100 but win $5000, this earns good money. Plus it
would be hard to argue that this is cheating.

The problem is that you need to supply the decompressor as well. It would
depend on how exactly he wants the decompressor be encoded, but the same
strategy may still apply with modifications (although it would not work if the
required encoding is as a bash script, since lg(100/5000) gives you just 2
bits to play with). [Or you could ask him to modify the challenge such that
you are allowed to send him an encrypted decompressor _before_ he sends you
the data, and doesn't count towards the total file size (encrypted because
otherwise he'll see your strategy and send you a file that it _doesn't_ work
on)]

tl;dr: although you can't compress random with probability > 1/2, you can
compress random data with probability > 100/5000.

~~~
lurker14
Yeah, so you could win a different incorrectly posed challenge from the
original incorrectly posed challenge.

~~~
jules
As I said, you can also win the original challenge with this method. The
original challenge is not fully specified. In particular it is not specified
how the decompressor is to be encoded. There are encodings for the
decompressor for which the strategy that I described does work, and encodings
for which it does not work.

The interesting attribute of this method is that it does genuinely compress
the file: it doesn't store data in hidden attributes like the method of the OP
(which stores information in the file lengths).

------
debacle
I wonder what the probability that a files is not compressible is.

Even more, I wonder if the likelihood that a file is compressible increases,
decreases, or is constant with relation to file size.

My assumption would be that 50% of files are not compressible at a given size,
and that compressibility doesn't vary with file size.

~~~
praptak
> My assumption would be that 50% of files are not compressible at a given
> size, and that compressibility doesn't vary with file size.

This would mean that say, for a 2 MB random file there's 75% chance that at
least one of its halves is compressible but only 50% chance that the file as a
whole is compressible. Not impossible but suspicious if you ask me.

~~~
debacle
No, it would mean for a 2mb random file, there's a 50% chance that at least
one of its halves is compressible.

~~~
praptak
Do you also state that there's only 50% chance of having at least one tails
when throwing two (edit:unbiased) coins?

------
benjaminwootton
I guess I'm missing something here :-)

    
    
      ls -la /bin/tar
      230464 Feb 25  2010 /bin/tar
    

If tar can compress and decompress an arbitrary length file by more than
230464 bytes then doesn't it meet the challenge?

Even if I've misunderstood, can anyone explain in plain english why this is a
futile task?

~~~
bradleyjg
Lossless compression works by exploiting redundancy in the data.

If I want to compress a text file that contains the word
'uncharacteristically' 1000 times, I can simply replace ever occurrence with
the a certain short unused byte sequence and record the mapping in the header.

Similarly in an image file, if there is a block of 50 pixels that are all the
same color, I can replace that block with a special byte sequence and map the
coordinates and color in the header.

But if I have truly random data, there are no patterns to exploit. The
shortest means of representing random data - is the data itself.

This is provable using information theory (which in a way is a branch of
thermodynamics).

~~~
lurker14
nitpick: There is no such thing as a truly random piece of data. "Truly
random" is only meaningful in the limit as the file size goes to infinity. The
string "000000" could be generated by a truly random process just as well as
any any other.

~~~
bradleyjg
I meant Kolmogorov randomness which is defined via incompressibility.

~~~
lurker14
Fair enough, but Kolmogorov randomness is an existence concept, not a
constructive concept, so I find it weird to talk about having such a string in
this case where a real person sets about to generate a concrete file.

Can you generate a Kolmogorov random file on your particular Unix system? That
requires specifying what "your Unix system" rather rigorously, which seems
unpleasant.

------
ealloc
If you just have to decompress a single file, I don't see why this problem is
impossible since you can tune your decompressor to that file.

For example, files containing the digits of pi or e will have a uniform
distribution of bytes (and are therefore incompressible by exploiting
redundancy) but it is very easy to write a "decompressor" for them as there is
a simple algorithm to obtain these numbers.

Is there a simple algorithm for every possible number? Because then it just
becomes a question of finding the algorithm that generates that number. My
intuition tells me "yes" but that it may be very hard to find it. Plus, you
don't need to find an algorithm for the whole file, just some subsection of
it.

~~~
romaniv
Think about it as identifying patterns. You can always remove chunks of the
original file and then simply re-add them with instruction as following: "Add
pattern 23989898989 to 234d, 387th and 1028th point of this file." If your
pattern is big relative to the file size and repeats often, you will save
space. However, if the file is truly random, the larger the pattern, the less
frequently you will see it. Very large patterns will be seen only once. Very
small patterns will require more information to encode their place than you're
saving by removing them from the file.

~~~
ealloc
Yes, I see why a repetition/pattern approach wouldn't work, but I'm proposing
something else: That the file may be generated by an algorithm, in the way one
generates the digits of pi using an algorithm.

It's similar to how if I suspected he had generated his random file with a
mersenne twister I could simply guess his seed and reproduce the file from it.
(Though, I realize now this is an NP-hard problem)

------
Tipzntrix
Tsk tsk...hackers and a coding challenge are the last things you should except
to be devoid of loopholes and tricks. Always be extra careful making them, and
don't bet more than you can afford, or you'll come out with a rep like Mike.

------
Beltiras
def decompress(filesize, sha1hash, openingbits, closingbits): for candidate in
bits(filesize,openingbits): if sha1(candidate) == sha1hash: if
candidate[-(len(closingbits)):] == closingbits: return candidate

~~~
StavrosK
That'll only take about 10^585 years, assuming you can do a billion hashes a
second. Good luck!

~~~
Beltiras
Nobody said it would need to be quick.

~~~
gelisam
I thought of using hashes too, but then I realized that after thousands of
years, the program is only going to output a collision, and not the original
data.

Checking the first and last few bits reduces the number of collisions, but
there are still 2^unchecked_bits/2^160 cyphers which map to the same hash. You
could store the number of collisions to skip along with the hash, but that
number will be about (unchecked_bits - 161) bits long, so that won't work
either.

~~~
Beltiras
Adding md5 (or whichever other hashing function), I have a feeling (but no
direct proof of) it's not just equivalent to adding bits to the sha hash.
Anyone with a better sense for hashing than me why my intuition is wrong?

~~~
gelisam
Whether you have two 128-bit hashes or one 256-bit hash, you're still trying
to fit 2^n pigeons into (2^128)*(2^128) = 2^256 pigeonholes.

Maybe your intuition is telling you that two hashes must be more secure
because if one of the two hash functions turns out to be easy to break, then
at least you will still be protected by the other?

~~~
Beltiras
No, what I was thinking was more like coverage. For any given hash, you do NOT
have the same coverage for the other hash function. Let's make a bucket of all
the originating bits of length n that will deliver a collision for one hash
function. You are not guaranteed that the results of another hashing function
has a uniform distribution. I still think I am wrong but you did not convince
me.

------
bagosm
Somebody please enlighten me. Is it that hard to find a pattern (for example
fibonacci) that only has zeros up to F100, then remove those zeros?

You dont have to start from the first bit, you can start from bit 11234
(information on the decompiler).

So for example I skip 11134, then run +F(i) bits to the right and add a zero,
then run +F(i + 1) bits to the right and add a new zero.. etc etc. The pattern
has to be just long enough to cover the decompiler's size, which shouldn't be
too much if scripting is allowed!

I believe that discovery of such a pattern shouldn't be too hard in a single
file, arbitrary large (but 3mb should be enough too)

~~~
macrael
You are welcome to try! This really is an application of number theory, and
the likelihood of there being a sufficiently long pattern in truly random data
is actually pretty slow.

------
bo1024
This brings up some cool questions about compression. Can any experts help me
out here?

How many n-bit strings can be compressed? (I think this question is how many
have a Kolmogorov complexity of less than n.)

Also, what is the difference between the statements "no program can compress
every string" and "not every string can be compressed by some program". I
think both are true, but they're referring to different strings, so it's kind
of odd....

~~~
IanCal
> Also, what is the difference between the statements "no program can compress
> every string" and "not every string can be compressed by some program".

The first is true, and you can show this with the pidgeon hole principle. The
second isn't true. The reason for this is that you can put different
information in each program.

To use the example from recursive below:

> You could do it like this: If the first bit is a 0, remove it.

There's a program that'll work for half of the strings. For the other half,
you could use the program "If the first bit is a 1, remove it.".

------
tocomment
Can anyone explain what this section is talking about? I'm completely not
following it:

cat $1 printf "X" cat $0 exit

~~~
quasse
Say I have an original file: AAAAAAAAXAAAAAAA

If I split it into two files, $1 and $2 that look like this: $1 - AAAAAAAA $2
- AAAAAAA

I have now "saved" one byte by dropping the X.

The script "cat $1 printf "X" cat $0 exit" will combine those two files and
reinsert the X, thus "decompressing" the data. Basically, he has replaced the
byte "X", which counted towards the total size, with the EOF marker on the
filesystem, which technically didn't count.

No space was actually saved, because he basically moved the information from
the X character to metadata (the file splits).

~~~
tocomment
but the script is longer than the one byte that he saves, right? So the total
is greater than the original file.

That's where I'm getting confused.

~~~
im3w1l
He won one byte per file.

------
tbirdz
What he should've done is just create a new file whose name is a text encoding
of the data in the original file, but it is 1 byte long. Then to decompress,
just take the name of the file, and use that to create the data.

~~~
anonymouz
If you read the linked page, you'll see that he considered this, but dismissed
it as cheating.

Also, in the end he did hide information in the file metadata, just in a much
more subtle way.

------
jimmybrite
Let's all send him 100$, jackass.

------
lurker14
This was the day that Mike learned the difference between theory and practice:
In theory, there is no difference between theory and practice; but in
practice, there is.

When I read the challenge, and saw that arbitrary Unix programs were allowed,
my first idea for a compressor was to scan the OS software for snippets of
data that match the source data.

Of course, that was 10 years ago. Today, in the Cloud world, I'd just write a
decompressor that connects to my system over the network and retrieves the
source data.

~~~
recursive
You can be pretty sure that you wouldn't find the target in the OS software.
For a well constructed output file, that has about the same likelihood as a
randomly generated cryptographic hash collision.

As for your cloud solution, that could be defeated by running the decompressor
offline.

~~~
mcherm
you wouldn't find the entire target but you would find file in the operating
system that had subsets of the target file.

~~~
meanguy
Sigh. Claude Shannon is rolling in his grave.

~~~
shimon
Why is this an invalid approach? If you have an agreed-upon OS installation,
you've got at least several hundred megabytes of binary data from which you
can extract sequences for a lookup dictionary. To compress, you locate largest
substrings of the input that also appear somewhere in the OS data. For
example, you might find that bytes 1000-2000 of the input data are the same as
bytes 4528-5528 of /bin/ls. Then you just need a program that can extract this
byte range from /bin/ls and fill it in at the appropriate place.

Of course, it isn't a given that you will be able to find a sufficiently large
overlapping substring in the OS binary data. And it may not be allowed to
assume an agreed-upon OS installation, although it was OK to assume that
gunzip was available. And finally, doing this sort of largest-overlapping-
substring search could be very very slow.

The real key here is that the challenge is about compressing one specific
random file. It's clearly impossible to write a compressor that can represent
ANY sequence of N bytes in fewer than N bytes including the decompressor. But
this challenge is constrained to reconstructing a single input file, not
handling arbitrary inputs. If the input files are random, and you keep asking
for them, there's a tiny chance you'll get one with enough repetition that
even gzip could compress it. If you can key against a large set of reference
data, your odds improve significantly.

So, I don't know if this is a practical approach given the restrictions of the
contest, the cost of entering, and the expected payoff, but I would say that
if it is plausible to use a large body of OS files for lookup, this could be a
winnable challenge.

~~~
romaniv
_Of course, it isn't a given that you will be able to find a sufficiently
large overlapping substring in the OS binary data._

It seems that it's a given that you will _not_ find that substring. The larger
it is, the less likely you will find it and the less likely it will repeat in
the uncompressed file.

<http://news.ycombinator.com/item?id=4617983>

~~~
shimon
Unlikely but not impossible. Note that the substring does NOT need to repeat
in the uncompressed file; it just needs to be possible to say "copy bytes X
through Y of /bin/ls here" in fewer than Y-X bytes. I'd love to know what the
odds really are, but it seems like you have some good options for improving
them (e.g. ask for a really huge input file).

~~~
recursive
> it just needs to be possible to say "copy bytes X through Y of /bin/ls here"
> in fewer than Y-X bytes

It is very unlikely that the length of the sequence will be smaller than the
space required to store X. Your plan is similar to saying "The input could
contain long repeated segments containing only zeroes. Compress those using
scheme Z. The longer the file, the more likely this is to occur.". Information
theory allows us to prove that this cannot always work. In fact it is very
unlikely to work for a specific randomly generated target file.

~~~
jwecker
So by saying 'cannot _always_ work,' and 'very unlikely to work,' you are
implicitly conceding the point- that it is possible. The question then
becomes, given the size of the OS environment space (and the quality of the
randomness in the originals), could one in 50 attempts actually succeed in
shaving off a byte? I sincerely doubt that Mike, despite his superior
knowledge of information theory, actually took the time to figure out just how
good or bad the odds were. He assumed that it was fundamentally impossible,
when it is in fact not impossible, just unlikely (to some unknown [to me]
degree).

~~~
recursive
Let me just put it this way. It's more likely that you could win using gzip.
It's similar to saying it's "possible" that the target file could have
contained all zeroes.

Assuming the target file is competently constructed, the chances of winning
wouldn't make it a rational bet for $0.01

~~~
shrughes
I'm so sorry you had to suffer under these postings.

------
lurker14
This was a great social engineering hack: The original challenge, poorly
stated as it was, was not beaten. Craig started up a conversation with Mike,
and tricked him into changing the rules. If Mike were careful, he would have
paused to think about whether the proposed rule change would open the door to
a hack.

It's a reminder to all of us that correctness is an ongoing process, not a
delivered product.

Mike started the thread as a smart-alec Randi-esque elder, putting up money to
shut up cranks. But he got blinded by his hubris.

~~~
im3w1l
The original rules would have been beatable too. Just put the last 200 bytes
in the filename.

~~~
lurker14
The rules were certainly unclear on that point.

Mike didn't guarantee to preserve the file metadata. It seems eminently
defensible for him to say that he was dealing in files, not filesystems, so he
could rename the file upon receipt. In fact, the EOF-hack in Craig's solution
turned on the assumption that the filesystem could be hacked by the
challengee, so it seems fair to allow the challenger to hack the filesystem as
well.

------
lurker14
Mike wrote a document (the challege), and offered a $5000 bug bounty. Patrick
found a bug, Mike should have paid up and fixed the bug in the spec, just like
Knuth pays for typos and Google pays for security bug reports.

