
The $5000 Compression Challenge - vq
http://www.patrickcraig.co.uk/other/compression.htm
======
paulsecwhatt
I absolutely believe Mike should have paid Patrick.

On the simple premise that since Mike was hosting a bet that he KNEW was
impossible (i.e. under no circumstance, ever, would he have to pay the 5000$),
then literally the only point of the game is to find any loopholes. Otherwise
it's just Mike preying on unsuspecting victims.

If you design an impossible game, the only possible thing for anyone to do is
to break it. If you then complain that THAT is cheating, you're a pedantic
idiot - one of those annoying kids in middle school who loses a bet and then
tries every possible way to weasel himself out.

To add fuel to the fire, his obnoxious replies such as "I tried running the
first two files and it didn't work", make my blood boil, as it's a clear
attempt to try and belittle the contestant.

~~~
brownbat
> If you design an impossible game, the only possible thing for anyone to do
> is to break it. If you then complain that THAT is cheating, you're a
> pedantic idiot

I feel the same way when casinos bust card counters. You use math to take
money from suckers. When other people use your rules and better math to take
money from you, that's suddenly deeply immoral.

~~~
titanomachy
I don't think anyone feels that card-counting is deeply immoral. It's more
like, "we lose money when people do this, so we will do everything in our
power to prevent it."

------
kazinator
Mike Goldman originally wrote the challenge such that it calls for one file
and one decompressor.

However, when subsequently asked whether there can be multiple files, he
agreed; thereby he was arguably duped. He didn't say "okay, but there will be
a 256 byte size penalty per additional file", he just plainly agreed.

This means that the original formula for adding the size of the solution
applies: just the file sizes added together.

Goldman should accept that he foolishly rushed into a careless amendment of
his original challenge and pay the money.

That said, it obviously is cheating to have the archive format or file system
hide the representation of where the removed bytes are! If a single file is
produced, it has to include a table of where to insert the bytes that were
taken out. If multiple files are produced, the archive format or file system
stores that information for you at considerable expense.

If both people are wrong, the contest should be declared invalid and Goldman
should return the $100.

If only Goldman is wrong, he should pay $5000.

Under no interpretation is Goldman strictly right and the contestant strictly
wrong.

So he is wrong to keep the $100 in any case.

~~~
gamblor956
Nope, perfectly acceptable for Goldman to keep the $100. Craig tried to play a
game of semantics, and lost because semantics let Goldman weasel out by
pointing to O/S meta data.

The challenge clearly, repeatedly, stated that Goldman would give $5000 to
anyone who could _compress_ a datafile such that the combined size of file and
decompressor was smaller than the original datafile.

Craig did not compress the data stored. He split the file up into multiple
smaller files and using "5" as the boundary. However, by creating 218 files,
he increased the amount of space required to store this data. Ergo, the
combined size of the "compressed" files and the decompressor exceeded the size
of the original file. It is only by excluding the space taken up by the meta
data that the "compressed" files are smaller. Furtheremore, elsewhere it was
noted that: "The FAQ clearly outlines that filesystem exploits are not true
compression."

~~~
okatzzz
This interpretation is inconsistent with Goldman's own statement about the
original data that "the file size is 3145728". He didn't say "the file size is
3145728 plus some file system overhead", so by file size he was thinking of
the number of bytes in the file ... until he was outsmarted.

It's hardly a filesystem exploit if - again by Goldman's own statement -
gunzip is allowable.

~~~
kazinator
I suspect that if the challenge had been solved with a single file, Goldman
would try to get out of paying by claiming that the program's size should
include the size of the interpreter for its language, and the libraries linked
to that, the size of the command line needed to invoke it (including the
pointer vector and null termination), not to mention the underlying kernel ...

~~~
emn13
I don't know goldman, and I bet you don't either - but there's a pretty big
difference between this solution (which clearly cheats the aim of the
challenge, and a solution that actually compresses. People hate to reward
cheaters, even if it's a fun kind of cheat from the outside. But that doesn't
mean he wouldn't have payed out for a real solution, which likely would have
been quite interesting (and not quite as impossible as it's being made out to
be, since we don't know whether his random source is truly random).

~~~
Dylan16807
What does "actually" compressing mean?

Replacing every "5" with EOF is apparently bad.

What if he replaced every "5z" with EOF? Fewer bytes there.

What if he had a variant of LZ77 doing dictionary encoding followed by a range
encoder that outputs symbols in the range -1 through 255? Even counting the
EOF as a character, this would give an output 2K characters smaller. Sounds
like compression to me. It's finding common sequences and uncommon sequences
and rescaling them based on probability to remove redundancy.

------
nothrabannosir
Quote from Mike:

    
    
        > Rather, you simply split the file into 218 parts ending with the
        > character "5" and then stripped that final character from each part.  Thus the
        > "decompressor" is nothing more than a reassembler,
        > concatenating the parts and reappending
        > the character "5" after each.
    

Well, that's exactly the definition of lossless compression. Look at e.g. how
js crunch works: you create a dictionary of common sequences, split the file
on those sequences recursively and then reassemble it by joining in reverse.
Gzip, bzip2, &c, &c, it's all the same thing. Split the file by a common
sequence and reassemble it by that. Patrick just created a customized
compressor that went only 1 level deep.

Normally you'd need a delimiter to separate those chunks, a delimiter that
doesn't occur in the chunks e.g. through padding or escaping. That, in turn,
increases the filesize, and now you're in trouble. What Patrick did was to use
EOF as a new fresh "delimiter" that doesn't occur anywhere, and at a cost of
zero bytes, no less.

Cheating, or inventive.

~~~
barrkel
The EOF is not at a cost of zero bytes; it costs as much as storing the length
of each constituent file. The extra space used is in the file system
accounting.

~~~
Dylan16807
It's at a cost of 0 competition score bytes. Mike screwed up by allowing an
alphabet of 257 symbols and then only counting 256 of them. Pretty much any
compression or repacking algorithm could have been used at that point.

~~~
cnvogel
Dylan16807, that's a very concise way to put it, thanks for making that
comment.

------
snowwrestler
The point of the challenge was to tempt people who do not understand
compression as well as Mike into putting themselves into a position for Mike
to mock and/or shame them. From that respect, it seems to me like it was a
trick.

In my experience, people who set up such tricks do not usually respond well
when the tables are turned.

There are some people in the world who take it personally when other people
don't understand their area of expertise as well as they do. They get angry
and offended at naive questions, and seek to punish the idiots. This is a
great way to take an interesting subject and ruin it for everyone.

One of the best aspects of the HN culture is that experts here tend to incline
more toward teaching and less toward chastising. It's a nice change from
Usenet.

~~~
x0x0
Why do you think that rather than Mike is genuinely interested in novel
compression methodologies, and willing to pay some money to make interested
people attempt to discover them?

~~~
pgaddict
Because that's pretty much what he says in his post to comp.compression, where
he announces that someone accepted the challenge. Let me quote:

> Before naming the individual and giving additional details of our >
> correspondence, I would like to give him some time to analyze the > data I
> will be sending him. It would be very easy to point out to > him the
> impossibility of his task, but far more interesting to see > how long he
> will struggle with the problem before realizing it for > himself. > > I am
> supposing that one of his fellow co-workers probably referred > him to my
> challenge, as I cannot fathom that someone would read the > comp.compression
> faq first and then want to participate after > understanding the futility of
> the effort. On the other hand, some > people just don't understand
> information theory too well. > > I'll try to give him a complete explanation
> of his error after a > week or so, I guess. :)

So Mike is just smug about how clever he is, how stupid the other person is,
not even suspecting there might be a loophole in the challenge.

I see no sign of interest in learning what the other person is up to, or even
admitting that there might be something to learn.

------
compbio
A trick similar to the recursive Barf compressor (add information to the
filename).
[http://mattmahoney.net/dc/barf.html](http://mattmahoney.net/dc/barf.html)

A longer running challenge is [http://www.drdobbs.com/architecture-and-
design/the-enduring-...](http://www.drdobbs.com/architecture-and-design/the-
enduring-challenge-of-compressing-ra/240049914) No entry fee, $100 prize, and
just as unfair.

A completely serious compression challenge with serious consequences for AI
and NLP: [http://prize.hutter1.net/](http://prize.hutter1.net/) up to 50.000$
prize money, but severe restrictions on memory and running time.

You can not beat Goldman's troll-ish challenge (certainly if the rules are
retro-actively clarified in favor of the organizer). You could however try to
put the challenge in limbo by creating a decompressor which bruteforces a
solution, 'till some hashes match or the final heat death of the universe or
the halting problem is solved, whichever comes first. Goldman will not be able
to ever verify your solution, and when he does (theoretically it is not
impossible), it means you win.

Or, instead of above Schrödinger's Compressor you can send a good random
number generator back as your solution. If Goldman wants his file to be
random, any random file should do. Why does he want exactly his own random
file? Why does he want to do a diff between two random files, is he perhaps
looking for order where there is none? But that's the same foolishness he
accuses his participants of.

~~~
bo1024
> A longer running challenge is [http://www.drdobbs.com/architecture-and-
> design/the-enduring-...](http://www.drdobbs.com/architecture-and-design/the-
> enduring-..). No entry fee, $100 prize, and just as unfair.

At the 10-year scale, a whole new set of tricks opens up. Invent a
sufficiently popular programming language, or contribute a lot to the linux
kernel, and start surreptitiously hiding bits of the file on his OS (the
easiest would be for you language to have a builtin function that spits out a
small part of the file).

~~~
emn13
That was explicitly forbidden :-)

------
thomasahle
AFAIK, information theory requires the _Expected_ size of a 'compressed' file
be at least as large as the original.

So we could create an encoding that compressed N/50 of the strings with
lg(N/50)=lg(N)-lg(50) bits. That would save us lg(50) > 5 bits with 2% chance.
In this game we have 50 tries (5000$/100$) so we'd be pretty sure to win.

The correct price for this game is probably closer to 200$.

~~~
Dylan16807
The problem is that even the most trivial linux-compatible decompressor will
add a few bytes, and there's almost no chance of saving that many bytes. But
sure it wouldn't hurt to require it be 100+ bytes smaller.

------
bdcs
I think Patrick sums it up quite well. Perhaps the interesting thing is
imagining how this would go down 15 years later in 2015: Patrick asks if the
bet is available; it is. They enter into a 2-of-3 bitcoin transaction with
3rd-party escrow. Patrick and Mike sign the terms (probably written in
pseudocode or python) using their sending bitcoin addresses (or GPG keys).
Filesharing is an order-of-magnitude easier than setting up FTPs with personal
IP addresses. The bet is promptly won by Patrick.

Ah, how things have changed in 14 years

~~~
qopp
They could have used an escrow service 15 years ago and the challenge terms
could have been defined as a Python program since Python is 24 years old.

~~~
nadaviv
Getting someone with domain expertise on the matter to provide escrow is non-
trivial. Escrow trust accounts are heavily regulated in most parts of the
world, and require licensing, bonds and lawyers. I highly doubt they would
find someone willing to go through all this. The cost for (legally) operating
an escrow is probably higher than the entire bet... (he could also do this
without licensing, and take on the legal risk, I guess. this might go under
the radar for small things like that, but doesn't really work at scale.)

Bitcoin improves on that by not requiring a trust account - their trusted
third party would simply hold one key in a 2-of-3 multi-signature scheme,
giving him the authority to resolve disputes and adjudicate between them, but
without holding any funds under his full control.

I find the legal implications of Bitcoin smart contracts very exciting - this
significantly lowers the entry barriers for providing many kinds of financial
services and opens up these markets for competition in a way that was simply
impossible before. There's lots of room for innovation and disruption with
that.

Disclaimer: standard IANAL/TINLA apply, but I'm the founder at a startup that
facilitates exactly that
([https://www.bitrated.com/](https://www.bitrated.com/)) and received
extensive legal guidance on the matter.

~~~
leereeves
Bitcoin makes the technical process easier but doesn't help with the hard
problem (as you said): finding a third party with domain expertise, whom they
both trust, who is willing to adjudicate at very low cost.

It sounds like your startup is trying to solve that and create a "Trust
Marketplace". Godspeed. Establishing trust between strangers is a very hard
problem.

And while you may find a way to innovate around existing laws, new laws will
be written.

~~~
nadaviv
> finding a third party with domain expertise, whom they both trust

I would say that in this specific instance, they could've quite easily agreed
on a reputable user from the comp.compression newsgroup whom they both trust.

> adjudicate at very low cost.

It makes sense that arbitrators for niche markets with considerable domain
expertise could charge a premium for their services. Say, $50-$150 for that
specific case doesn't seem like a stretch.

Also, note that the fee could be charged only in case of dispute. (and indeed,
many trust agents on Bitrated offer their services for free or nearly-free
when there's no dispute and no work on their part. 0.1% base fee + 2% for
disputes seems to be a popular fee structure.)

> Establishing trust between strangers is a very hard problem.

Indeed! This is the main problem we're trying to tackle, which is arguably
much harder than providing the technological platform for payments.

> And while you may find a way to innovate around existing laws, new laws will
> be written.

The thing is, we aren't taking advantage of some "loophole" or anything like
that. The escrow regulations exists for a reason and makes a lot of sense -
holding funds on behalf of others should have strict regulations attached to
it. Escrow providers are trusted to keep the funds safe from thieves, not to
"run away" with them, and to resist the temptation to invest user funds to
make a (potentially quite high) profit while holding them (which could result
in losing them, even in relatively "safe" investments).

With multi-signature, none of that risk exists, and so it makes sense that the
regulation won't either. Even when new laws gets written to address that,
they're likely to be much less strict.

The legal situation with multi-signature is very similar to a binding
arbitration clause, so we anticipate regulations to be based off of that (and
arbitration is significantly less strictly regulated than escrow).

------
yk
Previous discussions:

[https://news.ycombinator.com/item?id=5025211](https://news.ycombinator.com/item?id=5025211)

[https://news.ycombinator.com/item?id=4616704](https://news.ycombinator.com/item?id=4616704)

------
BenderV
"I still think I compressed the data in the original file. It's not my fault
that a file system uses up more space storing the same amount of data in two
files rather than a single file."

I don't really agree with that, given the fact that he used the information
about the size of the files.

------
cpks
Mike Goldman is now immortalized on the internet as someone who welches on
bets...

~~~
GhotiFish
It's not so clear cut, The filesystem really was acting as a "table" of index
values on where to replace the characters that were removed.

------
progrn
Can someone explain why this is not possible? I understand why sending a
decompressor beforehand is not possible for all inputs. I don't understand
this formulation of the problem, where it only needs to work for one input
that you get before you need to create the decompressor.

~~~
raverbashing
Some simple explanation

Compression exploits redundancy in a data stream (basically). You basically
get "all symbols" (and how you define this varies according to your
compression method: you could do all letters in the case of text, or even text
snippets that repeat, etc) and reassemble them in a way that the ones that
repeat the most take less space (and you also need to start from a basic
dictionary known by all uncompressors or ship it with your compressed file)

One simple analogy is writing with abbreviations, but if you write e.g. the
reader has to know what "e.g." means or you have to put in the beginning "e.g.
= example" (and this also takes space)

Now, a randomly generated file ideally has all symbols repeating with the same
frequency, (we say all symbols have the same entropy - I'm not sure about this
exact wording), hence you can't take a symbol that repeats more or less and
make it take less space in your compressed file

~~~
s369610
what if instead of using your own dictionary, you use an index into an
existing dictionary? such as an index into a subsequence of pi. Couldn't you
then find a sequence of bytes in the file in which the index into pi takes
less bytes and then replace them all with the index? If you couldn't find any
in pi use e or another such number? What am I missing

~~~
raverbashing
In this case your dictionary either doesn't have everything or to adequately
point to it you take as much space as not using it.

While Pi has all pairs of 2 digits, your index would take more space than
storing the pairs itself (because you might need to go beyond position 99)

For one situation you might "get lucky" and find a coincidence, but this won't
scale generically

~~~
gbl08ma
See also: [https://github.com/philipl/pifs](https://github.com/philipl/pifs)
and
[https://github.com/philipl/pifs/issues/23](https://github.com/philipl/pifs/issues/23)

I admit this fooled me for a bit. Good news is, I won't be fooled again by
something similar :)

------
TillE
In theory, I quite like the solution mentioned in the earlier threads: request
a file that's a few kilobytes, then get two or three different hashes of the
file, and write a "decompressor" that generates random files and checks the
hashes.

It's just a shame that the heat death of the universe will probably occur
before your program finishes.

~~~
piannucci
Sorry, but no. There are far more files with a given hash than just the one,
if the file is longer than the hash. And having multiple hashes doesn't help
until the hashes exceed the length of the file.

~~~
compbio
Chances of getting a collision is higher than getting a good solution, but
having multiple hashes does help in increasing the chance at a good solution.
With a hash 1 bit less than the length of the file, we put two pigeons inside
one hole, and have a 50% chance at picking the right pigeon. The fewer/smaller
hashes, the more we get "sorry, but no".

~~~
Someone
It's easier to just drop the final F bits from the N-bit input stream and, at
decompression time, guess what they are than to go through this exercise of
generating hashes that have N-F bits in total and hunt for bit streams having
those hashes.

------
jheriko
yeah, he should have been smart enough to spot what was coming when he was
asked about multiple files... or at least asked some more directed questions
than 'what do you think you have that will solve this problem'

~~~
gruntled
Or insisted it was a single file, tar files allowed.

~~~
SeoxyS
Tar files have a decent amount of overhead. Source: I wrote a streaming
untarring library in C for a streaming video product. You would definitely add
WAY more than 1 byte of overhead per file, which is what is required for this
trick to work.

~~~
gruntled
But it rules out other kinds of tricks like storing info in file names. If all
the metadata is counted in the length of the tar file, these tricks don't
stand a chance. There's way more than 1 byte of overhead per file in the file
system and Mike needed a rule that counts all of them.

------
hyperpallium
The file ordering contains information (done with filenames, _comp.$i_ , but
could use other file metadata).

Mike can escape his unthinking agreement to multiple files by the rules
forbidding information in filenames.

------
phkahler
Here's a more risky solution. Chose an arbitrary large file size. Have the
decompressor search the local file system for a file of that specific size and
make a copy of it as output. This presumes he's going to have the uncompressed
file on the system to verify the output of the decompressor. That may turn out
to be a false assumption, but what if...

~~~
log_n
Or just have the decompressor log onto a server and download the original
file. Only risk then is internet connection.

------
dvirsky
You should re-read this and imagine Goldman's messages being read in Vizzini's
voice.

------
yummybear
I remember a "fractal compression" hoax one time. It would compress a file
ridiculously (like 1 1mb file down to 100 bytes), and decompress it
flawlessly. Of course it just moved the file to some other place on the
harddrive and created a "compressed file" full of junk and restored the file
on decompress. Good one...

------
brownbat
If you want to hear some stories from the master of proposition bets, there's
an old autobiographical article in SI by Titanic Thompson:

[http://www.si.com/vault/1972/10/09/618832/soundings-from-
tit...](http://www.si.com/vault/1972/10/09/618832/soundings-from-titanic)

(The risk being that half of it is made up, but he definitely had a reputation
for this sort of thing.)

"You might wonder why, if I was the best golfer in the world, like I say I
was, I didn't turn pro and win all the championships? Well, you were liable to
win a golf bag if you won a tournament in those days. A top pro wouldn't win
as much in a year as I would in a week as a hustler. People would get to know
a pro, and I wanted to keep my skill a secret as far as possible. I didn't
care about championships. I wanted the cash."

------
prettyrandom100
I didn't see anything that mentions run-time in the challenge. I think a good
compression challenge should mention about run-time. Theoretically, it would
be possible to hash parts of the file and then brute force the hash in the
decompressor. This would take a lot of time but would work.

~~~
brongondwana
Except if the hash is smaller than the source data, then with a good hash,
there will be multiple source datasets that hash to the same result, which
makes your decompression program unreliable. You could well bruteforce the
wrong answer.

~~~
prettyrandom100
Do you mean that there could be two 3K files with the same sha256 hash and the
probability of hitting the collision is greater than the probability of
finding the correct hash for the file ? Let's divide the 3 mb file into 1000
parts, each having a ~3k size. Take sha256 hash of all 1000 parts and sha256
hash of the actual file. The size of these hashes are less than the actual
file + leaves ample room for a decompressor. Now start brute-forcing and
assume we have all the run-time power and time. Would the probability of
collision happening in all 1000 parts be very low then ? Given a good hashing
function could it be so low that we can disregard it ? If 1000 is not enough,
can we increase the splits to 2000 or 3000 ?

~~~
andrewaylett
It's not that there _might_ be a collision, but that collisions are
_guaranteed_. How many 3Mb files are there? 2^(3M _8) = 2^25165824. How many
distinct hashes are there? 2^(256_ 1001) = 2^256256.

By the pigeonhole principle, we can't fit 2^25165824 objects into 2^256256
holes; indeed each file will have on average 2^24909568 other files that share
the same set of 1001 sha256 hashes.

The reason that we are able to work with hashes, and that compression works in
practice, is that we _don 't_ have to deal with _every_ 3M file. Most of these
files are gibberish and will never be seen, and finding two files that match
even one hash is incredibly difficult. But once we start talking about brute-
forcing, we start encountering problems -- and having to dedicate an awful lot
of processing power to the problem isn't the biggest one...

~~~
prettyrandom100
There is just one 3MB file and we divide that file into 1000 parts. I agree
there will be collision per hash (for each part). But I'm skeptical that all
1000 hashes will produce such bits so that the final file will cause collision
on the hash of the original file (remember we do have hash of the original
file). If the final hash does not match the hash of original file we would
have to recompute all the hashes again by randomly generating the bits for
each 1000 file-parts. Do you mean to say that the collisions are guaranteed
and that the collision inside any file-part will also cause a collision in the
original file hash when the parts are combined ?

~~~
andrewaylett
Not every collection of 1000 correctly hashed parts will make a correctly
hashed whole, but there are an awful lot of different collections of parts
that will hash correctly (2^24909824 permutations of them) and of those, one
in 2^256 will also match the full-file hash.

------
mrfusion
I'm confused about the challenge. Why wouldn't simply using gzip work? I must
be missing something obvious.

~~~
btown
If you're given perfectly random data, gzipping it will (almost) never reduce
the size so much that you could fit the gunzip binary in the reduced space. In
the extremely rare occurrence that the generated random data has, say, a
repeated string longer than the gunzip binary, the challenger could be on
guard for that and just regenerate random data until that's not the case.

To be more formal, the challenger is finding what he believes to be a string
that is Kolmogorov-random, and betting (quite safely) that the challenged
party can't prove him wrong.

[http://en.wikipedia.org/wiki/Kolmogorov_complexity#Kolmogoro...](http://en.wikipedia.org/wiki/Kolmogorov_complexity#Kolmogorov_randomness)

~~~
function_seven
> that you could fit the gunzip binary in the reduced space

What's funny is that Patrick (the challenger) asked if a bash script
consisting solely of a call to gunzip would suffice as the decompressor. In
other words, all he had to do was compress the file by more than the amount of
the _script_. Mike allowed it, knowing that even a tiny "decompressor" that
really just called out to the real thing, would still be larger than the
compression achievable on a well-crafted random blob.

------
fishnchips
I am actually surprised that Patrick did not compress the original data to 0
bytes by keeping all the data in filenames. That would be the ultimate troll
;)

~~~
xorcist
I disagree. You could legitimately say he just stored the data in the metadata
fields. But files have a size, even as a stream, completely regardless of
metadata. I think this is a more clever hack.

~~~
fishnchips
Maybe, I don't know. One way or another you're storing _some_ data (chunk
ordering in Patrick's case, all data in my case) in file names.

------
softbuilder
>It's not my fault that a file system uses up more space storing the same
amount of data in two files rather than a single file.

Even without a filesystem - just sending data over the wire - you have to be
able to delimit files in some way, and there's going to be overhead associated
with that.

Another way to think of this is that any particular volume could be viewed as
a single big file. How much space in that big file is he taking up?

~~~
ThrustVectoring
That just invites more ways of implicitly sharing and hiding data. Like, have
each of 230 hosts have one part of the file. Instructions for running the
program are "Go to %part1url. Download the file as `foo.1`. Go to %part2url.
Download the file as `foo.2`." and so forth. That's exploiting the fact that
the instructions for running the program aren't counted as part of the program
size.

------
logicallee
In terms of behavior, we know Mike acted in bad faith: before he saw the
approach, he had agreed that the challenger could use multiple files. But once
the challenger had posted them, he proceeded to download only a single file to
verify its functionality, not touching the others. It shows bad faith on the
part of Mike when he chose to ignore the other files.

By the way in a theoretical sense Mike lost when he said he would allow
multiple files and count their sizes: this is because [] is not the same as
[[][][]], but consists of 3 empty sets. You can theoretically encode a file
into just a bunch of 0-byte files, without using the order of the files or
their names. He shouldn't have agreed to count only their sizes.

For the theoretical encoding into 0-sized files, you can simply interpret the
input file as a binary number, and then create that many empty files.

This is not a practical solution of course - you can only compress two bytes
down to 0-byte files this way, as 2^16-1 is already up to 65535 empty files.
For three bytes it's up to 16,777,215 files.

If you wanted to store 9 bytes in unary as the number of empty files, you
would need 2^72-1 = 4.7 sextillion (million quadrillion) files. Obviously that
is not actually possible. But even 9 bytes is hardly enough to interpret the
files as binary again. (Unless you can somehow get Mike to agree to the
invocation - since the decompression program itself doesn't need to store any
information and theoretically could be 0, 1 or 2 bytes.) But theoretically you
don't need anything other than what Mike foolishly agreed to: allowing
multiple output files counting their sizes.

2.

There is also another theoretical way to make money off of Mike, but it is not
practical. (It doesn't work.) If we were not limited to bytes but could use
bits, you could shave up to 5 bits off of every input file, if you figured out
a way to decompress it by always prepending the bits 00000. (Theoretically
there is only 1 pigeonhole to decompression, so you do not need to store any
information in the decompression algorithm and it has no minimum size).

If Mike is using a random source for the files, this would result in a correct
decompression in 1/32 of cases. But Mike is giving 49:1 odds (risk $100, get
$5000), which is better than 31:1. So you could simply repeat the game with
Mike thousands of times, always using the same decompression algorithm, until
you have all of Mike's money. This works better if Mike is a computer, of
course.

And it doesn't work at all on any actual systems, as a nibble is less than a
byte and would not count as savings even if you could encode the decompression
algorithm into a 0-byte (or up to 3 bit) invocation.

------
hackhat
He also doesn't state that the output should be right at first decompression
try. By using this you could encode multiple bits and then generate various
wrong archives, knowing that after 1000000 tries he would get a correct
decompressed file.

------
Intermernet
Does the decompressor have to be wholly hosted locally? Could it just be a
shim that pulls a more complex program from the net?

There are grey areas here. Does a decompressor that depends on linked
libraries count? Do things like libc count towards the total decompressor
size?

I know this was written 14 years ago, but we had the net then, and shared
libraries aren't exactly a new thing. Where do you draw the line? Can _any
decompression code call an external dependency_ and not be disqualified in the
same way?

I'd say that using the filesystem to "hide" bytes is the least of the possible
loop-holes with this challenge, if you were being pedantic about the rules.

~~~
function_seven
> Does the decompressor have to be wholly hosted locally? Could it just be a
> shim that pulls a more complex program from the net?

The judge can disconnect his computer from the Internet, attempt to run the
submitted decompressor and file, and declare failure when it doesn't work.
Nothing in the challenge guarantees Internet connectivity on the machine.

> Does a decompressor that depends on linked libraries count? Do things like
> libc count towards the total decompressor size?

Nope. In the email exchange, Mike said that just a script that called out to
gunzip would be fine, and that he'd only count the size of the script as the
decompressor size

> Where do you draw the line? Can any decompression code call an external
> dependency and not be disqualified in the same way?

Yup, as long as the dependency is already on the machine I suppose.

That's what's so enlightening about this challenge. Even a tiny, tiny script
that calls to other programs still can't compress a "pathologically" random
file by more than a few bytes.

~~~
Intermernet
>Yup, as long as the dependency is already on the machine I suppose.

That's exactly what I was getting at.

If gzip was allowed, then any common decompression utility should be allowed.

If that's the case, and said utility relies on an external library, should
that library be included as part of the size of the decompression utility?

It's arguable that the "fabric" over which the data is delivered shouldn't
matter. If shouldn't really matter if the linked library comes from the same
SSD as the executable, or from a server on the other side of the world.

We currently define "the machine" as the internals of a box. Would this count
if you were running the executable from a removable drive? If not, why does
"network storage" trigger the disqualification, and not "USB storage"?

I understand the original premise of Mike's challenge, but considering a
loophole is being discussed here, I'd like to know where people see that the
boundaries of similar loopholes lie.

------
hurin
(2001) tag?

------
gayprogrammer
If you printed the files each on a sheet of paper, it would 'use more trees'
than printing the original file (paper being the 'filesystem').

I agree that it shouldn't matter what the filesystem does to store it, if the
rules state that file size is determined by a specific command to count all
the inodes, then he lost. If the command is 'du' or 'wc', then he won.

------
kennywinker
My money is on Hooli winning this one.

------
im3w1l
I don't think it is 100% foolproof, even if no filesystem trickery is used.
The random number generator used is likely not perfect and so the data should
be compressible. I mean it would probably be quite difficult, but maybe
possible.

~~~
a_e_k
From random.org, used as a source for the data:

> RANDOM.ORG offers true random numbers to anyone on the Internet. The
> randomness comes from atmospheric noise, which for many purposes is better
> than the pseudo-random number algorithms typically used in computer
> programs.

~~~
im3w1l
Ah missed that. Yeah then it may be good enough randomness.

~~~
emn13
But yeah - it may well be not entirely random, and if someone actually managed
to build a compressor based off any slight bias, I think they deserve the 5000
:-).

------
thezilch
By Patricks thinking, I could just store the last N bytes of each file in the
filename! Clearly he can see this isn't compression?

------
skatenerd
Can't he just send you a Kolmogorov-random file? The definition of randomness
(in Kolmogorov sense) basically corresponds directly to his challenge.

Also, Kolmogorov-random sequences vastly outnumber non-random sequences in
general, so with a long-enough file, I wonder how certain he can be that he
has generated such a file.

[http://en.wikipedia.org/wiki/Kolmogorov_complexity#Kolmogoro...](http://en.wikipedia.org/wiki/Kolmogorov_complexity#Kolmogorov_randomness)

~~~
mappu
How do you determine whether the file is kolmogorov-random?

The only approach is to try a perfect kolmogorov compressor, which doesn't
exist (well, excepting brute force over the space of possible turing
machines).

~~~
skatenerd
I bet that with longer strings, the odds of drawing a kolmogorov-random string
are high enough that he's guaranteed to make money from his Challenge

------
Bentota
Why wouldn't binary run length encoding work here? E.g. "compressing" 11100110
to 30020 for example?

~~~
DanBC
Compression relies on entropy. There's not enough entropy in the random file
your your run-length encoding to work.

I think the data is available so you can always try to beat the bet.

~~~
et1337
I may be completely misinformed, but I think you meant to say there's _too
much_ entropy.

A binary string of all ones followed by all zeroes has very low entropy, while
a purely random binary string has high entropy. (I think. I'm skimming the
Wikipedia article on entropy now)

~~~
DanBC
Yes, sorry!!

------
ac29
See also section [9] of the comp.compression FAQ for more on the history of
compression of random data:

[http://www.faqs.org/faqs/compression-
faq/part1/](http://www.faqs.org/faqs/compression-faq/part1/)

