Hacker News new | past | comments | ask | show | jobs | submit login
The $5000 Compression Challenge (patrickcraig.co.uk)
323 points by vq on Mar 8, 2015 | hide | past | web | favorite | 168 comments

I absolutely believe Mike should have paid Patrick.

On the simple premise that since Mike was hosting a bet that he KNEW was impossible (i.e. under no circumstance, ever, would he have to pay the 5000$), then literally the only point of the game is to find any loopholes. Otherwise it's just Mike preying on unsuspecting victims.

If you design an impossible game, the only possible thing for anyone to do is to break it. If you then complain that THAT is cheating, you're a pedantic idiot - one of those annoying kids in middle school who loses a bet and then tries every possible way to weasel himself out.

To add fuel to the fire, his obnoxious replies such as "I tried running the first two files and it didn't work", make my blood boil, as it's a clear attempt to try and belittle the contestant.

> If you design an impossible game, the only possible thing for anyone to do is to break it. If you then complain that THAT is cheating, you're a pedantic idiot

I feel the same way when casinos bust card counters. You use math to take money from suckers. When other people use your rules and better math to take money from you, that's suddenly deeply immoral.

I don't think anyone feels that card-counting is deeply immoral. It's more like, "we lose money when people do this, so we will do everything in our power to prevent it."

Mike's responses could have been better, but in the correspondence I see no guarantee that the files will be presented in order, or with the same file names, or in an otherwise empty directory.

That sounds close to cheating, but I think it's exactly in the spirit of both the original challenge by Mike and the response by Patrick. After all, the original challenge didn't mention multiple files (so it's not surprising that this limitation isn't mentioned), and the subsequent alteration to the rules was initiated by Patrick, who intentionally tried to inject a loophole, but failed to specify the need to keep the files in order. His rules; he should have to live by them.

Also, I suspect that if Patrick had mentioned the need to leave the file names unaltered or in order, there's a good chance that Mike would have smelled a rat. After all, it is precisely in that side-channel where the information gain is to be had.

The "compressed" files includes a header file which records the original file name and the number of parts which make up the compressed file, so it doesn't depend on directory listing to re-combine the parts.

Yes, it does. It records the original file name, not the name of the parts, and it needs the names+metadata of those parts (specifically in this case their order) to reconstruct the original file.

Here's the script:

    f=`head -1 $1`
    n=`tail -1 $1`
    rm -f $f
    while [ $i != $n ]; do
        cat $1.$i >> $f
        i=`expr $i + 1`
        if [ $i != $n ]; then printf "5" >> $f; fi
The iteration is done with a while loop based on metatdata from the "compressed" file (the number of parts, $n) and the name of the archive ($1). The order of the parts is determined by incrementing $i up to $n, not from eg, the order of the files in the file system, which seems to be what was implied in the comment earlier.

> If you design an impossible game, the only possible thing for anyone to do is to break it.

Which is the Kobayashi Maru in a nutshell. In that fictional case, Kirk was disqualified but also received a commendation.


I don't know if the novel is canon, but it ought to be.

Kirk beat it by cheating, Scotty beat it legitimately, then proved that what he did only worked in the simulation.

> If you design an impossible game, the only possible thing for anyone to do is to break it. If you then complain that THAT is cheating, you're a pedantic idiot - one of those annoying kids in middle school who loses a bet and then tries every possible way to weasel himself out.

The purpose for the impossible game was to provide a distraction to people going to Usenet comp.compression and announcing their brilliant new algorithm and claiming it could compress any file better than any other algorithm.

Obnoxiousness just means this challenge becomes interesting to code-golfers who will also try to obey the rules but produce a broken result.

He did offer the guy his money back. I think the challenge is all in good fun and the intent is ultimately that no money ends up changing hands.

Mike Goldman originally wrote the challenge such that it calls for one file and one decompressor.

However, when subsequently asked whether there can be multiple files, he agreed; thereby he was arguably duped. He didn't say "okay, but there will be a 256 byte size penalty per additional file", he just plainly agreed.

This means that the original formula for adding the size of the solution applies: just the file sizes added together.

Goldman should accept that he foolishly rushed into a careless amendment of his original challenge and pay the money.

That said, it obviously is cheating to have the archive format or file system hide the representation of where the removed bytes are! If a single file is produced, it has to include a table of where to insert the bytes that were taken out. If multiple files are produced, the archive format or file system stores that information for you at considerable expense.

If both people are wrong, the contest should be declared invalid and Goldman should return the $100.

If only Goldman is wrong, he should pay $5000.

Under no interpretation is Goldman strictly right and the contestant strictly wrong.

So he is wrong to keep the $100 in any case.

> He didn't say "okay, but there will be a 256 byte size penalty per additional file", he just plainly agreed.

Exactly. IMO the challenge-setter is the one being more unfair here, since his metadata-based reason to reject the solution applies to submissions made under the original rules too. He accepted an amendment without counter-amending to cover that loophole, so if he stands by his word, he has to pay up.

It casts doubt on whether he would pay if somebody won by the original rules with some luck. Maybe he has an implicit allowance, that merely two files can't "encode" a big enough advantage, but again, then it's his fault for not addressing that in the new rules.

It's effectively a prop bet. If you're a world class table tennis champion and you bet someone you can beat them at table tennis on the proviso that they get to pick the bats, you can't really complain if you find out later that they have spent six months practicing with frying pans. You just have to pay up.

If you don't see the loophole before you agree, you pay. Actually there's a great little bit about this in the movie Guys And Dolls (1955), and what to do if someone bets you that he can make a jack of spades squirt cider in your ear.

Note that he never guaranteed to leave the files with the same file names, and without that, Patrick's trick wouldn't have worked.

Given the fact that Patrick made the rules concerning multiple files, and that they were intentionally tricky, I think it's fair to interpret them in that spirit - as a trick.

Notably, even without the ordering there is still some information leakage purely in terms of file sizes, but that's a little harder to exploit (likely still possible with a 3mb file). Of course, the rules don't explicitly state the directory must be otherwise empty...

Frankly, I think it's pretty reasonable that Patrick "lost" the bet, even if Mike reasoning wasn't sound. For Patrick to have won that bet would have required quite a list of requirements of how those "other" files were to be passed to the decompressor. Such requirements were not presented, and if they had been they would have been obviously fishy.

He could have ordered the files by size pretty easily. Trick still would have worked. No fancy requirements needed. Just making the files available in any manner.

And yes there needs to be a way to know which file(s) are the compressed data, that loophole would disqualify any entry.

Yep, file sizes alone leak sufficient information to be an issue, but it's a bit of a hypothetical because that's not what actually happened. Clearly, Mike shouldn't have allowed multiple files (or should have specified some kind of overhead for multiple files), and clearly there are other tricks Patrick could have used, but with these unfortunate rules and this submission nothing seems to require keeping file order intact.

It's a good point about needing to know which files are compressed. On a philosophical level, I'm not sure whether that means there actually should be a way to know which file(s) are the compressed data, or that the rules are simply broken...

What I'm trying to argue is that any justifiable objection Mike could have made would have failed.

Unjustifiable objections could have been made but screw those, because they would cause legitimate compression to fail.

Unjustifiable objections will pretty much always exist if the rules are written in English. That doesn't mean the rules are broken, it means you use judgement and follow the purpose of the rules.

Mike's fate was sealed when he allowed multiple files without an overhead penalty. He could have removed all the metadata possible, simply having the files exist as separate entities was enough to ruin him.

Clever, he'd likely be skipping many 5's towards the end, but still compresses.

Another scheme: chop it up into files of increasing length, the difference being 1..256, encoding a byte (ignore the last difference).

EDIT more efficient to increase in length by 1 or 2, encoding a bit.

>So he is wrong to keep the $100 in any case.

Regardless of the results of the challenge, at the end of everything Goldman offered in good faith to return the $100 and Craig said he could keep it.

Nope, perfectly acceptable for Goldman to keep the $100. Craig tried to play a game of semantics, and lost because semantics let Goldman weasel out by pointing to O/S meta data.

The challenge clearly, repeatedly, stated that Goldman would give $5000 to anyone who could compress a datafile such that the combined size of file and decompressor was smaller than the original datafile.

Craig did not compress the data stored. He split the file up into multiple smaller files and using "5" as the boundary. However, by creating 218 files, he increased the amount of space required to store this data. Ergo, the combined size of the "compressed" files and the decompressor exceeded the size of the original file. It is only by excluding the space taken up by the meta data that the "compressed" files are smaller. Furtheremore, elsewhere it was noted that: "The FAQ clearly outlines that filesystem exploits are not true compression."

This interpretation is inconsistent with Goldman's own statement about the original data that "the file size is 3145728". He didn't say "the file size is 3145728 plus some file system overhead", so by file size he was thinking of the number of bytes in the file ... until he was outsmarted.

It's hardly a filesystem exploit if - again by Goldman's own statement - gunzip is allowable.

I suspect that if the challenge had been solved with a single file, Goldman would try to get out of paying by claiming that the program's size should include the size of the interpreter for its language, and the libraries linked to that, the size of the command line needed to invoke it (including the pointer vector and null termination), not to mention the underlying kernel ...

I don't know goldman, and I bet you don't either - but there's a pretty big difference between this solution (which clearly cheats the aim of the challenge, and a solution that actually compresses. People hate to reward cheaters, even if it's a fun kind of cheat from the outside. But that doesn't mean he wouldn't have payed out for a real solution, which likely would have been quite interesting (and not quite as impossible as it's being made out to be, since we don't know whether his random source is truly random).

What does "actually" compressing mean?

Replacing every "5" with EOF is apparently bad.

What if he replaced every "5z" with EOF? Fewer bytes there.

What if he had a variant of LZ77 doing dictionary encoding followed by a range encoder that outputs symbols in the range -1 through 255? Even counting the EOF as a character, this would give an output 2K characters smaller. Sounds like compression to me. It's finding common sequences and uncommon sequences and rescaling them based on probability to remove redundancy.

"compress a datafile such that ..." appears to give a self-contained definition of what it means to compress in the context of the challenge.

In any case, no compression algorithm makes all inputs smaller. So, just because a data encoding doesn't make a particular input case smaller doesn't imply that it's not a compression algorithm.

I admit I'm surprised to see this comment is still the top comment. I wrote it off when it was first written as being far too pedantic to be at all meaningful, but I guess other people are falling for the same trick.

This is pure pedantry, and it's bad pedantry at that. Your argument is that Mike agreed that multiple files could be submitted, and from that you're drawing the completely baseless conclusion that the multiple files would be considered purely based on their filesize. According to a straightforward reading of the challenge, only the filesize of the decompressor and compressed file would matter. But when Mike agreed to multiple compressed files, nowhere did anyone state that only the filesize of the additional compressed files would matter.

And it should be pretty obvious to anyone that you cannot simply accept just the filesize of the multiple compressed files. Because the only point in using multiple files is to try and hide extra data in either the number of files or their filenames. And that obviously violates the entire point of the challenge.

If you want to take Patrick's multiple files approach to its logical conclusion, you may as well submit a bunch of zero-sized files where all of the data is stored in the filename (perhaps with an integral prefix for sorting purposes) and have your decompressor just be some variant on `echo` that strips the prefix and echoes the rest of the filename (without space separation). That way you could claim infinite compression!


If you're still not convinced, then how about this: nowhere did Mike agree that he would not rename the decompressor or compressed file(s) prior to running the program. And a trivial renaming would have broken Patrick's "decompressor". That alone should cause his entry to obviously be a failure.

> completely baseless conclusion that the multiple files would be considered purely based on their file size

You mean, like...

> I meant can I send you a compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file.

(to which Mike agreed)?

Since this is still a pedantry argument: Mike agreed that Patrick could send multiple files, but Mike never agreed that these files would be accepted as a winning submission. Furthermore, Mike never agreed that the filenames of the files would be considered to be meaningful data (without contributing to the filesize calculation).

> Mike agreed that Patrick could send multiple file, but Mike never agreed that these files would be accepted as a winning submission

Then what is "agreeing"? These two sentences appear logically inconsistent.

Huh? Patrick asked if he could send multiple files. Mike said yes. Again, since the context of this is pedantic arguments, agreeing that Patrick could send multiple files is not the same as agreeing that Mike would consider the multiple files to be a winning submission.

I have to logically agree with you. Indeed, Mike never issued a statement like, "I am amending the contest rules so that multiple files decoded by a single program are admitted as a valid solution."

Only, this trickerly was not Mike's intent, otherwise, of course, he would have played this card immediately. For instance:

"I said you could send that to me because that is a true statement with which I therefore agree. It is undeniably true that you can send me anything you like. You can send me an e-greeting card for my birthday, and you can send me a Britney Spears MP3 as a MIME attachment. You can send me questions asking about what you can send me. Obviously, though, the only material which you can send me which also happens to conform to the contest rules is a single program and a single data file. Please re-read the contest statement; I will gladly explain any portion of it that is not clear."

Mike had no need to complain that the program isn't a decompressor. He should have played it as above.

Instead, he demonstrated acceptance of the the multi-file structure of the solution by complaining that the algorithm in the program file isn't a form of compression.

In fact this solution doesn't use either the number of files or their filenames to encode additional information - it uses the sizes of those files to encode the additional information.

It does use the filenames, for ordering. If you renamed the files so they sort differently, they would not decompress correctly.

We probably agree that any program should be allowed to uniquely identify the files making up the input data. And if Mike would have spotted the (other) more obvious loophole, which is to store input data in only the filenames, he could have given this (possibly the only) sensible answer: "Yes, multiple files whose summed up size + the size of the decompressor are smaller than the original file are ok. But the files must be uniquely named inputfile.0 ... inputfile.NNN." Which is exactly what Patrick delivered.

But that's not what happened. Patrick, not Mike, changed the rules, and he intentionally introduced a loophole. But in doing so, he failed to specify that file names were to be left unaltered. I think it's fair that if you make intentionally tricky rules, you not be surprised when your own loopholes are exploited.

Providing multiple files with the (deceptive) intention of storing data via filesystem metadata, but then failing to specify how filesystem metadata is to be treated is just asking for it.

This is exactly why such challenges are always entirely unsatisfactory and nothing but a pointless distraction. There might be some point in providing entertainment to both parties if the pedantry war weren't entered into, but the nature of the challenge all but guarantees that will happen.

Mike will always be able to find an excuse not to pay. If Parrick had imposed the 'no file renaming' condition, Mike could have found another way in which to break his submission with a counter loophole.

This type of challenge should never be made or accepted, if financial incentives are involved, without an impartial adjudicator.

It's only a pointless distraction when people see these challenges and treat them as a game to be tricked. Mike already made it clear the challenge is not about financial compensation (and offered to give Patrick the $100 back even though Patrick did not win, under the very generous assumption that Patrick innocently misunderstood the challenge instead of deliberately tried to subvert it).

The point of a challenge like this is the same as the point of the Randi challenge: to provide a means to demonstrate that impossible claims are impossible (in this case, that of being able to compress truly random data, and in the case of the Randi challenge, supernatural/magical abilities such as ESP). After all, if any such claim were valid, then the claimant would be able to defeat the challenge, assuming that the stated rules are fair (and they are).

Which is to say, the point of a challenge like this is not actually to have anyone enter. It's just to be.

> This type of challenge should never be made or accepted, if financial incentives are involved, without an impartial adjudicator.

Did you read the whole page? Mike did offer an impartial adjudicator:

> I would gladly submit any such submission to an impartial ombudsman to determine the question of whether data compression has occurred.

I'm not sure I'd go that far, but it's definitely not a surprise to see it fizzle out. It's particularly uninteresting here because a real solution is either impossible, or only possible because Mike's random source isn't entirely random (which is kinda boring).

Then again, the purpose apparently was to make crackpots put their money where their mouth is rather than spam the usegroup, so perhaps fairness isn't the best success criterium.

I don't think that's relevant - in a sense the ordering information is only required because using individual files has otherwise destroyed some information (by allowing reordering within the output data).

The same process would work with a single stream input to the decompressor, as long as the length of each block where an additional '5' must be output is available - the list of file sizes - so that's where the additional information is stashed.

Quote from Mike:

    > Rather, you simply split the file into 218 parts ending with the
    > character "5" and then stripped that final character from each part.  Thus the
    > "decompressor" is nothing more than a reassembler,
    > concatenating the parts and reappending
    > the character "5" after each.
Well, that's exactly the definition of lossless compression. Look at e.g. how js crunch works: you create a dictionary of common sequences, split the file on those sequences recursively and then reassemble it by joining in reverse. Gzip, bzip2, &c, &c, it's all the same thing. Split the file by a common sequence and reassemble it by that. Patrick just created a customized compressor that went only 1 level deep.

Normally you'd need a delimiter to separate those chunks, a delimiter that doesn't occur in the chunks e.g. through padding or escaping. That, in turn, increases the filesize, and now you're in trouble. What Patrick did was to use EOF as a new fresh "delimiter" that doesn't occur anywhere, and at a cost of zero bytes, no less.

Cheating, or inventive.

The EOF is not at a cost of zero bytes; it costs as much as storing the length of each constituent file. The extra space used is in the file system accounting.

It's at a cost of 0 competition score bytes. Mike screwed up by allowing an alphabet of 257 symbols and then only counting 256 of them. Pretty much any compression or repacking algorithm could have been used at that point.

Dylan16807, that's a very concise way to put it, thanks for making that comment.

In return, Patrick screwed up in wanting to use the extra filesystem metadata, but he failed to actually ensure that it would be left unaltered. Only the file contents were to be left untouched.

It's inventive, but what if instead of doing it at the byte level, we did it at the bit level? Instead of splitting files on every "5" encountered, we split them at every 1 in binary, which gives about 50% of "compression".

That would essentially be a kind of run-length encoding. I doubt if a 50% compression ratio is possible, though.

I wonder how you got a downvote for that, it's quite accurate. A version of simplistic token replacement even has a wikipedia page http://en.wikipedia.org/wiki/Byte_pair_encoding

The point of the challenge was to tempt people who do not understand compression as well as Mike into putting themselves into a position for Mike to mock and/or shame them. From that respect, it seems to me like it was a trick.

In my experience, people who set up such tricks do not usually respond well when the tables are turned.

There are some people in the world who take it personally when other people don't understand their area of expertise as well as they do. They get angry and offended at naive questions, and seek to punish the idiots. This is a great way to take an interesting subject and ruin it for everyone.

One of the best aspects of the HN culture is that experts here tend to incline more toward teaching and less toward chastising. It's a nice change from Usenet.

Why do you think that rather than Mike is genuinely interested in novel compression methodologies, and willing to pay some money to make interested people attempt to discover them?

Because that's pretty much what he says in his post to comp.compression, where he announces that someone accepted the challenge. Let me quote:

> Before naming the individual and giving additional details of our > correspondence, I would like to give him some time to analyze the > data I will be sending him. It would be very easy to point out to > him the impossibility of his task, but far more interesting to see > how long he will struggle with the problem before realizing it for > himself. > > I am supposing that one of his fellow co-workers probably referred > him to my challenge, as I cannot fathom that someone would read the > comp.compression faq first and then want to participate after > understanding the futility of the effort. On the other hand, some > people just don't understand information theory too well. > > I'll try to give him a complete explanation of his error after a > week or so, I guess. :)

So Mike is just smug about how clever he is, how stupid the other person is, not even suspecting there might be a loophole in the challenge.

I see no sign of interest in learning what the other person is up to, or even admitting that there might be something to learn.

Because information theory says it's impossible.

I would say the update he posted to the news group rules this out.

There was a problem in comp.compression (or whatever Usenet newsgroup) of arrogant challengers announcing their brilliant new compression algorithm. This bet is a way of demonstrating at least the pigeon hole problem to other people.

A trick similar to the recursive Barf compressor (add information to the filename). http://mattmahoney.net/dc/barf.html

A longer running challenge is http://www.drdobbs.com/architecture-and-design/the-enduring-... No entry fee, $100 prize, and just as unfair.

A completely serious compression challenge with serious consequences for AI and NLP: http://prize.hutter1.net/ up to 50.000$ prize money, but severe restrictions on memory and running time.

You can not beat Goldman's troll-ish challenge (certainly if the rules are retro-actively clarified in favor of the organizer). You could however try to put the challenge in limbo by creating a decompressor which bruteforces a solution, 'till some hashes match or the final heat death of the universe or the halting problem is solved, whichever comes first. Goldman will not be able to ever verify your solution, and when he does (theoretically it is not impossible), it means you win.

Or, instead of above Schrödinger's Compressor you can send a good random number generator back as your solution. If Goldman wants his file to be random, any random file should do. Why does he want exactly his own random file? Why does he want to do a diff between two random files, is he perhaps looking for order where there is none? But that's the same foolishness he accuses his participants of.

> A longer running challenge is http://www.drdobbs.com/architecture-and-design/the-enduring-.... No entry fee, $100 prize, and just as unfair.

At the 10-year scale, a whole new set of tricks opens up. Invent a sufficiently popular programming language, or contribute a lot to the linux kernel, and start surreptitiously hiding bits of the file on his OS (the easiest would be for you language to have a builtin function that spits out a small part of the file).

That was explicitly forbidden :-)

Bruteforcing a solution until hashes match doesn't work. If you try to bruteforce X bits and use a hash of size Y for validation, you will get 2^(X-Y) possible solutions.

(I made the same mistake some time ago when I tried to bruteforce a 4 byte RC4 key with 3 known bytes in the plain text; I found 256 solutions.)

Yes, if you want to be sure that your solution is correct, you must run the compressor yourself. Then you count the number of collisions it takes to happen upon the correct solution and feed this counter to your decompressor. But then you place the burden of solving the halting problem on yourself and then you got more serious problems than compressing random data.

But your counter will need (X-Y) bits of storage, so you'll need to store Y + (X-Y) bits, or a total of X bits, and you have saved nothing.

If Goldman wants his file to be random, any random file should do. Why does he want exactly his own random file?

There is no such thing as random file. Randomness is not a property of a single file. You can't look at a byte sequence and say that "it's random". It can be random looking, you can run some statistic analysis on the byte sequence and say "It's probably generated by some good random algorithm", but even here "probably" doesn't mean any probability in [0,1].

Also there are cases in practice where you expect back the same random byte sequences even if they are randomly generated. Think about public key authentication and symmetric session key's exchange.

Randomness can be tied to the algorithm which generates a byte sequence. This information is not stored in the file in any way, it's just the mere result of an algorithm. Randomness is the "color of the bits" [1].

[1] http://ansuz.sooke.bc.ca/entry/23

> Randomness is not a property of a single file. You can't look at a byte sequence and say that "it's random".

Interestingly, this is less true than you might think. The mathematics of Kolmogorov complexity[1] formalizes the idea that a given finite fixed string is "random". In this framework, you can have a byte sequence that definitely is random. (However, your second sentence I quoted is still technically true here as well because it is algorithmically undecidable how random the string is.)

(In fact one of the definitions of a random string is one that is incompressible, i.e. there is no algorithm with shorter description than the string that produces the string.)

[1] http://en.wikipedia.org/wiki/Kolmogorov_complexity

>The mathematics of Kolmogorov complexity[1] formalizes the idea that a given finite fixed string is "random".

I read into the wiki article and this approach is certainly interesting. However this complexity depends on the description language so there is no unique way to determine if a string is random.

Right. However, the idea is for any two ways (say Java and Haskell), there will be a constant so that they will agree on all strings up to a constant. So it's quite robust and seems to capture something fundamental or deep about randomness and information...

> If Goldman wants his file to be random, any random file should do. Why does he want exactly his own random file?

Maybe because the file was XOR'd with a secret one time pad?

Blast! That would require our solution to break all known encryption. Then it would be easier to just target Goldman's setup.

A file with output from AES-256("sekrit") would not be random to Goldman. It would appear random to us, until we crack it with a plausible password. Key to randomness is unpredictability. The moment Goldman saves some atmospheric noise on his computer and makes a short pointer to it on his filesystem, is the moment this file loses its claimed unpredictability. Goldman knows this file deeply: He himself has compressed meaningless random data into predictable information that has meaning and purpose, a feat he set out to prove impossible. By simply entering the challenge with real meaningless randomness you make Goldman solve it for you.

> Why does he want exactly his own random file?

He's running a model and needs to use the same random data each time.

Then he can wait for the RNG to produce this same random data. Eventually it will produce a file which matches. A dynamic solution of sorts, because he would have to be quick to diff, before the file starts changing again.

I feel that the compressor for a true random stream is a true random generator. If I quickly show you a screen of black-and-white unpredictable noise, and ask you what it was, you'd compress/understand/recall that as "generate_noise()". I do not feel that this is lossy compression, for what did you lose? The ordering of a random file? Random files have no order to lose.

When you're running a reproducible science experiment it's probably a good idea to include the actual data. For random noose this could be the generator and seed and then some good hashes but only if the experimentor used a generator and seed - if the experimentor just grabs noise from somewhere and uses that you want the actual data as part of reproducibility.

>If Goldman wants his file to be random, any random file should do. Why does he want exactly his own random file?

Because not all random files are equally random? DUH! He had to ensure there's no bias in the distrubution, use a good random generation, check for skewed data etc...

AFAIK, information theory requires the _Expected_ size of a 'compressed' file be at least as large as the original.

So we could create an encoding that compressed N/50 of the strings with lg(N/50)=lg(N)-lg(50) bits. That would save us lg(50) > 5 bits with 2% chance. In this game we have 50 tries (5000$/100$) so we'd be pretty sure to win.

The correct price for this game is probably closer to 200$.

The problem is that even the most trivial linux-compatible decompressor will add a few bytes, and there's almost no chance of saving that many bytes. But sure it wouldn't hurt to require it be 100+ bytes smaller.

I think Patrick sums it up quite well. Perhaps the interesting thing is imagining how this would go down 15 years later in 2015: Patrick asks if the bet is available; it is. They enter into a 2-of-3 bitcoin transaction with 3rd-party escrow. Patrick and Mike sign the terms (probably written in pseudocode or python) using their sending bitcoin addresses (or GPG keys). Filesharing is an order-of-magnitude easier than setting up FTPs with personal IP addresses. The bet is promptly won by Patrick.

Ah, how things have changed in 14 years

They could have used an escrow service 15 years ago and the challenge terms could have been defined as a Python program since Python is 24 years old.

Getting someone with domain expertise on the matter to provide escrow is non-trivial. Escrow trust accounts are heavily regulated in most parts of the world, and require licensing, bonds and lawyers. I highly doubt they would find someone willing to go through all this. The cost for (legally) operating an escrow is probably higher than the entire bet... (he could also do this without licensing, and take on the legal risk, I guess. this might go under the radar for small things like that, but doesn't really work at scale.)

Bitcoin improves on that by not requiring a trust account - their trusted third party would simply hold one key in a 2-of-3 multi-signature scheme, giving him the authority to resolve disputes and adjudicate between them, but without holding any funds under his full control.

I find the legal implications of Bitcoin smart contracts very exciting - this significantly lowers the entry barriers for providing many kinds of financial services and opens up these markets for competition in a way that was simply impossible before. There's lots of room for innovation and disruption with that.

Disclaimer: standard IANAL/TINLA apply, but I'm the founder at a startup that facilitates exactly that (https://www.bitrated.com/) and received extensive legal guidance on the matter.

Bitcoin makes the technical process easier but doesn't help with the hard problem (as you said): finding a third party with domain expertise, whom they both trust, who is willing to adjudicate at very low cost.

It sounds like your startup is trying to solve that and create a "Trust Marketplace". Godspeed. Establishing trust between strangers is a very hard problem.

And while you may find a way to innovate around existing laws, new laws will be written.

> finding a third party with domain expertise, whom they both trust

I would say that in this specific instance, they could've quite easily agreed on a reputable user from the comp.compression newsgroup whom they both trust.

> adjudicate at very low cost.

It makes sense that arbitrators for niche markets with considerable domain expertise could charge a premium for their services. Say, $50-$150 for that specific case doesn't seem like a stretch.

Also, note that the fee could be charged only in case of dispute. (and indeed, many trust agents on Bitrated offer their services for free or nearly-free when there's no dispute and no work on their part. 0.1% base fee + 2% for disputes seems to be a popular fee structure.)

> Establishing trust between strangers is a very hard problem.

Indeed! This is the main problem we're trying to tackle, which is arguably much harder than providing the technological platform for payments.

> And while you may find a way to innovate around existing laws, new laws will be written.

The thing is, we aren't taking advantage of some "loophole" or anything like that. The escrow regulations exists for a reason and makes a lot of sense - holding funds on behalf of others should have strict regulations attached to it. Escrow providers are trusted to keep the funds safe from thieves, not to "run away" with them, and to resist the temptation to invest user funds to make a (potentially quite high) profit while holding them (which could result in losing them, even in relatively "safe" investments).

With multi-signature, none of that risk exists, and so it makes sense that the regulation won't either. Even when new laws gets written to address that, they're likely to be much less strict.

The legal situation with multi-signature is very similar to a binding arbitration clause, so we anticipate regulations to be based off of that (and arbitration is significantly less strictly regulated than escrow).

The greeks could have kickstarted the industrial revolution early if they'd used hero's engine. They just didn't think about the world in that way at that time.

"I still think I compressed the data in the original file. It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file."

I don't really agree with that, given the fact that he used the information about the size of the files.

Mike Goldman is now immortalized on the internet as someone who welches on bets...

It's not so clear cut, The filesystem really was acting as a "table" of index values on where to replace the characters that were removed.

Can someone explain why this is not possible? I understand why sending a decompressor beforehand is not possible for all inputs. I don't understand this formulation of the problem, where it only needs to work for one input that you get before you need to create the decompressor.

Some simple explanation

Compression exploits redundancy in a data stream (basically). You basically get "all symbols" (and how you define this varies according to your compression method: you could do all letters in the case of text, or even text snippets that repeat, etc) and reassemble them in a way that the ones that repeat the most take less space (and you also need to start from a basic dictionary known by all uncompressors or ship it with your compressed file)

One simple analogy is writing with abbreviations, but if you write e.g. the reader has to know what "e.g." means or you have to put in the beginning "e.g. = example" (and this also takes space)

Now, a randomly generated file ideally has all symbols repeating with the same frequency, (we say all symbols have the same entropy - I'm not sure about this exact wording), hence you can't take a symbol that repeats more or less and make it take less space in your compressed file

what if instead of using your own dictionary, you use an index into an existing dictionary? such as an index into a subsequence of pi. Couldn't you then find a sequence of bytes in the file in which the index into pi takes less bytes and then replace them all with the index? If you couldn't find any in pi use e or another such number? What am I missing

In this case your dictionary either doesn't have everything or to adequately point to it you take as much space as not using it.

While Pi has all pairs of 2 digits, your index would take more space than storing the pairs itself (because you might need to go beyond position 99)

For one situation you might "get lucky" and find a coincidence, but this won't scale generically

See also: https://github.com/philipl/pifs and https://github.com/philipl/pifs/issues/23

I admit this fooled me for a bit. Good news is, I won't be fooled again by something similar :)

In theory, I quite like the solution mentioned in the earlier threads: request a file that's a few kilobytes, then get two or three different hashes of the file, and write a "decompressor" that generates random files and checks the hashes.

It's just a shame that the heat death of the universe will probably occur before your program finishes.

Sorry, but no. There are far more files with a given hash than just the one, if the file is longer than the hash. And having multiple hashes doesn't help until the hashes exceed the length of the file.

Chances of getting a collision is higher than getting a good solution, but having multiple hashes does help in increasing the chance at a good solution. With a hash 1 bit less than the length of the file, we put two pigeons inside one hole, and have a 50% chance at picking the right pigeon. The fewer/smaller hashes, the more we get "sorry, but no".

It's easier to just drop the final F bits from the N-bit input stream and, at decompression time, guess what they are than to go through this exercise of generating hashes that have N-F bits in total and hunt for bit streams having those hashes.

Also, I guess the hash algorithm could be specifically tested to work for this particular solution, couldn't it (although only after having spent the $100 of course)?. The problem I guess is to have a script that takes up less space than for the hash to still work without a collision for the particular data. Even with the smallest possible program or script, this probably isn't going to work then.

You could maybe use something like pi-fs :) https://github.com/philipl/pifs

Yes, I had wanted to learn about multithreading in Python at the same time I had previously read this story. I built it, it worked, but took a reaaaaly long time, it was fun! https://github.com/abemassry/crazip

yeah, he should have been smart enough to spot what was coming when he was asked about multiple files... or at least asked some more directed questions than 'what do you think you have that will solve this problem'

Or insisted it was a single file, tar files allowed.

true. maybe he was smuggly thinking 'if he thinks using multiple files will help, he must be really dumb' :)

Tar files have a decent amount of overhead. Source: I wrote a streaming untarring library in C for a streaming video product. You would definitely add WAY more than 1 byte of overhead per file, which is what is required for this trick to work.

But it rules out other kinds of tricks like storing info in file names. If all the metadata is counted in the length of the tar file, these tricks don't stand a chance. There's way more than 1 byte of overhead per file in the file system and Mike needed a rule that counts all of them.

The file ordering contains information (done with filenames, comp.$i, but could use other file metadata).

Mike can escape his unthinking agreement to multiple files by the rules forbidding information in filenames.

Here's a more risky solution. Chose an arbitrary large file size. Have the decompressor search the local file system for a file of that specific size and make a copy of it as output. This presumes he's going to have the uncompressed file on the system to verify the output of the decompressor. That may turn out to be a false assumption, but what if...

Or just have the decompressor log onto a server and download the original file. Only risk then is internet connection.

You should re-read this and imagine Goldman's messages being read in Vizzini's voice.

I remember a "fractal compression" hoax one time. It would compress a file ridiculously (like 1 1mb file down to 100 bytes), and decompress it flawlessly. Of course it just moved the file to some other place on the harddrive and created a "compressed file" full of junk and restored the file on decompress. Good one...

If you want to hear some stories from the master of proposition bets, there's an old autobiographical article in SI by Titanic Thompson:


(The risk being that half of it is made up, but he definitely had a reputation for this sort of thing.)

"You might wonder why, if I was the best golfer in the world, like I say I was, I didn't turn pro and win all the championships? Well, you were liable to win a golf bag if you won a tournament in those days. A top pro wouldn't win as much in a year as I would in a week as a hustler. People would get to know a pro, and I wanted to keep my skill a secret as far as possible. I didn't care about championships. I wanted the cash."

I didn't see anything that mentions run-time in the challenge. I think a good compression challenge should mention about run-time. Theoretically, it would be possible to hash parts of the file and then brute force the hash in the decompressor. This would take a lot of time but would work.

There are an unbounded number of inputs which may result in the same hash digest. Saving just the hashes (digests) and then finding the preimages would not guarantee the same result. You would find collisions but not necessarily the right collision.

In fact, as the data being hashed increases in size, the amount of data required to identify which preimage is the correct preimage must be greater than or equal to the difference in size between the digest and the preimage.

For example, let's say you are using a perfect hash function, with a 64-bit digest. If you feed data into the hash function in 64 bits chucks, and then try to brute-force the results, each preimage you find will be the right one, and you can assemble the original file, but you have exactly as many bytes as when you started!

Now lets say you feed in 65 bits to the perfect hash function, saving 1/65 space in the resulting list of digests. But unfortunately, there are two 65-bit preimages which will result in each of your stored digests, so you need a bit to decide which one is correct.

And so on...

No it wouldn't work! Hashes do not defy information theory, they lose information. Such brute forcing would only find hash collisions and "decompress" to a different text than the original.

I endorse the creativity in your approach, but you are mistaken; this will not (deterministically) work.

Given a hash function that hashes an input I (of size N, comprised of N arbitrary bytes) to a digest D (of size M), then assuming that M is a fixed value, then for each output digest D_0, there will be 2^(N-M) values that hash to that D_0. How will you tell which is the "right" one?

Except if the hash is smaller than the source data, then with a good hash, there will be multiple source datasets that hash to the same result, which makes your decompression program unreliable. You could well bruteforce the wrong answer.

Do you mean that there could be two 3K files with the same sha256 hash and the probability of hitting the collision is greater than the probability of finding the correct hash for the file ? Let's divide the 3 mb file into 1000 parts, each having a ~3k size. Take sha256 hash of all 1000 parts and sha256 hash of the actual file. The size of these hashes are less than the actual file + leaves ample room for a decompressor. Now start brute-forcing and assume we have all the run-time power and time. Would the probability of collision happening in all 1000 parts be very low then ? Given a good hashing function could it be so low that we can disregard it ? If 1000 is not enough, can we increase the splits to 2000 or 3000 ?

It's not that there might be a collision, but that collisions are guaranteed. How many 3Mb files are there? 2^(3M8) = 2^25165824. How many distinct hashes are there? 2^(2561001) = 2^256256.

By the pigeonhole principle, we can't fit 2^25165824 objects into 2^256256 holes; indeed each file will have on average 2^24909568 other files that share the same set of 1001 sha256 hashes.

The reason that we are able to work with hashes, and that compression works in practice, is that we don't have to deal with every 3M file. Most of these files are gibberish and will never be seen, and finding two files that match even one hash is incredibly difficult. But once we start talking about brute-forcing, we start encountering problems -- and having to dedicate an awful lot of processing power to the problem isn't the biggest one...

There is just one 3MB file and we divide that file into 1000 parts. I agree there will be collision per hash (for each part). But I'm skeptical that all 1000 hashes will produce such bits so that the final file will cause collision on the hash of the original file (remember we do have hash of the original file). If the final hash does not match the hash of original file we would have to recompute all the hashes again by randomly generating the bits for each 1000 file-parts. Do you mean to say that the collisions are guaranteed and that the collision inside any file-part will also cause a collision in the original file hash when the parts are combined ?

Not every collection of 1000 correctly hashed parts will make a correctly hashed whole, but there are an awful lot of different collections of parts that will hash correctly (2^24909824 permutations of them) and of those, one in 2^256 will also match the full-file hash.

Hashes suffer from the same pigeonhole problem as compressed data.

I'm confused about the challenge. Why wouldn't simply using gzip work? I must be missing something obvious.

If you're given perfectly random data, gzipping it will (almost) never reduce the size so much that you could fit the gunzip binary in the reduced space. In the extremely rare occurrence that the generated random data has, say, a repeated string longer than the gunzip binary, the challenger could be on guard for that and just regenerate random data until that's not the case.

To be more formal, the challenger is finding what he believes to be a string that is Kolmogorov-random, and betting (quite safely) that the challenged party can't prove him wrong.


> that you could fit the gunzip binary in the reduced space

What's funny is that Patrick (the challenger) asked if a bash script consisting solely of a call to gunzip would suffice as the decompressor. In other words, all he had to do was compress the file by more than the amount of the script. Mike allowed it, knowing that even a tiny "decompressor" that really just called out to the real thing, would still be larger than the compression achievable on a well-crafted random blob.

GZIP would not apply very much compression, at all, to purely random data.

So the compressed data + gzip decompressor would very likely be a greater size than the original data.

I am actually surprised that Patrick did not compress the original data to 0 bytes by keeping all the data in filenames. That would be the ultimate troll ;)

I disagree. You could legitimately say he just stored the data in the metadata fields. But files have a size, even as a stream, completely regardless of metadata. I think this is a more clever hack.

Maybe, I don't know. One way or another you're storing some data (chunk ordering in Patrick's case, all data in my case) in file names.

>It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

Even without a filesystem - just sending data over the wire - you have to be able to delimit files in some way, and there's going to be overhead associated with that.

Another way to think of this is that any particular volume could be viewed as a single big file. How much space in that big file is he taking up?

That just invites more ways of implicitly sharing and hiding data. Like, have each of 230 hosts have one part of the file. Instructions for running the program are "Go to %part1url. Download the file as `foo.1`. Go to %part2url. Download the file as `foo.2`." and so forth. That's exploiting the fact that the instructions for running the program aren't counted as part of the program size.


Did you skip the part where Mike agreed to new terms that would allow multiple files?

In terms of behavior, we know Mike acted in bad faith: before he saw the approach, he had agreed that the challenger could use multiple files. But once the challenger had posted them, he proceeded to download only a single file to verify its functionality, not touching the others. It shows bad faith on the part of Mike when he chose to ignore the other files.

By the way in a theoretical sense Mike lost when he said he would allow multiple files and count their sizes: this is because [] is not the same as [[][][]], but consists of 3 empty sets. You can theoretically encode a file into just a bunch of 0-byte files, without using the order of the files or their names. He shouldn't have agreed to count only their sizes.

For the theoretical encoding into 0-sized files, you can simply interpret the input file as a binary number, and then create that many empty files.

This is not a practical solution of course - you can only compress two bytes down to 0-byte files this way, as 2^16-1 is already up to 65535 empty files. For three bytes it's up to 16,777,215 files.

If you wanted to store 9 bytes in unary as the number of empty files, you would need 2^72-1 = 4.7 sextillion (million quadrillion) files. Obviously that is not actually possible. But even 9 bytes is hardly enough to interpret the files as binary again. (Unless you can somehow get Mike to agree to the invocation - since the decompression program itself doesn't need to store any information and theoretically could be 0, 1 or 2 bytes.) But theoretically you don't need anything other than what Mike foolishly agreed to: allowing multiple output files counting their sizes.


There is also another theoretical way to make money off of Mike, but it is not practical. (It doesn't work.) If we were not limited to bytes but could use bits, you could shave up to 5 bits off of every input file, if you figured out a way to decompress it by always prepending the bits 00000. (Theoretically there is only 1 pigeonhole to decompression, so you do not need to store any information in the decompression algorithm and it has no minimum size).

If Mike is using a random source for the files, this would result in a correct decompression in 1/32 of cases. But Mike is giving 49:1 odds (risk $100, get $5000), which is better than 31:1. So you could simply repeat the game with Mike thousands of times, always using the same decompression algorithm, until you have all of Mike's money. This works better if Mike is a computer, of course.

And it doesn't work at all on any actual systems, as a nibble is less than a byte and would not count as savings even if you could encode the decompression algorithm into a 0-byte (or up to 3 bit) invocation.

He also doesn't state that the output should be right at first decompression try. By using this you could encode multiple bits and then generate various wrong archives, knowing that after 1000000 tries he would get a correct decompressed file.

Does the decompressor have to be wholly hosted locally? Could it just be a shim that pulls a more complex program from the net?

There are grey areas here. Does a decompressor that depends on linked libraries count? Do things like libc count towards the total decompressor size?

I know this was written 14 years ago, but we had the net then, and shared libraries aren't exactly a new thing. Where do you draw the line? Can any decompression code call an external dependency and not be disqualified in the same way?

I'd say that using the filesystem to "hide" bytes is the least of the possible loop-holes with this challenge, if you were being pedantic about the rules.

> Does the decompressor have to be wholly hosted locally? Could it just be a shim that pulls a more complex program from the net?

The judge can disconnect his computer from the Internet, attempt to run the submitted decompressor and file, and declare failure when it doesn't work. Nothing in the challenge guarantees Internet connectivity on the machine.

> Does a decompressor that depends on linked libraries count? Do things like libc count towards the total decompressor size?

Nope. In the email exchange, Mike said that just a script that called out to gunzip would be fine, and that he'd only count the size of the script as the decompressor size

> Where do you draw the line? Can any decompression code call an external dependency and not be disqualified in the same way?

Yup, as long as the dependency is already on the machine I suppose.

That's what's so enlightening about this challenge. Even a tiny, tiny script that calls to other programs still can't compress a "pathologically" random file by more than a few bytes.

>Yup, as long as the dependency is already on the machine I suppose.

That's exactly what I was getting at.

If gzip was allowed, then any common decompression utility should be allowed.

If that's the case, and said utility relies on an external library, should that library be included as part of the size of the decompression utility?

It's arguable that the "fabric" over which the data is delivered shouldn't matter. If shouldn't really matter if the linked library comes from the same SSD as the executable, or from a server on the other side of the world.

We currently define "the machine" as the internals of a box. Would this count if you were running the executable from a removable drive? If not, why does "network storage" trigger the disqualification, and not "USB storage"?

I understand the original premise of Mike's challenge, but considering a loophole is being discussed here, I'd like to know where people see that the boundaries of similar loopholes lie.

(2001) tag?

If you printed the files each on a sheet of paper, it would 'use more trees' than printing the original file (paper being the 'filesystem').

I agree that it shouldn't matter what the filesystem does to store it, if the rules state that file size is determined by a specific command to count all the inodes, then he lost. If the command is 'du' or 'wc', then he won.

My money is on Hooli winning this one.

I don't think it is 100% foolproof, even if no filesystem trickery is used. The random number generator used is likely not perfect and so the data should be compressible. I mean it would probably be quite difficult, but maybe possible.

From random.org, used as a source for the data:

> RANDOM.ORG offers true random numbers to anyone on the Internet. The randomness comes from atmospheric noise, which for many purposes is better than the pseudo-random number algorithms typically used in computer programs.

Give me a file the size of the universe, and my decompressor will re-sim the universe until the point of the file creation date and read the atmospheric noise from that point.

Ah missed that. Yeah then it may be good enough randomness.

But yeah - it may well be not entirely random, and if someone actually managed to build a compressor based off any slight bias, I think they deserve the 5000 :-).

From the link I posted in my other comment, a thought experiment... Yes, some random files can be compressed by a given program, but not all random files. The proof is fairly simple, once you think it through:

Theorem: No program can compress without loss all files of size >= N bits, for any given integer N >= 0.

Proof: Assume that the program can compress without loss all files of size >= N bits. Compress with this program all the 2^N files which have exactly N bits. All compressed files have at most N-1 bits, so there are at most (2^N)-1 different compressed files [2^(N-1) files of size N-1, 2^(N-2) of size N-2, and so on, down to 1 file of size 0]. So at least two different input files must compress to the same output file. Hence the compression program cannot be lossless.

The challenge was to provide a decompressor for one file, not any or all files.

This proof of the impossibility of a compressor that can compress any file has been known for decades.

Yes, but you don't get to see the file until after you've sent in the $100. So in order for this to be a good bet, you'd need a method that can compress a random file with at least 2% ($100/5000) probability. That's still quite difficult.

An example of an algorithm that does compress 2% (actually 1/32 ~= 3%) of random files: just drop the initial five bits of the file, and replace them with zeros upon decompression. There's a 1/32 chance they were all zeros, in which case you win. The difficult part is encoding the decompression procedure into a program of less than five bits. :-)

>you'd need a method that can compress a random file with at least 2% ($100/5000) probability

No, you'd need to be able to create a custom algo for that file with that probability. Quite a difference.

The "custom algorithm" thing is a red herring. You'll need to use some method to come up with your custom algorithm, once you see the input file, and whatever method you choose is itself a compression procedure, targeted at a fixed decoder which is essentially a Bash shell with a C compiler (* ). If you intend to treat different inputs differently (e.g., "if it contains the string "AB", I'll use this algorithm, if it has more zeros than ones, I'll use this other one, etc"), this just means that your compression procedure contains some 'if' statements. The laws of information theory don't care whether your compression procedure is actually a computer program or just implicitly encoded into human actions; it's still impossible to reliably represent random data with fewer than the usual number of bits.

(* ) There's a little bit of leeway here because the challenge as stated isn't actually precise about the execution environment. You could probably sneak in a small number of illicit bits by negotiating out-of-band whether the decompressor is to be a C program, ELF binary, Perl script, etc. It's not obvious how to make this useful though.

> If you intend to treat different inputs differently (e.g., "if it contains the string "AB", I'll use this algorithm, if it has more zeros than ones, I'll use this other one, etc"), this just means that your compression procedure contains some 'if' statements. The laws of information theory don't care whether your compression procedure is actually a computer program or just implicitly encoded into human actions; it's still impossible to reliably represent random data with fewer than the usual number of bits.

While you're obviously correct about the information theory aspect of this, in this particular challenge, the contestant enjoys the privilege of pruning his decompressor to match just the one file. All the other "if" statements can be thrown away. So now a lossy algorithm can be used, that just happens not to be lossy on that one particular file.

I understand that the contestant gets to specify a "decompressor"; my point is that this is not information-theoretically relevant. The real decompressor in this challenge is the Linux/C runtime environment. The "decompression" program you send is really just part of the compressed payload, and whatever steps you, as a human, take to produce the compressed payload (encompassing both the code and data portions) are ultimately governed by information theoretic limitations.

Another way to say this is that although you have the 'privilege' of a designing a custom decompressor for this one file, this is not really a benefit since your custom decompressor counts against the space limit. (in contrast to the typical information-theoretic setting where the channel can have an arbitrary pre-specified decoder).

Yeah, it's not a benefit, except if the challenge-giver makes a mistake and offers a file that happens to have nicely-compressible characteristics. Basically you're spending $100 on the hopes that the file you're tasked with compressing just happens to have low enough entropy that it can be successfully compressed by more than the decompressor's size.

Not a good bet to take.

> my point is that this is not information-theoretically relevant.

Exactly. This whole post is about the challenger trying to be clever with the rules and trying to sidestep the core information theory.

Continuing that logic:

You could get 6% if you can also do the same on the end of the file. There's a 1 in 16 chance that any random sequence of data will start or end with 5 zeros. The same trick could be done for 1s to bring it up to 12%. In other words, 12% of sequences will either begin or end with 5 ones or 5 zeros.

Since you can inspect the data before writing the decompressor, if we can figure out how to pad the beginning or end of a byte sequence with five 1s or five 0s in 5 bytes or less, then your expected outcome is $625 (minus $100 to play). Using 6 bytes would give you an expected outcome of $312.5, 7 bytes would give you an expected outcome of $156.


That's 7 bytes. It's not really a executable decompressor, but it does capture the necessary steps required to decompress.

Unfortunately you're confusing bits and bytes (I did the same thing too at first). Fitting a decompressor in five bits is a lot harder. :-)

I think there are two types of communication we need to distinguish. Pre-file, the communication we do before gettting the file, and post-file, the communication we do post file.

Pre-file we can communicate as much as we want without being penalized. We can say for instance that we want our bitstring to be interpreted as an elf binary to be run in ubuntu 14, with specific libraries installed. Or we can say that it will be a zip file. Or we can say that it should be pre-concatenated with 0's. We can specify hundreds of thousands of lines of code that should be compiled now, and then later be executed with the bitstring as input.

Then we get the file, and now every bit we communicate is penalized. If we have asked for a turing-complete format we may now send a decoder specific for the file.

as pointed out... bits vs bytes... I think the overall idea here is interesting. It is "cheating" because the implicit "EOF" and "start of file" allow us to forego indexing the data. This "cheat" is similar to the technique used in the article.

Similar cheats would be to leverage hardware instructions (imagine if x86 had a "pad with N zeros" instruction that fit in 7 bits). That's still cheating because the information is hiding in the architecture.

Sure. It's possible to win, and it's also possible to lose, depending on the specific file provided. This is much better than reformulating the problem to an impossible one (compress any file, instead of this one specific file)

Actually even that is not feasible:

Assume that you split a m byte string into k byte decompressor and l byte input data. If k+l < m, that is you actually compress, then there are only 2^(k+l) possible output strings. And since

    sum_{i=0}^{m-1} 2^i = 2^m -1
there exists at least one string in the space of m character strings which can not be compressed by any algorithm.

Yes, knowing which string(s) to choose which have this property as the challenge giver is the hard part.

The problem wasn't well formulated, since it didn't clearly define what can or cannot be done to the decompresser based on knowledge of the output file.

But consider a more strongly formulated problem, where you must provide the decompresser and then a file is randomly selected to be put through it - as long as it's still a matter of compressing one particular file it is statistically possible to win (especially at $5000:$100) - The idea being that you can reinterpret data to make certain outputs cheaper while others more expensive.

It was well formulated; the whole point is that you can do whatever you want to the decompresser.

Aren't cryptographically secure PRNGs supposed to produce outputs indistinguishable from truly random numbers?

CSPRNG generated sequences can be trivially compressed. You just have to know the seed and the length to reproduce the sequence. However getting the seed from the sequence is unfeasible but theoretically possible by bruteforce.

By Patricks thinking, I could just store the last N bytes of each file in the filename! Clearly he can see this isn't compression?

Can't he just send you a Kolmogorov-random file? The definition of randomness (in Kolmogorov sense) basically corresponds directly to his challenge.

Also, Kolmogorov-random sequences vastly outnumber non-random sequences in general, so with a long-enough file, I wonder how certain he can be that he has generated such a file.


Read the article, this is a circumvention rather than compression, although if you define Komogorov-randomness appropriately, such a definition would work.

But modern OS's have non-determinism available: take a Kolmogorov-random file of size n which can be generated by a program n+1. Then, replace a certain part of the program (totaling k bits, and delete 2 extra bits) with a function drawing k+2 random bits from the OS. Then with probability 2^-(k+2), you win.

How do you determine whether the file is kolmogorov-random?

The only approach is to try a perfect kolmogorov compressor, which doesn't exist (well, excepting brute force over the space of possible turing machines).

I bet that with longer strings, the odds of drawing a kolmogorov-random string are high enough that he's guaranteed to make money from his Challenge

Why wouldn't binary run length encoding work here? E.g. "compressing" 11100110 to 30020 for example?

Compression relies on entropy. There's not enough entropy in the random file your your run-length encoding to work.

I think the data is available so you can always try to beat the bet.

I may be completely misinformed, but I think you meant to say there's too much entropy.

A binary string of all ones followed by all zeroes has very low entropy, while a purely random binary string has high entropy. (I think. I'm skimming the Wikipedia article on entropy now)

Yes, sorry!!

How are you storing that 3 in 1s and 0s? ;)

See also section [9] of the comp.compression FAQ for more on the history of compression of random data:


Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact