Hacker News new | past | comments | ask | show | jobs | submit login
The $5000 Compression Challenge (patrickcraig.co.uk)
104 points by jpablo on Oct 5, 2012 | hide | past | favorite | 119 comments



I wonder if Mike Goldman ever made a response after everything was published. I found very distasteful when he threatened to accuse Patrick of fraud.

He should have recognized that the rules where vaguely stated and at the very least offer an apology. Plain refusing to download all his files, when he had expressly accepted multiple files at being ok was kind of weird.

The google groups link in the page no longer works so I wonder if that's archived elsewhwere.


> when he threatened to accuse Patrick of fraud

Exactly. He was the one committing the fraud.

To quote Mike Goldman (fta): "I cannot fathom that someone would read the comp.compression faq first and then want to participate after understanding the futility of the effort. On the other hand, some people just don't understand information theory too well."

He took $100 from this guy knowing he was frauding him! It was like watching a game of nerd 3-card monte.


Exactly, I consider Mike to be acting in bad faith here. Mike was willing to accept $100 (which in most of the world constitutes many hours of labor) not just once, but every time people wanted to play again. He was willing to accept real money for something that he believed was literally impossible. Mike didn't hint at the impossibility in any way. E.g. if he had said "Have you read the FAQ and you still sure you want to play?" I would view it in different light. But no, Mike seemed all to eager to accept the money.

So yes, then when a taker appears he's either not very bright (and therefore you shouldn't take advantage of him) or he's trying to out-hussle (and playing legalese with the terms is then fair game). After Mike got beaten at his own game he then continued to (a) weasel out of it (b) threaten legal action (c) attempt to deflect.

Not exactly a class act by Mike.


>The google groups link in the page no longer works so I wonder if that's archived elsewhwere.

Here you go: https://groups.google.com/d/msg/comp.compression/sSC04FEoAqE...


The rules weren't vague, but they relied on a rational (as opposed to legalistic) understanding of what it means to compress data.


He said he would accept multiple files; he did not however agree to change the rules of the game, which unambiguously required "a decompressor and a compressed file."

A compressed file, as in a single file, which has been compressed.

Patrick returned roughly 200 files, none of which were compressed. Thus, he did not complete the challenge, technically or legally.

EDIT: This is the crucial part: the files were not compressed by gzip. They were merely collected into a single archive (no compression), and some of the data from each file was moved into the file's name (no compression). IOW, the archive size of each chunk of data was the same as the filesystem size.


Your rejection based on the rule change makes no sense. Mike very clearly agreed to a change in the rules of the game. "can I send you a [de]compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file." "Sure"

Also your rejection based on the files not being compressed is not going to work. I can get around it in only a few seconds of effort:

Let's say I have a theory that gzip works more efficiently when you don't have too many 1 bits in a row. I make a script to extract all the 0xFF bytes and compress each chunk (I might have to strip headers or use a compression algorithm that's better at dealing with random data without expansion, but that's not a huge hurdle.). Is this compression somehow not good enough? It's definitely a meaningful transform of the original data, with the chunks coming out wildly differently.

In fact allowing both a decompressor and a compressed file is a bad idea in the first place. If you get a sufficiently compact language and large file you can achieve compression on any file by the original wording. Split the data between the two files, this will give you n free bits based on where you split. Then make your decompressor, as part of some process subjectively qualifying as 'decompression', incorporate these extra bits.

Mike should have simply said "a .tar with the compressed file and decompressor, which is smaller than the original file".


Just to clarify, the guy made a contest that he knew, statistically, could not be won except if the random data file he produced just happened to be compressible (due to luck alone).

In other words, he knowingly accepted money in a contest of chance not a contest of skill. So this contest (assuming either participant was in the US at the time) violated federal and state laws against unlicensed gambling operations, correct? [Assuming Mike Goldman did not possess a license to operate an online casino in 2001]


surprised all the comments here are about whether he achieved compression or not. Point = Missed!

the entire thing was an ingenious trolling exercise, as he says, to 'out trick the trickster'. his goal was to exploit the loose wording of the challenge, and prove a point (possibly winning $5k in the process); I found it an entertaining story along those lines.

it proves nothing (new) about compression, or lack of it; he even stats that the consensus at the time was 'unanimous' that no compression had taken place. that's not where the interest in this link/story lies.

thanks for the original link, OP.


The argument is over whether the challenge is identical with its spirit or with its statement. Patrick clearly failed the spirit of the challenge. He tried to exploit a loophole in the statement of the challenge. So what which is the actual challenge?

Actually, it doesn't matter. Patrick failed to complete the challenge as stated, which is important because he was exploiting a loophole in the statement. Technically, his files were smaller. But technically, he was required to send Mike his files and decompresser. Instead, he put them on his server and sent Mike their location. Mike then had to get the files. It goes both ways.

Finding loopholes, while clever, isn't terribly interesting.


> But technically, he was required to send Mike his files and decompresser. Instead, he put them on his server and sent Mike their location. Mike then had to get the files. It goes both ways.

In that case Patrick could just argue he hadn't sent the files at all (since Mike refused to accept them by refusing to download them) and that therefore the bet was still on, so he can still send the files by email or whatever.


Look at the relevant part of the exchange:

> > I meant can I send you a compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file > > Patrick

> Sure -- but you send me a decompressor, I don't need the compressor.

Was this "compressor" vs. "decompressor" actually a slip up, or intentional to distract from the importance of what he was asking?


I read through the compression FAQ and found it quite interesting. I had never heard the counting argument before, but it's very intuitive.

Part of the FAQ is a collection of crazies claiming that they have created novel compression techniques. One entry is about Jules Gilbert, but it ends abruptly. I googled him and found that the crazy continues....

>I am afraid that I am not a fan of Google. They are gay-friendly and also they suppressed lot's of stuff about Obama before he was nominated, and again, before he was elected.

http://scientopia.org/blogs/goodmath/2010/08/13/969/


Obviously Patrick met the requirements of the loosely worded challenge.

Though I wonder if the inverse would be acceptable. Take 100 files all the same size, but smaller than the filesystem block size. These files are n100 bytes, but take up blocksize100 bytes on the disk. Then, using tar (which itself does not compress, just tacks files end to end with metadata in between), 'compress' them into a single file that takes up fewer disk blocks than before.


I dont think he completed the challenge. He found a loophole, so _maybe_ you could say he was successful. In exploiting the rules that is, not in solving the problem that Goldman stated. Still a clever solution, but obviously not what was asked for.


Every technique is "just a loophole" when it's first discovered.


But he didnt actually succeed in getting the desired result- the problem was that the statement of the problem was flawed. There obviously is a difference between what Goldstein meant, and what he said.


I'm sure Rivest, Shamir, and Adleman would disagree. So would Lempel and Ziv.


Perhaps a stupid question, but is the challenge not also solvable with 'honest' comp.sci? The way I think about it at the moment, is that I take a family of functions, C_k, which map the space of n elements strings to the space of [1, 2n] element strings ( the union of m element strings for 0 > m > 2n) in a essentially random way. The output is then in [1, n - sizeof(C_k) -1] with some probability p and if I try long enough, I find a compression scheme which compresses the given input.


No, it is defInitely not possible. If it were, you could do it over and over again and compress any file down to an arbitrarily small size. Two things to search for are "pigeonhole principle" and "kolmogerov complexity."

Essentially the problem is that any encoding of a string either preserves the length of all possible inputs, or shortens some inputs and lengthens others, or lengthens all inputs. There is no encoding that shortens all strings because of the pigeonhole principle. To apply this to the challenge, see that the challenge is to encode the string as a program that when executed produces the string. Just because the encoding language is Turing complete does not mean that it can evade the pigeonhole principle.


But the challenge isn't to compress random strings, the challenge is to compress a specific random string. This means that given enough time, you should be able to come up with an algorithm or transformation that turns that specific string into something of lower entropy. I'm not saying this is very feasible, but it doesn't violate the pigeonhole principle.

For example, there is a fixed size algorithm to generate the digits of pi, even though pi is incompressible.


Yes it does. Lets say for simplicity that you are coding your solution in python, and that your output should be written to standard out, and the output file should be 10^6 bytes. Consider the function that takes a source file of at most 10^6 to the standard output of said program when interpreted. The set from which files you need to output, are drawn has cardinality 2^(10^6). The set of python programs (possibly invalid) that are strictly shorter has cardinality 2^(10^6-1). If all files were possible to compress, than a python interpreter would provide an injective map (one program gives maximum one output, possibly none if the program has errors) from a smaller set to a larger set. This is impossible. Therefore there is at least one file(actually a lot of them but this is beside the point) which cannot be compressed. Mike could send this one.

EDIT: Messed up my bits and bytes and fenceposts: size 1: 256^(10^6). Size 2: sum (i=1 to 10^6) 256^(10^6-i)


I really don't understand the mechanics of when a reply link appears.

Anyway >An interesting result. Do you have a link to the proof?

This is quite simple to understand. Just count how many files there are with size < n bytes (A), versus with size=n bytes (B). Size 1: sum (i=1 to n) 256^(n-i) ~= 256^(n-1). size 2: 256^(n). If you start pairing files from A up with files from B (equivalent to a compressing scheme), only 1/256 of files from B will be paired up. The remaining 255/256 of the files wont be paired up meaning they are not mapped to a smaller file, and cannot be compressed.


Praptak: >Only assuming he'd be able to generate one. Is there a known algorithm for generating a provably Kolmogorov-uncompressible strings?

In the theoretical case he has almost a 255/256 of doing it he generates a string in a truly random fashion. In the practical case his odds are much better. Many potential programs will a) refuse to run and b) give the same output.


> In the theoretical case he has almost a 255/256 of doing it he generates a string in a truly random fashion.

An interesting result. Do you have a link to the proof?


Proof:

There are k strings of bytelength n.

There are k * 256 strings of bytelength n+1.

k input strings can decompress to at most k output strings

Therefore only k / (k * 256) of length n+1 strings can be compressed by a byte.

The reason for 'almost' is the miniscule number of strings that will compress more than one bye.


Ah right, I totally missed the fact that compression by 1 bit doesn't beat the challenge!


> Mike could send this one.

Only assuming he'd be able to generate one. Is there a known algorithm for generating a provably Kolmogorov-uncompressible strings?


That's a tricky question. I'll start with a couple ground limitations.

First off, you can't have an algorithm to make incompressible strings of an arbitrary length, because then you could tell it to make an string larger than your algorithm, aka larger than its own complexity.

You also can't measure the complexity of arbitrary strings or you could loop over all possible strings and use it to build a generator.

But if you use the rules of your execution environment you can make sufficiently small incompressible strings. For example, if we used python, "abc" can't be compressed. There just aren't any operations that we can use to shrink it. But "abcabc" could be written as "abc"*2

Any algorithm you use is going to have to depend highly on whatever environment you choose, and it won't be able to produce very long strings, so I don't know if anyone has really bothered.

In comparison, it's really easy to use random numbers such that the chance of being able to remove n bits is 1/2^n. If you have even a few bytes of overhead, as shell scripting has, you're safe from luck.


> But the challenge isn't to compress random strings, the challenge is to compress a specific random string.

I agree and I think that this challenge is vulnerable albeit very difficult but not impossible, at least not via the pigeonhole principle.

To be clear, the pigeonhole principle says for a given compression aglo and an input file of size n the output file must get bigger than n for some inputs of size n, i.e. it cannot compress all possible permutations of size n.

However, the way the challenge is setup it is saying; given a specific random file of size n, find an algo that can compress the file so that the output plus decompressor is less than the input file. The pigeonhole principle does not say this is impossible. Granted, finding the algo that does this may mean solving P=NP or may take 10^32 lifetimes of the universe in calc time to find it but in principle it is not impossible.

I admit, my understanding may be wrong and would really appreciate it if someone can explain to me my error here.


How exactly the pigeonhole principle help here? It only proves that you cannot compress all inputs.

For the challenge to be unbeatable you need to prove something much stronger - namely that for each file size most of the strings are random in the Kolmogorov sense. Has this been proven?

Edit: The above is too strict - obviously it would be sufficient to prove that for any size you can effectively compute a Kolmogorov-uncompressible string of this size. Anyway my point is that such proof would not be trivial.


I think I agree with you, please see my reply to unconed above; http://news.ycombinator.com/item?id=4618495

I'd love to know if I am wrong.


I think the reason most people don't see this intuitively, is because pigeonhole principle explanation doesn't deal with the mechanism of compression they have in mind. It's just a general statement of impossibility. To give them intuition about it, you would need to show why exactly the compression approach they have in mind won't work.


I believe the pigeon hole principle does not realy apply here, since we are looking for an algorithm that encodes just one file. For example, if we have two programs and input: ( C, n) and ( D, n) which can both be encoded in n-1 bits and ( C, n) maps into the set of n bit strings with a leading 0 and ( D, n) into the set of strings with leading 1, then counting the number of bits, I end at 2^n-1 possible states for both the programs and the output. Effectively one bit is hidden in the choice of C or D.


I think anyone would agree that if you sent me an uncompressed file of size 1000 bytes and I sent you a decompressor of size 400 bytes and a compressed file of size 599 bytes, I would have met the challenge in spite of the fact that these two files will take up more disk space than the original file.

I don't agree, since the compressor can be tuned to the data you could just move some of the data into the "decompressor" and end up with an arbitrarily small "compressed" file.


Read again: The sum total of the file sizes of the "decompressor" and "compressed file" had to be less than the original file. So moving data into the decompressor would do nothing.

He did "win" on a technicality: The challenge did not count the size of the filesystem metadata towards the sum, so he hid information there.

Also tuning the decompressor to the specific data-set was entirely allowed, even intended. Otherwise it would be obvious that the challenge is impossible, in this way it depends on a sufficiently random data set.


I'm responding specifically to the part that I quoted, which would have allowed the compressor + compressed data to be larger than the original data.


The part you quoted doesn't say that.

Regardless, it boils down to this: Do you feel the rules of the challenge forbid shifting an arbitrary amount of information from the compressed file into the decompressor?


Shifting from the compressed file to the decompressor is totally allowed. What is not is shifting it to the filesystem. For example, he could have named the decompressor file with the first 512 bytes of data and then the decompressor concatenates the filename with the rest of the data. However the filesystem is still storing more data than it was before.


The part you quoted says that decompressor + compressed data must be smaller than the original data. You can make the compressor arbitrarily large, but it will not allow you to "cheat" in the way you stated (by moving data into the decompressor), because you'd also have to increase the decompressor size for that.


I agree with this. The key question is whether any additional information is stored in the file name, lengths, number of files, etc.

If you could hold 400 bytes of executable code in memory without any additional information, feed it 599 bytes of memory and it outputs 1000 bytes - that's compression.

Anything else is just an abstraction around this.

His solution does not work because he is storing extra information in the lengths of the files, i.e., where to split the files and write a "5" byte. Preserving this in memory would require extra storage.


This has been done! You can use the BARF utility to compress any file by at least one byte... assuming your filesystem allows filenames long enough to handle the encoding :) http://cs.fit.edu/~mmahoney/compression/barf.html


This. Reading to the end, I was left with the impression that a better test would have the contestant compress 3 different data files with only one decompressor; on top of which would show a similar level effectiveness for each.


So what? The challenge was for the total size of decompressor + compressed file to be less than the original file.


Since he offers $5000 but you only need to pay $100 to participate it is quite easy to win money with the challenge. Ask for a file of 2 bits in length. If it is "11" then compress it to "1". Otherwise compress it to "0X" where X is the contents of the file. Now you have a 1 in 4 chance of winning the challenge. Since you pay $100 but win $5000, this earns good money. Plus it would be hard to argue that this is cheating.

The problem is that you need to supply the decompressor as well. It would depend on how exactly he wants the decompressor be encoded, but the same strategy may still apply with modifications (although it would not work if the required encoding is as a bash script, since lg(100/5000) gives you just 2 bits to play with). [Or you could ask him to modify the challenge such that you are allowed to send him an encrypted decompressor before he sends you the data, and doesn't count towards the total file size (encrypted because otherwise he'll see your strategy and send you a file that it doesn't work on)]

tl;dr: although you can't compress random with probability > 1/2, you can compress random data with probability > 100/5000.


That doesn't work. You have to include the size of the decompressor.


Yeah, so you could win a different incorrectly posed challenge from the original incorrectly posed challenge.


As I said, you can also win the original challenge with this method. The original challenge is not fully specified. In particular it is not specified how the decompressor is to be encoded. There are encodings for the decompressor for which the strategy that I described does work, and encodings for which it does not work.

The interesting attribute of this method is that it does genuinely compress the file: it doesn't store data in hidden attributes like the method of the OP (which stores information in the file lengths).


I wonder what the probability that a files is not compressible is.

Even more, I wonder if the likelihood that a file is compressible increases, decreases, or is constant with relation to file size.

My assumption would be that 50% of files are not compressible at a given size, and that compressibility doesn't vary with file size.


> My assumption would be that 50% of files are not compressible at a given size, and that compressibility doesn't vary with file size.

This would mean that say, for a 2 MB random file there's 75% chance that at least one of its halves is compressible but only 50% chance that the file as a whole is compressible. Not impossible but suspicious if you ask me.


Ah, but let me go through the possibilities.

Neither side compressed: okay the file is the original length, let's move on

Both sides compressed: okay you split it in half and decompress each side, you found the lucky one in four two-bit savings

One side compressed: uh oh, you don't know which half compressed. if you want to decompress you're going to need extra info to know where the first half ends.

You could add an extra bit to store which side compressed but then you lose the compression. You could decide to only accept situations where side A compressed, but then you're back to only half the files compressing. etc. There is no way to take advantage of splitting random data into variable-sized chunks because storing the boundaries needs just as many bits.


No, it would mean for a 2mb random file, there's a 50% chance that at least one of its halves is compressible.


Do you also state that there's only 50% chance of having at least one tails when throwing two (edit:unbiased) coins?


The bigger the file size, the larger the chance that it has been generated in a pseudorandom way, meaning it is in theory compressible. On the flipside, in a larger file finding patterns from that pseudorandomity could take longer.



I guess I'm missing something here :-)

  ls -la /bin/tar
  230464 Feb 25  2010 /bin/tar
If tar can compress and decompress an arbitrary length file by more than 230464 bytes then doesn't it meet the challenge?

Even if I've misunderstood, can anyone explain in plain english why this is a futile task?


You can't compress any arbitrary file, but there are lots of files you can compress.

Proof for the first part (aka the Pigeonhole Principle): look at the universe of files that are 100 bits long. Apply your compressor to each one, so that you have reduced the length of each file to 99 bits. But the number of 99-bit long files is half the number of 100-bit long files, which means that for some of your compressed files, you can't tell exactly which 100-bit long file it corresponds to.

Now, you can perform run length encoding on some files. That looks for runs of repeated bits, replaces them with a special marker, and writes a dictionary to translate the marker to the run. You can look for patterns. This will work... on some files. Not on all.

A good compressor will try several different encoding schemes to do its work, and is going to succeed on files with embedded patterns. For instance, if the file is ASCII text, lots of bit patterns will never occur, and lots will occur repeatedly. If the file is English ASCII text, even more redundancy can be exploited. If the file is a structured data set, there's almost certainly a bunch of repeating patterns.

But for purely random data, there won't be enough patterns repeated enough to exploit.


Lossless compression works by exploiting redundancy in the data.

If I want to compress a text file that contains the word 'uncharacteristically' 1000 times, I can simply replace ever occurrence with the a certain short unused byte sequence and record the mapping in the header.

Similarly in an image file, if there is a block of 50 pixels that are all the same color, I can replace that block with a special byte sequence and map the coordinates and color in the header.

But if I have truly random data, there are no patterns to exploit. The shortest means of representing random data - is the data itself.

This is provable using information theory (which in a way is a branch of thermodynamics).


nitpick: There is no such thing as a truly random piece of data. "Truly random" is only meaningful in the limit as the file size goes to infinity. The string "000000" could be generated by a truly random process just as well as any any other.


I meant Kolmogorov randomness which is defined via incompressibility.


Fair enough, but Kolmogorov randomness is an existence concept, not a constructive concept, so I find it weird to talk about having such a string in this case where a real person sets about to generate a concrete file.

Can you generate a Kolmogorov random file on your particular Unix system? That requires specifying what "your Unix system" rather rigorously, which seems unpleasant.


It's the contest organizer who gets to choose the file contents. Random noise is practically non-compressible.

Btw, tar by itself does not compress at all.


"Random noise is practically non-compressible."

It has to be actually random though. I was hoping the solution to this would be that Patrick had figured out that Mike used a specific random number generation engine with a specific seed and thus his decompressor was just a reversed engineered version of the original data generator and the file it "decompressed" was 0 bytes long.


It says the source was random.org - probably the closest thing to 'true' random you're going to find freely available.


Here's a very simple example of compression. It explains what other people have said about using redundancy to achieve compression.

(https://sites.google.com/site/usingaiforbettercompression/)

(And then it has a few quotes from different places about why AI is linked to compression.)

(Google sites is sub-optimal, and when I've finished it I'll host it elsewhere.)

(Also, submitted it here. (http://news.ycombinator.com/item?id=4616442))

EDIT: I'll just paste it here. I need to work on language and tone, apologies for being patronising.

Imagine we have a language with just 4 letters: A, B, C, and D. In this language the frequency probabilities are A = 0.5, B = 0.3, C = 0.15, and D = 0.05

A short sentence in this language may look something like this, after we strip out the spaces. There are 20 characters.

    ACABBAABABAACBAADCAB
Now let's represent those letters with carefully chosen numerical values.

    A =   0
    B =  10
    C = 110
    D = 111
Our sentence becomes (first with spaces, and then without)

    A C   A B  B  A A B  A B  A A C   B  A A D   C   A B
    0 110 0 10 10 0 0 10 0 10 0 0 110 10 0 0 111 110 0 10

    0110010100010010001101000111110010
These values were chosen because they are unambiguous, even when you remove the spaces. You cannot turn the above collection of 1's and 0's into anything other than the correct sequence.

This uses 34 characters to represent the our sentence. But there's a further step.

We create another table.

    00 = q
    01 = w
    10 = z
    11 = x
And again we represent the sentence with the new letters.

    0110010100010010001101000111110010

    01 10 01 01 00 01 00 10 00 11 01 00 01 11 11 00 10
    w  z  w  w  q  w  q  z  q  x  w  q  w  x  x  q  z

    wzwwqwqzqxwqwxxqz
This is 17 characters. We have saved 3 characters. Obviously this isn't great, but it's just an example to demonstrate the process. You can see from this example why compressing random data can be hard. One feature of randomness is an equal distribution of each possible character. We cannot use redundancy to create our dictionary tables.


Generally, you can't compress every n-bit file into an (n-1) bit program because there are twice as many n-bit files as (n-1) bit programs. So the average n-bit file is not compressible.

On the other hand, there are clearly some files that are highly compressible, and many that are at least a little bit compressible, so it's not as impossible as the challenge issuer made it sound. I'm not sure how you'd calculate this, but if more than 1 in 50 random files were somewhat compressible then it would be a bad bet to issue.


If you were able to trim just 1 byte of an arbitrary file by running it through a compressor, then you could reduce any file to a single byte by repeatedly running it through the same compressor. Clearly, that's not possible.


Actually it is possible - the problem is that you will move the information into your compressor.

I don't know how "universal" this compressor would be though...


If you just have to decompress a single file, I don't see why this problem is impossible since you can tune your decompressor to that file.

For example, files containing the digits of pi or e will have a uniform distribution of bytes (and are therefore incompressible by exploiting redundancy) but it is very easy to write a "decompressor" for them as there is a simple algorithm to obtain these numbers.

Is there a simple algorithm for every possible number? Because then it just becomes a question of finding the algorithm that generates that number. My intuition tells me "yes" but that it may be very hard to find it. Plus, you don't need to find an algorithm for the whole file, just some subsection of it.


Think about it as identifying patterns. You can always remove chunks of the original file and then simply re-add them with instruction as following: "Add pattern 23989898989 to 234d, 387th and 1028th point of this file." If your pattern is big relative to the file size and repeats often, you will save space. However, if the file is truly random, the larger the pattern, the less frequently you will see it. Very large patterns will be seen only once. Very small patterns will require more information to encode their place than you're saving by removing them from the file.


Yes, I see why a repetition/pattern approach wouldn't work, but I'm proposing something else: That the file may be generated by an algorithm, in the way one generates the digits of pi using an algorithm.

It's similar to how if I suspected he had generated his random file with a mersenne twister I could simply guess his seed and reproduce the file from it. (Though, I realize now this is an NP-hard problem)


>If you just have to decompress a single file, I don't see why this problem is impossible since you can tune your decompressor to that file.

It isn't guaranteed to be impossible; it's just very, very likely to be impossible.


It may be that finding that algorithm is theoretically possible but impractical. I'm going to guess that for a random bit stream that it's impractical and would require huge computing resources, cost, and time.


Tsk tsk...hackers and a coding challenge are the last things you should except to be devoid of loopholes and tricks. Always be extra careful making them, and don't bet more than you can afford, or you'll come out with a rep like Mike.


def decompress(filesize, sha1hash, openingbits, closingbits): for candidate in bits(filesize,openingbits): if sha1(candidate) == sha1hash: if candidate[-(len(closingbits)):] == closingbits: return candidate


That'll only take about 10^585 years, assuming you can do a billion hashes a second. Good luck!


Nobody said it would need to be quick.


I thought of using hashes too, but then I realized that after thousands of years, the program is only going to output a collision, and not the original data.

Checking the first and last few bits reduces the number of collisions, but there are still 2^unchecked_bits/2^160 cyphers which map to the same hash. You could store the number of collisions to skip along with the hash, but that number will be about (unchecked_bits - 161) bits long, so that won't work either.


Adding md5 (or whichever other hashing function), I have a feeling (but no direct proof of) it's not just equivalent to adding bits to the sha hash. Anyone with a better sense for hashing than me why my intuition is wrong?


Whether you have two 128-bit hashes or one 256-bit hash, you're still trying to fit 2^n pigeons into (2^128)*(2^128) = 2^256 pigeonholes.

Maybe your intuition is telling you that two hashes must be more secure because if one of the two hash functions turns out to be easy to break, then at least you will still be protected by the other?


No, what I was thinking was more like coverage. For any given hash, you do NOT have the same coverage for the other hash function. Let's make a bucket of all the originating bits of length n that will deliver a collision for one hash function. You are not guaranteed that the results of another hashing function has a uniform distribution. I still think I am wrong but you did not convince me.


Can't make the indentation behave. Don't know why.


Indent everything with four spaces:

    like so


Somebody please enlighten me. Is it that hard to find a pattern (for example fibonacci) that only has zeros up to F100, then remove those zeros?

You dont have to start from the first bit, you can start from bit 11234 (information on the decompiler).

So for example I skip 11134, then run +F(i) bits to the right and add a zero, then run +F(i + 1) bits to the right and add a new zero.. etc etc. The pattern has to be just long enough to cover the decompiler's size, which shouldn't be too much if scripting is allowed!

I believe that discovery of such a pattern shouldn't be too hard in a single file, arbitrary large (but 3mb should be enough too)


You are welcome to try! This really is an application of number theory, and the likelihood of there being a sufficiently long pattern in truly random data is actually pretty slow.


This brings up some cool questions about compression. Can any experts help me out here?

How many n-bit strings can be compressed? (I think this question is how many have a Kolmogorov complexity of less than n.)

Also, what is the difference between the statements "no program can compress every string" and "not every string can be compressed by some program". I think both are true, but they're referring to different strings, so it's kind of odd....


> Also, what is the difference between the statements "no program can compress every string" and "not every string can be compressed by some program".

The first is true, and you can show this with the pidgeon hole principle. The second isn't true. The reason for this is that you can put different information in each program.

To use the example from recursive below:

> You could do it like this: If the first bit is a 0, remove it.

There's a program that'll work for half of the strings. For the other half, you could use the program "If the first bit is a 1, remove it.".


At most half of them.

You could do it like this: If the first bit is a 0, remove it.

It could never be more than half, by the pigeon hole principle.


Can anyone explain what this section is talking about? I'm completely not following it:

cat $1 printf "X" cat $0 exit


If you asked for a big enough file of random bytes, eventually, chances are that the file would just happen to contain the substring "Xcat $1\nprintf X\ncat $0\nexit\n". You then do the following:

1. Copy the part of the file preceding that substring to a file named "foo".

2. Take the rest of the file, including all of the substring EXCEPT for the initial X character, and put it in an executable file called "decompressor".

When you run "./decompressor foo > out", you will then recreate the original file. The length of decompressor + foo will be one byte shorter than the original file.


Say I have an original file: AAAAAAAAXAAAAAAA

If I split it into two files, $1 and $2 that look like this: $1 - AAAAAAAA $2 - AAAAAAA

I have now "saved" one byte by dropping the X.

The script "cat $1 printf "X" cat $0 exit" will combine those two files and reinsert the X, thus "decompressing" the data. Basically, he has replaced the byte "X", which counted towards the total size, with the EOF marker on the filesystem, which technically didn't count.

No space was actually saved, because he basically moved the information from the X character to metadata (the file splits).


but the script is longer than the one byte that he saves, right? So the total is greater than the original file.

That's where I'm getting confused.


He won one byte per file.


Suppose the random file is 3,000,000 bytes long and happens to contain 135 bytes "X". Split the file on the letter "X" into 136 smaller files, each with no X in it. Then, return the files along with a "decompressor" script which says something like:

    return(join("X", files 1 to 136))
Since the script takes up fewer bytes than the 136 "X" bytes would, the total size of the 136 files + decompressor script is less than the size of the original file.


What he should've done is just create a new file whose name is a text encoding of the data in the original file, but it is 1 byte long. Then to decompress, just take the name of the file, and use that to create the data.


If you read the linked page, you'll see that he considered this, but dismissed it as cheating.

Also, in the end he did hide information in the file metadata, just in a much more subtle way.


Let's all send him 100$, jackass.


This was the day that Mike learned the difference between theory and practice: In theory, there is no difference between theory and practice; but in practice, there is.

When I read the challenge, and saw that arbitrary Unix programs were allowed, my first idea for a compressor was to scan the OS software for snippets of data that match the source data.

Of course, that was 10 years ago. Today, in the Cloud world, I'd just write a decompressor that connects to my system over the network and retrieves the source data.


You can be pretty sure that you wouldn't find the target in the OS software. For a well constructed output file, that has about the same likelihood as a randomly generated cryptographic hash collision.

As for your cloud solution, that could be defeated by running the decompressor offline.


you wouldn't find the entire target but you would find file in the operating system that had subsets of the target file.


Yes, you would find some matches with an exhaustive search, but they'd be pretty small and now you have to encode the path to get those bytes back in a form that is smaller than the byte run that matches. And good luck with that.


Sigh. Claude Shannon is rolling in his grave.


Why is this an invalid approach? If you have an agreed-upon OS installation, you've got at least several hundred megabytes of binary data from which you can extract sequences for a lookup dictionary. To compress, you locate largest substrings of the input that also appear somewhere in the OS data. For example, you might find that bytes 1000-2000 of the input data are the same as bytes 4528-5528 of /bin/ls. Then you just need a program that can extract this byte range from /bin/ls and fill it in at the appropriate place.

Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data. And it may not be allowed to assume an agreed-upon OS installation, although it was OK to assume that gunzip was available. And finally, doing this sort of largest-overlapping-substring search could be very very slow.

The real key here is that the challenge is about compressing one specific random file. It's clearly impossible to write a compressor that can represent ANY sequence of N bytes in fewer than N bytes including the decompressor. But this challenge is constrained to reconstructing a single input file, not handling arbitrary inputs. If the input files are random, and you keep asking for them, there's a tiny chance you'll get one with enough repetition that even gzip could compress it. If you can key against a large set of reference data, your odds improve significantly.

So, I don't know if this is a practical approach given the restrictions of the contest, the cost of entering, and the expected payoff, but I would say that if it is plausible to use a large body of OS files for lookup, this could be a winnable challenge.


>Why is this an invalid approach?

Because of information theory. It's believed that the binary expansion of pi contains all possible finite bit sequences. A program that expands pi is relatively small. Assuming that hypothesis is true, does that mean we could write a compressor that simply breaks input into blocks and then encodes the output as indexes into pi?

And the answer is that we certainly could write that program. And the result would almost always be a larger output than input. Why? Because you'd have to search so far into pi that the index would contain more bits than the input blocks. In short, having a library of arbitrary strings to draw from does not help you to compress data.

More generally, compression is impossible in the general case. Every lossless compression algorithm will, on average, produce an equal or larger output than its input.


Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data.

It seems that it's a given that you will not find that substring. The larger it is, the less likely you will find it and the less likely it will repeat in the uncompressed file.

http://news.ycombinator.com/item?id=4617983


Unlikely but not impossible. Note that the substring does NOT need to repeat in the uncompressed file; it just needs to be possible to say "copy bytes X through Y of /bin/ls here" in fewer than Y-X bytes. I'd love to know what the odds really are, but it seems like you have some good options for improving them (e.g. ask for a really huge input file).


> it just needs to be possible to say "copy bytes X through Y of /bin/ls here" in fewer than Y-X bytes

It is very unlikely that the length of the sequence will be smaller than the space required to store X. Your plan is similar to saying "The input could contain long repeated segments containing only zeroes. Compress those using scheme Z. The longer the file, the more likely this is to occur.". Information theory allows us to prove that this cannot always work. In fact it is very unlikely to work for a specific randomly generated target file.


So by saying 'cannot _always_ work,' and 'very unlikely to work,' you are implicitly conceding the point- that it is possible. The question then becomes, given the size of the OS environment space (and the quality of the randomness in the originals), could one in 50 attempts actually succeed in shaving off a byte? I sincerely doubt that Mike, despite his superior knowledge of information theory, actually took the time to figure out just how good or bad the odds were. He assumed that it was fundamentally impossible, when it is in fact not impossible, just unlikely (to some unknown [to me] degree).


Let me just put it this way. It's more likely that you could win using gzip. It's similar to saying it's "possible" that the target file could have contained all zeroes.

Assuming the target file is competently constructed, the chances of winning wouldn't make it a rational bet for $0.01


I'm so sorry you had to suffer under these postings.


> you might find that bytes 1000-2000

On average, even if you have the specific file to work with, encoding those numbers, namely 1000 to 2000, will cost at least as much space as the lengths of the sequence you find. So it would be possible to do this, but the sequences would be so short, and offsets so large, that you'd end up losing.


> Why is this an invalid approach?

One way to structure a compressed file is as a lookup dictionary. To decompress the file, you'll need to agree on which file is to be decompressed. If you need to have an agreed OS installation to act as a lookup dictionary before decompression can happen, by all rights you should include the size of the OS in your total of how large the compressed file is.

If you want to use an arbitrary OS, the "decompresser" would need to be able to identify the correct fragments. If you can do that, in less space than the original files, you've already managed to compress the data!


>When I read the challenge, and saw that arbitrary Unix programs were allowed, my first idea for a compressor was to scan the OS software for snippets of data that match the source data.

No this would not work. It is very simple to see this. There are more files with size n, than with there are with lower sizes. A decompressor should therefor be a surjective function from a smaller set to a larger which is impossible. This applies no matter how you construct you (de)compressor. You can have it access word lists, files on harddrive, etc. There will be at least one file it wont be able to compress and then decompress.


I was slightly vague in my oroginal post. I didn't mean "match completely the soruce data", I meant "match a part of the source data", as a standard rsync/Exchange-style block-deduplicator would do.

Per the challenge, it need not work on every file, only the one the challenger chose to provide. If the challenger is not suspecting this attack, he won't know to make sure to pick one of the many source files that are resistant to the attack.

Of course the compression is impossible in general, if the task is sufficiently rigorously described, which it wasn't in this case.


A randomly generated file will in a crushing majority of cases not be compressible in this way.


This was a great social engineering hack: The original challenge, poorly stated as it was, was not beaten. Craig started up a conversation with Mike, and tricked him into changing the rules. If Mike were careful, he would have paused to think about whether the proposed rule change would open the door to a hack.

It's a reminder to all of us that correctness is an ongoing process, not a delivered product.

Mike started the thread as a smart-alec Randi-esque elder, putting up money to shut up cranks. But he got blinded by his hubris.


The original rules would have been beatable too. Just put the last 200 bytes in the filename.


The rules were certainly unclear on that point.

Mike didn't guarantee to preserve the file metadata. It seems eminently defensible for him to say that he was dealing in files, not filesystems, so he could rename the file upon receipt. In fact, the EOF-hack in Craig's solution turned on the assumption that the filesystem could be hacked by the challengee, so it seems fair to allow the challenger to hack the filesystem as well.


Mike wrote a document (the challege), and offered a $5000 bug bounty. Patrick found a bug, Mike should have paid up and fixed the bug in the spec, just like Knuth pays for typos and Google pays for security bug reports.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: