
Hutter Prize for Compressing Human Knowledge - cpeterso
http://www.hutter1.net/prize/
======
throwaway_yy2Di
Matt Mahoney has a neat page compiling text-compression benchmarks:

[http://mattmahoney.net/dc/text.html](http://mattmahoney.net/dc/text.html)

[https://news.ycombinator.com/item?id=6001489](https://news.ycombinator.com/item?id=6001489)

It's linked in the OP but easy to overlook.

Here's how the prize winner compares with generic tools on en-wiki text (from
Mahoney's page):

    
    
                  Compression
        	      Ratio (in/out)  Time (ns/byte)
        decomp8   6.27            324,000
        xz -9 -e  4.72            2,482
        bzip2 -9  3.94            379
        gzip -9   3.10            101

------
Geee
This competition is flawed in the context of artificial intelligence. Most of
the information in the Wikipedia text is in the language, not in the 'actual'
knowledge. It would be much more interesting competition if it would allow the
uncompressed text to say the same things in different words. Although there's
the problem of how to verify if it's the same thing.

~~~
Houshalter
Compression is closely related to intelligence. In order to compress something
you have to understand the pattern behind it. The better you can learn the
pattern, the better you can compress.

~~~
psb217
Lossy compression, yes. Lossless compression, not so much. And, the parent
post seems to have been hinting at this. I.e. being able to usefully "tl;dr"
all of Wikipedia would require a rather intelligent system. But,
recapitulating all of Wikipedia's content word-for-word and pixel-for-pixel
(including page layout for the truly pedantic) would not meaningfully test the
sort of abstract pattern recognition that you seem to be confusing with
lossless compression.

In other words, though often conflated, information and meaning are not
equivalent. "Meaning", as people typically use the word, seems to include some
concept of the utility of a bit of information. Most information in typical
text/images/sounds/etc. is really quite useless. Being able to discard the
useless while keeping the useful actually seems like one of the key traits
exhibited by intelligent systems, and this ability is not measured (and
perhaps even selected against) by lossless compression performance.

~~~
Houshalter
Loseless compression may not be as _useful_ as a summarizing algorithm, but it
is definitely a test of intelligence. Lossless compression requires
identifying the pattern that produced an input as perfectly as possible. If
you have an algorithm that can predict the next letter with great probability,
you only only need a few bits to store a string of text. And making good
predictions on real world data is a good test of AI and machine learning.

~~~
psb217
"Lossless compression requires identifying the pattern that produced an input
as perfectly as possible": no, it requires identifying it absolutely
perfectly. This includes all vacuous information as well. E.g., in the
Wikipedia example, if there were a 2d scatter plot of a sample from a
bivariate uniform distribution, lossless compression would require
"memorizing" all of the plotted points.

Predicting perfectly is much, much different from predicting well. Machine
learning is about the latter, while lossless compression is about the former.

~~~
Houshalter
This is not true. If you have a good predictor, you only need a few bits to
store a piece of information. One way is just to record the places where your
prediction is wrong. The ideal way would be to split all the possibilities so
exactly half of possible sequences are on one side, and exactly half the
probability is on the other. Every bit tells you what path to go down.

So instead of using 64 bits to specify the x y coordinates of every point on
the plot, you could just use a much smaller number to represent how far it
diverges from it's predicted location. You could narrow down the possible
locations the point could be in by half, and then you just bits to specify
only those possibilities, not all of them.

------
wglb
Looks like a fun challenge. However, I must say that it reminds me of the
story Ms Fnd in a Lbry:
[http://en.wikipedia.org/wiki/MS_Fnd_in_a_Lbry](http://en.wikipedia.org/wiki/MS_Fnd_in_a_Lbry)

~~~
javajosh
Forgetting is a valuable skill.

------
dalek2point3
Its going to be hard to beat "the inventor" who's won it a few times
already...

[http://www.linkedin.com/in/theinventor](http://www.linkedin.com/in/theinventor)

------
johnwatson11218
Would it be considered cheating if you found a pseudo random generator that
took a small seed and expanded to the executable of the current first place
winner?

~~~
nabla9
No. It would just mean that you are a God.

I mean, if there would be God and he would want to prove that he really is all
powerful, he would have to be able to pull that kind of trick to convince me.

~~~
Eliezer
I think that's a bit strict. The Judeo-Christian God, if functioning exactly
as advertised, shouldn't be able to do that.

~~~
tedks
That depends on the advertising you decide to accept as valid; I'm sure none
of my Christian friends would agree with you, assuming I could explain the
scenario to them.

~~~
gatehouse
If you take it for granted that the pretext of the question is an analogy of
the creation of the universe, and the consequence of the ability is the
knowledge of all that occurs, then that isn't how omniscience was explained to
me because a key aspect of Catholicism is free will, and that would allow God
to sort out the sinners without running the "experiment", viz. the universe.

I have to mention that this is only according to my meagre and disinterested
understanding.

------
TerraHertz
Some other problems with the prize task as defined:

* Wikipedia is light on images (due to copyright issues I suppose) but the wider challenge of compressing 'human knowledge' really has to deal with the vast masses of paper books. Which are fundamentally images (before they are OCR'd and compressed.) Also since this is really an exercise in preservation of historical record you really want to preserve the exact original visual appearance, blemishes and all, which means the true image must be retained along with the OCR text. Additionally there other recording media- photos, film, etc to capture.

* When you include analog original formats (of which ink on paper is one) then the whole idea of 'lossless' compression is moot. What you're really after is compression with no _perceptible_ content loss.

* Which means that the issue of 'acceptable quality' is crucial. And deciding what is acceptable quality loss for different forms of source material is something that will require very good AI.

For instance, images that are printed with mixes of offset printed screening
(those tiny dot patterns) and solid ink edges (text, hard lines, etc) is very
difficult to compress. You can't just blur everything to reduce the screening
dots to an even gradient, because that ruins the edge definition of the text
and lines. You can't scan at a lower or similar resolution to the dots, since
that produces horrible moire patterning. You can't scan and save the image at
a high enough resolution to capture the dots exactly, since that makes the
filesize HUGE.

So your AI has to actually 'understand' the image to some extent, and smooth
out just the screened areas, with precise but hard to identify mask edges. If
you've ever done this by hand in photoshop, you'll know how hard it is.

That is actually a problem I'm stuck on with some historic technical manuals
I'm trying to make digital copies of atm. If anyone knows of an automated way
to blur offset screening (but not other printed elements on the page) to
evenly shaded tones, please say so. See:
[http://everist.org/archives/scans/query/image_processing_que...](http://everist.org/archives/scans/query/image_processing_query.htm)

Another question: are there any freeware compression utils, that can do the
RARbook trick? ie pass them a file.jpg, and they'll ignore the type extension
and just scan through the file for a valid archive header then unpack from
there on. As WinRar does. So frustrating that WinZip, ALzip etc don't do this.

------
TTPrograms
Can any one confirm this prize is still alive?

