

An Exercise in Species Barcoding - rayvega
http://norvig.com/ibol.html

======
ced
_Aside: Dawkins's Information Challenge_

Yikes, I'm wary of contradicting Norvig, but using Lempel-Ziv (or any
traditional compression algo) is a terrible idea. By that measure, a string of
random bits would have much higher information content than a human genome of
the same length (re: SINEs and LINEs)

The "amount of information in the genome" is a fundamentally bad concept.
There isn't any inherently useful information in there without the context of
the genotype->phenotype conversion.

Similar to programming, we could define a "size of the smallest possible DNA
sequence that would result in the same animal". That would be interesting to
measure. One day, maybe there will be DNA programmers just as there are C++
programmers now.

~~~
gjm11
That "fundamentally bad concept" is unfortunately a popular one with
creationists: they claim that mutations always destroy "information" and never
create it, or that there's a "law of conservation of information" that says
information can't ever be created other than by intelligent agents -- and they
are always curiously reluctant to say exactly what they mean by "information".

Norvig's little experiment shows (crudely and unreliably, to be sure) that,
_with the usual mathematical definition of "information"_ , mutations
typically increase information. Of course this is old news, and won't be any
use in dealing with creationists because they aren't using the usual
mathematical definition of "information" in the first place. Or any particular
definition, for that matter.

With that definition, there's nothing at all wrong with the fact that a string
of 2N random bits contains more "information" than the human genome with N
base-pairs. There's redundancy and repetitive (so far as we can tell) junk in
the human genome, and that reduces the amount of "information" it contains.

A maximally concise version of any substantial body of text, or genome, or
software, or whatever, "looks" random almost everywhere according to almost
any simple test; because if it didn't, we could use the fact to compress it
further. This being HN, I'll add that this is one reason why I am skeptical of
Paul Graham's claim that "conciseness is power" in programming languages; a
_really_ maximally concise language would also be maximally incomprehensible.

~~~
ced
_Norvig's little experiment shows..._

Norvig's experiment doesn't show anything at all. Turning bits at random for
_any string at all_ , be it DNA or Shakespeare, will increase the amount of
"information" towards the maximum, that is, a completely random string.

DNA is code. Imagine that I compressed mygame.cpp and mygame.lisp with LZW and
claimed Ah-ha! The C++ version is more complex because it has more bytes!

And then I'd change a random character in the code, and claim that the
information content has increased.

Nonsense!

~~~
gjm11
Not "any string at all". Do it to a maximal-entropy string (e.g., a genuinely
random one) and you won't see an increase.

You're using "information" in the colloquial sense, where random junk is not
information. Norvig is using it in the information-theoretic sense, where
random junk has more information than anything else of the same length. The
information-theoretic sense is not "nonsense"; it's just not the same as the
colloquial one.

(Motivation for the terminology: the "information" in a string is the minimal
number of bits -- i.e., the minimal amount of information -- it takes you to
tell me what the string is.)

------
icey
To be honest, I'm not really interested in the subject matter, but the article
is worth it for the section "Note: On Java Verbosity" and below (including the
comments (so far at least)).

~~~
jwilliams
There are probably a dozen easier Java answers to that problem... e.g. Java
has a String.split method.

