An Exercise in Species Barcoding

ced · on Feb 12, 2009

Aside: Dawkins's Information Challenge

Yikes, I'm wary of contradicting Norvig, but using Lempel-Ziv (or any traditional compression algo) is a terrible idea. By that measure, a string of random bits would have much higher information content than a human genome of the same length (re: SINEs and LINEs)

The "amount of information in the genome" is a fundamentally bad concept. There isn't any inherently useful information in there without the context of the genotype->phenotype conversion.

Similar to programming, we could define a "size of the smallest possible DNA sequence that would result in the same animal". That would be interesting to measure. One day, maybe there will be DNA programmers just as there are C++ programmers now.

gjm11 · on Feb 12, 2009

That "fundamentally bad concept" is unfortunately a popular one with creationists: they claim that mutations always destroy "information" and never create it, or that there's a "law of conservation of information" that says information can't ever be created other than by intelligent agents -- and they are always curiously reluctant to say exactly what they mean by "information".

Norvig's little experiment shows (crudely and unreliably, to be sure) that, with the usual mathematical definition of "information", mutations typically increase information. Of course this is old news, and won't be any use in dealing with creationists because they aren't using the usual mathematical definition of "information" in the first place. Or any particular definition, for that matter.

With that definition, there's nothing at all wrong with the fact that a string of 2N random bits contains more "information" than the human genome with N base-pairs. There's redundancy and repetitive (so far as we can tell) junk in the human genome, and that reduces the amount of "information" it contains.

A maximally concise version of any substantial body of text, or genome, or software, or whatever, "looks" random almost everywhere according to almost any simple test; because if it didn't, we could use the fact to compress it further. This being HN, I'll add that this is one reason why I am skeptical of Paul Graham's claim that "conciseness is power" in programming languages; a really maximally concise language would also be maximally incomprehensible.

ced · on Feb 12, 2009

Norvig's little experiment shows...

Norvig's experiment doesn't show anything at all. Turning bits at random for any string at all, be it DNA or Shakespeare, will increase the amount of "information" towards the maximum, that is, a completely random string.

DNA is code. Imagine that I compressed mygame.cpp and mygame.lisp with LZW and claimed Ah-ha! The C++ version is more complex because it has more bytes!

And then I'd change a random character in the code, and claim that the information content has increased.

Nonsense!

gjm11 · on Feb 13, 2009

Not "any string at all". Do it to a maximal-entropy string (e.g., a genuinely random one) and you won't see an increase.

You're using "information" in the colloquial sense, where random junk is not information. Norvig is using it in the information-theoretic sense, where random junk has more information than anything else of the same length. The information-theoretic sense is not "nonsense"; it's just not the same as the colloquial one.

(Motivation for the terminology: the "information" in a string is the minimal number of bits -- i.e., the minimal amount of information -- it takes you to tell me what the string is.)

kurtosis · on Feb 12, 2009

amen brother, shalizi explains this problem well:

Complexity, Entropy and the Physics of gzip: http://www.cscs.umich.edu/~crshalizi/notebooks/cep-gzip.html

I think one of the limitations of the traditional formulation of information theory is that the entropy of a signal doesn't obviously tell you how much useful information is in the signal.

This paper provides an interesting perspective on this issue. "The Information Bottleneck Method" Bialek, Tishby and Pereira

http://www.princeton.edu/~wbialek/our_papers/tishby+al_99.pd...

notaddicted · on Feb 12, 2009

The "amount of information in the genome" is a concept whose only purpose cast doubt on evolution, that is why it is a bad concept.

The mechanism through which the random mutations become inherently useful is the process of natural selection.

The refutation of the challenge is simply that random mutations do add extra information.

The consequence is that in the rare case where such extra complexity is beneficial it maybe transmitted through reproduction. The existence of extra complexity isn't claimed as an inherent benefit by any sane person anywhere.

strlen · on Feb 12, 2009

That's one of the problems I have with information theoretic approaches (Shannon's entropy) being used in discussion of evolution. You're really measuring two different things.

Also the creationist sophists love mis-using information theory for faux-arguments against evolution.

Perhaps Kolmogorov complexity would be a much better measure for this.

gjm11 · on Feb 13, 2009

Kolmogorov complexity has the exact same problem (if it is a problem) as length-after-LZ-compression: what maximizes the complexity for a given length of string is random junk, which seems a long way from what people normally mean when they talk about "information".

What creationists are trying to get at when they make claims like "mutations never add information" would, I think, be better expressed by globally replacing "information" with something like "usefulness". Of course their claims are still wrong when you make that change (mutations usually don't add usefulness, but occasionally they do, and natural selection filters that to produce a steady increase in usefulness), and often the use of pseudo-information-theoretical arguments is just flimflam, but the Instant Refutation of saying "mutations almost always add information" or "duplications almost always add information" or whatever doesn't really engage with what the more honest ones are trying to say.

strlen · on Feb 13, 2009

You could say that at DNA level a mutation can contain the exact same amount of information as a non-mutation if you use Shannon's measure e.g.:

GTTACA to GAATCA

would be the same. But of course there is a vast difference in terms of the phenotype that would happen from this mutation (if the mutation happens in the relevant area).

Of course strictly speaking you can also add a mutation, the most trivial example is:

AAAAAA to ACAAAA

Obviously entropy is higher in second string.

All those examples show, however, that laws of information entropy (as well as thermodynamic entropy) don't really have much to do with evolution.

There isn't going to be a logical or semantic trick to "disprove" evolution either, as it's not a suggested mathematical theorem: it's not in the a-priori space. It's an enumeration that flows from evidence. No one is claiming that evolution is a logical necessity: it's merely a theory (just like gravity is merely a theory) that is the best one supported by evidence.

How would one disprove evolution? To quote an evolutionary biologist, find something along the lines of rabbits in the Pleistocene. Until that is found, evolution through natural selection is what best fits the evidence.

icey · on Feb 12, 2009

To be honest, I'm not really interested in the subject matter, but the article is worth it for the section "Note: On Java Verbosity" and below (including the comments (so far at least)).

jwilliams · on Feb 12, 2009

There are probably a dozen easier Java answers to that problem... e.g. Java has a String.split method.