
Is "junk DNA" essential for evolution? - pg
http://www.murdoch.edu.au/News/Shaking-up-the-theory-of-evolution/?cid=rss
======
timr
It's fairly well-established that non-coding DNA regions are important for
biological function. From regulatory elements that work via DNA and RNA
bending, to the catalytic RNA molecules that are essential for genome
maintenance, the portions of the genome that don't actually code for genes are
full of interesting stuff. So the title is a bit sensational, really.

That said, transposable elements are really not the more interesting parts of
the non-coding stuff -- we understand them well, and they don't _do_ much of
anything except disrupt genes during reproduction. To draw a rough analogy: if
you have a chunk of code with lots of changes in whitespace, it's probably
also under heavy development. You would be wrong to conclude that _"whitespace
is essential for software evolution"_ , but you might well be able to quantify
the software development progress by tracking only the whitespace.

~~~
Retro
>the portions of the genome that don't actually code for genes are full of
interesting stuff.

Thats true, but if you estimate that 2% of the genome is genes, perhaps a
further 1% is regulatory elements and catalytic RNA. That still leaves a heck
of a lot of 'junk'.

------
jballanc
Ok, while I understand that this sort of article might be interesting,
honestly it's only scratching the surface of the current state of the art (and
the idea of TE's acting as fodder for evolution isn't exactly new).

Short reply -- Get Martin Nowak's book _Evolutionary Dynamics_

Long reply -- Even Nowak's work is just the beginning (disclaimer: my current
Ph.D. thesis work is building off of Nowak's start). Let's start with this
notion that TE's are vital for evolution. Real answer: sort of. You see,
there's an awful lot of species that don't have an overabundance of "junk"
DNA, yet are still able to adapt and evolve. The reason that TE's at least
_seem_ to be more important in higher order plants and animals is because of
how such organisms organize their genomes.

Ok, before I get carried away on a long rant, here's the heart of the matter
for the hacker audience:

Proteins fold into domains. These are compact structures with a rough upper
limit to their size that typically can carry out one biochemical function. In
_E. coli_ , most genes code for proteins with only one domain. So, _E. coli_
can evolve by grabbing new genes, turning genes on or off, or in some
circumstances, evolve new functionality from existing domains through random
mutation. (In reality, random mutation takes a long time to produce anything
useful, and the creation of novel domains appears to no occur. There's a
theory growing in prominence that all of the protein domain folds that exist
today were present when life began, and may even represent independent life
originating events...but I'm getting off track.)

In humans, most of our proteins are multi-domain. Not all of these domains,
however, are catalytic. That is, in a human protein with 5 domains, maybe only
one actually has biochemical activity. The function of the rest is to modulate
that activity or localize the protein to one part of the cell or another.
Also, the mechanisms controlling when genes get turned on and off in humans is
much more complex than in bacteria. Therefore, evolution in higher order
plants and animals is much more likely to occur through a "shuffling" of
domains and regulatory elements. Because TE's are good at "shuffling" DNA,
it's not surprising that having a healthy dose of this sort of "junk" DNA is
advantageous. Of course, that's not all...there's also neutral evolution and
pseudo-genes and epigenetic inheritance, etc. Biology is really on the cusp of
exploding (oh, and I'm writing a book about that too ;-).

tl;dr -- The genetics of higher order plants and animals is not unlike a
program which relies heavily on many libraries. If you swap out one XML parser
for a better XML parser, you'll get better performance (and more customers!).
Transposable elements function (sort of) like biology's equivalent of a
linker, and can help organisms swap in and out libraries/protein domains.

~~~
timr
_"There's a theory growing in prominence that all of the protein domain folds
that exist today were present when life began, and may even represent
independent life originating events"_

Who in the world is advocating for _that_ theory? It sounds silly, on its
face...one has only to spend an afternoon browsing the SCOP or PFAM databases
to see that there's been a huge amount of recent evolution at the fold level,
and that the domains that we know about have appeared over a very long time.

~~~
jballanc
Gah! Ok, apologies. I have to learn to stop over-oversimplifying!

What I mean there was this: Mathematical modeling of the evolution of protein
folds points toward the impossibility of convergently evolving folds (but not
convergently evolving functionality). This implies that two proteins with the
same fold can reasonably be assumed to have evolved from a common ancestor
protein, even if that conclusion cannot be reached from sequence information.

Unfolded proteins (at least, above a certain length) cause problems for living
things. Thus, the ability of evolution to freely explore fold-space is
constrained. There is some interesting work going on in this space looking at
the possibility networks of interrelated folds that don't pass through
unfolded intermediates, but I think it's too soon to say, for certain, that
these networks are sufficient to generate truly "novel" folds.

As for the SCOP classification system, my personal view is that it tends to be
on the restrictive side. Of course, that's the point of SCOP: to robustly
categorize folds. As for PFAM, it's been a while, but last I looked they still
don't consider any 3D structural information in their classifications. I guess
what I'm trying to say here is that, whether or not "novel" protein folds are
actively appearing depends on your definition of "novel". If I mutate a
residue in the middle of a helix that breaks the helix in two, is that a novel
fold? If I then insert a few more amino acids and turn that one helix into a
helix-turn-helix, is that a novel fold?

The theory I alluded to is not my own, but I can admire the logic behind it.
The idea is that it is possible to group many folds through sequence and other
(i.e. threading) means into derived folds. However, even when you do this, you
don't arrive at a rootless tree. Instead, you find that there are somewhere
(depending on who you ask) between 800 and 1300 "roots" to the fold family
tree.

Presumably, these roots represent novel abiogenesis events. At the very least,
these root folds must have existed before "modern" biology (i.e. the sort that
cannot tolerate unfolded states) began. Whether they all have a common
ancestor or not is very much up for debate, but so far I don't know that we
have conclusive evidence one way or the other.

To be clear, though, I am guilty of oversimplification in the line you've
picked out.

~~~
timr
Nice post. (For the record, the simplification is understandable...I got my
PhD in this stuff, so I'm not exactly the target audience.)

I agree that convergent evolution of protein folds seems unlikely to have
happened frequently, but I'm not willing to dismiss it as impossible over the
course of evolutionary time-frames. More importantly, I'm not willing to
extend that idea to the conclusion that all protein folds must have been
extant at the beginning of evolutionary history. There are simply too many
ways for new protein folds to be produced -- and not necessarily by stepwise
point mutations through stable intermediates.

You're right that PFAM isn't a structural classification system _per se_ ;
they do what you describe a bit later, and build massive sequence profiles to
detect homologies. But at the sequence identity levels used to build PFAM, you
can safely assume that any sequence within a family that has a solved
structure will probably share that structure's overall fold. The details will
be different, but the fold will be conserved. That's really the point of PFAM
(and why it has a symbiotic relationship with SCOP).

But here's the problem with drawing too many structural conclusions from
sequence analysis: with state-of-the art algorithms, we can detect structural
homologies out to about 25-30% sequence identity. Beyond that, we just don't
know how to call two structures "similar" or "dissimilar", without having the
structures themselves.

Point being, you can't really say that those 800-1300 "roots" of the
evolutionary tree are independent starting points. All you can conclude is
that our tools aren't good enough (or there isn't enough data) to trace back
the evolutionary tree to the point where those "roots" may have converged.

For the record, I'm not trying to be pendantic or argumentative. This is one
of those few fun debates that makes the field interesting. ;-)

~~~
jballanc
Thanks. I'm always game for a good debate. As I mentioned, my thesis-in-
progress (T-minus 3 months...fingers crossed!) is in a related field
(evolutionary dynamics/computational biochemistry), but I used to be more
involved in protein structure/bioinformatics back in the day (worked, for a
time, in the lab that maintains ecogene.org).

The last debate I had regarding convergent evolution of protein folds got
rather heated. This is one of those classic problems that can't be approached
without some amount of hand-waving, and depending on which way you wave, you
can arrive at different results. In some respects it boils down to Levinthal's
paradox, except with evolutionary moves in place of topological ones. The one
big unknown that you would need to find before you could make any sort of
educated approach at the issue is what fraction of all possible protein chains
of a certain size have stable, fast-folding structural minima. If that number
is high, then short hops from one to the other could very likely result in
convergence.

As for the SCOP/PFAM part of the story, the 25-30% "twilight zone" for
sequences yielding related structures has a counterpart with structures that
are topologically related but with low (or I've even seen cases of essentially
no) sequence identity. That is, if you look at a group in PFAM, then take all
of the members with structures in that group, gather all of the corresponding
SCOP groups, add the members from the SCOP groups, then for each of those look
for related PFAMs or other sequences...essentially, what you're doing is a
structure informed PSI-BLAST (sort of what ESPRESSO is to T-COFFEE).

Now, if you attempt to fill the gaps from lack of structures by running each
of your sequences through something like Skolnics TASSER threading algorithm,
and using the best predicted folds to grow your group, this is how you can use
distant sequence and structure homologies to construct "master" fold groups.
This is, roughly speaking, how the 800-1300 number is arrived at.

Admittedly, the more structures we solve and genomes we sequence, the better
this sort of technique will get. In many ways, this is biology's analog to
cosmology and the big bang: we can look further and further out into space,
but at some point, we can know what happened any earlier than some small time
after the big bang without recreating those conditions. Likewise, we can
sequence and solve more structures, but I don't doubt that at some point we
will hit a wall and need to start attempting to recreate abiogenesis.

Interesting times!

