
Exploring DNA with Deep Learning - ReDeiPirati
https://blog.floydhub.com/exploring-dna-with-deep-learning/
======
comnetxr
no examples of "unreasonable effectiveness" or even "effectiveness" are given;
just a semi-plausible technique and some questions that might be worth
answering. I hope the term doesn't get diluted with more examples like this.

also not clear why a 2D local representation is being used for 1D data. there
is no meaning to the ordering of the rows (different individuals who represent
the samples in the genome), so it doesn't make much sense to encode that into
the image. I would presume that not much meaning comes from neighboring
mutations that are separated by a long string of no mutations, so the in-row
locality should be broken into chunks. Neither of these basic considerations
is motivated in the text either.

I would guess there is no effectiveness of CNN at all on this data set and a
different statistical technique should be used on this data set...

~~~
dekhn
This is the paper with the title:
[https://academic.oup.com/mbe/article/36/2/220/5229930](https://academic.oup.com/mbe/article/36/2/220/5229930)

Almost every article I've seen that uses that naming trope fundamentally
misunderstands Wigner's point in the original
([https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.htm...](https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html)).
What Wigner was hinting at (he never really comes out and says it) is that he
thinks math and science are two totally different domains and it's surprising
to think that mathematical models of physical systems would be able to make
generalized out-of-scope predictions. Ultimately, the best thing any theory
can do is predict something we didn't expect from the previous models, and
then have an experimentalist go and show the new theory's prediction is more
consistent with natural observations. Both relativity and QM have done that
repeatedly, although Wigner was surprised at that, as he believed that math
was an independent domain untethered to physics (many today assume that the
universe is effectively a physical embedding of a mathematical structure, and
our mathematical theories are simiplified approximations of that structure, so
it not's super surprising that a good math model would make good physical
predictions), and I think this article was him basically hinting at that new
idea without coming out and saying it.

As for why CNNs are useful here... in the generalized genotype-to-phenotype
problem, where you are trying to take a list of a person's mutations and
predict their physical attributes, there are some phenotypes/traits which are
absolutely and totally explained by a single mutation in one gene. In those
cases you could train classifiers using simple binary features
("has_mutation_TtoAatPosition37OfChromosome1") and make pretty good
predictions.

But most traits are only predictable by making complex non-linear models that
take more locations, and the interactions between locations. In some cases,
it's 1-2 mutations in a single gene near each other, in other cases, it's 100
different mutations spread throughout the genome, and in other cases, many
thousands (the variance of height in humans is a good example where a large
number of effects combine non-linearly). CNNs are great for dealing with non-
linear data with non-local interactions.

Sequence models also work well (always fine this funny because you're doing ML
sequence models on DNA sequences) because so much of the signal can be found
in the neighboring bases. For example, in transcription factors, where a
protein recognizes a short chunk of DNA, a short window (10-20 base pairs) is
recognized and it has significant internal predictability.

~~~
Real_S
Interesting point about Wigner, but about the article...

This study does not examine phenotypes, but may be applied to them somehow.
Instead:

>we use simulation to show that CNNs can leverage images of aligned sequences
to accurately uncover regions experiencing gene flow between related
populations/species, estimate recombination rates, detect selective sweeps,
and make demographic inferences

I believe this works well because the sorting of the data (Fig. 2) introduces
phylogenetic information into the image to be analyzed. This reminds me of
neighbor joining [0], but has some differences. Without this ordering, their
method does not work as well.

0)[https://en.wikipedia.org/wiki/Neighbor_joining](https://en.wikipedia.org/wiki/Neighbor_joining)

~~~
longtom
> sorting of the data

Perhaps the main advantage is that it can filter out non-adaptive, non-
functional (noisy) mutations this way by simply averaging similar genomes? In
that case, the rows of the learned M x N kernels should be nearly identical
and one could have simply averaged M data rows at a time and fed it to the 1D
CNN.

What other phylogenetic information could possibly be inferred?

Edit: It could also be thought as data augmentation as it effectively creates
novel inputs each time. IIRC there was also a technique for hardening against
adversarial examples which simply fed the network averaged datapoints along
with the original data.

------
Mizza
The real problem with this technique is, of course, normalizing and labeling
data.

I have worked on a project which has harmonized and labelled all public RNA
samples: [https://www.refine.bio](https://www.refine.bio)

------
lkjhdcba
TL;DR: a very entry-level article about the field of popgen ("what is a
genome?") and how the "breakthrough" is representing multiple sequence
alignments as compact binary matrices. There's very little explanation or
actual examples beyond that, so unless you're a complete layman the article
probably won't satiate you.

Applying deep learning to genomic data is something of a fad these days - the
bioinformatics world has caught up with the DL hype of the early 2010s and is
trying to use DL on nearly anything that moves for easy papers.

The main issue with DL frameworks in the context of genomics is the format of
input data. You pretty much want all your data to be a matrix of fixed size
(if you want to use CNNs at least, and that's what everyone is interested in
anyway), but that's just not how genomics data works. Sequences vary in length
(I see the problems of nucleotide gaps, let alone short indels is left
unanswered), alignments are not absolute (they are very much aligner dependent
and secondary alignments are a thing), the alignments themselves may stem from
different data sources (long reads cover more stretches of DNA but are less
reliable than short ones), there is no mention of how ploidy is handled
(especially in plants!) and somehow you're supposed to transform all of that
into a neat 48x48 array to feed to Keras. Wait, thousands of them. Did I
mention the human or plant genomes are often billions of basepairs long?
Waiting for bwa to be done mapping on your cluster is the xkcd equivalent of
"can't do work, compiling!"

So yeah, sorry to put a damper on this but I'm waiting for something within
the reach of practical workability (and believe me the standards of
bioinformaticians for workable stuff are _low_ ) before getting hyped.

------
longtom
> throwing away all the non-mutated locations in the genome because they carry
> no meaningful information

How can they claim this with certainty? Maybe the context of any given
mutation contains useful information similar to how a word embedding vector
contains useful information such as relatedness to other words etc.

~~~
mschuster91
The non-differing parts are worthless as you are looking for _differences_.
For example, take a group of five people with highly sensitive hearing and a
hundred people with normal hearing. You don't care what the sensitives have
together with the normals, you care about if you can find a mutation present
in all sensitives but in none of the normals. Therefore it makes sense to only
look at the differences and throw away the equal parts, you just waste
computational resources.

~~~
longtom
But isn't genetic code transcribed sequentially and in the context of
previously decoded structures? So it is similar to reading a text that
references parts elsewhere in the text. If you only keep the changes between
multiple subtly mutated texts, then it becomes an enigma what these changes
actually mean and, what's worse, the changes are now in context of nearby
changes which may make create completely different meanings.

