
Machine learning spots natural selection at work in human genome - myinnerbanjo
https://www.nature.com/articles/d41586-018-07225-z
======
nestorD
> To find these patterns, a growing number of geneticists are turning to a
> form of machine learning called deep learning. Proponents of the approach
> say that deep-learning algorithms incorporate fewer explicit assumptions
> about what the genetic signatures of natural selection should look like than
> do conventional statistical methods.

In the days when Sussman was a novice, Minsky once came to him as he sat
hacking at the PDP-6. “What are you doing?”, asked Minsky. “I am training a
randomly wired neural net to play Tic-Tac-Toe” Sussman replied. “Why is the
net wired randomly?”, asked Minsky. “I do not want it to have any
preconceptions of how to play”, Sussman said. Minsky then shut his eyes. “Why
do you close your eyes?”, Sussman asked his teacher. “So that the room will be
empty.” At that moment, Sussman was enlightened.

~~~
paraschopra
I increasingly feel that deep learning needs to incorporate more ideas from
evolution. Not just for parameter optimization but architecture discovery
itself.

Imagine pitting neural networks in an adversarial environment (just like the
real world). Under competitive pressure for limited food (computation
resources), evolved neural network could start approaching optimal
architectures that do the job but have no superfluous, pre-conceived notions
of the (modeled) world. In fact, such evolved architectures could encode
relevant notions of the world directly which we can learn about by reverse
engineering evolved architecture.

This is closely related to the ideas from predictive processing which closely
ties survival to prediction (as to predict your future states is to avoid
getting dissipated). So I'm anticipating evolution or survival notions to come
up in a big way in ML/Deep Learning in future.

~~~
ovi256
>approaching optimal architectures that do the job but have no superfluous,
pre-conceived notions of the (modeled) world

Usually, modelling efficiency is related to the amount of priors. The more
(correct) priors, the faster a model learns, and from less data.

A prior-poor model just spends data learning those correct priors in the first
place.

There are countless examples I could give to support intuition of this. There
isn't, AFAIK, any theoretical work yet.

That the priors are correctly aligned with reality is vital. A model with
fixed bad priors cannot unlearn them. If the priors are bad but not fixed,
you're just spending time and data to correct those priors.

~~~
paraschopra
Yes, and evolution builds priors into organisms.

------
a_bonobo
There was a very fun preprint a few weeks ago on biorxiv:
[https://www.biorxiv.org/content/early/2018/10/22/336073](https://www.biorxiv.org/content/early/2018/10/22/336073)

The same trick as DeepVariant, if you encode your genetic variants as images
and give that to a CNN you get reasonably good results without doing much
extra work!

~~~
klmr
DeepVariant didn’t actually explicitly encode the variants (nor the raw data)
as images. Press reports —including from Google Research itself) suggested
that this was the case but nothing in the original publication said so, and
the researchers themselves have disputed it.

It just so happens that the tensor representation lends itself well to
visualisation as (multi-channel) images. But that wasn’t the intent, it’s just
a nice side-effect. In reality the data is laid out in tensors that follow
naturally from how they were generated (i.e. via alignment of many
stochastically generated DNA fragments to a reference sequence).

~~~
a_bonobo
Interesting!! I did not know that, so I was wrong - but why does the GitHub
repo then talk of pileup images?

[https://github.com/google/deepvariant](https://github.com/google/deepvariant)

Edit: I think I get it now, this notebook shows it:
[https://github.com/google/deepvariant/blob/r0.7/docs/visuali...](https://github.com/google/deepvariant/blob/r0.7/docs/visualizing_examples.ipynb)
The channels are easily representable as images, but the network doesn't use
the images directly.

~~~
klmr
Yeah, the way they talk about this is admittedly confusing. The best way for
me to think about it (from [1]) is to mentally add quotation marks around the
word “image” whenever the methods mention it. Mathematically the data is
stored in higher-dimensional tensors with one dimension representing genomic
coordinates, another dimension representing the sequence depth (i.e. one row
per sequence read), and additional dimensions representing features of the
sequence (such as nucleoside identity, read direction, error probability,
match/mismatch with the reference, etc). I use almost the same kinds of
tensors for work that has no tangible relation to images or visualisation.
It’s simply the most straightforward way of representing this data as a
tensor.

But the similarity to images is so tantalising that even the paper’s methods
[2] fall prey to this, even though the dimensionality and minor details are
wrong. Furthermore, the term “pileup image” refers to a common way of
visualising genome–read alignments [3]. The DeepVariant tensor is _not_ a
pileup image but it is very close. And the tensor can be converted in to an
image [4] but as mentioned this requires some transformations (splitting the
channels, and rescaling the values).

[1]
[https://bioinformatics.stackexchange.com/q/4098/29](https://bioinformatics.stackexchange.com/q/4098/29)

[2] I initially claimed the paper didn’t mention this. That’s wrong;
apologies.

[3]
[https://www.google.com/search?tbm=isch&q=genome+pileup](https://www.google.com/search?tbm=isch&q=genome+pileup)

[4]
[https://github.com/google/deepvariant/blob/r0.7/docs/visuali...](https://github.com/google/deepvariant/blob/r0.7/docs/visualizing_examples.ipynb)

~~~
a_bonobo
Thank you for the explanation, it makes perfect sense - as an RGB picture is
represented in a three dimensional list of numbers for red, green, blue you're
not constrained by that, you might as well throw anything in there.

------
tmaly
I would be interested to hear more about how you overcome these cases when you
do not have enough samples to train on. This seems like a useful area to
improve upon.

------
mactrey
So is the University of Oregon in Portland or Eugene?

