
CS 522: Machine Learning Approaches to Decode the Human Genome - casawa
https://cs522.stanford.edu/notes/anshulkundaje.html
======
hateful
Direct link to video (via link on top of site):
[https://www.youtube.com/watch?list=PLeEhPBsiwmeQPNgi1iHb4Bi4...](https://www.youtube.com/watch?list=PLeEhPBsiwmeQPNgi1iHb4Bi4WO3NuYnX9&time_continue=3&v=lX76DzZdjvQ)

------
casawa
If of interest, other notes are available at cs522.stanford.edu and more will
be available shortly!

~~~
indescions_2018
Yeah, definitely interested ;)

Any chance the course has room for the other side of the coin? Namely, how
neuroevolution and genetic strategies inform deep reinforcement learning?

Am also interested in learning about the state-of-the-art in cloud based
packages. I noticed recently Google released a tool called DeepVariant for use
on their genomics platform.

[https://github.com/google/deepvariant](https://github.com/google/deepvariant)

Creating a universal SNP and small indel variant caller with deep neural
networks

[https://www.biorxiv.org/content/early/2018/01/09/092890](https://www.biorxiv.org/content/early/2018/01/09/092890)

~~~
chillee
Answer: it doesn't.

Genetic algorithms and the like are pretty much all terrible. They're ways of
approximating your gradient, and fall to the curse of dimensionality. The only
reason Uber and open.ai published their papers on evolutionary strategies
(something pretty different from what people think of as genetic algorithms)
is that current policy gradient methods are really bad as well, allowing what
is effectively random search to do well.

It's kinda like how Bayesian hyperparameter optimization is pretty terrible
and 2x random search almost always beats it easily.

------
JepZ
Anybody knows why we cant just write a cell simulator and start experimenting
that way with DNA manipulation? I mean I have no idea what the first
nucleotides are for, but when I have a simulator I can try changing them and
see what happens?

This may sound like a naive approach, but is there anything special hindering
us from building such a simulator or is it just that scientists will not find
it useful as it will take too long before we know which nucleotides are
important for our goals?

~~~
viewtransform
To build the cell simulator you would have to model all the biochemical
pathways in a cell. This is a bit of a problem

[http://biochemical-pathways.com/#/map/1](http://biochemical-
pathways.com/#/map/1)

[http://biochemical-pathways.com/#/map/2](http://biochemical-
pathways.com/#/map/2)

~~~
JepZ
Thanks, cool links. I didn't even know you can use Leaflet for something else
than just maps.

~~~
constantlm
Didn't even notice Leaflet there, nice spot. TIL

------
proc0
> Learning the DNA regulatory code of the genome

This is so interesting. I cannot imagine what kind of language evolution chose
to build on top of DNA. There must be some paradigm it maps to, and it's going
to be incredibly interesting to see if a compiler/interpreter can be made for
DNA, along with higher level languages that compile down to it.

~~~
shaki-dora
If you’re interested, much of what you’re asking about is actually known:
(some) DNA sequences map directly to protein sequences. Because DNA has a 4
letter alphabet (G, C, T, A), it takes three to map to one of the 2x different
amino acids that make up proteins. Proteins, in turn, are the “machines” that
work in cells.

DNA also contains regulatory segments, errors, “dead” code left over (and
carried through generations) long after it stopped being transcribed, DNA once
inserted by viruses (both active and inactive) etc etc...

If you want an analogy with computer code: it’s most like a spaghetti-code
hair ball of assembly coding it’s own compiler, IDE, vim, a few games, and a
neural network ten magnitudes the size of anything tensorflow can do. It does
it all on hardware that works only probabilistically. And it is constantly
starved for resources, leading to hacks such as DNA sequences that code for
two completely different functioning proteins depending on reading it either
forward or backward, or starting to read at an offset (what’s called a
“reading frame”).

There’s a thick book on molecular biology by Alberts et al. It’s the most
phantastic deep dive into this, any many other, insanities. I believe Larry
Page used to recommend it to all new googlers.

~~~
madhadron
I've stopped recommending Alberts. It's a great cartoon guide to a mythical
average eukaryotic cell, but it abstracts much farther than the data can bear
and leaves the reader without the intellectual tools to work with the material
in it. And so you get computer scientists thinking about assembly language and
compiling and physicists building little stochastic models of state
transitions without knowing the biological considerations that lead those
efforts astray.

Not that I have a book to recommend instead...

~~~
entee
I disagree, it's a decent overview that introduces the major concepts and how
they work. It's not intended to be at the cutting edge or to provide all
caveats. By definition all textbooks in biochemistry, biology and the like are
out of date by the time they go to press.

The average MBOC provides is enough of a basic understanding of molecular
biology to move on to more advanced work including papers that start to get
into the nitty gritty details. Note that many people haven't been even exposed
to the basics!

~~~
madhadron
I'm not worried about it being out of date. I worry about the misconceptions
it leaves in the minds of those who learn from it. They can parrot the words,
but the mental models that result from studying it regularly lead people
astray. At least, that's my anecdotal observation from as a research
biologist.

------
davidgl
For anyone would missed it, also see this great link on HN recently:
[https://news.ycombinator.com/item?id=16233644](https://news.ycombinator.com/item?id=16233644)
\- DNA seen through the eyes of a coder (2017)

------
zitterbewegung
Can I give you my genome? Or will I be able to reproduce your methods from
this site?

~~~
maxander
This is a course on a fairly novel field of research, not a method to do
anything remotely consumer-facing.

~~~
nextos
I disagree! It's a fantastic domain for startups! Nanopore is readily
available. I have sequenced stuff in my kitchen, and I'm a computer scientist.

~~~
maxander
_DNA sequencing_ is mature and, if anything, almost getting to be late in the
game for new startup entries. _Determining gene function through ML methods_
is a whole different thing, and that's what I, and the OP, was talking about.
Are we all responding to the same article here?

