
Deep learning can debug biology - saurabh20n
http://20n.com/blog.html#most-recent
======
posterboy
I guess this is deliberately written to be cryptic, as it's an advertisement.
Terms are used before they are introduced, but maybe I'm not in the target
group. What's called prediction would technically be inference. Hand wavy
explanations of machine learning used for differentiation of chemical analysis
data seem to be the sell here.

~~~
saurabh20n
Agree that this is not written with the rigor of a journal paper. Our
intention was to communicate the simple wins we've had employing deep learning
in comparison to the tools of the prior generation. XCMS is the most widely
used library:
[http://www.bioconductor.org/packages/release/bioc/html/xcms....](http://www.bioconductor.org/packages/release/bioc/html/xcms.html).
It requires very painful parameter tuning. Internally, we had also built our
own custom targeted analysis. In the targeted pipeline, we had to pre-specify
"acetaminophen, shikimate, chorismate...". After building this deep learning
workflow, we have exclusively switched over to it: no chemicals pre-specified,
no parameter tuning. With about 185 engineered yeast that need analysis, with
replicates, feeding conditions, and controls, these simplifications have been
helpful.

We are getting easy wins over microbial data. Human data is noisier and we're
testing over that now. More later.

If you have microbial data, or have used XCMS in the past and would like to
compare, happy to chat. email me at saurabh@20n.

~~~
sndean
I should read some of the related papers first, but would some of these
techniques potentially be useful for LC-MS/MS proteomic data?

I'll probably you with more specific questions

------
karmel
It might be the lack of detail in the piece, but it's unclear to me why this
isn't a hammer to kill a fly-- that is, why wouldn't a much simpler peak-
finding algorithm be appropriate here? What is the NN doing that's more than
just peak finding over many molecules? Is there some interdependency that I am
missing, or is this just signal-processing over millions of independent
traces?

~~~
dre85
As far as I understand, NN is used here to find patterns which discriminate
between sample cohorts (healthy vs disease). Peak finding gives you a list of
peaks, but it doesn't tell you which of them discriminate between cohorts.

------
dre85
I find this very interesting. As a related topic, would it be possible to use
deep learning to classify samples based on the quantities of pre-identified
chemicals? If so, how would this work roughly? Does anybody have any ideas?
Traditionally people use linear discriminant analysis, PCA, PLS, etc. I can't
really wrap my head around the use of multiple neutral network layers for such
problems.

~~~
saurabh20n
One possibility is as an extension of the untargeted analysis: run the
analysis over different kinds of samples. The output for each sample is the
list of major peaks (and intensities). Use this as the "image" to train a
(shallow) network.

You might even get away without specifying pre-identified chemicals. Adding
that list would only help.

------
100ideas
Who's working on ChemStructure2Vec? Could the Word2Vec approach be used to
predict novel structures with functions in-between desired sets of known
chemicals?

~~~
cing
People are working on such things!
[https://arxiv.org/abs/1610.02415](https://arxiv.org/abs/1610.02415)

~~~
100ideas
Wow! Thanks for the link.

Can we map the latent chemical space directly into a word space (can we ensure
grammatical correctness? semantic correctness?)

> "We report a method to convert discrete representations of molecules to and
> from a multidimensional continuous representation... We train deep neural
> networks on hundreds of thousands of existing chemical structures to
> construct two coupled functions: an encoder and a decoder. The encoder
> converts the discrete representation of a molecule into a real-valued
> continuous vector, and the decoder converts these continuous vectors back to
> the discrete representation from this latent space..."

Consider the set of all chemical structures composed of between 0-6 carbon
atoms, 0-4 oxygen atoms, 0-12 hydrogen atoms, 0-4 nitrogen atoms, and 0-1
"functional groups" expressed in the SMILES chemical nomenclature; for
instance Alanine could be represented as O=C(O)C(N)C; Serine as
C([C@@H](C(=O)O)N)O

Can we create a correspondence function that takes a given SMILES chemical
structure, such as O=C(O)C(N)C, and returns a sequence of english words that
encodes the same structure, such that, read left to right:

    
    
      - each atom is represented by a noun or adjective that starts with the same letter(s) as the atomic symbol; (pronouns don't count)
      - double or triple bonds are represented with prepositions 
      - charge is represented numerically
      - branches are represented by subordinating conjunctions
      - verbs and articles and other parts of speech (non-subordinating conjunctions, pronouns) can be used freely for grammatical correctness
    

it's hard to select word strings that are semantically meaningful sentences,
but haiku-like forms are easier.

So Alanine ( O=C(O)C(N)C ) might be:

    
    
      (1)  O-Adjective/Noun preposition proposition C-Adjective/Noun subordinating-conjunction O-Adjective/Noun; 
      (2)  C-Adjective/Noun, subordinating-conjunction N-Adjective/Noun; 
      (3)  C-Adjective/Noun.
    
      (1)  Oranges blossom under and above Cherries where the Orangutans roam; 
      (2)  Conscious, unless Nocturnal; 
      (3)  Change coming."
    

Clearly many interesting structures could be expressed with semantically-
invalid sentences.

But conversely, how often are "interesting" sentences chemically pointless or
invalid?

Could we bias sentence construction to be more interesting by constraining it
with the semantic vector space of a big work of literature, such as phrases
found in Infinite Jest or the collected works of Shakespeare?

