
Keras-based molecular autoencoder - frisco
https://github.com/maxhodak/keras-molecules
======
duvenaud
One of the authors of the original paper
[https://arxiv.org/abs/1610.02415](https://arxiv.org/abs/1610.02415) here.

From a machine learning point of view, we simply glued together two
techniques: text autoencoders and Bayesian optimization. That is, we trained
an autoencoder to transform a text representation of chemicals (SMILES) to and
from continuous vectors. Then “chemical design” is just maximizing a function
of a continuous variable, something that we already know a lot about.

We also showed off some of the nice things that one can do with continuous
latent representations, such as interpolation. This had already been done for
images by many people, and for text in
[https://arxiv.org/abs/1511.06349](https://arxiv.org/abs/1511.06349)

Of course, our paper is just a proof of concept. For instance, instead of
encoding to and from SMILES, it would be much better to encode to and from
graphs directly. We know how to encode graphs into vectors, but I don’t know
of a good way to decode a vector into a graph.

Another open problem is that it’s hard to know what to optimize. Our initial
experiments optimizing for specific chemical properties produced suggested
molecules with crazy structures, such as giant rings. Human chemists have a
great intuition for what is easily synthesizable or stable, and it’s hard to
articulate all the properties we want the molecule to have programmatically.
Alternatively, we could enforce the optimizer to only look at molecules
similar to ones we’ve already seen, but this is unsatisfying too - after all,
the best result of exploration is when you find something unlike what you’ve
seen before.

~~~
momeara
I think one of the biggest potential for molecular autoencoders is that they
can be used to generate inputs for virtual high throughput screening campaigns
to predict new drugs. The idea would be to train models to predict compounds
that can be evaluated with more physically realistic molecular docking
simulations --> in vitro activity assays --> animal models --> and then
clinical trials as it goes through the pipeline.

Here is an example from our lab using virtual screening to develop PZM21 to
treat pain [1]. where we screened 3M compounds. We would have liked to have
screened 10^6 fold more compounds to cover easily synthesizable chemical space
in this as well as other campaigns, but that is currently computationally
infeasible. If molecular autoencoders could help us more efficiently screen
this space, it would be huge.

I'm co-organizing a free, 1-day workshop for deep learning for
chemoinformatics at Stanford Nov 11th. We've got ~75 mostly computational
chemistry researchers coming. I would love to have more machine learning
researchers come as well. The website is deepchemworkshop.docking.org, or PM
if any of you think you may be interested.

[1] Manglik, et al. Structure-based discovery of opioid analgesics with
reduced side effects (doi:10.1038/nature19112)

~~~
pizza
Someone with PZM21 knowledge! Intriguing! Would there be any point to using
low-granularity approximations of 'disjoint-class-ish' molecular backbones and
building upon best candidates, doing some kind of low-res hill-climbing in
effect, before increasing granularity with the best ones?

Granularity might be some kind of "well we need some kind of phenol or phenol-
derived ring here, why not just replace that with some sort of representation
of 'phenol-like-ring-here'" or something to that effect.

Also, about PZM21 -- will it ever experience the same fate of U47700, or the
likes of the orphaned opioids resurfacing from their watery graves?

~~~
momeara
In theory I think you are right--there should be a tower of representations
from low-res/fast to high-res/slow. Though in practice it has been hard to
make multi-resolution modeling work together. For example for proteins, where
the backbone is much more regular than small molecules Rosetta has "centroid
mode" and "full atom mode". There is also MM/QM models where just the active
site is modeled with higher level of theory representation.

For virtual screening it is possible to speed things up by say not taking into
account receptor flexibility or ignoring explicit interactions with water.

As for lower resolution representations of small molecules, there is ROCS[1]
and friends which represents small molecules with a set of gaussians.

One of challenges with low-resolution representations is that the aims of
virtual screening is often to find novel backbones that may interact with the
protein. So any low-resolution representation should mix different backbones
into the same cluster, but finding such a representation is difficult, given
the diversity of small molecules.

As for U47700, finding the mechanism of action for drugs that treat complex
processes like pain is quite difficult. Also small molecules often interact
with numerous targets so deconstructing how it works is non trivial. Part of
the motivation for PZM21 is to try to separate out the downstream effects of
hitting the mu-opioid receptor as a "biased" ligand. I think PZM21 with its
new scaffold will help disentangle the effects of classical opioids.

[1] [https://www.eyesopen.com/rocs](https://www.eyesopen.com/rocs)

~~~
pizza
Any concerns that PZM21 will be an even 'better' designer drug than U47700,
O-DSMT, MPP, hell even heroin? Especially due to the adversarial nature of
clandestine chemists and their respective nations' law enforcement agencies.
Then again, taking a peek at PZM21's shape, good luck out there to all the
non-sigma aldrich tier chemists who want to make their own, lol.

------
jostmey
This is the future of computational chemistry. The field of "Molecular
Dynamics" could easily be swept aside by machine learning. It is why I
switched from running "Molecular Dynamic" simulations as a graduate student to
modelling genomic data with TensorFlow as a Postdoc.

~~~
cing
How do you suppose the questions you've answered working on ion channels would
be addressed with machine learning? I get the impression by "swept aside" that
you feel the field of molecular biophysics is not meaningful in the study of
health and medicine.

~~~
jostmey
The neural network community uses proper "training"/"validation"/"test"
datasets to asses model performance, and has developed algorithms to fit
complex models to large amounts of data using relatively little computing
power. I think it is possible to build completely new and accurate models of
biophysical systems with neural networks with relative ease.

Using models like variational autoencoders it might be possible to draw
_truly_ independent samples in a single step, instead of relying lots of MD
steps to find novel conformations.

I could go on, but I have to get to bed now :-)

~~~
cing
I like the idea, but I wouldn't call it "relative ease". Are we talking about
making better force fields, or training a recurrent neural network for
generative dynamics that obeys the laws of physics (can be used to calculate
ensemble averages) for arbitrary length proteins/folds? How do you construct a
VAE for protein dynamics when you only have a single structure? There's a
serious lack of data that prevents these things.

------
pizza
It might be interesting if there were some routine that could 'reverse-decode'
drug-like properties, i.e. producing Lipinski's rule of 5 from nothing but
"drug-like" and "not-drug-like" labeled training set.

~~~
rgbombarelli
We did something like that for a more continuous property, like logP, which is
easy to predict with cheminformatics. We are working on metrics that reflect
drug-likeness better.

~~~
pizza
Interesting. If I'm not mistaken, logP is a type of property where it is easy
to look up constituents' "contribution" for the purpose of predicting logP
with substitutions/alterations of the example molecule.

~~~
rgbombarelli
Exactly, we have a good understanding that logP is additive (as a matter of
fact, one predicts it using group contributions). The AE noticed this and
started adding halogens to already high logP molecules.

The interesting part is that it stops before going totally crazy and per-
halogenating the molecule. This is probably because it has an intuition about
how molecules look like and hasn't really seen that kind of substitution
pattern.

~~~
Houshalter
Wait, what is logP? I assumed you meant log probability, since that is the
standard error metric for text prediction tasks like this.

~~~
rgbombarelli
It's the water-octanol partition coefficient, a basic molecular descriptor of
the physiological distribution of a drug. It is one of the multiple targets
one aims to optimize in a novel drug-like compound.

[https://en.wikipedia.org/wiki/Partition_coefficient](https://en.wikipedia.org/wiki/Partition_coefficient)

------
dkural
The author of the github repo doesn't seem to be an author on the ArXiv pre-
print. Anyone knows why?

~~~
rgbombarelli
One of the authors here. Mostly he was just faster than us! Most co-authors
are busy right now starting jobs at new places (plus trying to get published
at an old school journal for the comfort of chemists out there).

Max has done a very good job writing a neat implementation of our autoencoder.
I encourage everyone to go invent some new molecules!

------
ecesena
The chart in the readme makes the text completely unreadable on mobile. I'd
suggest putting it on top or bottom of the text.

------
michaelmwangi
I did Computer Aided Drug Design for my undergrad. I wish I knew this

