Hacker News new | past | comments | ask | show | jobs | submit login
Keras-based molecular autoencoder (github.com/maxhodak)
127 points by frisco on Oct 24, 2016 | hide | past | favorite | 41 comments

One of the authors of the original paper https://arxiv.org/abs/1610.02415 here.

From a machine learning point of view, we simply glued together two techniques: text autoencoders and Bayesian optimization. That is, we trained an autoencoder to transform a text representation of chemicals (SMILES) to and from continuous vectors. Then “chemical design” is just maximizing a function of a continuous variable, something that we already know a lot about.

We also showed off some of the nice things that one can do with continuous latent representations, such as interpolation. This had already been done for images by many people, and for text in https://arxiv.org/abs/1511.06349

Of course, our paper is just a proof of concept. For instance, instead of encoding to and from SMILES, it would be much better to encode to and from graphs directly. We know how to encode graphs into vectors, but I don’t know of a good way to decode a vector into a graph.

Another open problem is that it’s hard to know what to optimize. Our initial experiments optimizing for specific chemical properties produced suggested molecules with crazy structures, such as giant rings. Human chemists have a great intuition for what is easily synthesizable or stable, and it’s hard to articulate all the properties we want the molecule to have programmatically. Alternatively, we could enforce the optimizer to only look at molecules similar to ones we’ve already seen, but this is unsatisfying too - after all, the best result of exploration is when you find something unlike what you’ve seen before.

Hey David, author of the linked repo here. I thought the paper was pretty neat and I'm a fan of the graph convolution work being done by both you and Patrick Riley's and Vijay Pandes' groups. A few questions:

- In the paper you trained the property decoder separately but mention in passing that it could be a good idea to train the whole thing jointly. I haven't implemented the property decoder part yet and it seems like a pretty easy extension to just add that to the same loss function. Where is your thinking on independent vs joint training of the autoencoder/property prediction network now? Do you think non-obvious tricks will be required to get nice smoothness of the latent representation for things more complex than logP?

- Though learning directly to/from SMILES is weird, it's also kind of awesome. Do you think the future is an "inverse weave" module or something of that kind, or going further on learning more complex input representations?

- How do you feel about the synthetic accessibility score you used in the paper overall? Obviously it fell short on carbon ring sizes, but do you think we're close on this or we basically need to start over? What datasets would you think about using for "learning" synthetic accessibility as you imply in your post here?

Thanks! My vague intuition had been that this is a domain where I'd expect physics-based simulation should work better than machine learning but I have to say that the recent literature is changing my perspective.

Thank you very much, Max! To answer your questions:

- As you say, jointly training the autoencoder for reconstruction and prediction accuracy would be easy to code, but it might be tricky to get the tradeoff right since we usually have relatively little labeled data. A massively multi-task approach probably makes sense here. I'm not sure how hard it will be to get smoothness w.r.t. the latent representation, but the logP result was mildly encouraging.

- I agree that it's probably time to bring some of the recent augmented recurrent network architectures to bear on graph decoders. What makes is hard is that it requires differentiating through a cascading series of discrete choices, but people are making progress on this sort of problem.

- I do think we need to completely start over on the synthetic accessibility. There are just too many ways for molecules to be weird to try to write an explicit function for it. In fact I'd go so far as to say that this is the 'missing half' of this method.

- I view this method as a complement to physics-based simulation. A neural net is almost always going to be cheaper to evaluate than a physical simulation, so it's not a bad way to compile all the simulation results together.

I think one of the biggest potential for molecular autoencoders is that they can be used to generate inputs for virtual high throughput screening campaigns to predict new drugs. The idea would be to train models to predict compounds that can be evaluated with more physically realistic molecular docking simulations --> in vitro activity assays --> animal models --> and then clinical trials as it goes through the pipeline.

Here is an example from our lab using virtual screening to develop PZM21 to treat pain [1]. where we screened 3M compounds. We would have liked to have screened 10^6 fold more compounds to cover easily synthesizable chemical space in this as well as other campaigns, but that is currently computationally infeasible. If molecular autoencoders could help us more efficiently screen this space, it would be huge.

I'm co-organizing a free, 1-day workshop for deep learning for chemoinformatics at Stanford Nov 11th. We've got ~75 mostly computational chemistry researchers coming. I would love to have more machine learning researchers come as well. The website is deepchemworkshop.docking.org, or PM if any of you think you may be interested.

[1] Manglik, et al. Structure-based discovery of opioid analgesics with reduced side effects (doi:10.1038/nature19112)

Several people have asked for background material for the workshop--

(Wallach, 2015, http://arxiv.org/pdf/1510.02855.pdf) AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

(Duvenaud, 2015, http://papers.nips.cc/paper/5954-convolutional-networks-on-g...) Convolutional Networks on Graphs for Learning Molecular Fingerprints

(Kearnes, 2016, http://arxiv.org/abs/1606.08793v1) Modeling Industrial ADMET Data with Multitask Networks

(Gómez-Bombarelli, 2016, doi:10.1038/nmat4717) Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach

and of course

(Gómez-Bombarelli, 2016, https://arxiv.org/abs/1610.02415) Automatic chemical design using a data-driven continuous representation of molecules

Someone with PZM21 knowledge! Intriguing! Would there be any point to using low-granularity approximations of 'disjoint-class-ish' molecular backbones and building upon best candidates, doing some kind of low-res hill-climbing in effect, before increasing granularity with the best ones?

Granularity might be some kind of "well we need some kind of phenol or phenol-derived ring here, why not just replace that with some sort of representation of 'phenol-like-ring-here'" or something to that effect.

Also, about PZM21 -- will it ever experience the same fate of U47700, or the likes of the orphaned opioids resurfacing from their watery graves?

In theory I think you are right--there should be a tower of representations from low-res/fast to high-res/slow. Though in practice it has been hard to make multi-resolution modeling work together. For example for proteins, where the backbone is much more regular than small molecules Rosetta has "centroid mode" and "full atom mode". There is also MM/QM models where just the active site is modeled with higher level of theory representation.

For virtual screening it is possible to speed things up by say not taking into account receptor flexibility or ignoring explicit interactions with water.

As for lower resolution representations of small molecules, there is ROCS[1] and friends which represents small molecules with a set of gaussians.

One of challenges with low-resolution representations is that the aims of virtual screening is often to find novel backbones that may interact with the protein. So any low-resolution representation should mix different backbones into the same cluster, but finding such a representation is difficult, given the diversity of small molecules.

As for U47700, finding the mechanism of action for drugs that treat complex processes like pain is quite difficult. Also small molecules often interact with numerous targets so deconstructing how it works is non trivial. Part of the motivation for PZM21 is to try to separate out the downstream effects of hitting the mu-opioid receptor as a "biased" ligand. I think PZM21 with its new scaffold will help disentangle the effects of classical opioids.

[1] https://www.eyesopen.com/rocs

Any concerns that PZM21 will be an even 'better' designer drug than U47700, O-DSMT, MPP, hell even heroin? Especially due to the adversarial nature of clandestine chemists and their respective nations' law enforcement agencies. Then again, taking a peek at PZM21's shape, good luck out there to all the non-sigma aldrich tier chemists who want to make their own, lol.

Hey are there any datasets related to this stuff publicly available? It would be awesome to put this up on kaggle and let people compete to find the best model.

There are some, but more effort is needed.

http://deepchem.io/ is trying to set up standard data sets for chemoinformatics/machine learning.

ChEMBL and PubChem are the big public repositories though some care must be taken in curating data from these for machine learning.

SMILES strings are not necessarily unique; each molecule can be encoded multiple ways as a SMILES string. As a sanity check, you could pass multiple representations of a molecule into the autoencoder to see if they generate the same continuous representation. Otherwise, you may end up essentially training to the peculiarities of whatever algorithm generated your SMILES strings.

You are right about SMILES. We trained on the canonical SMILES output by the RDKit, so as far as our AE is concerned, there is only one way to write SMILES for a given molecule (http://www.rdkit.org/docs/GettingStartedInPython.html#writin...). Of course, the choice of what the canonical SMILES is is somewhat arbitrary.

Do you or one of your co-authors are looking for master II interns on this topic ? I am a Machine-learning MSc candidate at the Paris-Sud University looking for an internship in applied ML to other scientific fields like Physics or chemistry. I think that Generative models has great potential for system design. web curriculum : https://laurentcetinsoy.net/cv

You can contact me if you want a more formal version : )

Kind regards

What's the point of the autoencoder? Why use those instead of CNNs?

autoencoders are unsupervised learning where CNNs are supervised. Learning the input space can be thought of as a form of regularization when training data is scarce. http://www.deeplearningbook.org/ is a wonderful resource to learn more about why and when to use different architectures.

This is the future of computational chemistry. The field of "Molecular Dynamics" could easily be swept aside by machine learning. It is why I switched from running "Molecular Dynamic" simulations as a graduate student to modelling genomic data with TensorFlow as a Postdoc.

Who says you cannot do deep learning and molecular dynamics? There are a couple of recent examples of neural networks to predicted energies and gradients allowing faster computation and larger time steps in MD simulations.


I agree. I did molecular dynamics simulations for my PhD. The last year I spent some time exploring feature representations that capture atomic environments up to their symmetry invariants (rotations, reflections, permutations of identical atoms, etc.) The use of machine learning for MD and quantum chemistry has exploded in the last year and a half. I'm a bit sad that I just finished my degree right as this kind of work is heating up.

h-suppressed graphs?

VESPR graph stuff?

compressive sampling / measuring of typical states?

ligand complices?

ligand transport?

membrane dynamics?

predictive toxicology?

cheaper hartree-fock approximations?

quark shit???

time to daydream

How do you suppose the questions you've answered working on ion channels would be addressed with machine learning? I get the impression by "swept aside" that you feel the field of molecular biophysics is not meaningful in the study of health and medicine.

The neural network community uses proper "training"/"validation"/"test" datasets to asses model performance, and has developed algorithms to fit complex models to large amounts of data using relatively little computing power. I think it is possible to build completely new and accurate models of biophysical systems with neural networks with relative ease.

Using models like variational autoencoders it might be possible to draw truly independent samples in a single step, instead of relying lots of MD steps to find novel conformations.

I could go on, but I have to get to bed now :-)

I like the idea, but I wouldn't call it "relative ease". Are we talking about making better force fields, or training a recurrent neural network for generative dynamics that obeys the laws of physics (can be used to calculate ensemble averages) for arbitrary length proteins/folds? How do you construct a VAE for protein dynamics when you only have a single structure? There's a serious lack of data that prevents these things.

Molecular dynamics simulations can be used to answer a range of structural biology questions, but abstractly many of them can be phrased as evaluating the difference in free energy between different conformational states. In molecular dynamics this is done by thermodynamic integrating the energy of over the state space volume for each of the conformational states.

An alternative approach is to directly map conformational states to their free energy. This leads to a problem of searching for candidate conformational states (e.g. the folded state, transition states etc.) and scoring them. Usually for a given computational budget there is a trade off between better conformational sampling or higher accuracy energy scoring.

Historically, searching and scoring methods have been designed separately. For example [1] improves sampling while [2] improves energetics. This is done because they historically involved different aspects of the simulation and each is lot of work. But searching and sampling are not really separable, in that the deeper one samples the more challenging the task of the scoring function becomes--discriminating stable from unstable conformations.

Another application that can be thought of as searching and scoring is the game of GO. My impression is that one of the major breakthroughs with AlphaGo is that they were able to integrate models for searching and scoring together and learn the models simultaneously. It would be awesome if similar architectures could be applied to molecular modeling.

A remaining challenge in applying GO models to molecular biology is that while the representation and scoring rules for GO are fixed and quite easy, the ground truth for molecular simulations comes from heterogenous experimental data (X-ray crystal structures, small molecule activities, directed evolution antibody screens etc.) and higher levels of theory QM simulations, which have their own challenges. However, I think the principles carry over--complicated scoring functions (e.g. free energy) over large state spaces (e.g. protein conformation space or chemical space) can be learned by combining models for searching and scoring. I think deep learning is poised to tackle these problems.

[1] (Conway, et al., 2013, DOI: 10.1002/pro.2389) Relaxation of backbone bond geometry improves protein energy landscape modeling

[2] (Park, 2016, PMID: 27766851) Simultaneous optimization of biomolecular energy function on features from small molecules and macromolecules.

I'm in this field so I'm quite familiar with the search/score situation, but thanks for clearly mapping out the challenge and identifying where you think neural networks will be most beneficial. I just think the particulars make this an immensely different story than GO, and not just the point you describe as the "remaining challenge".

The search space is vast in GO, but it inevitably shrinks over the game, where as in MD simulations it does not shrink, proteins can fold and unfold. There are a fixed number of possible legal play positions to play in GO, but the legal moves for protein conformation fluctuates wildly, is governed by physics (which you would need to relearn), and likely to be much larger than GO since it's continuous. In simulations, you care about successive moves, where as AlphaGO does not care about time-dependent properties (there are also kinetic observables, like folding rates that seem non-intuitive to evaluate without simulations). Even if you sampled enough conformations on some pathway, perhaps some sort of allosteric change, how would you know how fast it happens? In GO, you always play the same game, but in simulations, you often play different games, i.e, you don't want to be unfolding your protein when you are studying ligand binding. In a similar vein, imagine a single-point mutation that causes protein misfolding. It seems to me that you'd need to retrain your search/score algorithm for each new protein sequence, which doesn't seem like you're saving much time/complexity. There is also a huge problem in scale. We're talking about proteins varying from hundreds to hundreds of thousands of atoms/dihedrals/contacts, not to mention sampling water in the active sites of druggable proteins.

I think it could work in principle, but a physics-based approach sure seems elegant by comparison.

Hi Chris,

You bring up very good issues and perhaps I'm being too optimistic. I definitely agree that there isn't going to be one single mapping of sequence --> energy landscape any time soon or even ever.

But I think there are subproblems that are easier because the search space is more limited[1] or the chemistry is easier (e.g. avoiding chemical reactions or interactions with high energy fields). I think often the major modeling challenge is identifying when it is feasible to take advantage of problem constraints or when lower levels of theory can be used. For example there are a range of "enhanced sampling methods" for molecular dynamics that e.g. constrain the the simulation to a reaction coordinate or assume Markov transitions between states so they can be computed on a distributed cluster.

Taking advantage of these opportunities often requires a fair amount of engineering to build appropriate representations. I wonder to what extent these representations can be learned?

Please, feed me more specifics while I salivate. In as well-defined a problem/mission statement/objective as possible, what are you doing (and with what / how), and where do you intend to reach?

So DE Shaw Research is barking up the wrong tree?

What kind of genomic data out of curiosity?

Is that actually working? I tried rolling my own neural nets and running genomic data through them and instantly realized the problem was low N, high autocorrelation in the data. Bayesian forests seem like a better choice, but that's me.

There have been some recent papers in subsets of genomics with a bit more data (Transcription factor binding, for example [1]). You're correct though, definitely depends on your problem.

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908339/pdf/btw...

It might be interesting if there were some routine that could 'reverse-decode' drug-like properties, i.e. producing Lipinski's rule of 5 from nothing but "drug-like" and "not-drug-like" labeled training set.

We did something like that for a more continuous property, like logP, which is easy to predict with cheminformatics. We are working on metrics that reflect drug-likeness better.

Interesting. If I'm not mistaken, logP is a type of property where it is easy to look up constituents' "contribution" for the purpose of predicting logP with substitutions/alterations of the example molecule.

Exactly, we have a good understanding that logP is additive (as a matter of fact, one predicts it using group contributions). The AE noticed this and started adding halogens to already high logP molecules.

The interesting part is that it stops before going totally crazy and per-halogenating the molecule. This is probably because it has an intuition about how molecules look like and hasn't really seen that kind of substitution pattern.

Wait, what is logP? I assumed you meant log probability, since that is the standard error metric for text prediction tasks like this.

It's the water-octanol partition coefficient, a basic molecular descriptor of the physiological distribution of a drug. It is one of the multiple targets one aims to optimize in a novel drug-like compound.


The author of the github repo doesn't seem to be an author on the ArXiv pre-print. Anyone knows why?

One of the authors here. Mostly he was just faster than us! Most co-authors are busy right now starting jobs at new places (plus trying to get published at an old school journal for the comfort of chemists out there).

Max has done a very good job writing a neat implementation of our autoencoder. I encourage everyone to go invent some new molecules!

I'm not part of the group that wrote the paper. I'm just a guy on the internet.

The chart in the readme makes the text completely unreadable on mobile. I'd suggest putting it on top or bottom of the text.

I did Computer Aided Drug Design for my undergrad. I wish I knew this

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact