Hacker News new | past | comments | ask | show | jobs | submit login

This is the future of computational chemistry. The field of "Molecular Dynamics" could easily be swept aside by machine learning. It is why I switched from running "Molecular Dynamic" simulations as a graduate student to modelling genomic data with TensorFlow as a Postdoc.

Who says you cannot do deep learning and molecular dynamics? There are a couple of recent examples of neural networks to predicted energies and gradients allowing faster computation and larger time steps in MD simulations.


I agree. I did molecular dynamics simulations for my PhD. The last year I spent some time exploring feature representations that capture atomic environments up to their symmetry invariants (rotations, reflections, permutations of identical atoms, etc.) The use of machine learning for MD and quantum chemistry has exploded in the last year and a half. I'm a bit sad that I just finished my degree right as this kind of work is heating up.

h-suppressed graphs?

VESPR graph stuff?

compressive sampling / measuring of typical states?

ligand complices?

ligand transport?

membrane dynamics?

predictive toxicology?

cheaper hartree-fock approximations?

quark shit???

time to daydream

How do you suppose the questions you've answered working on ion channels would be addressed with machine learning? I get the impression by "swept aside" that you feel the field of molecular biophysics is not meaningful in the study of health and medicine.

The neural network community uses proper "training"/"validation"/"test" datasets to asses model performance, and has developed algorithms to fit complex models to large amounts of data using relatively little computing power. I think it is possible to build completely new and accurate models of biophysical systems with neural networks with relative ease.

Using models like variational autoencoders it might be possible to draw truly independent samples in a single step, instead of relying lots of MD steps to find novel conformations.

I could go on, but I have to get to bed now :-)

I like the idea, but I wouldn't call it "relative ease". Are we talking about making better force fields, or training a recurrent neural network for generative dynamics that obeys the laws of physics (can be used to calculate ensemble averages) for arbitrary length proteins/folds? How do you construct a VAE for protein dynamics when you only have a single structure? There's a serious lack of data that prevents these things.

Molecular dynamics simulations can be used to answer a range of structural biology questions, but abstractly many of them can be phrased as evaluating the difference in free energy between different conformational states. In molecular dynamics this is done by thermodynamic integrating the energy of over the state space volume for each of the conformational states.

An alternative approach is to directly map conformational states to their free energy. This leads to a problem of searching for candidate conformational states (e.g. the folded state, transition states etc.) and scoring them. Usually for a given computational budget there is a trade off between better conformational sampling or higher accuracy energy scoring.

Historically, searching and scoring methods have been designed separately. For example [1] improves sampling while [2] improves energetics. This is done because they historically involved different aspects of the simulation and each is lot of work. But searching and sampling are not really separable, in that the deeper one samples the more challenging the task of the scoring function becomes--discriminating stable from unstable conformations.

Another application that can be thought of as searching and scoring is the game of GO. My impression is that one of the major breakthroughs with AlphaGo is that they were able to integrate models for searching and scoring together and learn the models simultaneously. It would be awesome if similar architectures could be applied to molecular modeling.

A remaining challenge in applying GO models to molecular biology is that while the representation and scoring rules for GO are fixed and quite easy, the ground truth for molecular simulations comes from heterogenous experimental data (X-ray crystal structures, small molecule activities, directed evolution antibody screens etc.) and higher levels of theory QM simulations, which have their own challenges. However, I think the principles carry over--complicated scoring functions (e.g. free energy) over large state spaces (e.g. protein conformation space or chemical space) can be learned by combining models for searching and scoring. I think deep learning is poised to tackle these problems.

[1] (Conway, et al., 2013, DOI: 10.1002/pro.2389) Relaxation of backbone bond geometry improves protein energy landscape modeling

[2] (Park, 2016, PMID: 27766851) Simultaneous optimization of biomolecular energy function on features from small molecules and macromolecules.

I'm in this field so I'm quite familiar with the search/score situation, but thanks for clearly mapping out the challenge and identifying where you think neural networks will be most beneficial. I just think the particulars make this an immensely different story than GO, and not just the point you describe as the "remaining challenge".

The search space is vast in GO, but it inevitably shrinks over the game, where as in MD simulations it does not shrink, proteins can fold and unfold. There are a fixed number of possible legal play positions to play in GO, but the legal moves for protein conformation fluctuates wildly, is governed by physics (which you would need to relearn), and likely to be much larger than GO since it's continuous. In simulations, you care about successive moves, where as AlphaGO does not care about time-dependent properties (there are also kinetic observables, like folding rates that seem non-intuitive to evaluate without simulations). Even if you sampled enough conformations on some pathway, perhaps some sort of allosteric change, how would you know how fast it happens? In GO, you always play the same game, but in simulations, you often play different games, i.e, you don't want to be unfolding your protein when you are studying ligand binding. In a similar vein, imagine a single-point mutation that causes protein misfolding. It seems to me that you'd need to retrain your search/score algorithm for each new protein sequence, which doesn't seem like you're saving much time/complexity. There is also a huge problem in scale. We're talking about proteins varying from hundreds to hundreds of thousands of atoms/dihedrals/contacts, not to mention sampling water in the active sites of druggable proteins.

I think it could work in principle, but a physics-based approach sure seems elegant by comparison.

Hi Chris,

You bring up very good issues and perhaps I'm being too optimistic. I definitely agree that there isn't going to be one single mapping of sequence --> energy landscape any time soon or even ever.

But I think there are subproblems that are easier because the search space is more limited[1] or the chemistry is easier (e.g. avoiding chemical reactions or interactions with high energy fields). I think often the major modeling challenge is identifying when it is feasible to take advantage of problem constraints or when lower levels of theory can be used. For example there are a range of "enhanced sampling methods" for molecular dynamics that e.g. constrain the the simulation to a reaction coordinate or assume Markov transitions between states so they can be computed on a distributed cluster.

Taking advantage of these opportunities often requires a fair amount of engineering to build appropriate representations. I wonder to what extent these representations can be learned?

Please, feed me more specifics while I salivate. In as well-defined a problem/mission statement/objective as possible, what are you doing (and with what / how), and where do you intend to reach?

So DE Shaw Research is barking up the wrong tree?

What kind of genomic data out of curiosity?

Is that actually working? I tried rolling my own neural nets and running genomic data through them and instantly realized the problem was low N, high autocorrelation in the data. Bayesian forests seem like a better choice, but that's me.

There have been some recent papers in subsets of genomics with a bit more data (Transcription factor binding, for example [1]). You're correct though, definitely depends on your problem.

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908339/pdf/btw...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact