Hacker News new | past | comments | ask | show | jobs | submit login
End-to-end differentiable learning of protein structure (biorxiv.org)
91 points by tepal on Feb 18, 2018 | hide | past | web | favorite | 26 comments

It would be cool if machine learning researchers would start participating CASP and CAPRI. If you crack Go, you get fame, but if you crack protein prediction, you get Nobel price and completely revolutionize biochemistry and medicine.



edit: Why there is no XPRICE for protein folding?

DeepMind and others are trying. "Hassabis said the company is now planning to apply an algorithm based on AlphaGo Zero to other domains with real-world applications, starting with protein folding."

[1] https://www.bloomberg.com/news/articles/2017-10-18/deepmind-...

That doesn't make any sense unless I'm missing something, A0 is suited for a completely different problem than protein folding...

The AlphaZero algorithm (monte carlo tree search with value estimator trained by reinforcement learning) works on any environment you can simulate during play time, single player or not.

Any environment with finite action and state-spaces.

No, the key requirement which makes it difficult to use on real-world tasks is that you must be able to do a forward rollout of your environment in your decision-making process.

FWIW, AlphaGo like algorithms have already been applied to this domain, see AlphaChem

I used to work on protein structure about ten years ago.

Back then, the mood kind of changed from “solve this and you have a Nobel waiting”: the general opinion was that progress was both significant and piecemeal, making it unlikely a Nobel will be awarded because “cracking it”would end up being to hard to assign to any three people.

lol what-I was looking at this list of people who do this-in fact a lot of them ARE machine learning researchers...including some in my department!

I do think however that protein folding is very much understudied in the ML community, relative to say the big three of vision, NLP, and speech. The lack of standardized data sets and benchmarks, not to mention the need for domain knowledge, have made it difficult to get into the field

at the risk of offending NLPers/Vision/Speech I just think those tasks are 'easier' in a variety of ways.

CASP is a pretty nice dataset, so is all of the PDB.

The PDB represents the best we have, but I wouldn't call it a great dataset for learning. The 150,000 known structures are a drop in the ocean when it comes to the space of possible sequences/structures.

It's happening.

I would guess that there have been many attempts to use ML for protein folding. It's one of the most obvious ways to approach the problem.

I work on protein structure, albeit not from a computational standpoint, and it struck me as odd that none of the work from the Baker group (Univ Washington) e.g. Rosetta (https://www.rosettacommons.org/) was mentioned. Rosetta can be used to predict tertiary structure from amino acid sequence. Does anyone familiar with the field know how the methods used by software like ROSETTA differ from those presented in this paper?

Hi! I’m the author of the paper. Not sure why you say Rosetta isn’t mentioned? It’s extensively referenced throughout the paper, discussed in the discussion section, and is one of the top 5 CASP servers compared to in the results section.

Also as for how it’s different from what’s described in the paper, that’s the topic of the introduction of the paper. Rosetta uses both fragment assembly and co-evolution methods.

Oops I seemed to have skimmed it a bit too quickly. Thank you for the kind reply.

This is a very interesting approach. Clearly a lot more work to do but the robust prediction of protein structure from sequence would be an absolute game changer for biomedical science so I hope that this opens up new strategies.

What are the real-world applications of protein folding (preferably, some specific example)? I always hear that it's really important for drug design and biotechnology but have a hard time imagining something concrete.

Re drug discovery, often times in “rational” drug design, medicinal chemists try to make small molecules that bind snuggly into a binding pocket on the protein. Having the structure of the protein aids greatly in that process.

Yes! And I'd also add that there are others that come into play...

* Elucidating function by identifying similarity to other known structures

* Finding novel signaling mechanisms (see work on PHinder)

* Modeling co-receptor/ligand dynamics

* Identifying function of orphan receptors

* Working with ancestral genes by identifying descendant structure

* Classifying and clustering proteins based on solved structure

* Learning new biochemical mechanisms through active vs inactive state structures


Cool method! Are you planning to participate in the next CASP? Do you plan to open source the code?

Yes! Certainly on the source code, and hopefully on CASP13 too.

Thanks for the answer! I hope then to see you in CASP (and CAMEO too, it is a great tool to test/refine your method). I was discussing a paper with a co-worker of mine (we also work on psp, we work on RBO Aleph). We had a hard time pinpointing the thing that made your method finally work. You have mentioned in your blog post that you have been working on it for years now, and I guess a lot of other people had the idea of using deep learning for psp. But what was the insight that made it all work, using LSTM? or was it many small refinements and hacks?

I would say the biggest thing is obviously the architecture, coupling LSTMs with the geometric units that spit out the actual 3D structure that can then be directly optimized via the dRMSD loss function. That's the biggest point of distinction from everything else out there (no contact map prediction, etc.) So it really is about end-to-end differentiability IMO, which hasn't been done before.

As for why it took so long, it is and it is not fine-tuning. Getting RGNs to train _at all_ was a rather difficult process, and required a lot of finicking around. But since I got them working, I haven't actually spent all that much time fine-tuning them, and so I expect there to be a lot of low-hanging fruit in terms of optimizing performance (starting from the baseline I found.)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact