Hacker News new | past | comments | ask | show | jobs | submit login
Improved protein structure prediction using potentials from deep learning (nature.com)
71 points by lawrenceyan 39 days ago | hide | past | web | favorite | 15 comments

Predicting protein structure based on a given DNA/RNA sequence has been a field of study that has existed for quite a while now. There have been two primary methodologies that have been explored, one of which has been to try and simulate the actual physical dynamics of a given system at a molecular/atomic level. At places like D.E. Shaw or with Folding@Home, you'll see approaches like these being taken and with relative success. Though generally with purely physics based solutions, you quickly run into exponentially growing simulation time scales as well as a lack of accuracy due to a currently insufficient understanding of molecular mechanics.

The other approach has been to take the problem and look at it purely as a translation problem, ignoring simulation of intermediary steps, to go directly from sequence to folded protein target.

With the advent of deep learning and a massive repository of data from an existing Protein Data Bank (PDB) et al., this approach has become increasingly popular, and for protein structure competitions like CASP, has quickly become state of the art within the field. DeepMind's recent breakthrough with AlphaFold in the paper above is just another solid step in the right direction.

PDB is not a massive repository. It's a very biased, tiny dataset (~20K structures) and an enormous amount of data cleaning has to be done to do anything related to big-data machine learning on it.

What's far more important is evolutionary data- for example, making alignments of many similar proteins, and computing correlated variations across them. Those variations are often the best structural clues-- better and cheaper to obtain than protein structures.

I wouldn't really call DM's work a "breakthrough", other groups were exploring similar ideas. DM executed well (they're a games company and understand the rules of the competition) and had a huge amount of compute resources which handles a lot of the challenges of optimizing a process like this.

My summary is pretty generalized, aimed more towards a layman audience, and so I definitely am missing pieces. Co-evolutionary couplings between different protein sequences provide a very rich source of information, and are definitely very important!

For those of you that are curious as to what these couplings represent, the basic idea, is that intuitively you can sort of see how given proteins are a product of evolution, that they're might be a large amount of conserved structure between one protein to another. Co-evolutionary coupling is just an attempt at quantifying this relationship in a rigorous statistical manner.

I mainly don't want ML folks to suddenly think protein folding is easy because the PDB is a good training set. It's not.

Why frame things in such a pessimistic manner? It seems like it would only be a net benefit to have more people become interested in protein folding as a field of study. Is there really a need for this type of gate keeping here?

yes, personally I think there is. I've spent a tremendous amount of my career watching computer scientists misunderstand how to work on protein folding and waste a lot of people's time. Because the concept of protein folding is so unbelievably complex, most CS and ML folks get the basic talk: "nearly all proteins fold reversibly to a global minimum energy structure which is completely defined by the sequence of the protein", which isn't remotely true (basically a weak form of Anfinsen's dogma and Levinthal's paradox). It's easy to explain, and CS and ML people get excited and go off to work on the problem. This led to a lot of publications that focused on rapidly finding heuristics that could sample enough space to find an approximation of the lowest energy structure. these methods typically failed to make good predictions although eventually methods like Rosetta did start making good predictions around 15-20 years ago (amusingly, the author of Rosetta told me: "the larger the PDB (training data set) gets, the worse the predictions we make".

But people who spend a long time getting a biological education know why this is true: most proteins don't fold to their energetic minimum, they fold to a collection of kinetically accessible states, rarely finding their true minimum (some small proteins do fold quickly, and we typically can predict their structure). And, many of the physical approximations that are used lead to inaccuracies (for example, some variables are constrained to specific values to save time, but making good predictions requires them to be unconstrained).

Some of my work made significant contributions to changing these beliefs, and I've very thankful for the CS and ML folks who contributed to that, but all of them spent a lot of time learning about proteins before they were useful contributors.

Myself I've had to "unlearn" a lot of the early things that were explained to me when I was a layperson, because when you're first learning something, if somebody gives you a simplified view, it can be really hard to move on to the more subtle and nuanced details in the field (for example, many people learn Mendelian genetics and then spend years struggling to understand why most traits don't follow mendelian statistics).

My goal here is to prevent wasted time on behalf of the experienced contributors in the field. I do appreciate good attempts at explaining the field to laypeople, but want to set the expectations on contributing accurately.

So what you're saying is: https://xkcd.com/1831/, except that CS/ML practitioners have a negative impact by trying to contribute without understanding the nuance. I think the next logical question is: how many years of education should you have in order to contribute? 10 years? We'll all be killed by a virus by then :)

Hah, I forgot that one. Part of the negative impact is the time spent explaining stuff (like Brooks mythical man month). Another negative impact is that ML folks have gotten really good at hype- paper with slick web page, press release, etc, but the results don't stand up to the claims.

I generally recommend PhD-level study in biology (that's 7 years on top of undergrad) but I think a really smart person could learn most of what's required in 2-3 years if they are in a good lab.

No, we will not all be killed by a virus in the next 10 years; that's just media alarmism. Remember, even if coronavirus becomes a worldwide pandemic, some fraction of people will survive who will be genetically immune. We're much more at risk of wars, climate change, and driving cars.

So, is there a (collection of) book that would save everyone’s time? Asking for a friend.

Personally, I prefer the classic textbook approach, so I recommend Principles of Biochemistry (Lehninger), Biological Sequence Analysis (Durbin, Eddy, Krogh, Mitchison) which is sadly pretty dated now, a general Biology (Campbell), and finally if you really want to dive down the rabbit hole of a complex biological problem with huge health applications, Biology of Cancer (Weinberg).

I've had this argument with folks before and some people seem fine learning in other ways, but I really prefer the textbook approach, especially textbooks which are basically just summaries of the current understanding of the field, with direct links to the detailed review articles.

How many authors on the deep mind paper had biology phds? Are they really just gaming things in an unfair way?

The CEO of deepmind is an author on the paper, his PhD is in biology (but a totally different field, cog neuroscience). The rest of the authors include all the ingredients you'd expect from a modern successful quantitative scientific collaboration: a university professor of Bioinformatics who has a huge prior knowledge of computer-aided protein folding (http://www0.cs.ucl.ac.uk/staff/D.Jones/), several postdoc or post-postdoc level bio/protein experts with knowledge in physical simulation (the method they used ultimately works as distance and angle constraints on the protein structure), as well as a bunch of world-class machine learning/computer science folks.

They're not gaming things. DeepMind is good at games, and CASP is a competition, but everybody who does well at CASP is already doing the same sorts of things that DeepMind did to score well. And they really did come up with a good system that was demonstrably better (I want to give them credit, I just don't think 'breakthrough' is really correct). But one thing I know about CASP (I competed one year) is that after 2 years, whatever the previous winning team did is duplicated by the other top teams, and 2 years after that, everybody can do it.

I think ML is moving protein folding competitions like CASP to be faster now, because you can put your code, training data generator (much of the hard work in protein folding is coming up with good training data), a materialized copy of the training data the generator generates, and a trained model checkpoint on github, so after 2 years, everybody will be able to do what DM did at the previous competition. I think this has been one of the really important improvements in the last few years in protein folding- the computational infrastructure, both the training data, the systems to train on, and the tools to do training, have all gotten much better, and lots of people have gotten good at using them. That's a really promising sign and I hope it takes over more quantitative science.

Paywalled, can anyone with access to the text expand upon the ML algorithm? Specifically, is AlphaFold based on AlphaGo MCTS? Or is the name similarity incidental?

Literally no similarity, it's an IBM Watson like branding move

not paywalled, try this link: https://rdcu.be/b0mtx

just a naming similarity, the common part is a pun that is a popular google naming scheme, eg. alpha-bet.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact