
Improved protein structure prediction using potentials from deep learning - lawrenceyan
https://www.nature.com/articles/s41586-019-1923-7
======
lawrenceyan
Predicting protein structure based on a given DNA/RNA sequence has been a
field of study that has existed for quite a while now. There have been two
primary methodologies that have been explored, one of which has been to try
and simulate the actual physical dynamics of a given system at a
molecular/atomic level. At places like D.E. Shaw or with Folding@Home, you'll
see approaches like these being taken and with relative success. Though
generally with purely physics based solutions, you quickly run into
exponentially growing simulation time scales as well as a lack of accuracy due
to a currently insufficient understanding of molecular mechanics.

The other approach has been to take the problem and look at it purely as a
translation problem, ignoring simulation of intermediary steps, to go directly
from sequence to folded protein target.

With the advent of deep learning and a massive repository of data from an
existing Protein Data Bank (PDB) et al., this approach has become increasingly
popular, and for protein structure competitions like CASP, has quickly become
state of the art within the field. DeepMind's recent breakthrough with
AlphaFold in the paper above is just another solid step in the right
direction.

~~~
dekhn
PDB is not a massive repository. It's a very biased, tiny dataset (~20K
structures) and an enormous amount of data cleaning has to be done to do
anything related to big-data machine learning on it.

What's far more important is evolutionary data- for example, making alignments
of many similar proteins, and computing correlated variations across them.
Those variations are often the best structural clues-- better and cheaper to
obtain than protein structures.

I wouldn't really call DM's work a "breakthrough", other groups were exploring
similar ideas. DM executed well (they're a games company and understand the
rules of the competition) and had a huge amount of compute resources which
handles a lot of the challenges of optimizing a process like this.

~~~
lawrenceyan
My summary is pretty generalized, aimed more towards a layman audience, and so
I definitely am missing pieces. Co-evolutionary couplings between different
protein sequences provide a very rich source of information, and are
definitely very important!

For those of you that are curious as to what these couplings represent, the
basic idea, is that intuitively you can sort of see how given proteins are a
product of evolution, that they're might be a large amount of conserved
structure between one protein to another. Co-evolutionary coupling is just an
attempt at quantifying this relationship in a rigorous statistical manner.

~~~
dekhn
I mainly don't want ML folks to suddenly think protein folding is easy because
the PDB is a good training set. It's not.

~~~
lawrenceyan
Why frame things in such a pessimistic manner? It seems like it would only be
a net benefit to have more people become interested in protein folding as a
field of study. Is there really a need for this type of gate keeping here?

~~~
dekhn
yes, personally I think there is. I've spent a tremendous amount of my career
watching computer scientists misunderstand how to work on protein folding and
waste a lot of people's time. Because the concept of protein folding is so
unbelievably complex, most CS and ML folks get the basic talk: "nearly all
proteins fold reversibly to a global minimum energy structure which is
completely defined by the sequence of the protein", which isn't remotely true
(basically a weak form of Anfinsen's dogma and Levinthal's paradox). It's easy
to explain, and CS and ML people get excited and go off to work on the
problem. This led to a lot of publications that focused on rapidly finding
heuristics that could sample enough space to find an approximation of the
lowest energy structure. these methods typically failed to make good
predictions although eventually methods like Rosetta did start making good
predictions around 15-20 years ago (amusingly, the author of Rosetta told me:
"the larger the PDB (training data set) gets, the worse the predictions we
make".

But people who spend a long time getting a biological education know why this
is true: most proteins don't fold to their energetic minimum, they fold to a
collection of kinetically accessible states, rarely finding their true minimum
(some small proteins do fold quickly, and we typically can predict their
structure). And, many of the physical approximations that are used lead to
inaccuracies (for example, some variables are constrained to specific values
to save time, but making good predictions requires them to be unconstrained).

Some of my work made significant contributions to changing these beliefs, and
I've very thankful for the CS and ML folks who contributed to that, but all of
them spent a lot of time learning about proteins before they were useful
contributors.

Myself I've had to "unlearn" a lot of the early things that were explained to
me when I was a layperson, because when you're first learning something, if
somebody gives you a simplified view, it can be really hard to move on to the
more subtle and nuanced details in the field (for example, many people learn
Mendelian genetics and then spend years struggling to understand why most
traits don't follow mendelian statistics).

My goal here is to prevent wasted time on behalf of the experienced
contributors in the field. I do appreciate good attempts at explaining the
field to laypeople, but want to set the expectations on contributing
accurately.

~~~
cing
So what you're saying is: [https://xkcd.com/1831/](https://xkcd.com/1831/),
except that CS/ML practitioners have a negative impact by trying to contribute
without understanding the nuance. I think the next logical question is: how
many years of education should you have in order to contribute? 10 years?
We'll all be killed by a virus by then :)

~~~
dekhn
Hah, I forgot that one. Part of the negative impact is the time spent
_explaining_ stuff (like Brooks mythical man month). Another negative impact
is that ML folks have gotten really good at hype- paper with slick web page,
press release, etc, but the results don't stand up to the claims.

I generally recommend PhD-level study in biology (that's 7 years on top of
undergrad) but I think a really smart person could learn most of what's
required in 2-3 years if they are in a good lab.

No, we will not all be killed by a virus in the next 10 years; that's just
media alarmism. Remember, even if coronavirus becomes a worldwide pandemic,
some fraction of people will survive who will be genetically immune. We're
much more at risk of wars, climate change, and driving cars.

~~~
mynegation
So, is there a (collection of) book that would save everyone’s time? Asking
for a friend.

~~~
dekhn
Personally, I prefer the classic textbook approach, so I recommend Principles
of Biochemistry (Lehninger), Biological Sequence Analysis (Durbin, Eddy,
Krogh, Mitchison) which is sadly pretty dated now, a general Biology
(Campbell), and finally if you really want to dive down the rabbit hole of a
complex biological problem with huge health applications, Biology of Cancer
(Weinberg).

I've had this argument with folks before and some people seem fine learning in
other ways, but I really prefer the textbook approach, especially textbooks
which are basically just summaries of the current understanding of the field,
with direct links to the detailed review articles.

------
allovernow
Paywalled, can anyone with access to the text expand upon the ML algorithm?
Specifically, is AlphaFold based on AlphaGo MCTS? Or is the name similarity
incidental?

~~~
deepnotderp
Literally no similarity, it's an IBM Watson like branding move

