
A watershed moment for protein structure prediction - ColinWright
https://www.nature.com/articles/d41586-019-03951-0
======
DrScientist
As the summary says, the basic approach ( evolutionary data -> position
coupling -> to distance restraints - into structure solver ) is actually quite
old, with the key paper back in 2011.

In fact, the distance constraints to 3D structure part is in fact very old - I
was calculating structures from experimentally determined distances 30 years
ago. You need a surprisingly low number of fairly weak ( these atoms are
between 3 and 5 Angstroms ) distances to determine a 3D structure if you have
a decent number of long range ones.

What they have done is execute better.

However the problem they are working on

sequence -> structure

though it's been a long term 'holy grail', practically it's not that useful!

The models typically aren't quite good enough, it's not predicting
interactions, and experimental methods to determine structures have also moved
on in leaps and bounds.

As the article briefly mentions what you really want to do is go the other
way.

Designed novel structure -> protein sequence to make it.

One way to do that if you have a function going the other way ( like alphafold
( let's ignore limitations for now - ie does a knowledge based approach work
well for completely novel folds ? ) ), is some sort of heuristic search -
however the search space is huge and a step size of hours isn't going to cut
it.

~~~
thaumasiotes
> However the problem they are working on

> sequence -> structure

I recall reading about a result a couple of years ago supposedly demonstrating
that "synonymous" DNA codons were in fact not synonymous, because the ribosome
took systematically different amounts of time to process them, and the
difference in construction time resulted in different folding for the protein.

This would imply that the problem "sequence -> structure" is not well defined,
at least if the sequence in question is the sequence of peptides making up the
protein and not the sequence of codons making up the gene that codes for the
protein.

Do you know anything about this? Am I just making it up?

~~~
dekhn
I think it's much more likely that the different codon usage leads to
different rates of synthesis, not differently folded proteins. But that's a
complex area. Many proteins do not fold to their native structure
spontaneously, and there are other proteins that refold them "correctly".

It's not clear to me that you could truly demonstrate substantially different
folding due to codon usage like that, on an experimental level, to make a
general statement about all proteins.

~~~
thaumasiotes
> Many proteins do not fold to their native structure spontaneously, and there
> are other proteins that refold them "correctly".

This would also seem to imply that "sequence -> structure" isn't quite the
right problem.

~~~
dekhn
well, not sure what you mean by "Sequence->structure". Historically, people
have used the fact that small globular proteins refold spontaneously and
rapidly to their native state to support the idea that there is a single,
unique structure encoded by a specific sequence. That's a helpful if
ultimately limited approach (as we observe many proteins that don't fold
rapidly to single native structure).

That's a reason that evolution-based methods, which use statistics about
families of related proteins to estimate distances between pairs of amino
acids (in 3D space), are more effective- many times in biology we can use
evolutionary relationships between proteins to infer things that would be hard
to determine through experiments or rigorous, thorough simulations.

But it's important to appreciate there are a large number of proteins that
don't fold to a single unique structure rapidly-- and there are many ways this
can be the case and many different biologically relevant behaviors depend on
these properties. The tools from CASP are much less useful for proteins that
violate the assumptions of Anfinsen's dogma, although the evolutionary data is
helpful there too, it can often be a lot more challenging to deconvolute the
signal.

Ultimately, "what is the right problem"? THe one that makes the most money?
Produces the most "useful" scientific result? Is accessible with today's
technology? For now, there's plenty of value in these sorts of competitions.

Personally I think the "right problem" is: "given a collection of diseases,
use experimentally derived data and clever math, to discover biological
treatments that reduce the total suffering from those diseases, subject to
monetary and ethical constraints". That's what pharma attempts to do, although
not particularly well. Others might say simply solving interesting problems
like protein folding is inherently valuable as the right problem.

------
choeger
I am not an ML engineer (except when I program in ML, of course ;)). But this
sounds a lot like the following:

1\. We had a model that worked in principle, but the search space was
practically infeasible.

2\. We made an observation that a different model might exist that makes the
search space irrelevant.

3\. We threw ML at it.

4\. Now we _might_ have a model that fulfills (2) but we cannot be sure
because we used a black-box approach.

5\. Somehow the results are exciting. Better results would be _really_
exciting.

6\. We hope that more data yields these better results.

Is that correct? Am I the only one to lament these black-box approaches?
Should there not be a bunch of people now studying the learned models to
figure out if much better results can actually exist?

~~~
ArtWomb
>>> Am I the only one to lament these black-box approaches?

Far from it. Prediction looks like a tool in the arsenal for better
understanding. One still has to correlate the structure with the complex
interactions in vivo. Even using AI in classification mode, where we can
segment a large atlas of tumor cells and identify a dozen or so classes of
cell anomalies may lead to faster breakthroughs in immunotherapy.

What I am trying to wrap my head around is the synthesis problem. Say
AlphaFold generates a promising candidate. One that does not exist naturally.
You still need the DNA or mRNA transcription sequence to synthesize the
protein, right? Won't some candidates simply be too complex and unstable to
reliably produce using existing mammalian or baculovirus platforms?

~~~
ovi256
>Won't some candidates simply be too complex and unstable to reliably produce
using existing mammalian or baculovirus platforms?

You can add that to the objecting function that your model training function
is optimizing, ensuring model output is not "too complex or unstable to
reliably produce".

------
cs702
For background, read this fantastic blog post by the same author, Mohammed
AlQuraishi at Harvard Medical School, from a year ago:

[https://moalquraishi.wordpress.com/2018/12/09/alphafold-
casp...](https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-
just-happened/)

~~~
wpasc
Thanks for posting, it was a great read.

I especially enjoy the segments where he upfrontly addresses "an indictment of
academic science" and "an indictment of pharma". He pulls no punches in saying
how embarrassing it is for pharma and academia to be literally outclassed by
DeepMind.

A great quote:

"If you think I’m being overly dramatic, consider this counterfactual
scenario. Take a problem proximal to tech companies’ bottom line, e.g. image
recognition or speech, and imagine that no tech company was investing research
money into the problem. (IBM alone has been working on speech for decades.)
Then imagine that a pharmaceutical company suddenly enters ImageNet and blows
the competition out of the water, leaving the academics scratching their heads
at what just happened and the tech companies almost unaware it even happened."

~~~
dekhn
nobody is embarassed here. pharma doesn't work on protein folding prediction.
now they can take the published results and code and use it, but protein fold
prediction has not, is not, and probably will not ever be the rate limiting
step in novel drug discovery and development.

------
RocketSyntax
"The resulting algorithm outperformed all entrants at the most recent blind
assessment of methods used to predict protein structures, generating the best
structure for 25 out of 43 proteins, compared with 3 out of 43 for the next-
best method."

~~~
KKKKkkkk1
This is remarkable. Teams of researchers all over the world have taken part in
the CASP competitions for decades. Many attempts using machine learning and
ANNs have been made. What is it about DeepMind that allowed them to make such
a breakthrough? Do they have expertise in deep learning that does not exist in
academia? Incredible amounts of compute that academia cannot afford?

~~~
dekhn
The techniques DM used are popular in academia right now, too. Using
evolutionary data to shortcut hard problems has been key to advancement in
protein research for decades. DM just executed better, a combination of smart
people, some good ideas, and lots of experimentation. NEver underestimate the
ability of company that exists to win games, to win competitions.

~~~
natechols
And never underestimate the amount of money that a big tech company can throw
at a random problem. DeepMind probably blew through the equivalent of multiple
R01 grants writing that paper.

~~~
robocat
Big biotech can throw big amounts too.

And I read that the size of the team was 10 people - that's not a big number.

The compute power applied was not why they had this outcome.

~~~
natechols
If their salaries are anything like what Bay Area companies are shelling out
for top AI engineers, each one of those 10 people is probably costing as much
as 10 grad students in any of the other labs working on this problem. Big
Biotech does not usually have the money to get into a bidding war for
engineering talent with companies like Google.

~~~
robocat
"There are dozens of academic groups, with researchers likely numbering in the
(low) hundreds, working on protein structure prediction. We have been working
on this problem for decades, with vast expertise built up on both sides of the
Atlantic and Pacific, and not insignificant computational resources when
measured collectively. For DeepMind’s group of ~10 researchers, with primarily
(but certainly not exclusively) ML expertise, to so thoroughly route everyone
surely demonstrates the structural inefficiency of academic science."

"What is worse than academic groups getting scooped by DeepMind? The fact that
the collective powers of Novartis, Pfizer, etc, with their hundreds of
thousands (~million?) of employees, let an industrial lab that is a complete
outsider to the field, with virtually no prior molecular sciences experience,
come in and thoroughly beat them on a problem that is, quite frankly, of far
greater importance to pharmaceuticals than it is to Alphabet. It is an
indictment of the laughable “basic research” groups of these companies, which
pay lip service to fundamental science but focus myopically on target-driven
research that they managed to so badly embarrass themselves in this episode."

From: [https://moalquraishi.wordpress.com/2018/12/09/alphafold-
casp...](https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-
just-happened/)

~~~
natechols
I completely disagree with his interpretation. It would be surprising if group
that concentrates some of the top expertise in AI weren't able to make a big
impact on a well-defined optimization problem that has been studied for
decades.

I think a lot of the commentary is missing two essential points:

1\. Protein structure prediction is to a large extent a solved problem _for
small-ish, soluble targets_. AlphaFold is a significant improvement on the
current state of the art, but the state of the art was already far enough
along that the best computational models in 2007 were good enough to bootstrap
experimental structure determination
([https://www.ncbi.nlm.nih.gov/pubmed/17934447](https://www.ncbi.nlm.nih.gov/pubmed/17934447)).
In other words, it's not like the entire academic community was stumbling
around helplessly in the dark.

2\. The value of these predictions to pharmaceutical companies is extremely
marginal. Having a high-accuracy model is very helpful but it's rare that the
researchers have so little information available that a completely de-novo
prediction is necessary. And when they really don't have much information at
all, it's usually because the target is sufficiently messy to defy traditional
structure determination methods - which means it's almost certainly more than
AlphaFold can handle too.

------
suhaildawood
For those unaware of cryo-EM (cryogenic electron microscopy), I highly
recommend reading into it. A structural biology renaissance is upon us.

------
nl
Code and neural network weights for AlphaFold:
[https://github.com/deepmind/deepmind-
research/tree/master/al...](https://github.com/deepmind/deepmind-
research/tree/master/alphafold_casp13)

------
madengr
I had the protein folding project running in my computer for a few years. Can
these deep learning models be distributed like that, or do they require
tightly coupled processors? Seems the latter as there was a recent IEEE
article on a wafer scale array of CPU for deep learning.

~~~
mkagenius
Atleast those (obsolete?) bitcoin miners can be put to this use.

~~~
derision
If you're referring to ASICs, not at all. Those things can only compute hashes

~~~
OJFord
(By design, because that's the 'A' (application) in 'ASIC' (Application-
Specific Integrated Circuit).)

------
s_dev
Will this have any impact on the Folding@Home efforts?

~~~
jcoffland
Folding@home is working on a different problem. Folding@home finds the path a
protein takes to arrive at its final structure, not just the final structure.
One of the main goals is to understand protein misfolding or where on the path
things go astray to result in disease.

After all these years, we are still at it. New methods are regularly evaluated
and the simulation software is being refined all the time.

------
synthmeat
Is there or will there be AlphaFold CASP14 entry too?

