The other approach has been to take the problem and look at it purely as a translation problem, ignoring simulation of intermediary steps, to go directly from sequence to folded protein target.
With the advent of deep learning and a massive repository of data from an existing Protein Data Bank (PDB) et al., this approach has become increasingly popular, and for protein structure competitions like CASP, has quickly become state of the art within the field. DeepMind's recent breakthrough with AlphaFold in the paper above is just another solid step in the right direction.
What's far more important is evolutionary data- for example, making alignments of many similar proteins, and computing correlated variations across them. Those variations are often the best structural clues-- better and cheaper to obtain than protein structures.
I wouldn't really call DM's work a "breakthrough", other groups were exploring similar ideas. DM executed well (they're a games company and understand the rules of the competition) and had a huge amount of compute resources which handles a lot of the challenges of optimizing a process like this.
For those of you that are curious as to what these couplings represent, the basic idea, is that intuitively you can sort of see how given proteins are a product of evolution, that they're might be a large amount of conserved structure between one protein to another. Co-evolutionary coupling is just an attempt at quantifying this relationship in a rigorous statistical manner.
But people who spend a long time getting a biological education know why this is true: most proteins don't fold to their energetic minimum, they fold to a collection of kinetically accessible states, rarely finding their true minimum (some small proteins do fold quickly, and we typically can predict their structure). And, many of the physical approximations that are used lead to inaccuracies (for example, some variables are constrained to specific values to save time, but making good predictions requires them to be unconstrained).
Some of my work made significant contributions to changing these beliefs, and I've very thankful for the CS and ML folks who contributed to that, but all of them spent a lot of time learning about proteins before they were useful contributors.
Myself I've had to "unlearn" a lot of the early things that were explained to me when I was a layperson, because when you're first learning something, if somebody gives you a simplified view, it can be really hard to move on to the more subtle and nuanced details in the field (for example, many people learn Mendelian genetics and then spend years struggling to understand why most traits don't follow mendelian statistics).
My goal here is to prevent wasted time on behalf of the experienced contributors in the field. I do appreciate good attempts at explaining the field to laypeople, but want to set the expectations on contributing accurately.
I generally recommend PhD-level study in biology (that's 7 years on top of undergrad) but I think a really smart person could learn most of what's required in 2-3 years if they are in a good lab.
No, we will not all be killed by a virus in the next 10 years; that's just media alarmism. Remember, even if coronavirus becomes a worldwide pandemic, some fraction of people will survive who will be genetically immune. We're much more at risk of wars, climate change, and driving cars.
I've had this argument with folks before and some people seem fine learning in other ways, but I really prefer the textbook approach, especially textbooks which are basically just summaries of the current understanding of the field, with direct links to the detailed review articles.
They're not gaming things. DeepMind is good at games, and CASP is a competition, but everybody who does well at CASP is already doing the same sorts of things that DeepMind did to score well. And they really did come up with a good system that was demonstrably better (I want to give them credit, I just don't think 'breakthrough' is really correct). But one thing I know about CASP (I competed one year) is that after 2 years, whatever the previous winning team did is duplicated by the other top teams, and 2 years after that, everybody can do it.
I think ML is moving protein folding competitions like CASP to be faster now, because you can put your code, training data generator (much of the hard work in protein folding is coming up with good training data), a materialized copy of the training data the generator generates, and a trained model checkpoint on github, so after 2 years, everybody will be able to do what DM did at the previous competition. I think this has been one of the really important improvements in the last few years in protein folding- the computational infrastructure, both the training data, the systems to train on, and the tools to do training, have all gotten much better, and lots of people have gotten good at using them. That's a really promising sign and I hope it takes over more quantitative science.
just a naming similarity, the common part is a pun that is a popular google naming scheme, eg. alpha-bet.