Hacker News new | past | comments | ask | show | jobs | submit login
AlphaFold's database grows over 200x to cover nearly all known proteins (wandb.ai)
152 points by OnlineInference on Aug 1, 2022 | hide | past | favorite | 50 comments

Much discussion a few days ago: https://news.ycombinator.com/item?id=32262856

The top comment there (from a structural biologist) is worth reading. Here's my opinion, as a computer scientist that worked in this area.

A protein sequence is analogous to a computer program, but the "machine" is a mostly-water solution, and the instructions are interpreted by summing up all the intermolecular forces at play as the sequence is squirted out of a little extruder as a string (as in the stuff your clothes are made of, not text). Bits of the string repel and attract each other and it globs up in some biologically useful structure. The folding problem is the problem of predicting that structure from the string.

Unlike the halting problem, there is no way to generate an execution trace saying a particular glob would be formed in reality. In fact, there is no way to perform a polynomial time check of the result, so we've already escaped the land of P and NP.

Also, things like temperature and proximity to other proteins mean that there might not be a unique fold for a given sequence. Therefore, like the halting problem, we have unknown inputs, and we need to figure out which states an arbitrary program can reach.

When someone claims to have "solved" folding, you should be as skeptical as you would be if someone claimed to have solved the halting problem for arbitrary machine code, and that they don't need any extra information about the machines that run that code. Although their program runs on conventional computers, it also works on programs written for quantum computers.

(Edit: That's not to say this work isn't useful, or that this press release overclaims. I've been hearing some pretty wild claims about this work elsewhere...)

For an unrelated reason, shortly after making that comment, I put 31 genes from a viral genome (the whole genome, assuming we have the reading frames correct and nothing else funky is going on) through AlphaFold. We're getting ready to do some proteomics to see what's in the capsid, and I wanted to inform the proteomics by doing some sequence analysis. Only three genes of the 31 came back with any sort of confidence. Two of the three were crystallized and solved by my group a few years back.

Is the AlphaFold team winning Folding@home? (which started at Washington University in St. Louis, home of the Human Genome Project)


FWIU, Folding@home has additional problems for AlphaFold, if not the AlphaFold team;

> Install our software to become a citizen scientist and contribute your compute power to help fight global health threats like COVID19, Alzheimer’s Disease, and cancer. Our software is completely free, easy to install, and safe to use. Available for: Linux, Windows, Mac

> which started at Washington University in St. Louis, home of the Human Genome Project

While Wash U was a contributor, I am confused about why you call it the home of the Human Genome Project. The Project seems a lot more strongly linked to the Whitehead/MIT in terms of press and the site of key figures.

If we're just sharing links, I have one too:


Together, these teams have achieved a very significant cost reduction: the link I shared cites a sub-$1K cost to sequence a genome today; a cost savings of millions of dollars per genome.

Both projects tackle related problems, but each is trying to answer a different question: https://news.ycombinator.com/item?id=32264059

> Folding@home answers a related but different question. While AlphaFold returns the picture of a folded protein in its most energetically stable conformation, Folding@home returns a video of the protein undergoing folding, traversing its energy landscape.

Is there any NN architectural reason that AlphaFold could not learn and predict the Folding@home protein folding interactions as well? Is there yet an open implementation?

I think it would be much harder to do that, since it probably requires modelling physics at some level, while AlphaFold is really just mining statistical correlations of structures and sequences.

Yes, there are open implementations of nearly-AlphaFold at this point.

FWIU there's no algorithmic reason that AlphaZero-style self play w/ rules could not learn the quantum chemistry / physics. Given the infinite monkey theorem, can an e.g. bayesian NN learn quantum gravity enough to predictively model multibody planetary orbits given an additional solar mass in transit through the solar system? (What about with "try Bernoulli's on GR and call it superfluid quantum gravity" or "the bond yield-curve inversion is a known-good predictor, with lag" as Goal-programming nudges to distributedly-partitioned symbolic EA/GA with a cost/error/survival/fitness function?)

E.g. re-derivations of Lean Mathlib would be the strings to evolve.

RIP Folding@home on Playstation 3! Bring it back!

Pretty much everything you said doesn't make any sense. Folding@Home started at Stanford, not WashU. WashU was also not "the home of the human genome project", that was a distributed effort. AlphaFold doesn't contribute to Folding@Home, it's an entirely different problem.

Disclaimer: I'm a professional (computational) structural biologist. My opinion is slightly different than the researcher that commented on the linked post.

I didn't see any claim by DeepMind that protein structure prediction is a solved problem. I think these guys are pretty diligent when it comes to communicating their science. What you may have seen, is a non-scientist reporter making inaccurate claims.

The problem with the structure prediction problem is not a loss/energy function problem, even if we had an accurate model of all the forces involved we'd still not have an accurate protein structure prediction algorithm.

Protein folding is a chaotic process (similar to the 3 body problem). There's an enormous number of interactions involved - between different amino acids, solvent and more. Numerical computation can't solve chaotic systems because floating point numbers have a finite representation, which leads to rounding errors and loss of accuracy.

Besides, Short range electro static and van der waals interactions are pretty well understood and before alphafold many algorithms (like Rosetta) were pretty successful in a lot of protein modeling tasks.

Therefore, we need a *practical* way to look at protein structure determination that is akin to AlphaFold2.

Now I really want to read a long form book like this comment ‘A Computer Scientists Guide to an intuitive understanding of biochemistry’

I’ve found it extremely hard to have a casual understanding of biology, unlike math where I feel like I have a solid high level sampling of the field. I’ve done a few bio and chemistry courses and books but it’s so deep and ill suited for a programmer who is used to asking how things work underneath at every level (you have to constantly stop yourself from asking why something does what it does and just go with it until it starts to connect later, which is more of a commitment than I could give).

Anyway thanks for your comment

I would suggest carefully reading a deep textbook on biology like Molecular Biology of the Cell. You can't get a casual but realistic understanding of biology without a significant effort. That's a big problem in modern society. Biology is subtle and yet ever-important to us earth-bound organisms. The vast majority of people have only the most trivial understanding of biology, but scientifically we have a rather complete perspective and mental model that, due to its recent development, hasn't yet become common.

Great book suggestion! Absolutely agree as someone in the field

Biology and biochemistry is unbelievably complicated and difficult to grasp without truly going deep into the fundamentals

Slightly OT, but I am a computational chemist (PhD) looking to learn more about molecular biology (to say, and undergraduate or beginning graduate level). I am looking to learn more to see ways in which advances in computational chemistry tools could be applicable outside of our usual domains.

I am looking at Molecular Biology of the Cell (Alberts) and Cell Biology (Pollard). Both were recommended to me, but wondering what the pros and cons of each are (if you are familiar with both of them).

I'm not familiar with Cell Biology by Pollard but MBoC has incredible diagrams and flow charts that make pathways and other concepts incredibly easy to understand

I would suggest taking MIT's Secret of Life course on EdX. Its taught by Eric Lander who was a key figure in the human genome project and was a mathematician beforehand, so he follows an axiomatic approach that is much different than the way other schools teach biology


Alternatively, Harvard Extension School has some great biology courses you can sign up and get credit for. Though those are mostly for pre-med career changers

Two recs:

- There is a (short) book called "A Computer Scientist's Guide to Cell Biology" by William Cohen which is a little pricey but very dense and helpful with a lot of concepts.

- Combine that with David Goodsell's "The Machinery of Life" which has a lot of great illustrations and practical examples.

Only way to truly learn biology imo is to read and do experiments. The feedback loop between those two things is what actually gives someone real intuition.

Is it possible or likely that the folding process is more procedurally deterministic than it seems? (given sequence, temperature etc) The degrees of freedom perhaps seem intractable because we don't know what steps the structure takes between the linear extrusion and final fold. AlphaFold, if I understand correctly, doesn't attempt to solve this problem. Your comment implies we should be skeptical of it because it's solving a potentially-intractable problem; perhaps it's both tractable, and AlphaFold doesn't solve it.

Let's say you have a car (or lego set etc). The number of possible ways the parts could go together are astronomical! Does that mean it's not possible to figure out how it fits together, or how you might build one?

Yes, if you have a Lego set, or a series of car parts, there are many ways to put them together to make something. What AF is doing as far as I understand is essentially looking at a catalog of all Lego sets ever produced, or all car models ever produced, and choosing one that most closely matches the pieces it is seeing.

But there is no reason to expect this process to produce the right end-result for a Lego set that has never been seen before.

Didn't AlphaFold win a competition based on folding proteins that had a secret result?

Yes, but that competition is using lots of proteins that are similar to other known proteins, as far as I understand.

There is also a lot of sub-structure that helps - similar parts of proteins tend to fold in similar ways, so even if you don't have real predictive power on unknown sequences, you may do quite well for a protein that is 90% the same as one in the training set - you will be quite correct on ~90% of the folds, even if your pretty way off on the remaining 10%.

Note that all of this is not to minimize the success of what AlphaFold achieved. I am just trying to explain how you can do well at this problem without having discovered some deeper deterministic structure in protein folds.

Yes, but many proteins can be boiled down to basically two classes - the folded portion, and the unfolded portion. The folded portions are typically shared (shared is a loose term, there's a lot of leeway) among almost all proteins.

So, I can pull a protein out of thin air and there's a good chance it'll have an overall fold similar to another protein that's got a structure. Unfortunately, the devil is almost always in the details. An amino acid here or there, a short extension here or there, a missing charged residue or an extra glycine and now you have a different target and entirely different behavior in a biological system.

One cool thing I found actually, was a protein in an Archaeal virus had no known homology a few years ago, but when I checked the other day, it now matches most closely to an (otherwise thought to be) entirely synthetic protein out of David Baker's lab at UW. Which means this Archeal virus and David Baker converged on the same fold somehow (likely because it was "stable").

Given how quantum physics and chemistry works, highly unlikely.

>When someone claims to have "solved" folding, you should be as skeptical as you would be if someone claimed to have solved the halting problem for arbitrary machine code

That's absurd. The halting problem is provably impossible with either conventional computers or Quantum computers.

This is clearly not true for protein folding, although it is possible that it is computationally intractable with a conventional computer.

I think the parent comment is saying that it's impossible to arrive at a specific folding endpoint because that state is dependent on continuously changing environmental variables.

Take a look at the configs for Amber (molecular dynamics simulation -- https://ambermd.org). QC might help map the space of inputs that would converge, but it probably couldn't identify a hypothetical 'done folding' state for any given protein.

I don't think it's super valuable to spend time thinking about the computational class protein folding (or structure prediction) is in. It's clear now that approaches that approximate the expensive physics and extended sampling using every bit of additional information available are going to be much more successful in providing data that people need from structures.

I propose this as a thought experiment: Nature has solved this. How? Some lines of reasoning:

#1: The quantum interactions of electrons that are the basis for chemical bonds behave in ways our computers and intuition are incapable of simulating

#2 It's a matter of degree, not kind, and nature is more sophisticated than our computers, reasoning and thought processes.

#3 Nature is magic, whatever you define that to be

#4 When stipulating the degrees of freedom involved (ie from dihedral angles), the possibility of additional information we haven't discovered is being overlooked. Is there a recipe or algorithm that could help?

#5 Proteins don't fold in isolation. We know some proteins need chaperone proteins to fold, for instance. Others form part of a complex. The problem can't be solved in the general case just based on the sequence of the protein you want to know the structure of. That's also a problem experimentally -- we don't know if the structure of a crystalized protein is really the biologically meaningful form.

I'd go with #1. Especially considering that there are quantum approaches to protein folding.

But nature hasn't really "solved" the problem, it is just doing its thing, but the way it does things is completely different from what our computers do.

It is like trying to reproduce a guitar sound using a synthesizer. A guitar solves to problem of sounding like a guitar, but it doesn't mean it is more sophisticated than a synthesizer, in fact, a synthesizer can do much more, it is just that the process by which the guitar makes sounds are hard to simulate.

Could not be just bruteforce? Nature operates on a much bigger temporal scale than us.

Could be! Are you thinking thermodynamic fluctuations from surrounding water molecules jostling things around into many combinations? In this view, do you think the final protein would be found by chance, or through intermediate assemblies?

Isn't #1 the most likely, given that most quantum interactions take exponential time to simulate on classical computers with any known algorithm?

Why did we escape the land of NP?

NP problems are ones whose positive solutions are verifiable in polynomial time.

For example, the problem "is there a route in this graph that visits all nodes and has length <= L" can be quickly verified with a classical computer, as long as you're given a "yes" answer accompanied by such a route. Finding the answer from scratch might be much slower, but checking it is quick.

As a consumer of all things science but who will never be more than a fan cheering from the stands, one thing I read somewhere stuck in my mind. It went something like:

"We are close to understanding the genome, which will lead us to understanding the proteome. Understanding the proteome will lead us to understanding the metabalome. Once we understand the metabalome, we will have everything: eternal youth, no disease, and everyone can have the perfect body of their choosing."

I doubt any of us will leave to see it, but it's a nice thought.

Given that we already had Alphafold... And we already had a big database of proteins... Running Alphafold on that database seems... unremarkable?

Was it a massive amount of computation that is inaccessible to anyone else? Is there a good reason to run it on the whole database, not just the handful of proteins you want to investigate?

To me the announcement is a bit like saying "we've run prime number analysis on every phone number in the united states!". Great... anyone could have done that... And nobody has done it yet because it wasn't super important to anyone.

"ML model predicts data from the training set" doesn't make a very good headline does it?

edit: wrote it in a sarcastic way, but I would love to understand if that's what it actually is or if there is an actual difference between this and what AlphaFold does.

The result is available for download.

Research often search those database for features to find interesting proteins.

It was indeed a huge amount of computation. On Colab it takes about 2-3 hours to fold a single protein. So the resources required to fold 200 million is far beyond what most organizations have. Just as importantly, making the results easy to search and download will really impact a lot of biology.

I tried looking up some interesting enzymes involved in various industrial approaches to carbon capture and sequestration. AlphaFold has uploaded a lot of these to the UniProt database. Compare and contrast, for a carbonic anhydrase enzyme (catalyzes the H2O + CO2 <-> H3CO- + H+ reaction in a thermophilic bacterium):

X-ray crystal structure uploaded to UniProt database (this is a trimer, i.e. the crystal unit contains 3 copies of the protein, while the AlphaFold model is the single protein):


AlphaFold result uploaded to UniProt database:


One thing to notice is that the X-ray data (structure from 2000) contains a lot more information than the Alpha model, such as location of the metal-binding active site. There's also a fair amount of uncertainty about one of the helix elements in the AlphaFold model.

However, structural protein biochemistry is a very complicated field. Take a look at this (full open access) paper as an example, which uses a variety of modern computational techniques:

"(2021) In Silico Investigation of Potential Applications of Gamma Carbonic Anhydrases as Catalysts of CO2 Biomineralization Processes: A Visit to the Thermophilic Bacteria Persephonella hydrogeniphila, Persephonella marina, Thermosulfidibacter takaii, and Thermus thermophilus"


> "All γ-CAs are structurally similar, adopting a homo-trimeric architecture dominated by β-sheets and an α-helical C-terminal. These CAs are only functional in their trimeric form because each active site is located between two monomers, with two coordinating residues coming from one monomer and the third from the neighboring one."

I'm guessing AlphaFold will be used more as a primary tool for getting a rough idea of the structure of unknown sequences and that X-ray crystallization and NMR dynamics methods for understanding the details, particularly when it comes to catalytic proteins, will continue to be absolutely necessary techniques, along with a wide variety of other computational techniques like molecular dynamics simulations, etc.

It’s amazing watching them wash away/make irrelevant decades of academic effort

This is really more of a tool that allows researchers to take protein sequences and get pretty good predictions of what the crystallized protein will look like. It might be an aid to crystallization as well as you can look at crystallization conditions of similar proteins. However, claims that real-world data is no longer needed are certainly overblown.

Another issue is that a complete analysis of a protein's function, particularly catalytic proteins, requuires that you understand their dynamics, which are not reflected in static protein crystals. Other methods like NMR are used for that purpose.

Could you elaborate on this? I have no idea what you're talking about.

If I had to take a guess I'd say that in the past it took a ton of work to record and catalog just a small number of these and now they can brute force find and record them so quickly that it makes the past efforts look like a waste of time.

Although I suspect that this new thing wouldn't have been possible without the old thing.

Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
