Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AlphaFold Protein Structure Database (ebi.ac.uk)
315 points by matejmecka on July 22, 2021 | hide | past | favorite | 58 comments


I'm impressed and grateful that DeepMind released this resource, this will save a lot of compute from labs trying to replicate an entire exome for themselves. While some structures look great, there are still some misses here. Important structures like BRCA1 (a well-studied breast cancer associated protein) are just structures for the BRCT and RING domains surrounded by a low-confidence string of amino acids, likely shaped to be globular: https://alphafold.ebi.ac.uk/entry/P38398

Maybe I was wrong for expecting the impossible here, but I was excited to see this specific structure and it appears that there is still work to do. Nevertheless, kudos to Deepmind on their amazing achievement and contributions to the field!


Everything between the BRCT and RING domains of BRCA1 is an intrinsically unstructured region which DeepMind correctly predicts, https://pubmed.ncbi.nlm.nih.gov/15571721/

Another famous one would be R-domain of CFTR, which was not resolved in experimental structure determination, and AlphaFold models correctly show disorder there. Nothing to be done in those cases except perform molecular simulation or other experiments to assess dynamic ensembles, https://alphafold.ebi.ac.uk/entry/P13569


A curious non-biologist here: how valuable are these low confidence predictions for biologists? In other words, is it hard to predict but easy to check situation as with, say, prime numbers in mathematics?


The medium-confidence predictions are great for grounding or sourcing intuition. If you're trying to divide up a protein for an experiment and you have to choose where to divy it up - you'd like to use even a bad prediction to help weight an otherwise completely random approach. AND there are great methods to help with this, but they're often custom, time-consuming, and out-of-field for most. So being able to very quickly spot-check using a uniform state-of-the art, for any arbitrary protein, makes it actually pretty useful for certain kinds of pre-experimental guidance.


Some are valuable for the reasons the other person responding noted, but some of the low confidence predictions may also be high confidence predictions of a disordered class of protein that doesn't have a standard rest state. So it's useful work one way or the other.


As an ex biomedical researcher I was trying to think what protein I should enter and see, and couldn't come up with a protein that I know of, that didn't have a structure already (at least a crude one). That is, we roughly know how most known important proteins look like. This is an amazing tool, and will he indispensable in labs (I'll expect any lab to use this site at least once a year?) But it's not as transformative as some might think.


https://www.embl.org/news/science/alphafold-potential-impact...

> A discussion of the applications that AlphaFold DB may enable and the possible impact of the resource on science and society


Do we really know the structure of every protein that assembles into a human cell?


From their abstract:

---

After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure1. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence.

https://www.nature.com/articles/s41586-021-03828-1

---

The metric they use (residues) is a bit unusual (I would have used number of proteins instead), but I assume they wanted to account for ambiguity (such as proteins with partial structures).


One of the reasons we don't have them all is that individual genes can encode for multiple protein isoforms through alternative splicing. AlphaFold was only run on one. Otherwise, there's lots of important biochemical/biophysical processes that impact structure, as cells are only about 50% protein by weight.


Definitely not.


Anyone else getting a 403 Forbidden?

If so it might be better to link to the paper instead: https://www.nature.com/articles/s41586-021-03828-1


Works fine for me. Must have been a temporary glitch.


Didn't see this post so posted it also. Also relevant: https://www.embl.org/news/science/alphafold-potential-impact...


This is a fabulous convenience! The reach of this ready-to-go data will be much larger (in some directions) than the model and CASP results themselves.


I used to do some RNA molecular dynamics simulations in college which were both computationally expensive and difficult to replicate. Having the ability to reasonably predict protein structure is an incredible scientific achievement - however I am curious if anyone here who is better informed has takes on the following.

1. How likely is it that alphafold learned to accurately predict protein structure in the narrow domain of proteins that have been experimentally synthesized and whose structure has been measured? in other words will AlphaFold's results generalize to proteins which cannot yet be synthesized in the laboratory.

2. If Alphafold's accuracy holds, what type of commercial applications does this open up?


There's a lot of news about AlphaFold lately but what about Rossettafold? Wasn't it more accurate and much faster?


I believe slightly less accurate but significantly faster is where it stands.


Running a sequence against both seems like a good idea. If they agree the certainty will go way up.


This is awesome! When they announced CASP results a few months ago, I was wondering if AlphaFold will be accessible as an API, where you can submit a protein id or a sequence and get back a 3D structure. This database is basically that, except it's free & open to the public. Major props!


From the abstract[1]:

> After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins).

[1] https://www.nature.com/articles/s41586-021-03828-1


Basically they are saying that decades of distributed protein folding was useless and everyone would have had more utility mining cryptocurrency if it existed several years earlier

But at least it inspired someone to make and release this


you're conflating two different disciplines: distributed protein folding studies the biophysical process of proteins folding over time, while protein structure prediction makes a static single predict of what is believed to be the final structure adopted by the protein in the folding process.

I think many people believe that given infinite computer time the protein folding simulations would produce the same output as the static prediction (modulo a number of complex details) but use far, far more computer time to get there.

The fundamental observation from the DM AF2 paper that I've been able to glean (which I kind of sort of already believed) is that careful multiple sequence alignments of 30-100 evolutionarily related proteins is enough to produce coarse distance constraints that can be used to guide a structure prediction to a good answer quickly. And that depended on new ML technology that didn't exist before.


thanks for that explanation!


Just in case you're not joking, it's worth noting that the majority of distributed molecular simulation (past and present) is spent studying "folded proteins" to discover structures of proteins that are often hidden from methods like AlphaFold (currently). For example, https://www.nature.com/articles/s41557-021-00707-0


I don't know if you know, but doctors spent 1,300 YEARS using the wrong anatomy book. A few years and compute time isnt the end of the world. I'm sure oracle's DB2 test suite has burned more carbon than protein folding labs have.


A third way in which you are wrong is that AlphaFold derives a lot of its power by referring to previously-solved protein structures, or parts of them. It doesn't fold the proteins from scratch in an "alpha-zero" way.


so its more like protein folding was useless until an AI could make sense of the 17% solved variations and using that for the other 83% of proteins found in humans?

> After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins).

I just don't actually understand the quote from the article if it isn't comparing the same thing


> experimentally-determined structure

refers to structures determined by means of physical examination, with like crystallography, not to attempts at predictive computational analysis prior to AlphaFold, which were not accurate compared to AlphaFold.


Quick question, please excuse my ignorance, but is there a way to extrapolate sequence from structure? In other words, can we design proteins and calculate the sequence required to make it?


It's hard but people do it! This is the field of "protein engineering".


Interesting that they're porting it to other organisms. Different organisms have variations in ribosomes, post translational modifications and even tRNA repertoire. So it's not a guarantee that two identical DNA sequences will give identical proteins in two different organisms.


Shouldn't matter? Protein folding is based on the laws of physics after all. If DNA sequences folds differently in different organisms then an external factor is missing.


While the laws of physics remain the same, the folding machinery between species varies to some degree. Protein folding is determined by the unique environment/machinery of a cell. A concrete example is disulphide bonds (S-S, ex cystein-cystein) that require a certain pH to form. The primary pathways of disulphide-bond formation are localized in the endoplasmic reticulum (ER) of eukaryotic cells and the periplasmic space of prokaryotic cells. So two complete different mechanisms to end up with the same bond (protein structure) depending on the organism.


Outside of missing post translational modifications, can you give a concrete example of a protein that is known to fold differently in different species, not counting, say, stuff getting sent to the garbage bin of inclusion bodies due to the stress of overexpression? My understanding (7 years of grad school researching protein folding in the ER) is that outside of some rare corner and disease state cases, folding is pretty much binary event, and if it weren't for most cases the low delta g difference between isoforms would be just as easily overcome over the course of environmental changes in a single individual as "between different species" namely having a deterministic outcome is important for through-time robustness.


??? Unless you jump from eukaryotes to archea these are not real concerns. Most PTM markers are very conserved.


I'd say the jump from eukaryotes to procaryotes is a realistic scenario in recombinant DNA technology.

I have some experience with recombinant yeast and PTMs. Degree of glycosylation actually vary a lot depending on strain used and has a huge effect of protein activity. And of course these PTMs affects the crystal structure.


I happen to be working on a database for folds as well. But RNA folds not protein folds. I’m not a bio guy but my gf is and if I understand correctly this is not the same. I hope they are different because it would suck to be me lol.

This is my first big boy project and I’m driving solo so it takes me a while to make any progress. But at least now I have this db and genbank to model after


yikes, this doesn't even do some basic stuff like trim off pre-protein segments for secreted proteins... Without this, you could get some very incorrect structures.


[flagged]


1. Gain of function is not as easy as you think. 2. Such bio-weapons are not likely because any virus released in the wild will mutate over time, and also because you cannot target "races" in the way you describe. Phenotypic traits span across geographical borders, and any attempt to do such a thing is likely to backfire.


I think if the CCP were successful in creating race targeted bioweapons it would be in their interest to convince the world that they didn't exist.

Insults, character assassination campaigns and politicizing the existence of these bioweapons would be a good way to do that. Just copy paste some of the comments here and change the name to insult anyone who thinks they exist. They could then go and kill millions and not receive any retaliation whatsoever with people praising them for their effective program of keeping the disease epidemic they created under control. Even if you got the guy who discovered AIDS and won the Nobel prize for it to say that these were gain of function viruses that incorporated HIV protein parts, you could just launch a big propaganda campaign to attack his character.[1] Much cheaper than having to fight a war.

[1]https://www.gmanetwork.com/news/scitech/science/736458/frenc...


...and many doctors will use it to attach pharmaceuticals to receptor sites of particular cancers.


I'm thinking that the problem is is that it is much harder to develop drugs that only kill cancers very efficiently and don't harm the rest of the body than to tweak viruses that just have to keep the person alive long enough to spread the virus.


I 100% agree your point is valid. The counterargument is "Yes, people can do bad things with protein data, just as they can do bad things with a telephone, like use it to discuss a bank robbery."


The crazy part is a bioweapons program is really cheap compared to a nuclear weapons program, and now with these new tools it's even cheaper. Before, it was vastly more expensive to do the cycle of creating a new viral protein and testing a bioweapon on human cell culture. Now that process is speeded up millions of times with this technology because that can all take place inside a computer.

This is similar to the change with drone weaponry. Before, you had to have large cruise missiles to get pinpoint strikes. Now small countries like Azerbaijan can buy a whole fleet of drone weapons and get the benefits of having a modern air force with pinpoint strikes and even stealth for vastly less money.


Is this a correct summary of your statements:

Because it -might- make things slightly easier for a state actor with nigh-unlimited resources to enact a doomsday scenario, which they might or might not be pursuing, medical researchers should not publish otherwise helpful research?


I think it's great that the Wuhan institute published all their gain of function research. They even said who paid for it. It's a clear trail back to them, but apparently taking any action to acknowledge that this is a bad thing and something fishy might be going on is a completely politicized issue now that apparently gets as many downvotes as arguing about hot button political topics now.

What I'm saying is there should at least be an open and frank discussion of what the whole world is getting itself into right now with all this.


It's a bit of a Gish gallop to ignore my point and say something orthogonal that only is weakly related to your original assertion.

Gain of function research is simultaneously concerning (aside from bioweapons use, there is also the possibility of deliberate release) and important (to understand how pathogenesis happens and to better combat future pandemics and bioweapons). It, however, has approximately nothing to do with your assertions that A) China is preparing race-targeted pathogens, or B) that publishing a protein folding database does anything significant to assist China with A.

> ... at least be an open and frank discussion of ...

OK, then you need to be open and frank, rather than engaging in these dishonest argument tactics.


> Ridiculous fookin’ idjit and compulsive liar M.T.Greene came out of a classified briefing recently and announced that the CCP is hard at work on race specific bioweapons.

Fixed that for you. By the way, you shouldn’t pay attention to that clown.


Citation factory, that's what it is.


Resources as useful as this are bound to be. We do cite our sources after all.


I’m sorry but why don’t tbey just release the ability for a user to enter a known real-world sequence’s accession number from Genbank / GISAID, and generate the protein structure from that? Why do they have to abstract the user from the process by only exposing a completed database of the protein structures the Alphafold researchers decided would be worth producing?


You can use the open-source code, and we also have a Colab notebook for that: https://bit.ly/alphafoldcolab

More info: https://deepmind.com/blog/article/putting-the-power-of-alpha...


Thanks for that - I can see why my comment was downvoted now, as the the posted article's FAQ lists these links for those who would like to study their favorite sequenced-but-unmodeled protein. I'm glad Alphafold is as open source as it is, and I recognize that it didn't have to be so.

I think I was primed for a knee-jerk reaction because when Alphafold's results were announced back in Dec. 2020, with expressions of what a boon it would be for researchers around the globe, I anticipated there would be a timeline announced for exposing a tool or for the open-sourcing. (The Github repo has only just been released about 6 days ago ...)

With all the work on SARS-CoV-2's 'interactome', as well as human proteins & enzymes involved in pharmacology of antiviral drugs under development / repurposing , it's easy to imagine that drug developers would have liked to exercise Alphafold as soon as it was announced. (I myself have wanted a structure for human enzyme OATP1A2 that wasn't available on the PDB for such a drug pharmacology study - quite glad it is available at hand now.. .:) ).

Anyway I'm sure good arguments will be made about the need to really 'get it right' before releasing, or internal deliberations on how much to open up vs charging for it.

But 7 months lead time during a pandemic is a long time...

In all cases thanks again for this innovation's availability now. :)


A little addendum (for posterity as this is now an old article post)

• RosettaFold came out in academic paper form as well as open-source github repo slightly prior to AlphaFold. Was AlphaFold's decision to open up motivated by RosettaFold's publishing / opening-up activity? So I feel that the extent of AF's altruism in this (while high) deserves some scrutiny, as perhaps it falls short of the extent that several of us commenting interpreted a couple days ago.

• While RosettaFold allows online execution to the best of their spec, they currently have a backlog queue of ~3000 jobs going back to mid-July. This is telling about the processing power required for folding. AlphaFold has access to a lot of processing power (even if it could ultimately & reasonably end up being on a charge-per-job basis).


I'd guess the ad-hoc simulation of the structure is computationally quite expensive and takes a while, though that's just a guess and I haven't read the original paper yet.


In fact a cost of $1-$4 for the preferred implementation:

https://news.ycombinator.com/item?id=27894060

The colab provides a slightly-less-accurate version that operates in the cloud. For the real mccoy it seems one must set up one’s own environment and leverage the git repo.


DeepMind has already released the open source code and model parameters. The database makes it easier to access the predictions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: