So, we're talking like what? Maybe $100K to $300K of hardware? Wet biology labs often have multiple pieces of $100K+ equipment at their disposal. Why shouldn't computational labs too?
The cost of computing will also go down in the future and for government funds this sounds like a drop in the sea when they are building multi billion dollar particle accelerators.
I guess the main reason it hasn't been done is that the deprecation is still huge due to chip advancements.
All the experimentation and fine tuning probably meant thousands of trials which may have been significantly bigger scale before they got the model optimised ...
Sounds like an oxymoron these days.
I wrote a thesis on protein structure prediction in 1995. We weren't very good at it then. Amazing to see this.
In hindsight though, we were so far off from both an algorithmic perspective and a hardware perspective to actually achieving meaningful results. I am glad, 20 years later it seems real progress is being made. I haven't really followed the folding@home project in many many years, but its not clear to me much came out of it that was all that useful, at least not in practical terms.
I don't believe this new approach runs on their distributed compute network, but its cool to see some good competition.
This can be applied to just about any area of biology. You could design novel antigens to combat disease, and then easily mass-produce them. Or just inject the RNA to have the body produce them.
But the applications are boundless, from genetically modifying crops, to anti-aging, and more.
It is also one of the key pathways to molecular nanotechnology, where instead of building arbitrary structures out of amino acids, we increase the range of arbitrary molecules we can design, build, and produce in quantity.
Specific structures are useful in all manner of ways, from cleaving a DNA molecule at a specific point, enzymes for breaking apart molecules, etc.
Very, very useful.
Just to frame it a particular way, biological systems are basically solved nanotechnology, extremely good, self-sustaining, resilient little machines that have spent a long time optimizing to be better and better. But all the designs are preset, if we can crack the code and design our own little machines, then amazing things like more plastic-like cellulose could be made, all sorts of problems are suddenly far easier to solve. But also a lot of new problems emerge that weren't even imaginable before, since the code being cracked is a big chunk of the code of life itself. So, yknow, playing God and all, so there probably will be some negative consequences of this too.
Generally speaking molecular nanotechnology will solve all the "intractable" problems we as a society face today: climate change, poverty, biological death from old age / disease / cancer, and more.
We could also create tools of destruction so vast, it can be hard to contemplate.
- better predict drug binding to proteins (massive benefits if accurate)
- better understand the functional outcomes of missense mutations on proteins
- study protein-protein interactions
- and in general, just gain a better understanding of biology (which is driven by proteins and their reactions/interactions)
A group of your competitors then trashes your proposal in a group and if you've properly massaged the right backs, you get a pittance, which permits you to struggle to keep up with all your promises.
To paraphrase Derek Lowe a lot (see, e.g., https://blogs.sciencemag.org/pipeline/archives/2021/03/19/ai...), there are several hard problems in biology, and the kind of progress embodied in AlphaFold isn't progress towards the rate-limiting problems. And many of the things that make drugs hard to develop are going to carry over into making bioweapons hard to develop.
This is a question we should remember when we feel like condemning big corporations for monopolizing AI. HuggingFace lists 12,257 models in its zoo, many coming from FAANG. You can start one in 3 lines of Python, or fine-tune it with a little more effort.
A lot of people will say "Unless you opensource my work and that of my colleagues, I quit".
When faced with all your best people threatening to quit, you might just opensource that work. It turns out you still have an advantage by being ~1 year ahead on applying it to anything, and having all the people who know how it works on your staff.
1. It is inline with their vision/mission of the organization, advancing science.
2. Differentiate themselves from OpenAI, which despite the name, is not really big on open source.
It's pretty clear at this point that the work led to a large improvement in psp scores, but there's literally nothing else groundbreaking about it; I don't mean that in a bad way, except to criticize all the breathless press about applications and pharma.
What I would support is that AlphaFold 2 does not solve the protein folding problem: how, as opposed to what to, a protein folds.
Maybe according to the current definition of the term, which has drifted over the years. Homology modeling and "ab initio" structure prediction have been drifting toward each other for a long time. These days, the categories are separated by (an essentially arbitrary) sequence identity threshold. If you have a protein sequence with high homology to some other protein with a structure, then you're homology modeling. If you have no matches at all, you're doing "ab initio". In the middle, you have a gray area where you can mix the approaches and call it whatever you like.
This is not a pedantic point. If your method requires homology -- however distant and fragmented -- in order to work, then you're always limited to the knowledge in the database. Maybe we've sampled enough of protein space to get the major folds, but certainly, the databases don't have enough information to get the small details right.
I have never been a huge believer in the idea that we can go directly from protein sequence to protein structure simply using a mathematical model of physics, but that is the original meaning of "ab initio structure prediction", and if you could do it, it would be far more valuable than alphafold. At risk of making a trivially nerd-snipable metaphor, it's kind of like the difference between google translate and a theoretical model of human intelligence that understands concepts and can generate language. The latter is obviously immensely more capable than the former.
ab initio means from nothing, and at most, you're allowed to have physically inspired force fields, not sequence similarity to known structures. I put a lot of effort into improving the state of the art in that area, but ultimately concluded it made more sense to concentrate experimental structural determination in the area that was most useful- in proteins that had unknown folds or no known homology (see https://scholar.google.com/citations?view_op=view_citation&h... for some previous work I did in this area).
The category is given the name, not the methods. People can use any method they like to solve the structures. The organizers are not zealots.
The ab initio portion of CASP consists of proteins that the organizers know have low sequence identity to anything in the existing databases. They represent proteins that are "difficult" to solve using what any practitioner might call homology modeling. That doesn't mean that you can't use a method that takes into account the biological databases -- and essentially all of the good methods do!
For example, the Rosetta method has competed in both the homology modeling and the ab initio categories for many years. They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits.
I haven't paid close attention to CASP in a long time, but I assume the competitor list still has tons of entries from people who cling tightly to the purist vision of ab initio modeling. They don't tend to do very well.
"They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits."
that's the best description of what I believe AF2 is doing, but that AF2 is being marketed as not depending on any sequence similarity.
If the CASP folks really are saying "if you have 20% sequence identity and use the structure from that alignment it's ab initio"... that's really just totally misleading.
Of course, even ab initio methods are parameterized on biological information; for example, I used AMBER to do MD simulations and many of the force field terms were determined using spectroscopic data from fragments of biological models. That, however is ab initio, because nothing even as large as a single amino acid is parameterized.
I'm not saying there's anything wrong with homology modelling, or that the purist vision of ab initio is right. For practical purposes, exploiting subtle structural information through sequence alignment is a very nice way to save enormous amounts of computer time.
OK, great. Me too. I'm not saying anything controversial here. Right from the top of the "ab initio" tab on predictioncenter.org:
"Modeling proteins with no or marginal similarity to existing structures (ab initio, new fold, non-template or free modeling) is the most challenging task in tertiary structure prediction."
My understanding is no, they did the equivalent of template modelling, which uses sequence/structure relationships (that are more subtle than the ones you get from homology modelling).
I'm less interested in reconciling my internal mental model of psp wiht CASPs, than I am in understanding if AF2 is somehow able to get all the necessarily structural constraints through coevolution of amino acid pairs entirely without some (direct or indirect) learned relationship between the sequence similarity to known structures (be it even short fragments like helices).
If they really did do that, and nobody did it before, that's great and I will happily promote the DM work, as it supports what I said when I did CASP: ML and MD will eventually win, although in a way that exploits the rich sequence evolutionary information we have, rather than predominantly by having an accurate force field and good smapling methods.
If I'm mistaken about this then I'll happily take back what I said, but there's no way that AF2 could work wihtout MSAs, therefore, it is not ab initio.
Ah, OK checked the paper again. They're working on the "template" category which means there is structure-sequence information... maybe CASP organizers consider this ab initio ? The paper never mentions anything about ab initio predicitons. Is that what you're saying, that template methods are ab initio?
The commonly accepted definition of homology modelling implies using a known structure ("template") as a scaffold to model the protein's topology. Since there are many CASP14 targets without appropriate templates, AlphaFold 2 simply cannot "just do homology modelling".
I do take the point that the correct term is "free modelling" (it does not have, or does not use, any good structure as a template), and not "ab initio modelling" (it uses physics to fold the protein), though. A deep enough MSA is generally a requirement.
IE, any MSAs would always include alignments to known protein structures. Are you saying their MSAs don't include alignments to known protein structures?
(the reason I'm asking all this is because if I'm mistaken, then AF2 did do something "interesting", but everything in the paper says that everything they did is template based. If they are just folding proteins using MSAs without alignments to protein structures, that's far more interesting. I don't think they did that.
edit: I've now reread the paper again, and I believe their claim of making predictions where there is no structural homology is incorrect from a technical perspective. I've communicated this to both the CASP organizers (whom I know) and DeepMind.
It would help if you coould point to one of the alignmennts they made that has no underlying structure (even a template fragment) support.
I reread the methods section, https://static-content.springer.com/esm/art%3A10.1038%2Fs415...
They train jointly on the results of genetic search and template search (template search). Can you show an example of a prediction made using only genetic search and not template search. Those templates are fastas made from PDB files, which, while not homology modelling, is definitely not "ab initio".
I'm going to be a bit skeptical but if that's the case, then it really is a significant improvement. Glad to see that with just the idea, the academic community was able to reach near parity in a short time, demonstrating there was nothing unique to DM except their huge amount of compute, storage, and talent, and this would have happened in the next CASP anyway.
Having Bioinformatics people requiring to stray a long way from their core competency to learn a scripting language from the 80's to write glue code seems... suboptimal. How many hours of expert time has been wasted figuring out how to split a string in bash?
Can us software people build a better tool to eliminate the need for this?
Sadly, this is very uncommon in the community. In a bioinformatics meeting, the sentence "I spent X days setting up Y software" will not raise many eyebrows
I look with much jealousy over at the computer science field where papers often include code, multiple versions under version control, automated tests, setup/docker scripts, and demonstration workflows and interfaces.
They probably could have achieved the same by invoking things in Python, but it would have been slower and not achieved a lot, other than “not using shell scripts”.
And once you go down the path of optimizing this enough, you’ll end up reinventing shell scripts altogether.
I think we could create something like a GitHub of bio projects that need help and people assist with hopes of getting their names in a paper
I find astonishing how bad Python is as a bash replacement.
I often rather write an argument parser in bash than use Python if I have to invoke a bunch of commands.
> explaining ppl how to use Python subprocess Module Launch different apps, capture their output
There's no shame in using `os.system`.
Seriously please tell me if you are founding this company so I can invest.
Most probably, not. Bash is currently the sweet spot. It is actually the best tool for this job. Any other option comes with increased complexity, will make the whole software less stable.
I also use Bash and AWK for preprocessing a lot.
Probably a good many more than would be needed to learn how to split a string in Python.