Hacker News new | past | comments | ask | show | jobs | submit login
Wet-lab innovations will lead the AI revolution in biology (substack.com)
60 points by lysozyme 40 days ago | hide | past | favorite | 25 comments



> people unacquainted with biology have a false perception of how low-throughput biology experimentation is. In many ways, it can be. But the underlying physics of microbiology lends itself very well to experiments that could allow one to collect tens-of-thousands, if not millions, of measurements in a singular experiment. It just needs to be cleverly set up.

I think this passage gets to the fundamental rift of disagreement in perspective between those focused purely on computational advances versus innovating in wet lab techniques.

Why? Because years of peoples' careers have been wasted waiting on promises from molecular biologists claiming they will make these "clever" high-throughput experiments work. In my experience, they'll spend months to years concocting a Rube Goldberg machine of chained molecular biology steps, each of which has (at best) a 90% success rate. You don't have to chain many of these together before your "clever" setup has a ~0% probability of successfully gathering data.


You have just very eloquently expressed why I left a career in biochemistry behind after undergrad. Realistically I had no business doing that degree in the first place: I simply don't have the patience for the lab work grind.


Same reason I left Chemistry. Gotta grind for 80 hours a week for a decade for a chance at a good position. And hope you don’t get stuck under a PI that steals all the credit.


Seems like I am not alone - I still tried for 4 years to make it work in pharma but realised that spending 3 days to grow primary cells to just get irreproducible results was not going go be satisfying.


The old joke by Sydney Brenner: "low input, high throughput, no output" science.


NGS is the counter example here. We will also have single cell proteomics in the near future. Robots can do lots of things like vary the initial conditions across a spectrum of variables... something no grad student would want to do.


> Get those linguists out of here, more data will replace whatever insights they have! It’s a fun and increasingly popular stance to take. And, to a degree, I agree with it. More data will replace domain experts, the bitter lesson is as true in biology as it is in every other field.

I think it’s fundamentally shifting how people approach R&D in all physical fields. The power of “the ML way” is almost a self-fulfilling prophecy. Once you see ML upend the standard approach in one area, the question is not if but when it will upend your area, and the natural next step is to ask, “how can I massively increase data collection rates so I can feed ML”? It just completely flips all branches of science on their head, from carefully investigating and building first-principles theory, to saying “screw it, I really just wanted to map this design space so I can accurately predict outcomes, why don’t I just build a machine to do that?”

It then becomes a question of how easy it actually is to build an ML-feeding machine (not easy, very problem-specific), ergo the pendulum now swings to physical lab automation.


In grad school (I was in Chemical engineering ) I took molecular biology course. We read/reviewd a number of papers in different areas. For my review I proposed a series of experiments to answer questions raised by the paper. It was very logical and well thought out. Problem was it would have amounted to 3grad students full time for at least a year. Once you see the effort involved you can see why the ML approach is exciting


How exactly would ML speed it up something that takes 3 grad students full time?

Listen. A lot of this shit gets discovered for crazy reasons. For two years these two postdocs were throwing away one fraction of their size exclusion chromatography step. I got into a really heated six hour argument where i insisted that the postdocs did not understand that size exclusion chromatography, big shit comes out first (they thought that big shit comes out last). The next day the postdoc apologized, since I was correct.

Oddly, a month or so later, they stopped to take a look at the fraction they were throwing out and it turned out that their molecule was self-assembling into cages. This is important for how the molecule is supposed to work. They got some very important papers out of it. I'm not even thanked.

ML is not accelerate this sort of stuff.


A hypothetical good AI would have reviewed the experimental design and pointed out the misconception


How would the AI know to look there? Trash on ends "void volumes" of chromatography is common.

Anyways what is your training set? Probably upwards of 50% of papers are trash. Will the AI have the intuition to know which ones are good? Does the AI listen at the water cooler to grad students griping about Corey yield?

(It's not on the internet. It's the general sentiment that yields reported by the E.J. Corey lab are inflated).


In biology, the most important step is finding the right thing to measure. Biological systems are highly contextual, so the second most important step is finding the second thing to measure in relationship to the first thing.

In the case of AlphaFold, measuring crystal structures is the most important thing (molecular phenotype). The second most important thing is measuring many genomes. Multiple sequence alignments allows evolution (variation under selection) to tell you about the important bits of the structure. The distance from aligned DNA sequences to protein structure isn't a bridge too far.

Unfortunately, biology has been mislead by the popularity of transcriptomics, which the post touches on briefly (limits of single-cell approaches). Transcriptomics generates lots of data (relatively) cheaply, but isn't really the right thing to measure most of the time because it is too far removed causally from the organismal phenotype, the thing we generally care about in biomedicine. Although gene expression has provided some insights, we've exhausted most of its value by now and I doubt ML will rescue it (speaking from personal experience).


The flip side of this is that progress in ML for biology is always going to be _slower_ than progress in ML for natural languages and images [1].

Humans are natural machines capable of sensing and verifying the correctness of a piece of text or an image in milliseconds. So if you have a model that generates text or images, it’s trivial to see if they’re any good. Whereas for biology, the time to validate a model’s output is measured more in weeks. If you generate a new backbone with RFDiffusion, and then generate some protein sequences with LigandMPNN, and then want to see if they fold correctly … that takes a week. Every time. Use ML to solve _that_ problem and you’ll be rich.

TFA mentions the difficulty of performing biological assays at scale, and there are numerous other challenges. Such as the number of different kinds of assays required to get the multimodal data needed to train the latest models like ESM-3 (which is multimodal, in this context meaning primary sequence, secondary structure, tertiary structure, as well as several other tracks). You can’t just scale a fluorescent product plate reader assay to get the data you need. We need sequencing tech, functional assays, protein-protein interaction assays, X-ray crystallography, and a dozen others, all at scale.

What I’d love to see companies like A-Alpha and Gordian and others do is see if they can use the ML to improve the wet lab tech. Make the assays better, faster, cheaper with ML. Like how they use ML to translate the electrical signals of DNA passing through the pore into a sequence in the Nanopore sequencers. So many companies have these sweet assays that are very good. In my opinion, if we want transformative progress in biology, we should spend less time fitting the same data with different models, and spend more time improving and scaling wet lab assays using ML. Can we use ML to make the assay better, make our processes better, to improve the amount and quality of data we generate? The thesis of TFA (and experience) suggests that using the data will be the easy part

1. https://alexcarlin.bearblog.dev/why-is-progress-slow-in-gene...


I read a great piece from Michael Bronstein about this very topic earlier this year.

https://towardsdatascience.com/the-road-to-biology-2-0-will-...

I think an important point raised here is the distinction between good data, and the "relative" data present in a lot of biology. As examples from the article, a protein structure, or genome/protein sequence data is good data, but data like RNA-seq or mass spectrometry data is relative (and subject to sensitivity / noise etc). The way I like to think of it is that sequence data and structural data is looking at the actual thing, but the relative data only gets you a sliver of a snapshot of a process. Therefore it's easier to build models to capture relationships between representations of real things, rather than models where you can't really distinguish between signal and noise. I spend a fair amount of time these days trying to figure out how to take advantage of good data to gain insights into things where we have relative data.


The lab is not incredibly low throughput but also most of the experiments look at a single modality. Take a cell viability or FACS assay - while some additional measurements could be taken or analysed - most of the time the scientist will look at a single parameter. In a separate assay the cell (other passage/day) will be undertaken another assay resulting in nearly incomparable data.

The solution: Multimodal data and getting more info on experiments setup (often a bit of voodoo and not written down properly).


Clonal DNA synthesis has increased in price over the past 6 years (even when accounting for inflation). On that metric, we’re actually regressing in our ability to modify the natural world. It’s even worse than stagnation.

Or even look at lab robotics - in 2015, you were able to buy a new opentrons for $2500. Now it’s about $10,000 - the only way to rival the old pricing is to scrounge around used sales.

Enzyme prices haven’t dropped in basically forever. Addgene increased plasmid prices a little bit ago.

I feel like computer hackers can’t even imagine how bad it is over here


this is super interesting, had no idea this was happening! i assumed things were tending towards cheaper given sequencing prices were dropping, why are things rising elsewhere?


For synthesis, it’s mainly because there is a monopoly and the challenge (DNA assembly) is super boring in comparison to sexy things like DNA synthesis.

For robotics, it’s because Opentrons started scaling and needed to make more $$$ rather than staying small (fair to them, SoftBank dollars are attractive)

Generally speaking, biology is hard and uncertain, so the market isn’t as competitive as you’d imagine. Synthesis prices stagnated a little bit for a while in comparison to where they could be because Illumina didn’t have competition


This is a low quality piece about an over hyped subject.


What specifically did you find to be low quality about it? Why do you think the subject is overhyped?


apologies will do better


No, I should apologize. I was just grumpy when I commented and had no good reason to write that.


<3


Theranos 2.0 we will do it all for you give us money old boomer investors in a decrepit dying empire.


It's awesome when someone else writes you series A pitch for you. www.molecularReality.comr




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: