Hacker News new | past | comments | ask | show | jobs | submit login
Nucleotide Transformer: building robust foundation models for human genomics (nature.com)
57 points by bookofjoe 68 days ago | hide | past | favorite | 9 comments



Cool! I don’t understand what the genomic tasks it solves are. What can it actually do?

Also can we train this same model on regular language data so we can converse about the genomes? I suppose a normal multi modal model can talk about what it sees in images in english. Could we have a similar thing with genomes? Ie DNA is just another modality in a multimodal.


> Also can we train this same model on regular language data so we can converse about the genomes?

Yes! That is what has been done in ChatNT [1] where you can ask natural language questions like "Determine the degradation rate of the human RNA sequence @myseq.fna on a scale from -5 to 5." and the ChatNT will answer with "The degradation rate for this sequence is 1.83."

> My biggest point of confusion is what type of practical things these models can do.

See for example this notebook [2] where the Nucleotide Transformer is finetuned to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types.

Disclaimer: I work at InstaDeep but was not involved in either of the above projects.

[1] https://www.biorxiv.org/content/10.1101/2024.04.30.591835v2 [2] https://github.com/huggingface/notebooks/blob/main/examples/...


Possibly a dumb question - but are these models useful for homology finding? If you have two homologous genes, do they have similar embeddings?

The reason I ask is I have a bunch of genes where I can’t get much better than a 1:many orthology mapping, and if this method can capture related promoters/intronic regions etc per gene, and tell me if they are related, that would be a huge help (assuming this works on eukaryotic genomes).


I’ve been trialing a bunch of these models at work. They basically learn where the DNA has important functions, and what those functions are. Its very approximate, but up to now that’s been very hard to do from just the sequence and no other data.


> Its very approximate, but up to now that’s been very hard to do from just the sequence and no other data.

the synthetic syn 1.0 project used a promoter search algorithm written in cobol by one of the leaders. one of the professors on the project had a wordperfect macro that found protein sequences, point being they weren't the best programmers in the world. i would hardly say its been "very hard"


It depends on what sort of model you're implementing.

There's a big implementation (and result quality) difference from direct string searching (fixed pattern matching) and probabilistic methods (everything from simple profile methods to hidden markov models). Finding direct matches is the same as the "string.find()" method, while probabilistic methods usually involve dynamic programming, heuristic approximations, floating point matrices, etc.

But more importantly, techniques like Nucleotide Transformers are much less supervised than existing search techniques. Previously people had to do a fair amount of labelling and QC work to identify patterns that underlying general sequence categories, these methods spontaneously learn them from the data. I could imagine building an entire transformer model in COBOL although it would be cumbersome; building one with a wordperfect macro would be extremely challenging if not impossible. Even a profile-based method would be painful (I don't know if WP macros are turing complete/general purpose programming).

I don't think it's particularly fair or nice to imply that the work being done here is the same sort of work that was being done with a promoter search algorithm; I'm an expert in this area and you're being unnecessarily dismissive. The field has come a long way.


> from just the sequence and no other data

This is my real question with these... we already have a ton of other data for genomics. So, many of the important regions are already known and studied. And really, the functional importance of any given region/sequence is highly context/cell type specific. So, given this, what are the use cases? What kind of hypothesis generation can these models lead to that we aren't currently doing in genomics?


The whole idea of unsupervised learning is to find patterns in the data that people wouldn't have easily found by manually looking for categories/labels. So far most of the categories we've identified and manually clustered (to build statistical models that find more of them) have taken extensive discovery biology and curation efforts.


That’s really cool. Can you share any insights the models have given you? My biggest point of confusion is what type of practical things these models can do.

(Or Email in profile if you can’t share publicly)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: