Mapreduce is a programming model for processing and generating large data sets.4 Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. We built a system around this programming model in 2003 to simplify construction of the inverted index for handling searches at Google.com. Since then, more than 10,000 distinct programs have been implemented using MapReduce at Google, including algorithms for large-scale graph processing, text processing, machine learning, and statistical machine translation. the Hadoop open source implementation of MapReduce has been used extensively outside of Google by a number of organizations.
interesting observation and experience. must have made thesis development complex, assuming the realization dawned on you during the phd.
what do you trust more than NMR?
AF's dependence on MSAs also seems sub-optimal; curious to hear your thoughts?
that said, it's understandable why they used MSAs, even if it seems to hint at winning CASP more than developing a generalizable model.
arguably, MSA-dependence is the wise choice for early prediction models as demonstrated by widespread accolades and adoption, i.e., it's an MVP with known limitations as they build toward sophisticated approaches.
My realizations happened after my PhD. When I was writing my PhD I still believed we would solve the protein folding and structure prediction problems using classical empirical force fields.
It wasn't until I started my postdocs, where I started learning about protein evolutionary relationships (and competing in CASP), that I changed my mind.
I wouldn't say it so much as "multiple sequence alignments"; those are just tools to express protein relationships in a structured way.
If Alphafold now, or in the future, requires no evolutionary relationships based on sequence (uniprot) and can work entirely by training on just the proteins in PDB (many of which are evoutionarily related) and still be able to predict novel folds, it will be very interesting times. The one thing I have learned is that evolutionary knowledge makes many hard problems really easy, because you're taking advantage of billions of years of nature and an easy readout.
this is very astute, not only about deepmind but about science and humanity overall.
what CASP did was narrowly scope a hard problem, provided clear rules and metrics for evaluating participants, and offered a regular forum in which candidates can showcase skills -- they created a "game" or competition.
in doing so, they advanced the state of knowledge regarding protein structure.
how can we apply this to cancer and deepen our understanding?
specifically, what parts of cancer can we narrowly scope that are still broadly applicable to a complex heterogenous disease and evaluate with objective metrics?
[edited to stress the goal of advancing cancer knowledge, not to "gamify" cancer science but to create structures that inivte more ways to increase our understanding of cancer.]
this is correct. it's clear that smoking elevates cancer risk, but why doesn't it cause cancer in all smokers?
to develop a cure, we must better understand the causal mechanisms.
this starts with acknowledging what we know and don't know about a devilishly complex disease that is arguably better conceptualized as a broad category rather than one monolith -- similar to how the flu, cold, and covid could be grouped under one mega classification, but are better identified as distinct conditions.
I think we understand the causal mechanism pretty well. it is just laypeople struggle with binary thinking vs probability.
The reason why only 20% of smokers get cancer is similar to why a person doesn't get cancer after 1 cigarette.
This is only counterintuitive if your default thinking is that smoking=cancer. In reality, there are a lot of variable chemical and biological processes involved, but ultimately it ultimately boils down to a cumulative risk, not guarantee.
full tweet since truncation was required for submission:
Does #RAG/web search solve #LLM hallucinations?
We find that even with RAG, 45% of responses by #GPT4 to medical queries are not fully supported by retrieved URLs. The problem is much worse for GPT-4 w/o RAG, #Gemini and #Claude arxiv.org/pdf/2402.02008…
not fungal, but pathogen related: stanford researchers in 2022 identified how epstein-barr virus (EBV) could be one cause of MS. essentially, EBV proteins may mimic a human protein and induce the immune system to mistakenly attack the body’s nerve cells.
could you elaborate on the mechanical/physical limitations that cause SOTA actuators to lag behind muscles, and if there's an equivalent "moore's law" that might predict when this gap closes appreciably, if ever?
Mapreduce is a programming model for processing and generating large data sets.4 Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. We built a system around this programming model in 2003 to simplify construction of the inverted index for handling searches at Google.com. Since then, more than 10,000 distinct programs have been implemented using MapReduce at Google, including algorithms for large-scale graph processing, text processing, machine learning, and statistical machine translation. the Hadoop open source implementation of MapReduce has been used extensively outside of Google by a number of organizations.