This seems closely related to the "Mixtral" approach of a mixture-of-experts transformer [1]... I'm not claiming the approach is not original, it just helped me understand what was going on.
Consider a case of two "experts" or two "value parameter tokens."
The mixture of experts has a "router" network that provides a weight to each expert (through a softmax) conditional on an input. The output is a (sparse) weighted sum of the outputs of the experts.
The TokenFormer has an "attention" layer combines the token and a key value to provide a weight to each "value parameter" token. A(B+C) = AB + AC definitionally, so this is like applying a weighted sum of distinct transformations.
I think the differences are: a) where the non-linearity hits (the above description doesn't consider an activation function), b) this attention softmax is not (necessarily) sparse, c) that "mixtral" networks only replace the feed-forward components of the layer, and d) that extending a "mixtral" approach would require re-training the "router" layers.
It seems like (d) is maybe the nicest feature here... my intuition would think (a) doesn't matter much, (b) is debatable (how close a sparse-MoE can approximate a dense-MoE), (c) has probably been tried (guessing the ffwd limitation was just "more-bang-for-buck-given-parameters" not an oversight)...
... I wonder, though, if there might be diminishing returns here (I believe that Mixture-of-Experts tends to struggle with imbalanced "winner-take-all" dynamics, since "early" winners get more gradient signal to improve their weights) and how different this would have been from going from 3x7B to a 8x7B to a 24x7B training approach (with a "retrain routing networks" step).
This code seems to somewhat re-create effect, except noise is symmetric at both sides (also true of the xkcd-style plot in this article, actually). Plot: https://imgur.com/a/dGZyylf
It seems a crucial piece of context is that there is some correlation between perceived and actual performance. There also is a global optimistic bias, apparently across experiments people perceive their rank is in the 66th percentile, so this universal effect leaves more room for error on the under-performing people while the highly-performing people are going to end up closer to that 66th percentile tautologically, and bounded above by smaller amount of space.
import random
import numpy as np
np.random.seed(seed=12345)
from scipy.linalg import eigh, cholesky
from scipy.stats import norm
from matplotlib import pyplot as plt
#Draw correlated random variables
#Ref here: https://scipy-cookbook.readthedocs.io/items/CorrelatedRandomSamples.html
num_samples = 125 * 4 #
x = norm.rvs(size=(2, num_samples)) # uncorrelated random normal variables
expected_correlation = np.array([[1.0, 0.19], [0.19, 1.0]])
c = cholesky(expected_correlation, lower=True) #there's a slight correlation of R = 0.19 between actual and perceived scores (according to Ackerman, 2002)
trans_x = np.dot(c, x)
#perceived / actual readings with R=0.19
perceived = trans_x[0, :]
actual = trans_x[1, :]
#Sort both variables by actual scores.
sort_by_actual = sorted(range(num_samples), key = lambda idx: actual[idx])
perceived_by_actual = [perceived[i] for i in sort_by_actual]
actual_by_actual = [actual[i] for i in sort_by_actual]
quartile_indices = [i * (num_samples // 4) for i in range(5)] #note: depends on divisiblity by four
x_coords = [xx // 2 for xx in quartile_indices[1:]] #mid-points just for plotting
perceived_means = [np.mean(perceived_by_actual[start:end]) for (start, end) in zip(quartile_indices[:-1], quartile_indices[1:])]
actual_means = [np.mean(actual_by_actual[start:end]) for (start, end) in zip(quartile_indices[:-1], quartile_indices[1:])]
#Plot
fig = plt.figure()
ax1 = fig.add_subplot(111)
plt.title("Dunning-Kruger")
ax1.scatter(x_coords, perceived_means, marker="s", label="perceived")
ax1.scatter(x_coords, actual_means, marker="o", label="true")
ax1.legend()
plt.show()
Summary because the title overstates significance a bit (of a cool paper): Most sequencing today is done using Illumina machines, which basically break DNA into small parts (on the order of ~hundred letters/bases), then use fluorescence/imaging to find sequence.
This paper applies to a new technology, nanopore-based sequencing, which pulls much longer pieces of DNA (record length: 2.3 million base pairs, but most end up shorter, on the order of thousands of bases) through a microscopic molecular channel, providing real-time outputs of voltage that can be somewhat noisily mapped to nucleotide sequences (since different DNA sequences have different voltage outputs when run through a channel).
This technology is very cool, and the fact that it's real-time opens up an interesting idea: you can potentially give real-time feedback to the sequencer as the run is operating about whether a given piece of DNA it's reading is interesting to you. If it's not, the channel can spit out the piece of DNA and start reading in a new one.
So for example, let's say you want to sequence viral sequences present in a human tissue sample. Well, naively, if you just try to collect all the DNA in the sample, most of the DNA is going to be from the human genome (human genome is ~100,000x bigger than viral genomes and likely not all cells are infected). This approach aims to map the DNA as it's being read to some reference (in this case, the human genome) and avoid re-sequencing pieces of DNA mapping to the reference (by getting the channel to spit out the piece of DNA it's reading and wait for a new sequence to come in). Your sequencing results are therefore enriched in the actual sequences of interest.
In terms of the contributions of this paper, nanopore sequencing is still a growing area. This real-time aspect has been a key motivation for some time: this paper improves the feasibility of the approach by algorithmic improvements to mapping between real-time voltage readings and reference sequences. This has to be fast to be effective, since DNA is read pretty quickly and in parallel across many channels, and previous methods apparently weren't fast enough to provide real advantages.
The caveats are that this only applies to cases where you don't want to read the majority of DNA present (an important use case but not universal), nanopore sequencing still has issues with high error rates which makes it a bit less attractive than Illumina sequencing, and the amount of DNA you can read through nanopore is still less than what you can do with Illumina. So it's a cool step on the way to a future where we can do some really exciting "interactive" real-time sequencing work but it's still a part of a developing technology suite.
The problem of experienced professionals vs grad students seems to be a real problem for science. Academia could really do with a build-out of experienced research scientists able to make careers out of consistently building experience and knowledge.
Instead most labs are oriented around a PI that has little time to look into details of most projects due to grant-writing pressures, post-docs who are not expected to stay long-term in lab but whose modal career advancement expectation is the incredibly difficult "land a tenure-track position" option, and graduate students who leave as soon as they've accrued enough scientific knowledge to be truly useful.
It's purely a cost-structure decision on a per-lab basis, but on a societal level it's costly to have most science being spearheaded by inexperienced graduate students (messing up experiments in ways more experienced researchers would not) and costly to train so many people in niche science fields that they then leave to join unrelated tech companies. Salary may play a part in this decision, but many scientists are passionate about their field of study and seems like the real deal-breaker is a lack of career options post "researcher-in-training" phase.
A problem that I was surprised to not be solved well is the study of paths (not represented as graphs). It seems that for example trying to cluster trajectories over time through real coordinates has a bit of prior work (there’s a bit of an unintuitive and costly metric called the Frechet distance that can be used) but is not the solved problem I would have expected for this type of fundamental problem (that would seem to come up a lot for human movement applications). Like answering the question of “what are the most commonly used paths to get to work” without resorting to discretizing steps into nodes on a graph seems tricky as far as I can tell.
The posted article isn't particularly fascinating, but for a bit of fun, there's an OpenAI project where they demonstrate that due to the non-linear rounding of Float32 values you can actually train "non-linear" linear networks: https://openai.com/blog/nonlinear-computation-in-linear-netw...
The DNA/RNA that encodes the proteins can itself be structured in a way that might be disrupted by synonymous amino acid changes. In particular, recent work in the field has shown that changing codons near the start of the gene can disrupt transcriptional/translational machinery.
These effects are often minor but this bacteria has undergone hundreds of millions of years of optimization via natural selection and some researchers have come along and disrupted 18,000 sites. Probably the slower growth and length abnormalities are just that the bacteria is a little miscalibrated and displaying minor symptoms of malaise.
The Decameron has a contemporaneous account of the plague in Florence that has always stuck with me in its first section, "The Plague of Florence" (you can see here, http://faculty.sgc.edu/rkelley/The%20Decameron.pdf ).
The whole section is worth reading but a fragment following descriptions of people that exhibited extreme temperance out of fear, or extreme partying out of nihilism is, "Of the adherents of these divers opinions not all died, neither did all escape; but rather there were, of each sort and in every place, many that sickened, and by those who retained their health were treated after the example which they themselves, while whole, had set, being everywhere left to languish in almost total neglect. Tedious were it to recount, how citizen avoided citizen, how among neighbours was scarce found any that shewed fellow-feeling for another, how kinsfolk held aloof, and never met, or but rarely; enough that this sore affliction entered so deep into the minds of men and women, that in the horror thereof brother was forsaken by brother, nephew by uncle, brother by sister, and oftentimes husband by wife; nay, what is more, and scarcely to be believed, fathers and mothers were found to abandon their own children, untended, unvisited, to their fate, as if they had been strangers. Wherefore the sick of both sexes, whose number could not be estimated, were left without resource but in the charity of friends (and few such there were), or the interest of servants, who were hardly to be had at high rates and on unseemly terms, and being, moreover, one and all men and women of gross understanding, and for the most part unused to such offices, concerned themselves no farther than to supply the immediate and expressed wants of the sick, and to watch them die; in which service they themselves not seldom perished with their gains."
This passage always struck me as a particularly living and breathing account of what it might have been like to live through that time in human history.
Worth noting that there has been a fair bit of good research in causal machine learning in the last year or so, for example "Implicit Causal Models for Genome-wide Association Studies" (https://arxiv.org/pdf/1710.10742.pdf).
The key point of this paper is that neural networks really are very good at "curve fitting" and that this curve fitting in the context of variational inference has advantages for causal reasoning, too.
Neural networks can be used in a variety of structures, and these structures tend to benefit from the inclusion of powerful trainable non-linear function approximators. In this sense, deep learning will continue to be a powerful tool despite some limitations in its current use.
I think Pearl, who's obviously remained very influential for many practitioners of machine learning, knows the value of "curve fitting". However I think it's a bit hard for a brief interview to sit down and have a real conversation about the state of the art of an academic field and the "Deep Learning is Broken" angle is a bit more attractive.
It's worth considering that anywhere in graphical models where coefficients of any sort learned can be augmented by neural networks (such as in the last decade of natural language processing, where the SOTA of almost all problems has been successfully neuralized).
I wonder if Deep Belief Machines and their flavor of generative models, which seem closer in nature to Pearl's PGMs, have a chance to bridge the gap involved.
Edit, as an aside: Given the enormously high dimensionality of personal genomes and the incredibly small sample size, for over a decade I've failed to put any trust in GWAS studies and found my suspicion supported on a number of occasions, considering difficulty in reproducibility likely brought about by the above problem. Is there any reason to think that improved statistical methods can possibly surmount the fundamental problem of limited sample size and high dimensionality?
Numerous important biomedical findings have resulted from GWAS. Most GWAS today are inherently reproducible since their hits usually come from multi-stage designs with independent samples. Sample sizes are no longer "incredibly small" either; large GWAS often have in the order of 100s of 1000s of patients. Some have over a million.
I suppose the most important idea is that GWAS aren't really supposed to show causality. "Association" is in the name. GWAS are usually hypothesis generating (e.g., identification of associated variants) and then identified variants can be probed experimentally with all of the tools of molecular biology.
In summary, GWAS have their problems, but I think your statement is a bit too strong.
Mendelian randomization is a good technique to start thinking about causality for epidemiological studies.
This is a good paper that demonstrates the approach: https://www.nature.com/articles/srep16645
Millard, Louise AC, et al. "MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization." Scientific reports 5 (2015): 16645.
Thousands of samples and millions of dimensions still doesn’t strike me as an easy problem, but it makes sense to me that downstream molecular biology can verify putative associations.
Thank you for weighing in.
Consider a case of two "experts" or two "value parameter tokens."
The mixture of experts has a "router" network that provides a weight to each expert (through a softmax) conditional on an input. The output is a (sparse) weighted sum of the outputs of the experts.
The TokenFormer has an "attention" layer combines the token and a key value to provide a weight to each "value parameter" token. A(B+C) = AB + AC definitionally, so this is like applying a weighted sum of distinct transformations.
I think the differences are: a) where the non-linearity hits (the above description doesn't consider an activation function), b) this attention softmax is not (necessarily) sparse, c) that "mixtral" networks only replace the feed-forward components of the layer, and d) that extending a "mixtral" approach would require re-training the "router" layers.
It seems like (d) is maybe the nicest feature here... my intuition would think (a) doesn't matter much, (b) is debatable (how close a sparse-MoE can approximate a dense-MoE), (c) has probably been tried (guessing the ffwd limitation was just "more-bang-for-buck-given-parameters" not an oversight)...
... I wonder, though, if there might be diminishing returns here (I believe that Mixture-of-Experts tends to struggle with imbalanced "winner-take-all" dynamics, since "early" winners get more gradient signal to improve their weights) and how different this would have been from going from 3x7B to a 8x7B to a 24x7B training approach (with a "retrain routing networks" step).
[1] https://arxiv.org/abs/2401.04088