I really love the work in this series, it feels like they are getting close to uncovering a periodic table of features/concepts that are common across all models.
I'm glad you've enjoyed it! If you like the idea of a periodic table of features, you might like the Early Vision article from the original Distill circuits thread: https://distill.pub/2020/circuits/early-vision/
We've had a much harder time isolating features in language models than vision models (especially early vision), so I think we have a clearer picture there. And it seems remarkably structured! My guess is that language models are just making very heavy use of superposition, which makes it much harder to tease apart the features and develop a similar picture. Although we did get a tiny bit of traction here: https://transformer-circuits.pub/2022/solu/index.html#sectio...
I should mention, I've been a reader of hackernews for years, but never bothered to create an account/comment. These articles piqued my interest enough to finally get me to register/comment :)
Thank you for sharing these, I will definitely check them out! The concept of superposition here is new to me, but the way its described in these articles makes it very clear. The connection to compressed sensing and the Johnson–Lindenstrauss lemma is fascinating. I am very intrigued by your toy model results, especially the mapping out of the double-descent phenomena. Trying to understand what is happening to the model in this transition region feels very exciting.
My best guess at the middle regime is that there are _empirical correlations between features_ due to the limited data. That is, even though the features are independent, there's some dataset size where by happenstance some features will start to look correlated, not just in the sense of a single feature, but something a bit more general. So then the model can represent something like a "principal component". But it's all an illusion due to the limited data and so it leads to terrible generalization!
This isn't something I've dug into. The main reason I suspect it is that if you look at the start of the generalizing regime, you'll see that each feature has a few small features slightly embedded in the same direction as it. These seem to be features with slight empirical correlations. So that's suggestive about the transition regime. But this is all speculation -- there's lots we don't yet understand!