Thanks for this, it provides an intriguing approach to thinking about transformers (or predictors in general)!
For extracting the fractal from the residual stream, did I understand it correctly as follows: You repeatedly sample the transformer, each time recording the actual internal state of the HMM and the (higher-dimensional) residual stream. Then you perform a linear regression to obtain a projection matrix from residual stream vector to HMM state vector.
If so, then doesn't that risk "finding" something that isn't necessarily there? While I think/agree that the structure of the mixed state representation is obviously represented in the transformer in this case, in general I don't think that, strictly speaking, finding a particular kind of structure when projecting transformer "state" into known world "state" is proof that the transformer models the world states and its beliefs about the world states in that same way. Think "correlation is not causation". Maybe this is splitting hairs (because, in effect, what does it matter how exactly the transformer "works" when we can "see" the expected mixed state structure inside it), but I am slightly concerned that we introduce our knowledge of the world through the linear regression.
Like, consider a world with two indistinguishable states (among others), and a predictor that (noisily) models those two with just one equivalent state. Wouldn't the linear regression/projection of predictor states into world states risk "discovering" the two world states in the predictor, which don't actually exist there in isolation at all?
Again, I'm not doubting the conclusions/explanation of how, in the article, that transformer models that world. I am only hypothesizing that, for more complex examples with more "messy" worlds, looking for the best projection into the known world states is dangerous: It presupposes that the world states form a true subspace of the residual stream states (or equivalent).
Would be happy to be convinced that there is something deeper that I'm missing here. :)
For extracting the fractal from the residual stream, did I understand it correctly as follows: You repeatedly sample the transformer, each time recording the actual internal state of the HMM and the (higher-dimensional) residual stream. Then you perform a linear regression to obtain a projection matrix from residual stream vector to HMM state vector.
If so, then doesn't that risk "finding" something that isn't necessarily there? While I think/agree that the structure of the mixed state representation is obviously represented in the transformer in this case, in general I don't think that, strictly speaking, finding a particular kind of structure when projecting transformer "state" into known world "state" is proof that the transformer models the world states and its beliefs about the world states in that same way. Think "correlation is not causation". Maybe this is splitting hairs (because, in effect, what does it matter how exactly the transformer "works" when we can "see" the expected mixed state structure inside it), but I am slightly concerned that we introduce our knowledge of the world through the linear regression.
Like, consider a world with two indistinguishable states (among others), and a predictor that (noisily) models those two with just one equivalent state. Wouldn't the linear regression/projection of predictor states into world states risk "discovering" the two world states in the predictor, which don't actually exist there in isolation at all?
Again, I'm not doubting the conclusions/explanation of how, in the article, that transformer models that world. I am only hypothesizing that, for more complex examples with more "messy" worlds, looking for the best projection into the known world states is dangerous: It presupposes that the world states form a true subspace of the residual stream states (or equivalent).
Would be happy to be convinced that there is something deeper that I'm missing here. :)