Theoretically, HMMs are "models of the world" and transformers are approximations of HMMs in the "forward" algorithm.
Something seems sus about how the linear projection ended up exactly in the same shape as their prediction. Also that their projection seems to stay in the same shape throughout training. Typically, projections look like they "spin around" as they move from a random point cloud to the separated shapes, but I have not done experiments on transformers and it's unclear what they mean by projection.
yes, the projection is possibly responsible for it looking like a simplex/triangle since it's a probability distribution over 3 states.
another individual seem to have asked that same question in the comment section of that article and they wrote another article with the author after a lot of back and forth:
Something seems sus about how the linear projection ended up exactly in the same shape as their prediction. Also that their projection seems to stay in the same shape throughout training. Typically, projections look like they "spin around" as they move from a random point cloud to the separated shapes, but I have not done experiments on transformers and it's unclear what they mean by projection.