A lot of people didn't seem to get it when it was discussed on HN. A GPT had _only_ ever seen Othello transripts like: "E3, D3, C4 ..." and NOTHING else. It knows nothing of the board. It doesnt event know that there are two players. It learned Othello like it was a language, and was able to play an OK game of it, making legal moves 99.99% of the time. Inside its 'mind', by looking for correlations between its internal state and what they knew the 'board' would look like at each step in the games, they found 64 nodes that seemed to represent the 8x8 Othello board and representation of the two different colours of counters.
And this is the key bit: They reached into its mind and flipped bits on that internal representation (to change white pieces to black for example) and it responded in the appropriate way when making the next move. And by doing this they were able to map out its internal model in more detail, by running again and again with different variations of each move.
I agree this is an incredibly interesting paper. I am not a practitioner but I interpreted the gradient article differently. They didn’t directly find 64 nodes (activations) that represented the board state as I think you imply. They trained “64 independent two-layer MLP classifiers to classify each of the 64 tiles”. I interpret this to mean all activations are fed into a 2 layer MLP with the goal of predicting a single tile (white, black, empty). Then do that 64 times once for each tile (64 separately trained networks).
As much as I want to be enthusiastic about this, it’s not entirely clear to me that it is surprising that such a feat can be achieved. For example it may be possible to train a 2 layer MLP to predict the state of a tile directly from the inputs. It may be that the most influential activations are closer to the inputs then the outputs, implying that Othello-GPT itself doesn’t have a world model, instead showing that you can predict board colors from the transcript. Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there. I think it would be more impressive if they were only taking “later” activations (further from the input), and using a linear classifier to ensure the world model isn’t in the tile predictor instead of Othello-GPT. I would appreciate it if somebody could illuminate or set my admittedly naive intuitions straight!
That said, I am reminded of another OpenAI paper [1] from way back in 2017 that blew my mind. Unsupervised “predict the next character” training on 82 million amazon reviews, then use the activations to train a linear classifier to predict sentiment. And it turns out they find a single neuron activation is responsible for the bulk of the sentiment!
Right, so the 64 Probes are able to look at OthelloGTPs internals and are trained using the known board-state-to-OthelloGPT-internals data. The article says
It turns out that the error rates of these probes are reduced from 26.2% on a randomly-initialized Othello-GPT to only 1.7% on a trained Othello-GPT. This suggests that there exists a world model in the internal representation of a trained Othello-GPT.
I take that to mean that the 64 trained Probes are then shown other OthelloGTP internals and can tell us what what the state of their particular 'square' is 98.3% of the time. (we know what the board would look like, but the probes dont)
As you say "Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there."
But then they go back and actually mess around with OthelloGTPs internal state (using the Probes to work out how), changing black counters to white and so on, and then this directly affects the next move OthelloGTP makes. They even do this for impossible board states (e.g. two unlinked sets of discs) and OthelloGTP still comes up with correct next moves.
So surely this proves that the Probes were actually pointing to an internal model? Because when you mess with the model in a way to affect the next move, it changes OthelloGTPs behaviour in the expected way?
It‘s not a synonym for NNs. It‘s one specific NN architecture, consisting of an input layer, an output layer, and a number of hidden layers in between. It‘s feed-forward and fully-connected, as you said.
> Inside its 'mind', by looking for correlations between its internal state and what they knew the 'board' would look like at each step in the games, they found 64 nodes that seemed to represent the 8x8 Othello board and representation of the two different colours of counters.
Is that really surprising though?
Take a bunch of sand, and throw it on an architectural relief, and through seemingly random process for each grain, there will be a distribution of final positions for the grains that represents the underlying art piece. In the same way, a seemingly random set of strings (as "seen" by the GPT) given a seemingly random process (next move), will have some distribution that correspond to some underlying structure, and through process of training that structure will emerge in the nodes.
We are still dealing with functional approximators after all.
Its not suprising, but it answers the question "Do Large Language Models learn world models or just surface statistics?" - OthelloGTP is not using some weird trick to come up with the next move "G4". You can imagine some sort of shortcut trick where you say "use a letter thats near the middle of the bell curve of letters you've seen so far, and a number thats a bit to the left of the bell curve" or something. Its not using a weird trick, its actually modelling the board, the counters, and the rules about where the black and white discs are allowed to go, and keeping track of the game state. It derived all that from the input.
But the point is that Othello notation is basically 64 tokens which map 1:1 to positions on an Othello board, and the "grammar" of whether one token is a valid continuation is basically how the previous sequence of moves updates game state, so surface statistics absolutely do lead inexorably towards a representation of the game board. Whether a move is a suitable continuation or not absolutely is a matter of probability contingent on previous inputs (some moves common, some moves uncommon, many other moves not in training set due to impossibility). Translating inputs into an array of game state has a far higher accuracy rate than "weird tricks" like outputting the most common numbers and letters in the set, so it's not surprising an optimisation process involving a large array converges on that to generate its outputs. Indeed I'd expect a dumb process involving a big array of numbers to be more likely to converge on that solution from a lot of data than a sentient being with a priori ideas about bell curves of letters...
I think some of the stuff ChatGPT can actually do like reject the possibility of Magellan circumnavigating my living room is much more surprising than a specialist NN learning how to play Othello from a DSL providing a perfect representation of Othello games, but there's still a big difference between acquiring through training a very basic model of time periods and the relevance of verbs to them such that it can conclude an assertion in the form was impossible for to X have [Verb]ed Y "because X lived in V and Y lived in Q is a suitable continuation and having a high fidelity, well rounded word model. It has some sort of world model, but it's tightly bound to syntax and approval and very loosely bound to the actual world. The rest of the world doesn't have neat 1:1 mapping to sentence structure like Othello to Othello notation, which is why LLMs appear to have quite limited and inadequate internal representations even of things which computers can excel at (and humans be taught with considerably fewer textbooks) like mathematics, never mind being able to deduce what it's like to have an emotional state from tokens typically combined with the string "sad".
> "E3, D3, C4 ..." and NOTHING else. It knows nothing of the board. It doesnt event know that there are two players.
Yeah, like language have gramma rules games also have rules, in both cases LLM can learn rules, it's the same with many other structured chains of actions/tokens, you could also model actions from different domains and use them as language. It seems a lot of emergent behaviours of LLMs are what you could call generalized approximated algorithms for certain tasks. If we could distill only these patterns and extract them and maybe understand them (if possible, as some of these are HUGE) then based on this knowledge maybe we could create traditional algorithms that would solve similar problems.
Knowledge distillation for transformers is already a thing and it is still actively researched since the potential benefits of not having to run these gigantic models are enormous.
Imagine I painted an Othello board in glue, then I threw a handful of sawdust on the "painting", then gave it a good shake. Ta-da! My magic sawdust made an Othello board!
That's what's happening here.
The model is a set of valid game configurations, and nothing else. The glue is already in the right place. Is it any mystery the sawdust resembles the game board? Where else can it sick?
What GPT does is transform the existing relationships between repeated data points into a domain. Then, it stumbles around that domain, filling it up like the tip of a crayon bouncing off the lines of a coloring book.
The tricky part is that, unlike my metaphors so far, one of the dimensions of that domain is time. Another is order. Both are inherent in the structure of writing itself, whether it be words, punctuation, or game moves.
Something that project didn't bother looking at is strategy. If you train on a specific Othello game strategy, will the net ever diverge from that pattern, and effectively create its own strategy? If so, would the difference be anything other than noise? I suspect not.
While the lack of divergence from strategy is not as impressive as the lack of divergence from game rules, both are the same pattern. Lack of divergence is itself the whole function of GPT.
The way Othello works, playing a legal game requires understanding how the symbols map to the geometry of the board, at least as far as knowing that there are two orthogonal axes on which the tokens are ordered. Playing an "E3" might change the colour of nodes on any neighbouring extent of the 3 rank or the E file. If it's playing a legal game, it's difficult to see an alternative explanation that doesn't map to "it's got an internal representation consistent with an 8x8 Othello board", especially if you directly reach in and make changes to that representation and it subsequently makes moves consistent with those changes.
> And this is the key bit: They reached into its mind and flipped bits on that internal representation (to change white pieces to black for example) and it responded in the appropriate way when making the next move.
Excuse my ignorance, but how is this useful? This seems to indicate only that they found the "bits" in the internal state.
> Excuse my ignorance, but how is this useful? This seems to indicate only that they found the "bits" in the internal state.
Right, they found the bits in the internal state that seem to correspond to the board state. This means the LLM is building an internal model of the world.
This is different from if the LLM is learning just that [sequence of moves] is usually followed by [move]. It's learning that [sequence of moves] results in [board state] and then that [board state] should be followed by [move]. They're testing this by giving it [sequence of moves], then altering the bits of the internal state that model the board and checking to see what move it makes. If they haven't found the bits of internal state, the resulting move isn't something you'd expect to make sense.
I see, thanks. I guess it means that if there was only a statistical model of [moves]->[next move], this would be impossible (or extremely unlikely) to work.
Yeah, exactly. I think it's a really interesting approach to answering the question of what these things might be doing.
You can still try and frame it as some overall statistical model of moves -> next move (I think there's discussions on this in the comments that I don't fancy getting into) but I think the paper does a good job of discussing this in terms of surface statistics:
> From various philosophical [1] and mathematical [2] perspectives, some researchers argue that it is fundamentally impossible for models trained with guess-the-next-word to learn the “meanings'' of language and their performance is merely the result of memorizing “surface statistics”, i.e., a long list of correlations that do not reflect a causal model of the process generating the sequence.
On the other side, it's reasonable to think that these models can learn a model of the world but don't necessarily do so. And sufficiently advanced surface statistics will look very much like an agent with a model of the world until it does something catastrophically stupid. To be fair to the models, I do the same thing. I have good models of some things and others I just perform known-good actions and it seems to get me by.
A lot of people didn't seem to get it when it was discussed on HN. A GPT had _only_ ever seen Othello transripts like: "E3, D3, C4 ..." and NOTHING else. It knows nothing of the board. It doesnt event know that there are two players. It learned Othello like it was a language, and was able to play an OK game of it, making legal moves 99.99% of the time. Inside its 'mind', by looking for correlations between its internal state and what they knew the 'board' would look like at each step in the games, they found 64 nodes that seemed to represent the 8x8 Othello board and representation of the two different colours of counters.
And this is the key bit: They reached into its mind and flipped bits on that internal representation (to change white pieces to black for example) and it responded in the appropriate way when making the next move. And by doing this they were able to map out its internal model in more detail, by running again and again with different variations of each move.