I'm impressed and surprised that a relatively small model can learn so much from just the textual move records. Not even full algebraic notation 1.e2-e4 e7-e5 2.Ng1-f3 but just 1.e4 e5 2.Nf3 ... It has to figure out what o-o and o-o-o mean just from what King and Rook moves appear later. And the only way it could learn that the King cannot move through a checked square while castling is that such situations never appear in its training set.
Would love to see a similar experiment for 9x9 Go, where the model also needs to learn the concepts of connected group and its liberties.
I'd be curious to see if, in the 1-2% of cases where the linear probe fails to predict board occupancy, the LLM also predicts (or at least assigns non-trivial probability to) a corresponding illegal move. For example, if the linear probe incorrectly thinks there's a bishop on b4, does the LLM give more probability to illegal bishop moves along the corresponding diagonals than to other illegal bishop moves?
Nice experiment, even though we know that LLMs distill an internal world model representation of whatever they are trained on.
The experiment could be a little better by using a more descriptive form of notation than PGN. PGN notation's strength is the shorthand properties of it, because it is used by humans while playing the game. That is far from being a strength as LLM training data. ML algorithms, and LLMs are trained better by feeding them more descriptive and accurate data, and verbosity is not a problem at all. There is the FEN notation in which in every move the entire board is encoded.
One could easily imagine many different ways to describe a game, like encoding vertical and horizontal lines, listing what exact squares each piece is covering, what color squares, which of the pieces are able to move, and in each move generate one whole page of the board situation.
I call this spatial navigation, in which the LLM learns the ins and outs of it's training data and it is able to make more informed guesses. Chess is fun and all, but code generation has the potential to be a lot better than just writing functions. By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.
> Nice experiment, even though we know that LLMs distill an internal world model representation of whatever they are trained on.
There are still a lot of people who deny that (for example Bender's "superintelligent octopus" supposedly wouldn't learn a world model, no matter how much text it trained on), so more evidence is always good.
> There is the FEN notation in which in every move the entire board is encoded.
The entire point of this is to not encode the board state!
>The entire point of this is to not encode the board state!
I am not sure about this. From the article "The 50M parameter model played at 1300 ELO with 99.8% of its moves being legal within one day of training."
I thought that the experiment was how well the model will perform, given that it's reward function is to predict text, rather than checkmate. Leela, Alpha0 their reward function is to win the game, checkmate or capture pieces. Also it goes without saying that Leela, Alpha0 cannot make illegal moves.
The experiment does not need to include the whole board position if that's a problem, if that's an important point of interest. It could encode more information about squares covered by each side for example. See for example this training experiment for Trackmania [1]. There are techniques that the ML algorithm will *never* figure out by itself if this information is not encoded in it's training data.
The point still stands. PGN notation certainly is not a good format if the goal (or one of the goals) of the experiment is to be a good chess player.
That just shows that it worked in some sense. If it didn't reach any ELO, clearly the results would be uninformative: maybe it's impossible to learn chess from PGN, or maybe you just screwed up. He's clear that the point is to interrogate what it learns:
"This model is only trained to predict the next character in PGN strings (1.e4 e5 2.Nf3 …) and is never explicitly given the state of the board or the rules of chess. Despite this, in order to better predict the next character, it learns to compute the state of the board at any point of the game, and learns a diverse set of rules, including check, checkmate, castling, en passant, promotion, pinned pieces, etc. In addition, to better predict the next character it also learns to estimate latent variables such as the ELO rating of the players in the game."
By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.
Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.
More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
>Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.
Great stuff.
>More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
As the experiments on PHI-1 and PHI-2 from microsoft show, training data make a difference. The "textbooks is all you need" moto means better structured data, more clear data make a difference.
> The experiment could be a little better by using a more descriptive form of notation than PGN
The author seems more interested in the ability to learn chess at a decent level from such a poor input, as well as what kind of world model it might build, rather than wanting to help it to play as well as possible.
The fact that it was able to build a decent model of the board position from PGN training samples, without knowing anything about chess (or that it was even playing chess) is super impressive.
It seems simple enough to learn that, for example, "Nf3" means that an "N" is on "f3", especially since predicting well requires you to know what piece is on each square.
However, what is not so simple is to have to learn - without knowing a single thing about chess - that "Nf3" also means that:
1) One of the 8 squares that is a knights move away from f3, and had an "N" on it, now has nothing on it. There's a lot going on there!
2) If "f3" previously had a different piece on it, that piece is now gone (taken) - it should no longer also be associated with "f3"
If you take a neural network that already knows the basic rules of chess and train it on chess games, you produce a chess engine.
From the Wikipedia page on one of the strongest ever[1]: "Like Leela Zero and AlphaGo Zero, Leela Chess Zero starts with no intrinsic chess-specific knowledge other than the basic rules of the game. Leela Chess Zero then learns how to play chess by reinforcement learning from repeated self-play"
As described in the OP's blog post https://adamkarvonen.github.io/machine_learning/2024/01/03/c... - one of the incredible things here is that the standard GPT architecture, trained from scratch from PGN strings alone, can intuit the rules of the game from those examples, without any notion of the rules of chess or even that it is playing a game.
Which is not to diminish the work of the Leela team at all! But I find it fascinating that an unmodified GPT architecture can build up internal neural representations that correspond closely to board states, despite not having been designed for that task. As they say, attention may indeed be all you need.
> What's the strength of play for the GPT architecture?
Pretty shit for a computer. He says his 50m model reached 1800 Elo (by the way, its Elo and not ELO as the article incorrectly has it, it is named after a Hungarian guy called Elo). It seems to be a bit better than Stockfish level 1 and a bit worse than Stockfish level 2 from the bar graph.
Based on what we know I think its not surprising these models can learn to play chess, but they get absolutely smoked by a "real" chess bot like Stockfish or Leela.
Afaik his small bot reaches 1300 and gpt-3.5-instruct reaches 1800. We have no idea how much and on what kind of PGNs the Openai model was trained. I heard a rumor that they specifically trained on games up to 1800 before but no idea.
They also say “I left one training for a few more days and it reached 1500 ELO.” I find it quite likely the observed performance is largely limited by the spent compute.
I can't see it being superhuman, that's for sure. Chess AI are superhuman because they do vast searches, and I can't see that being replicated by an LLM architecture.
The apples-to-apples comparison would be comparing an LLM with Leela with search turned off (only using a single board state)
According to figure 6b [0] removing MCTS reduces Elo by about 40%, scaling 1800 Elo by 5/3 gives us 3000 Elo which would be superhuman but not as good as e.g. LeelaZero.
Leela policy is around 2600 elo, or around the level of a strong grandmaster.
Note that Go is different from chess since there are no draws, so skill difference is greatly magnified.
Elo is always a relative scale (expected score is based on elo difference) so multiplication should not really make sense anyways.
Chess AI used to dominate by computational power but to my knowledge that is no longer true and the engines beat all but the very strongest players even when run on phone CPUs.
Deep Blue analyzed some 200 million positions per second. Modern engines analyze a three to four orders of magnitude fewer nodes per second, but have much more refined pruning of the search space.
It's not self-play. It's literally just reading sequences of moves. And it doesn't even know that they're moves, or that it's supposed to be learning a game. It's just learning to predict the next token given a sequence of previous tokens.
What's kind of amazing is that, in doing so, it actually learns to play chess! That is, the model weights naturally organize into something resembling an understanding of chess, just by trying to minimize error on next-token prediction.
It makes sense, but it's still kind of astonishing that it actually works.
> I am pretty sure a bunch of matrix multiplications can't intuit anything.
I don't understand how people can say things like this when universal approximation is an easy thing to prove. You could reproduce Magnus Carlsen's exact chess-playing stochastic process with a bunch of matrix multiplications and nonlinear activations, up to arbitrarily small error.
I read such statements as being claims that "intuition" is part of consciousness etc.
It's still too strong a claim given that matrix multiplication also describes quantum mechanics and by extension chemistry and by extension biology and by extension our own brains… but I frequently encounter examples of mistaking two related concepts for synonyms, and I assume in this case it is meant to be a weaker claim about LLMs not being conscious.
Me, I think the word "intuition" is fine, just like I'd say that a tree falling in a forest with no one to hear it does produce a sound because sound is the vibration of the air instead of the qualia.
Funnily, for me intuition is the part of intelligence which I can more easily imagine as being done by a neural network. When my intuition says this person is not to trust I can easily imagine that being something like a simple hyperplane classification in situation space.
It's the active, iterative thinking and planning that is more critical for AGI and, while obviousky theoretically possible, much harder to imagine a neural network performing.
No, matrix multiplication is the system humans use to make predictions about those things but it doesn’t describe their fundamental structure and there’s no reason to imply they do.
This simply isn't true. There are big caveats to the idea that neural networks are universal function approximators (as there are to the idea that they're universal Turing machines, which also somehow became common knowledge in our post-ChatGPT world). The function has to be continuous, we're talking about functions rather than algorithms, an approximator being possible and us knowing how to construct it are very different things, and so on.
That's not a problem. You can show that neural network induced functions are dense in a bunch of function spaces, just like continuous functions. Regularity is not a critical concern anyways.
>functions vs algorithms
Repeatedly applying arbitrary functions to a memory (like in a transformer) yields you arbitrary dynamical systems, so we can do algorithms too.
> an approximator being possible and us knowing how to construct it are very different things,
This is of course the critical point, but not so relevant when asking whether something is theoretically possible. The way I see it this was the big question for deep learning and over the last decade the evidence has just continually grown that SGD is VERY good at finding weights that do in fact generalize quite well and that don't just approximate a function from step-functions the way you imagine an approximation theorem to construct it, but instead efficiently find features in the intermediate layers and use them for multiple purposes, etc. My intuition is that the gradient in high dimension doesn't just decrease the loss a bit in the way we imagine it for a low dimensional plot, but in those high dimensions really finds directions that are immensely efficient at decreasing loss. This is how transformers can become so extremely good at memorization.
You are probably joking, but I think it's actually very important to look at the language we use around LLMs, in order not to get stuck in assumptions and sociological bias associated with a vocabulary usually reserved for "magical" beings, as it were.
This goes both ways by the way. I could be convinced that LLMs can achieve something the likes of intuition, but I strongly believe that it is a very different kind of intuition than we normally associate with humans/animals. Usins the same label is thus potentially confusing, and (human pride aside) might even prevent us from appreciating the full scope of what LLMs are capable of.
I think the issue is that we're suddenly trying to pin down something that was previously fine being loosely understood, but without any new information.
If someone came to the table with "intuition is the process of a system inferring a likely outcome from given inputs by the process X - not to be confused with matmultuition which is process Y", that might be a reasonable proposal.
Can a bunch of neurons firing based on chemical and electrical triggers intuit anything? It has to be the case that any intelligent process must be the emergent result of non-intelligent processes, because intelligence is not an inherent property of anything.
I think that “intuit the rules” is just projecting.
More likely, the 16 million games just has most of the piece move combinations. It does not know a knight moves in an L. It knows from each square where a knight can move based on 16 million games.
No this isn’t likely. Chess has trillions of possible games[1] that could be played and if it all it took was such a small number of games to hit most piece combinations chess would be solved. It has to have learned some fundamental aspects of the game to achieve the rating stated ITT
It doesn’t take the consumption of all trillions of possible game states to see a majority of possible ways a piece can move from one square to another.
Maybe I misread something as I only skimmed, but the pretty weak Elo would most definitely suggest a failure of intuiting rules.
no, a weak elo just indicates poor play. he also quantifies what percent of moves the model makes which are legal, and it’s ~99%, meaning it must have learned the rules
That's entirely my point: both your kids and ChessGPT know the rules, but still don't play very strongly. You say they "don’t know much more than how the pieces move" but that's exactly what the rules are, how the pieces are allowed to move, given the sequence of moves that have come before (i.e the state of the board.) I'm saying ChessGPT is a poor player, and didn't learn much high level play. But it definitely learned the rules!
On a board with a finite number of squares, is this truly different?
The representation of the ruleset may not be the optimal Kolmogorov complexity - but for an experienced human player who can glance at a board and know what is and isn’t legal, who is to say that their mental representation of the rules is optimizing for Kolmogorov complexity either?
You assert something that is a hypothesis for further research in the area. Alternative is that it in fact knows that knights move in an L-shaped fashion. The article is about testing hypotheses like that, except this particular one seems quite hard.
It'd seem surprising to me if it had really learnt the generalization that knights move in an L-shaped fashion, especially since it's model of the board position seems to be more probabilistic than exact. We don't even know if it's representation of the board is spatial or not (e.g. that columns a & b are adjacent, or that rows 1 & 3 are two rows apart).
We also don't know what internal representations of the state of play it's using other than what the author has discovered via probes... Maybe it has other representations effectively representing where pieces are (or what they may do next) other than just the board position.
I'm guessing that it's just using all of it's learned representations to recognize patterns where, for example, Nf3 and Nh3 are both statistically likely, and has no spatial understanding of the relationship of these moves.
I guess one way to explore this would be to generate a controlled training set where each knight only ever makes a different subset of it's legal (up to) 8 moves depending on which square it is on. Will the model learn a generalization that all L-shaped moves are possible from any square, or will it memorize the different subset of moves that "are possible" from each individual square?
A minor detail here is that the analysis in the blog shows that the linear model built/trained on the the activations of an internal layer has a representation of the board that is probabilistic. Of course the full model is also probabilistic by design, though it probably has a better internal understanding of the state of the board than the linear projection used to visualize/interpret the internals of the model. There is no real meaning in the word "spatial" representation beyond the particular connectivity of the graph of the locations, which seems to be well understood by the model as 98% of the moves are valid, and this includes sampling with whatever probabilistic algorithm of choice that may not always return the best move of the model.
A different way to test the internal state of the model would be to score all possible valid and invalid moves at every position and see how the probabilities of these moves would change as a function of the player's ELO rating. One would expect that invalid moves would always score poorly independent of ELO, whereas valid moves would score monotonically with how good they are (as assessed by Stockfish) and that the player's ELO would stretch that monotonic function to separate the best moves from the weakest moves for a strong player.
> There is no real meaning in the word "spatial" representation beyond the particular connectivity of the graph of the locations
I don't think it makes sense to talk of the model (potentially) knowing that knights make L-shaped moves (i.e. 2 squares left or right, plus 1 square up or down, or vice versa) unless it is able to add/subtract row/column numbers to be able to determine the squares it can move to on the basis of this (hypothetical) L-shaped move knowledge.
Being able to do row/column math is essentially what I mean by spatial representation - that it knows the spatial relationships between rows ("1"-"8") and columns ("a"-"h"), such that if it had a knight on e1 it could then use this L-shaped move knowledge to do coordinate math like e1 + (1,2) = f3.
I rather doubt this is the case. I expect the board representation is just a map from square name (not coordinates) to piece on that square, and that generated moves likely are limited to those it saw the piece being moved make when it had been on the same square during training - i.e. it's not calculating possible, say, knight destinations base on an L-shaped move generalization, but rather "recalling" a move it had seen during training when (among other things) it had a knight on a given square.
Somewhat useless speculation perhaps, but would seem simple and sufficient, and an easy hypothesis to test.
Thanks for linking the actual post—it was a great read. I'm not an ML expert, but the author really made it easy to follow their experiment's method and results.
> I fine-tuned GPT-2 on a 50 / 50 mix of OpenWebText and chess games, and it learned to play chess and continued to output plausible looking text. Maybe there’s something interesting to look at there?
To me that suggests investigating whether there are aspects of human culture that can improve chess playing performance - i.e. whether just training on games produces less good results than training on games and literature.
This seems plausible to me, even beyond literature that is explictly about the game - learning go proverbs, which are often phrased as life advice is a part of learning go, and games are embedded all through our culture, with some stories really illustrating that you have to 'know when to hold em, know when to fold em, know when to walk away, know when to run'.
I’ve skimmed this, but if it is really true that it can play at 1800 ELO based purely on the moves, without seeing the board at each turn, that is insane. 1800 ELO is a strong human rating even with seeing the board. 1800 ELO essentially blindfolded is incredible
I’m curious how human like this LLM feels when you play it.
One of the challenges to making fun chess bots is to make it play like a low or mid ranked human. The problem is that a stockfish based bot knows some very strong moves, but deliberately plays bad moves so it’s about the right skill level. The problem is that these bad moves are often very obvious. For example I’ll threaten a queen capture.
Any human would see it and move their queen. The bot “blunders” and loses the queen to an obvious attack. It feels like the bot is letting you win which kills the enjoyment of playing with the bot.
I think that this approach would create very human like games.
There is a very interesting project on this exact problem called Maia, which trains an engine based on millions of human games played on Lichess, specifically targeting varying levels of skill from 1300 to 1900 Elo. I haven't played it myself, by my understanding is that it does a much better job imitating the mistakes of human players. https://maiachess.com
What I'm most interested in is what an LLM trained on something specific like this (even though chess, arguably, isn't super specific) has to say in human words about their strategies and moves, especially with some kind of higher order language.
And the reverse, can a human situation be expressed as a chessboard presented with a move?
Humans and machines find good moves in different ways.
Most humans have fast pattern matching that is quite good at finding some reasonable moves.
There are also classes of moves that all humans will spot. (You just moved your bishop, now it’s pointing at my queen)
The problem is that stockfish scores all moves with a number based on how good the move is. You have no idea if a human would agree.
For example mis-calculating a series of trades 4 moves deep is a very human mistake, but it’s scored the same as moving the bishop to a square where it can easily be taken by a pawn. They both result in you being a bishop down. A nerfed stockfish bot is equally likely to play either of those moves.
You might think that you could have a list of dumb move types that the bot might play, but there are thousands of possible obviously dumb moves. This is a problem for machine learning.
I'd call it an approach issue: LLM vs brute-force lookahead.
An LLM is predicting what comes next per it's training set. If it's trained on human games then it should play like a human; if it's trained on Stockfish games, then it should play more like Stockfish.
Stockfish, or any chess engine using brute force lookahead, is just trying to find the optimal move - not copying any style of play - and it's moves are therefore sometimes going to look very un-human. Imagine if the human player is looking 10-15 moves ahead, but Stockfish 40-50 moves ahead... what looks good 40-50 moves out might be quite different than what looks good to the human.
I mean, this seems obvious to me. How would the model predict the next move WITHOUT calculating the board state first? Yes, by memorization, but memorization hypothesis is easily rejected by comparison to training dataset in this case.
It is possible the model calculates an approximate board state, which is different from the board state but equivalent for most games, but not all games. It would be interesting to train adversarial policy to check this. From KataGo attack we know this does happen for Go AIs: Go rules have a concept of liberty, but so called pseudoliberty is easier to calculate and equivalent for most cases (but not all cases). In fact, human programmers also used pseudoliberty to optimize their engines. Adversarial attack found Go AIs also use pseudoliberty.
Surprisingly many people seem to believe LLMs cannot form any deeper world models beyond superficial relationships between words, even if figuring out a "hidden" model allows for a big leap in prediction performance – in this case, a hypothesis corresponding to chess rules happens to be give the best bang for the buck for predicting strings that have chess notation structure.
But the model could in principle just have learned a long list of rote heuristics that happen to predict notation strings well, without having made the inferential leap to a much simpler set of rules, and a learner weaker than a LLM could well have got stuck at that stage.
I wonder how well a human (or a group of humans) would fare at the same task and if they could also successfully reconstruct chess even if they had no prior knowledge of chess rules or notation.
(OTOH a GPT3+ level LLM certainly does know that chess notation is related to something called "chess", which is a "game" and has certain "rules", but to what extent is it able to actually utilize that information?)
It’s one thing to think it’s obvious, but quite another to prove it. I think this is the true value of this kind of work, is that it’s helping to decipher what these models are actually doing. Far too often we hear “NNs / LLMs are black boxes” as if that’s the end of the conversation.
> It is possible the model calculates an approximate board state
Yes - this is exactly what the probes show.
One interesting aspect is that it still learns to play when trained on blocks of move sequences starting from the MIDDLE of the game, so it seems it must be incrementally inferring the board state by what's being played rather than just by tracking the moves.
I'm impressed and surprised that a relatively small model can learn so much from just the textual move records (not even full algebraic notation 1.e2-e4 e7-e5 2.Ng1-f3 but just 1.e4 e5 2.Nf3).
Would love to see a similar experiment for 9x9 Go, where the model also needs to learn the concepts of connected group and its liberties.
The 'world model' question seems "not even understood" by those in the field who provide these answers to it -- and use terms like "concepts" (see the linked paper on sentiment where the NN has apparently discovered a sentiment "concept").
Consider the world to contain causal properties which bring about regularities in text, eg., Alice likes chocolate so Alice says, "I like chocolate". Alice's liking, ie., her capacity for preference, desire, taste, asethetic juddgement etc is the cause of "like".
Now these causal properties brings about significant regularities in text, so "like" occurring early in the paragraph comes to be extremely predictive of other text tokens occurring (eg., b-e-s-t, etc.)
No one in this debate doubts, whatsoever, that NNs contain "subnetworks" which divide the problem up into detecting these token correlations. This is trivially observable in CNNs where it is trivial to demonstrate subnetworks "activating" on, say, an eye-shape.
The issue is that when a competent language user judges someone's sentiment, or the implied sentiment the speaker of some text would have -- they are not using a model of how some subset of terms (like, etc.) comes to be predictive of others.
They're using the fact that the know the relevant causal properties (liking, preference, desire, etc.) and how these cause certain linguistic phrases. It is for this reason a competent language user can trivially detect irony ("of course I like going to the dentist!" -- here since we know how unlikely it is to desire this, we know this phrase is unlikely to express such a preference, etc.).
To say NNs, or any ML system, is sensitive to these mere correlations is not to say that these correlations are not formed by tracking the symptoms of real causes (eg., desire). Rather it is to say they do not track desire.
This seems obvious, since the mechanism to train them is just sensitive to patterns in tokens. These patterns are not their causes, and are not models of their causes. They're only predictive of them under highly constrained circumstances.
Astrological signs are predictive of birth dates, but they arent models of being born -- nor of time, or anything else.
No one here doubts whether NNs are sensitive to patterns in text caused by causal properties -- the issue is that they arent models of these properties; they are models of (some of) their effects as encoded in text.
>Astrological signs are predictive of birth dates, but they arent models of being born -- nor of time, or anything else.
Also eating ice cream and getting bitten by a shark do have some mutual predictive associations.
I think that the chess-GPT experiment can be interesting, not because the machine can predict every causal connection, but how many causal connections can it extract from the training data by itself. By putting a human in the loop, much more causal connections will be revealed but the human is lazy. Or expensive. Or expensive because he is lazy.
In addition correlation can be a hint for causation. If a human researches it further, then maybe it is a correlation and nothing substantial, but sometimes it may actually be a causative effect. So there is value in that.
About the overall sentiment, NN's world model is very different from a human world model indeed.
If you understand the cause of a regularity, you will predict it in all relevant circumstances. If you're just creating a model of its effects in one domain, you can only predict it in that domain --- with all other factors held constant.
This makes (merely) predictive models extremely fragile; as we often see.
One worry about this fragility is saftey: no one doubts that, say, city route planning from 1bn+ images is done via a "pixel-correlation (world) model" of pedestrian behaviour. The issue is that it isnt a model of pedestrian behaviour.
So it is only effective insofar as the effects of pedestrian behaviour, as captured in the images, in these environments, etc. remain constant.
If you understood pedestrians, ie., people, then you can imagine their behaviour in arbitrary environments.
Another way of putting it is: correlative models of effects arent sufficient for imagining novel circumstances. They encode only the effects of causes in those circumstances.
Whereas if you had a real world model, you can trivially simulate arbiatry circumstnaces.
There's many notions of "prediction" and "generalisation" -- the relevant ones here, which apply to NNs, are extremely limited. That's the problem with all this deceptive language -- it invites people to think NNs predict in the sense of simulate, and generalise in the sense of "apply across different effect domains".
NNs cannot apply a 'concept' across different 'effect' domains, because they have only one effect domain: the training data. They are just models of how the effect shows itself in that data.
This is why they do not have world models: they are not generalising data by building an effect-neutral model of something; theyre just modelling its effects.
Compare having a model of 3D vs. a model of shadows of a fixed number of 3D objects. NNs generalise in the sense that they can still predict for shadows similar to their training set. They cannot predict 3d; and with sufficiently novel objects, fail catastrophically.
The downside is that it's a supervised technique, so you need to already know what you're looking for. It would be nice to have an unsupervised tool that could list out all the things the network has learned.
Whoa, this is super cool! I can imagine if we had something like this for ChatGPT, we could use it to do some serious prompt engineering. Imagine seeing what specific neurons you were activating with your prompt, and being able to identify which word in your prompt was triggering an undesired behavior. Super cool stuff, excited to see if it scales
World model might be a too big word here. When we talk of a world model (in the context of AI motels), we refer to its understanding of the world, at least in the context we trained it. But what I see is just a visualization of the output in a fashion similar to a chess board. A stronger evidence would be a for example a map of the next move, which will show whether it truly understood the game’s rules. If it show probability larger than zero on illegal board fields, it will show us why it sometimes makes illegal moves. And obviously, it didn’t fully understand the rules of the game.
Strictly speaking, it should be a mistake to assign a probability equal to zero to any moves, even for illegal board moves, but especially for an AI that learns by example and self-play. It never gets taught the rules, it only gets shown the games -- there's no reason that it should conclude that the probability of a rook moving diagonally is exactly zero just because it's never seen it happen in the data, and gets penalized in training every time it tries it.
But even for a human, assigning probability of exactly zero is too strong. It would forbid any possibility that you misunderstand any rules, or forgot any special cases. It's a good idea to always maintain at least a small amount of epistemic humility that you might be mistaken about the rules, so that sufficiently overwhelmingly strong evidence could convince you that a move you thought was illegal turns out to be legal.
Every so often, I encounter someone saying that about some topic while also being wrong.
Also, it took me actually writing a chess game to learn about en passant capturing, the 50 moves without capturing or pawn move forced draw, and the 3 state repetition forced draw.
That's exactly right. A probability of zero is a truly absurd degree of self-confidence. It would be like if someone continued to insist that en passant capturing is illegal, even while being told otherwise by official chess judges, being read multiple rulebooks, being shown records of historic chess games in which it was used, and so on. P=0 means one's mind truly cannot be changed by anything, which leaves one wondering how it got to that state in the first place!
Probably most of us even know about en passant, so we think we know everything. But if I found myself in that same bewildering situation being talked down by a judge after an an opponent moved their rook diagonally, I'd have to either admit I was wrong about knowing all the rules, or else at least wonder how and why such an epic prank was being coordinated against me!
But the topic is chess, which does have a small number of fixed rules. You not knowing about en passant or 3 state repetition just means you never bothered to read all the rules. At some point, an LLM will learn the complete rule set.
> At some point, an LLM will learn the complete rule set.
Even if it does, it doesn't know that it has. And in principle, you can't know for sure if you have or not either. It's just a question of what odds you put on having learned a simplified version for all this time without having realised that yet. Or, if you're a professional chess player, the chance that right now you're dreaming and you're about to wake up and realise you dreamed about forgetting the 𐀀𐀁𐀂𐀃:𐀄𐀅𐀆𐀇𐀈𐀉 move that everyone knows (and you should've noticed because the text was all funny and you couldn't read it, which is a well-known sign of dreaming).
That many people act like things can be known 100% (including me) is evidence that humans quantise our certainty. My gut feeling is that anything over 95% likely is treated as certain, but this isn't something I've done any formal study in, and I'd assume that presentation matters to this number because nobody's[0] going to say that a D20 dice "never rolls a 1". But certainty isn't the same as knowledge, it's just belief[1].
[0] I only noticed at the last moment that this itself is an absolute, so I'm going to add this footnote saying "almost nobody".
[1] That said, I'm not sure what "knowledge" even is: we were taught the tripartite definition of "justified true belief", but as soon as it was introduced to us the teacher showed us the flaws, so I now regard "knowledge" as just the subjective experience of feeling like you have a justified true belief, where all that you really have backing up the feeling is a justified belief with no way to know if it's true, which obviously annoys a lot of people who want truth to be a thing we can actually access.
Say a white rook is on h7 and a white pawn is on g7.
Rook gets taken, then the pawn moves to g8 and promotes to a rook.
The rook kind of moved diagonally.
"Ah, when the two pieces are in this position, if you land on my rook, I have the option to remove my pawn from the board and then move my rook diagonally in front of where my pawn used to be."
There's got to be a probability cut-off, though. LLMs don't infinitely connect every token with every other token, some aren't connected at all, even if some association is taught, right?
The weights have finite precision which means they represent value-ranges / have error bars. So even if the weight is exactly 0 it does not represent complete confidence in it never occurring.
When relationships are represented implicitly by the magnitude of the dot product between two vectors, there's no particular advantage to not "creating" all relationships (i.e. enforcing orthogonality for "uncreated" relationships).
That’s not right; there are many vectors that go unbuilt between unrelated tokens. Creating a ton of empty relationships would obviously generate an immense amount of useless data.
Your links are not about actually orthogonal vectors, so they’re not relevant. Also that’s not what superposition is defined as in your own links:
> In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition