>That’s ignoring all the other more serious issues I raised. The only other issu...

foobarqux · 2024-11-22T18:35:23 1732300523

Linear regression also learns to model data to some degree. Using the term “world model” that expansively is intentionally misleading.

Besides that and the big red flag of not directly analyzing the performance of the predicted board state I also said training a neural network to return a specific result is fishy, but that is a more minor point than the other two.

og_kalu · 2024-11-22T18:56:07 1732301767

The degree matters. If we find auto encoders learning surprisingly deep models then i have no problems saying they have a world model. It's not the gotcha you think it is.

>the big red flag of not directly analyzing the performance of the predicted board state I also said training a neural network to return a specific result is fishy

The idea that probes are some red flag is ridiculous. There are some things to take into account but statistics is not magic. There's nothing fishy about training probes to inspect a models internals. If the internals don't represent the state of the board then the probe won't be able to learn to reconstruct the state of the board. The probe only has access to internals. You can't squeeze blood out of a rock.

foobarqux · 2024-11-22T20:17:16 1732306636

I don’t know what makes a “surprisingly deep model” but I specifically chose autoencoders to show that simply encoding the state internally can be trivial and therefore makes that definition of “world model” vacuous. If you want to add additional stipulations or some measure of degree you have to make an argument for that.

In this case specifically “the degree” is pretty low since predicting moves is very close to predicting board state (because for one you have to assign zero probability to moves to occupied positions). That’s even if you accept that world models are just states, which as mtburgess explained is not reasonable.

Further if you read what I wrote I didn’t say internal probes are a big red flag (I explicitly called it the minor problem). I said not directly evaluating how well the putative internal state matches the actual state is. And you can “squeeze blood out of a rock”: it’s the multiple comparison problem and it happens in science all the time and it is what you are doing by training a neural network and fishing for the answer you want to see. This is a very basic problem in statistics and has nothing to do with “magic”. But again all this is the minor problem.

og_kalu · 2024-11-22T22:01:16 1732312876

>In this case specifically “the degree” is pretty low since predicting moves is very close to predicting board state (because for one you have to assign zero probability to moves to occupied positions).

The depth/degree or whatever is not about what is close to the problem space. The blog above spells out the distinction between a 'world model' and 'surface statistics'. The point is that Othello GPT is not in fact playing Othello by 'memorizing a long list of correlations' but by modelling the rules and states of Othello and using that model to make a good prediction of the next move.

>I said not directly evaluating how well the putative internal state matches the actual state is.

This is evaluated in the actual paper with the error rates using the linear and non linear probes. It's not a red flag that a precursor blog wouldn't have such things.

>And you can “squeeze blood out of a rock”: it’s the multiple comparison problem and it happens in science all the time and it is what you are doing by training a neural network and fishing for the answer you want to see.

The multiple comparison problem is only a problem when you're trying to run multiple tests on the same sample. Obviously don't test your probe on states you fed it during training and you're good.

foobarqux · 2024-11-23T14:22:48 1732371768

> The point is that Othello GPT is not in fact playing Othello by 'memorizing a long list of correlations' but by modelling the rules and states of Othello and using that model to make a good prediction of the next move.

I don't know how you rule out "memorizing a long list of correlations" from the results. The big discrepancy in performance between their synthetic/random-data training and human-data training suggests to me the opposite: random board states are more statistically nice/uniform and suggests that these are in fact correlations not state computations.

> This is evaluated in the actual paper with the error rates using the linear and non linear probes. It's not a red flag that a precursor blog wouldn't have such things.

It's the main claim/result! Presumably the reason it is omitted from the blog is that the results are not good: nearly 10% error per tile. Othello boards are 64 tiles so the board level error rate (assuming independent errors) is 99.88%.

> The multiple comparison problem is only a problem when you're trying to run multiple tests on the same sample. Obviously don't test your probe on states you fed it during training and you're good.

In practice what is done is you keep re-running your test/validation loop with different hyperparameters until the validation result looks good. That's "running multiple tests on the same sample".