Oh, I could imagine many things that would demonstrate this. The simplest evidence would be that the model is mechanically-plausibly forming thoughts before (or even in conjunction with) the language to represent them. This is the opposite of how the vanilla transformer models work now—they exclusively model the language first, and then incidentally, the world.
nb., this is not the only way one could achieve this. I'm just saying this is one set of things that, if I saw it, it would immediately catch my attention.
Transformers, like other deep neural networks, have many hidden layers before the output. Are you certain that those hidden layers aren't modeling the world first before choosing an output token? Deep neural networks (including transformers) trained on board games have been found to develop an internal representation of the board state. (eg. https://arxiv.org/pdf/2309.00941)
On the contrary, it is clear to me they definitely ARE modeling the world, either directly or indirectly. I think basically everyone knows this, that is not the problem, to me.
What I'm asking is whether we really have enough evidence to say the models are "alignment faking." And, my position to the replies above is that I think we do not have evidence that is strong enough to suggest this is true.
Oh, I see. I misunderstood what you meant by "they exclusively model the language first, and then incidentally, the world." But assuming you mean that they develop their world model incidentally through language, is that very different than how I develop a mental world-model of Quidditch, time-turner time travel, and flying broomsticks through reading Harry Potter novels?
The main consequence to the models is that whatever they want to learn about the real world has to be learned, indirectly, through an objective function that primarily models things that are mostly irrelevant, like English syntax. This is the reason why it is relatively easy to teach models new "facts" (real of fake) but empirically and theoretically harder to get them to reliably reason about which "facts" are and aren't true: a lot of, maybe most, of the "space" in a model is taken up by information related to either syntax or polysemy (words that mean different things in different contexts), leaving very little left over for models of reasoning, or whatever else you want.
Ultimately, this could be mostly fine except resources for representing what is learned are not infinite and in a contest between storing knowledge about "language" and anything else, the models "generally" (with some complications) will prefer to store knowledge about the language, because that's what the objective function requires.
It gets a little more complicated when you consider stuff like RLHF (which often rewards world modeling) and ICL (in which the model extrapolates from the prompt) but more or less it is true.
That's a nicely clear ask but I'm not sure why it should be decisive for whether there's genuine depth of thought (in some sense of thought). It seems to me like an open empirical question how much world modeling capability can emerge from language modeling, where the answer is at least "more than I would have guessed a decade ago." And if the capability is there, it doesn't seem like the mechanics matter much.
I think the consensus is that the general purpose transformer-based pertaining models like gpt4 are roughly as good as they’ll be. o1 seems like it will be slightly better in general. So I think it’s fair to say the capability is not there, and even if it was the reliability is not going to be there either, in this generation of models.
It might be true that pretraining scaling is out of juice - I'm rooting for that outcome to be honest - but I don't think it's "consensus". There's a lot of money being bet the other way.
It is the consensus. I can’t think of a single good researcher I know who doesn’t think this. The last holdout might have been Sutskever, and at NeurIPS he said pretaining as we know it is ending because we ran out of data and synthetics can’t save it. If you have an alternative proposal for how it avoids death I’d love to hear it but currently there are 0 articulated plans that seem plausible and I will bet money that this is true of most researchers.
nb., this is not the only way one could achieve this. I'm just saying this is one set of things that, if I saw it, it would immediately catch my attention.