While it's true that language models are fundamentally based on statistical patterns in language, characterizing them as mere "probabilistic syllable generators" significantly understates their capabilities and functional intelligence.
These models can engage in multistep logical reasoning, solve complex problems, and generate novel ideas - going far beyond simply predicting the next syllable. They can follow intricate chains of thought and arrive at non-obvious conclusions. And OpenAI has now showed us that fine-tuning a model specifically to plan step by step dramatically improves its ability to solve problems that were previously the domain of human experts.
Although there is no definitive evidence that state-of-the-art language models have a comprehensive "world model" in the way humans do, several studies and observations suggest that large language models (LLMs) may possess some elements or precursors of a world model.
For example, Tegmark and Gurnee [1] found that LLMs learn linear representations of space and time across multiple scales. These representations appear to be robust to prompting variations and unified across different entity types. This suggests that modern LLMs may learn rich spatiotemporal representations of the real world, which could be considered basic ingredients of a world model.
And even if we look at much smaller models like Stable Diffusion XL, it's clear that they encode a rich understanding of optics [2] within just a few billion parameters (3.5 billion to be precise). Generative video models like OpenAI's Sora clearly have a world model as they are able to simulate gravity, collisions between objects, and other concepts necessary to render a coherent scene.
As for AGI, the consensus on Metaculus is that it will arrive in 2023. But consider that before GPT-4 arrived, the consensus was that full AGI was not coming until 2041 [3]. The consensus for the arrival date of "weakly general" AGI is 2027 [4] (i.e AGI that doesn't have a robotic physical world component). The best tool for achieving AGI is the transformer and its derivatives; its scaling keeps going with no end in sight.
> Generative video models like OpenAI's Sora clearly have a world model as they are able to simulate gravity, collisions between objects, and other concepts necessary to render a coherent scene.
I won't expand on the rest, but this is simply nonsensical.
The fact that Sora generates output that matches its training data doesn't show that it has a concept of gravity, collision between object, or anything else. It has a "world model" the same way a photocopier has a "document model".
My suspicion is that you're leaving some important parts in your logic unstated. Such as belief in a magical property within humans of "understanding", which you don't define.
The ability of video models to generate novel video consistent with physical reality shows that they have extracted important invariants - physical law - out of the data.
It's probably better not to muddle the discussion with ill defined terms such as "intelligence" or "understanding".
I have my own beef with the AGI is nigh crowd, but this criticism amounts to word play.
It feels like if these image and video generation models were really resolving some fundamental laws from the training data they should at least be able to re-create an image at a different angle.
"Allegory of the cave" comes to mind, when trying to describe the understanding that's missing from diffusion models. I think a super-model with such qualifications would require a number of ControlNets in a non-visual domains to be able to encode understanding of the underlying physics. Diffusion models can render permutations of whatever they've seen fairly well without that, though.
I'm very familiar with the allegory of the cave, but I'm not sure I understand where you're going with the analogy here.
Are you saying that it is not possible to learn about dynamics in a higher dimensional space from a lower dimensional projection? This is clearly not true in general.
E.g., video models learn that even though they're only ever seeing and outputting 2d data, objects have different sides in a fashio that is consistent with our 3d reality.
The distinctions you (and others in this thread) are making is purely one of degree - how much generalization has been achieved, and how well - versus one of category.
Not only are we within eyesight of the end, we're more or less there. o1 isn't just scaling up parameter count 10x again and making GPT-5, because that's not really an effective approach at this point in the exponential curve of parameter count and model performance.
I agree with the broader point: I'm not sure it isn't consistent with current neuroscience that our brains aren't doing anything more than predicting next inputs in a broadly similar way, and any categorical distinction between AI and human intelligence seems quite challenging.
I disagree that we can draw a line from scaling current transformer models to AGI, however. A model that is great for communicating with people in natural language may not be the best for deep reasoning, abstraction, unified creative visions over long-form generations, motor control, planning, etc. The history of computer science is littered with simple extrapolations from existing technology that completely missed the need for a paradigm shift.
The fact that OpenAI created and released o1 doesn't mean they won't also scale models upwards or don't think it's their best hope. There's been plenty said implying that they are.
I definitely agree that AGI isn't just a matter of scaling transformers, and also as you say that they "may not be the best" for such tasks. (Vanilla transformers are extremely inefficient.) But the really important point is that transformers can do things such as abstract, reason, form world models and theories of minds, etc, to a significant degree (a much greater degree than virtually anyone would have predicted 5-10 years ago), all learnt automatically. It shows these problems are actually tractable for connectionist machine learning, without a paradigm shift as you and many others allege. That is the part I disagree with. But more breakthroughs needed.
To whit: OpenAI was until quite recently investigating having TSMC build a dedicated semiconductor fab to produce OpenAI chips [1]:
(Translated from Chinese)
> According to industry insiders, OpenAI originally actively negotiated with TSMC to build a dedicated wafer factory. However, after evaluating the development benefits, it shelved the plan to build a dedicated wafer factory. Strategically, OpenAI sought cooperation with American companies such as Broadcom and Marvell for its own ASIC chips. Development, among which OpenAI is expected to become Broadcom's top four customers.
Even if OpenAI doesn't build its own fab -- a wise move, if you ask me -- the investment required to develop an ASIC on the very latest node is eye watering. Most people - even people in tech - just don't have a good understanding of how
"out there" semiconductor manufacturing has become. It's basically a dark art at this point.
For instance, TSMC themselves [2] don't even know at this point whether the A16 node chosen by OpenAI will require using the forthcoming High NA lithography machines from ASML. The High NA machines cost nearly twice as much as the already exceptional Extreme Ultraviolet (EUV) machines do. At close to $400M each, this is simply eye watering.
I'm sure some gurus here on HN have a more up to date idea of the picture around A16, but the fundamental news is this: If OpenAI doesn't think scaling will be needed to get to AGI, then why would they be considering spending many billions on the latest semiconductor tech?
These models can engage in multistep logical reasoning, solve complex problems, and generate novel ideas - going far beyond simply predicting the next syllable. They can follow intricate chains of thought and arrive at non-obvious conclusions. And OpenAI has now showed us that fine-tuning a model specifically to plan step by step dramatically improves its ability to solve problems that were previously the domain of human experts.
Although there is no definitive evidence that state-of-the-art language models have a comprehensive "world model" in the way humans do, several studies and observations suggest that large language models (LLMs) may possess some elements or precursors of a world model.
For example, Tegmark and Gurnee [1] found that LLMs learn linear representations of space and time across multiple scales. These representations appear to be robust to prompting variations and unified across different entity types. This suggests that modern LLMs may learn rich spatiotemporal representations of the real world, which could be considered basic ingredients of a world model.
And even if we look at much smaller models like Stable Diffusion XL, it's clear that they encode a rich understanding of optics [2] within just a few billion parameters (3.5 billion to be precise). Generative video models like OpenAI's Sora clearly have a world model as they are able to simulate gravity, collisions between objects, and other concepts necessary to render a coherent scene.
As for AGI, the consensus on Metaculus is that it will arrive in 2023. But consider that before GPT-4 arrived, the consensus was that full AGI was not coming until 2041 [3]. The consensus for the arrival date of "weakly general" AGI is 2027 [4] (i.e AGI that doesn't have a robotic physical world component). The best tool for achieving AGI is the transformer and its derivatives; its scaling keeps going with no end in sight.
Citations:
[1] https://paperswithcode.com/paper/language-models-represent-s...
[2] https://www.reddit.com/r/StableDiffusion/comments/15he3f4/el...
[3] https://www.metaculus.com/questions/5121/date-of-artificial-...
[4] https://www.metaculus.com/questions/3479/date-weakly-general...