Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Any insider takes on Yann LeCun's push against current architectures?
385 points by vessenes 10 days ago | hide | past | favorite | 325 comments
So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.






Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)


Is that what he's arguing? My perspective on what he's arguing is that LLMs effectively rely on a probabilistic approach to the next token based on the previous. When they're wrong, which the technology all but ensures will happen with some significant degree of frequency, you get cascading errors. It's like in science where we all build upon the shoulders of giants, but if it turns out that one of those shoulders was simply wrong, somehow, then everything built on top of it would be increasingly absurd. E.g. - how the assumption of a geocentric universe inevitably leads to epicycles which leads to ever more elaborate, and plainly wrong, 'outputs.'

Without any 'understanding' or knowledge of what they're saying, they will remain irreconcilably dysfunctional. Hence the typical pattern with LLMs:

---

How do I do [x]?

You do [a].

No that's wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [b].

No that's also wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [a].

FML

---

More advanced systems might add a c or a d, but it's just more noise before repeating the same pattern. Deep Seek's more visible (and lengthy) reasoning demonstrates this perhaps the most clearly. It just can't stop coming back to the same wrong (but statistically probable) answer and so ping-ponging off that (which it at least acknowledges is wrong due to user input) makes up basically the entirety of its reasoning phase.


on "stochastic parrots"

Table stakes for sentience: knowing when the best answer is not good enough.. try prompting LLMs with that..

It's related to LeCun's (and Ravid's) subtle question I mentioned in passing below:

To Compress Or Not To Compress?

(For even a vast majority of Humans, except tacitly, that is not a question!)


Right now, humans still have enough practice thinking to point out the errors, but what happens when humanity becomes increasingly dependent on LLMs to do this thinking?

Over the last few years I’ve become exceedingly aware at how insufficient language really is. It feels like a 2D plane and no matter how many projections you attempt to create from it, they are ultimately limited in the fidelity of the information transfer.

Just a lay opinion here but to me each mode of input creates a new, largely orthogonal dimension for the network to grow into. The experience of your heel slipping on a cold sidewalk can be explained in a clinical fashion, but an android’s association of that to the powerful dynamic response required to even attempt to recover will give a newfound association and power to the word ‘slip’.


This exactly describes my intuition as well. Language is limited by its representation, and we have to jam so many bits of information into one dimension of text. It works well enough to have a functioning society, but it’s not very precise.

LLM is just the name. You can encode anything into the "language" including pictures video and sound.

> You can encode anything into the "language

Im just a layman here, but i don't think this is true. Language is an abstraction, an interpreative mechanism of reality. A reproduction of reality, like a picture, by definition holds more information than it's abstraction does.


I think his point is that LLMs are pre-trained transformers. And pre-trained transformers are general sequence predictors. Those sequences started out as text or language only but by no means is the architecture constrained to text or language alone. You can train a transformer that embeds and predicts sound and images as well as text.

A picture is also an abstraction. If you take a picture of a tree, you have more details than the word "tree". What i think the parent is saying, is that all the information in a picture of a tree can be encoded in language, for example a description of a tree, using words. Both are abstractions but if you describe the tree well enough with text(and comprehend the description) it might have the same "value" as a picture(not for a human, but for a machine). Also, the size of the text describing the tree might be smaller than the picture.

> all the information in a picture of a tree can be encoded in language

What words would you write that would as uniquely identify this tree from any other tree in the world, like a picture would?

Now repeat for everything in the picture, like the time of day, weather, dirt on the ground, etc.


I've always been wondering if anyone is working on using nerve impulses. My first thought when transformers came around was if they could be used for prosthetics, but I've been too lazy to do the research to find anybody working on anything like that, or to experiment myself with it.

There are a few folks working on this in neuroscience, e.g. training transformers to "decode" neural activity (https://arxiv.org/abs/2310.16046). It's still pretty new and a bit unclear what the most promising path forward is, but will be interesting to see where things go. One challenge that gets brought up a lot is that neuroscience data is often high-dimensional and with limited samples (since it's traditionally been quite expensive to record neurons for extended periods), which is a fairly different regime from the very large data sets typically used to train LLMs, etc.

There are ‘spiking neural networks’ that operate in a manner that more closely emulates how neurons communicate. One idea I think that is interesting to think about is that we build a neural network that operates in a way that is effectively ‘native’ to our mind, so it’s less like there’s a hidden keyboard and screen in your brain, but that it simply becomes new space you can explore in your mind.

Or learn king fu.


Like Cortical labs? Neurons integrated on a silicon chip https://corticallabs.com/cl1.html

When you train a neural net for Donkeycar with camera images plus the joystick commands of the driver, isn't that close to nerve impulses already?

> I've always been wondering if anyone is working on using nerve impulses. My first thought when transformers came around was if they could be used for prosthetics

Neuralink. Musk warning though.

For reference, see Neuralink Launch Event at 59:33 [0], and continue watching through until Musk takes over again. The technical information there is highly relevant to a multi-modal AI model with sensory input/output.

https://youtu.be/r-vbh3t7WVI?t=3575


Great, but how do you imagine multimodal with text, video. Just 2 for simplicity, what will be in the training set. With text model tries to predict next, then more steps were added. But what to do with multimodal?

Doesn't Language itself encode multimodal experiences? Let's take this case write when we write text, we have the skill and opportunity to encode the visual, tactile, and other sensory experiences into words. and the fact is llm's trained on massive text corpora are indirectly learning from human multimodal experiences translated into language. This might be less direct than firsthand sensory experience, but potentially more efficient by leveraging human-curated information. Text can describe simulations of physical environments. Models might learn physical dynamics through textual descriptions of physics, video game logs, scientific papers, etc. A sufficiently comprehensive text corpus might contain enough information to develop reasonable physical intuition without direct sensory experience.

As I'm typing this there is one reality that I'm understanding, the quality and completeness of the data fundamentally determines how well an AI system will work. and with just text this is hard to achieve and a multi modal experience is a must.

thank you for explaining in very simple terms where I could understand


No.

> The sun feels hot on your skin.

No matter how many times you read that, you cannot understand what the experience is like.

> You can read a book about Yoga and read about the Tittibhasana pose

But by just reading you will not understand what it feels like. And unless you are in great shape and with greate balance you will fail for a while before you get it right. (which is only human).

I have read what shooting up with heroin feels like. From a few different sources. I certain that I will have no real idea unless I try it. (and I dont want to do that).

Waterboarding. I have read about it. I have seen it on tv. I am certain that is all abstract to having someone do it to you.

Hand eye cordination, balance, color, taste, pain, and so on, How we encode things is from all senses, state of mind, experiences up until that time.

We also forget and change what we remember.

Many songs takes me back to a certain time, a certain place, a certain feeling Taste is the same. Location.

The way we learn and the way we remember things is incredebily more complex than text.

But if you have shared excperiences, then when you write about it, other people will know. Most people felt the sun hot on their skin.

To different extents this is also true for animals. Now I dont think most mice can read, but they do learn with many different senses, and remeber some combination or permutation.


Even beyond sensations (which are never described except circumstantially, as in "the taste of chocolate" says nothing of the taste, only of the circumstances in which the sensation is felt), it's very often people don't understand something another person says (typically a work of art) until they have lived the relevant experiences to connect to the meaning behind the (whatever medium of communication).

I can't see as much color as a mantis shrimp or sense electric fields like a shark but I still think I'm closer to AGI than they are

> No.

Huh, text definitely encodes multimodal experiences, it's just not as accurate and as rich encoding as the encodings of real sensations.


I don't think GP is asserting that the multimodal encoding is "more rich" or "more accurate", I think they are saying that the felt modality is a different thing than the text modality entirely, and that the former isn't contained in the latter.

Text describes semantic space. Not everything maps to semantic space losslessly.

It's just a description, not an encoding.

Language encodes what people need it to encode to be useful. I heard of an example of colors--there are some languages that don't even have a word for blue.

https://blog.duolingo.com/color-words-around-the-world/


> text definitely encodes multimodal experiences

Perhaps, but only in the same sense that brown and green wax on paper "encodes" an oak tree.


Doesn't this imply that the future of AGI lies not just in vision and text but in tactile feelings and actions as well ?

Essentially, engineering the complete human body and mind including the nervous system. Seems highly intractable for the next couple of decades at least.


Yes it’s why robotics is so exciting right now

All of these "experiences" are encoded in your brain as electricity. So "text" can encode them, though English words might not be the proper way to do it.

We don't know how memories are encoded in the brain, but "electricity" is definitely not a good enough abstraction.

And human language is a mechanism for referring to human experiences (both internally and between people). If you don't have the experiences, you're fundamentally limited in how useful human language can be to you.

I don't mean this in some "consciousness is beyond physics, qualia can't be explained" bullshit way. I just mean it in a very mechanistic way: language is like an API to our brains. The API allows us to work with objects in our brain, but it doesn't contain those objects itself. Just like you can't reproduce, say, the Linux kernel just by looking at the syscall API, you can't replace what our brains do by just replicating the language API.


No, text can only refer to them. There is not a text on this planet that encodes what the heat of the sun feels like on your skin. A person who had never been outdoors could never experience that sensation by reading text.

> There is not a text on this planet that encodes what the heat of the sun feels like on your skin.

> A person who had never been outdoors could never experience that sensation by reading text.

I don't think the latter implies the former as obviously as you make it to be. Unless you believe in some sort of metaphysical description of human, you can certainly encode the feeling (as mentioned in another comment it will be reduced to electrical signals after all). The only question is how much storage you need for that encoding to get what precision. However, the latter statement, if true, is simply constrained by your input device to the brain, i.e. you cannot transfer your encoding to the hardware in this case a human brain via reading or listening. There could be higher bandwidth interfaces like neuralink that may do that to human brain and in the case of AI, an auxiliary device might not be needed and the encoding would be directly mmap'd.


Electrical signals are not the same as subjective experiences. While a machine may be able to record and play back these signals for humans to experience, that does not imply that the experiences themselves are recorded nor that the machine has any access to them.

A deaf person can use a tape recorder to record and play back a symphony but that does not encode the experience in any way the deaf person could share.


That’s some strong claims, given that philosophers (e.g. Chalmers vs Dennett) can’t even agree whether subjective experiences exist or not.

Even if you’re a pure Dennettian functionalist you still commit to a functional difference between signals in transit (or at rest) and signals being processed and interpreted. Holding a cassette tape with a recording of a symphony is not the same as hearing the symphony.

Applying this case to AI gives rise to the Chinese Room argument. LLMs’ propensity for hallucinations invite this comparison.


Are LLMs having subjective experiences? Surely not. But if you claim that human subjective experiences are not the result of electrical signals in the brain, then what exactly is your position? Dualism?

Personally, I think the Chinese room argument is invalid. In order for the person in the room to respond to any possible query by looking up the query in a book, the book would need to be infinite and therefore impossible as a physical object. Otherwise, if the book is supposed to describe an algorithm for the person to follow in order to compute a response, then that algorithm is the intelligent entity that is capable of understanding, and the person in the room is merely the computational substrate.


The Chinese Room is a perfect analogy for what's going on with LLMs. The book is not infinite, it's flawed. And that's the point: we keep bumping into the rough edges of LLMs with their hallucinations and faulty reasoning because the book can never be complete. Thus we keep getting responses that make us realize the LLM is not intelligent and has no idea what it's saying.

The only part where the book analogy falls down has to do with the technical implementation of LLMs, with their tokenization and their vast sets of weights. But that is merely an encoding for the training data. Books can be encoded similarly by using traditional compression algorithms (like LZMA).


>The book is not infinite, it's flawed.

Oh and the human book is surely infinite and unflawed right ?

>we keep bumping into the rough edges of LLMs with their hallucinations and faulty reasoning

Both things humans also do in excess

The Chinese Room is nonsensical. Can you point to any part of your brain that understands English ? I guess you are a Chinese Room then.


Humans have the ability to admit when they do not know something. We say “sorry, I don’t know, let me get back to you.” LLMs cannot do this. They either have the right answer in the book or they make up nonsense (hallucinate). And they do not even know which one they’re doing!

>Humans have the ability to admit when they do not know something.

No not really. It's not even rare that a human confidently says and believes something and really has no idea what he/she's talking about.

>We say “sorry, I don’t know, let me get back to you.” LLMs cannot do this

Yeah they can. And they can do it much better than chance. They just don't do it as well as humans.

>And they do not even know which one they’re doing!

There's plenty of research that suggests this is the case.

https://news.ycombinator.com/item?id=41418486


No not really. It's not even rare that a human confidently says and believes something and really has no idea what he/she's talking about.

Like you’re doing right now? People say “I don’t know” all the time. Especially children. That people also exaggerate, bluff, and outright lie is not proof that people don’t have this ability.

When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.

As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?


>Like you’re doing right now?

Lol Okay

>When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.

Good thing I wasn't talking about that. There's a lot of evidence that human explanations are regularly post-hoc rationalizations they fully believe in. They're not lieing to anyone, they just fully believe the nonsense their brain has concocted.

Experiments on choice and preferences https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196841/

Split Brain Experiments https://www.nature.com/articles/483260a

>As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?

Maybe read any of them ? If you weren't interested in evidence to the contrary of your points then you could have just said so and I wouldn't have wasted my time. The 1st and 6th Links make it quite clear current post-training processes hurt calibration a lot.


In this case, there kind of is. It’s ‘spicy’. The TRPV1 receptor is activated by capsaicin as if it were being activated by intense heat.

If texts are conveying actual message - For eg. text: This spice is very hot - reader's tongue should feel the heat! Since that doesn't happen, it is only for us to imagine. However, AI doesn't imagine the feeling/emotion - at least we don't know that yet.

> No matter how many times you read that, you cannot understand what the experience is like.

OK, so you don't have qualia. But if know all the data needed to complete any tasks that can be related to this knowledge, does it matter?


I'm reminded of the story of Helen Keller, and how it took a long time for her to realize that the symbols her teacher was signing into her hand had meaning, as she was blind and deaf and only experienced the world via touch and smell. She didn't get it until her teacher spelled the word "water" as water from a pump was flowing over her hand. In other words, a multimodal experience. If the model only sees text, it can appear to be brilliant but is missing a lot. If it's also fed other channels, if it can (maybe just virtually) move around, if it can interact, the way babies do, learning about gravity by dropping things and so forth, it seems that there's lots more possibility to understand the world, not just to predict what someone will type next on the Internet.

It is important to note that Helen Keller was not born blind and deaf, though. (I am not reducing the struggle she went through. Just commentary on embodied cognition and learning.) There were around 19 months of normal speech and hearing development until then and also 3D object space traversal and object manipulation.

at least a few decades ago, this idea was called "embodied intelligence" or "embodied cognition". just FYI.

Enactivist philosophy. Karl Friston is testing this approach as CTO of an AI startup in LA.

> Doesn't Language itself encode multimodal experiences?

When communicating between two entities with similar brains who have both had many thousands of hours of similar types of sensory experiences, yeah. When I read text I have a lot more than other text to relate it to in my mind; I bring to bear my experiences as a human in the world. The author is typically aware of this and effectively exploits this fact.


Some aspects of experience— e.g. raw emotions, sensory perceptions, or deeply personal, ineffable states—often resist full articulation.

The taste of a specific dish, the exact feeling of nostalgia, or the full depth of a traumatic or ecstatic moment can be approximated in words but never fully captured. Language is symbolic and structured, while experience is often fluid, embodied, and multi-sensory. Even the most precise or poetic descriptions rely on shared context and personal interpretation, meaning that some aspects of experience inevitably remain untranslatable.


Just because we struggle to verbalize something, doesn't mean that it cannot be verbalized. The taste of a specific dish can be broken down into its components. The basic tastes: how sweet, sour, salty, bitter and savory it is. The smell of it: there are are apparently ~400 olfactory receptor types in the nose. So you could describe how strongly each of them is activated. Thermoception, the temperature of the food itself, but also fake temperature sensation produced by capsaicin and menthol. The mechanoceptors play a part, detecting both the shape of the food as well as the texture of it. The texture also contributes to a sound sensation as we hear the cracks and pops when we chew. And that is just the static part of it. Food is actually an interactive experience, where all those impressions change over time and varies over time as the food is chewed.

It is highly complex, but it can all be described.


Imagine I give you a text of any arbitrary length in an unknown language with no images. With no context other than the text, what could you learn?

If I told you the text contained a detailed theory of FTL travel, could you ever construct the engine? Could you even prove it contained what I told you?

Can you imagine that given enough time, you'd recognize patterns in the text? Some sequences of glyphs usually follow other sequences, eventually you could deduce a grammar, and begin putting together strings of glyphs that seem statistically likely compared to the source.

You can do all the analysis you like and produce text that matches the structure and complexity of the source. A speaker of that language might even be convinced.

At what point do you start building the space ship? When do you realize the source text was fictional?

There's many untranslatable human languages across history. Famously, ancient Egyptian hieroglyphs. We had lots and lots of source text, but all context relating the text to the world had been lost. It wasnt until we found a translation on the Rosetta stone that we could understand the meaning of the language.

Text alone has historically proven to not be enough for humans to extract meaning from an unknown language. Machines might hypothetically change that but I'm not convinced.

Just think of how much effort it takes to establish bidirectional spoken communication between two people with no common language. You have to be taught the word for apple by being given an apple. There's really no exception to this.


I'm optimistic about this. I think enough pictures of an apple, chemical analyses of the air, the ability to arbitrarily move around in space, a bunch of pressure sensors, or a bunch of senses we don't even have, will solve this. I suspect there might be a continuum of more concept understanding that comes with more senses. We're bathed in senses all the time, to the point where we have many systems just to block out senses temporarily, and to constantly throw away information (but different information at different times.)

It's not a theory of consciousness, it's a theory of quality. I don't think that something can be considered conscious that is constantly encoding and decoding things into and out of binary.


A few GB worth of photographs of hieroglyphs? OK, you're going to need a Rosetta Stone.

A few PB worth? Relax, HAL's got this. When it comes to information, it turns out that quantity has a quality all its own.


> Doesn't Language itself encode multimodal experiences

Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.


There are absolutely reasons that we cannot capture the entirety—or even a proper image—of human cognition in semantic space.

Cognition is not purely semantic. It is dynamic, embodied, socially distributed, culturally extended, and conscious.

LLMs are great semantic heuristic machines. But they don't even have access to those other components.


The LLM embeddings for a token cover much more than semantics. There is a reason a single token embedding dimension is so large.

You are conflating the embedding layer in an LLM and an embedding model for semantic search.


I don't think we're using the term semantic in the same way. I mean "relating to meaning in language."

The embedding layer in an llm deals with much more than the meaning. It has to capture syntax, grammar, morphology, style and sentiment cues, phonetic and orthographic relationships and 500 other things that humans can't even reason about but exist in words combinations.

I'll give you that. I was including those in "semantic space," but the distinction is fair.

My original point still stands: the space you've described cannot capture a full image of human cognition.


Thanks for articulating this so well. I'm a musician and music/CS phd student, and as a jazz improvisor of advanced skill (30+ years), I'm accutely aware that there are significant areas of intelligence for which linguistic thinking is not only not good enough, but something to be avoided as much as one can (which is bloody hard sometimes). I have found it so frustrating, but hard to figure out how to counter, that the current LLM zeitgeist seems to hinge on a belief that linguistic intelligence is both necessary and sufficient for AGI.

Most modern LLMs are multimodal.

Does it really matter? At the end of the day, all the modalities and their architectures boil down to matrices of numbers and statistical probability. There’s no agency, no soul.

At the end of the day, all modalities boil down to patterns of electrical activity in your brain.

The brain is the important part. The electricity just keeps it going. And it’s more than numerical matrices.

You mean soul?

You misspelled strawman

Tri-modal at best: text, sound and video, and that's it. That's just barely "multi" (it's one more than two).

I don't get it.

1) Yes it's true, learning from text is very hard. But LLMs are multimodal now.

2) That "size of a lion" paper is from 2019, which is a geological era from now. The SOTA was GPT2 which was barely able to spit out coherent text.

3) Have you tried asking a mouse to play chess or reason its way through some physics problem or to write some code? I'm really curious in which benchmark are mice surpassing chatgpt/ grok/ claude etc.


Mice can survive, forage, reproduce. Reproduce a mammal. There is a whole load of capability not available in an LLM.

An LLM is essentially a search over a compressed dataset with a tiny bit of reasoning as emergent behaviour. Because it is a parrot that is why you get "hallucinations". The search failed (like when you get a bad result in Google) or the lossy compression failed or it's reasoning failed.

Obviously there is a lot of stuff the LLM can find in its searches that are reminiscent of the great intelligence of the people writing for its training data.

The magic trick is impressive because when we judge a human what do we do... an exam? an interview? Someone with a perfect memory can fool many people because most people only acquire memory from tacit knowledge. Most people need to live in Paris to become fluent in French. So we see a robot that has a tiny bit of reasoning and a brilliant memory as a brilliant mind. But this is an illusion.

Here is an example:

User: what is the French Revolution?

Agent: The French Revolution was a period of political and societal change in France which began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799. Many of the revolution's ideas are considered fundamental principles of liberal democracy and its values remain central to modern French political discourse.

Can you spot the trick?


When you talk to ~3 year old children they hallucinate quite a lot. Really almost nonstop when you ask them about almost anything.

I'm not convinced that what LLM's are doing is that far off the beaten path from our own cognition.


Interesting but a bit non-sequitur.

Humans learn and get things wrong. A formative mind is a seperate subject. But a 3 year old is vastly intelligent vs an LLM. Comparing the sounds from a 3 year old and the binary tokens from an LLM is simply indulging the illusion.

I am also not convinced that magicians saw people in half, and thise people survive, defying medical and physical science.


I'm not sure I buy that, I didnt find the counter argument persuasive, but this comment basically took you from thoughtful to smug — unfairly so, ironically, because I've been so bored by not understanding Yann's "average housecat is smarter than an LLM"

Speaking of which...I'm glad you're here ,because I have an interlocutor I can be honest with while getting at the root question of the Ask HN.

What in the world does it mean that a 3 year old is smarter than an LLM?

I don't understand the thing about sounds vs. binary either. Like, both go completely over my head.

The only thing I can think of it's some implied intelligence scoring index where "writing a resume" and "writing creative fiction" and "writing code" are in the same bucket thats limited to 10 points. Then there's anther 10 point bucket for "can vocalize", that an LLM is going to get 0 on.*

If that's the case, it comes across as intentionally obtuse, in that there's an implied prior about how intelligence is scored and it's a somewhat unique interpretation that seems more motivated by the question than reflective of reality — i.e. assume a blind mute human who types out answers out that match our LLMs. Would we say that person is not as intelligent as a 3 year old?

* well, it shouldn't, but for now let's bypass that quagmire


It is easy to cross wires in a HN thread.

I think what makes this discussion hard (hell it would be a hard PhD topic!) is:

What do we mean by smart? Intelligent? Etc.

What is my agenda and what is yours? What are we really asking?

I won't make any more arguments but pose these questions. Not for you to answer but everyone to think about:

Given (assuming) mammals including us have evolved and developed thought and language as a survival advantage, and LLMs use language because they have been trained on text produced by humans (as well as RLHF) - how do we tell on the scale of "Search engine for human output" to "Conscious Intelligent Thinking Being" where the LLM fits?

When a human says I love you, do they mean it, or is it merely 3 tokens? If an LLM says it, does it mean it?

I think the 3yr old thing is a red herring because adult intelligence VS AI is hard enough to compare (and we are the adults!) let alone bring children brain development into it. LLMs do not self organise their hardware. I'd say forget about 3 year olds for now. Talk about adults brainfarts instead. They happen!


a 3yr old is actually far more similar to AI than an adult. 3 year olds have extremely limited context windows. They will almost immediately forget what happened even 20-30 seconds ago when you play a game like memory with them, and they rarely remember what they ate for breakfast or lunch or basically any previous event from the same day.

When a 3 year old says "I love you" it is not at all clear that they understand what that means. They frequently mimic phrases they hear/basically statistical next word guessing and obviously don't understand the meaning of what they are saying.

You can even mimic an inner voice for them like Deepseek does for thinking through a problem with a 3 year old and it massively helps them to solve problems.

AI largely acts like a 3 year old with a massive corpus of text floating around in their head compared to the much smaller corpus a 3 year old has.


> What in the world does it mean that a 3 year old is smarter than an LLM?

Because LLMs have terrible comprehension of the real world. Here's an example:

> You: If you put a toddler next to a wig on the floor, which reaches higher?

> ChatGPT: The wig would typically reach higher than the toddler, especially if the wig is a standard size or has long hair. Toddlers are generally around 2 to 3 feet tall, while wigs can range in size, but many wigs are designed to be worn by adults and have hair that extends well beyond a toddler's height. So, unless the wig is very small or short, it would likely reach higher.

Both the toddler and the wig would be lying down (unless the toddler can stand already), with the wig likely collapsing into a very flat heap, making the toddler "reach higher". Since it lacks comprehension of how these things behave in the real world, ChatGPT goes on about length of the wig and the average height of toddlers, which are both qualities that are not even relevant.

If you could communicate this problem to a toddler, they would probably intuit it even with their still limited understanding of the world.

Here's another fun prompt to mess with LLMs:

> You: I'm trying to make a rope, made of only aubergines, not using any twine or string. What kind of knot would you recommend to tie together each aubergine with the next? I'm trying to use up as little of the length of each aubergine as possible.

It's obviously impossible nonsense if you ask any human, but ChatGPT will happily recommend you some techniques for tying aubergines in a knot.

And if you really want a laugh follow up with:

> You: Are they the worst fruit for this?

> ChatGPT: Aubergines (or eggplants) aren't ideal for making a rope, but they might not be the absolute worst, either. [...] Potentially Better Fruits: Apples or Pears: These fruits are firmer and have a smoother, more uniform shape, making them better suited for knotting.

Thanks to the large corpus of text ChatGPT is trained on, it's really good at hiding that it has basically no comprehension of the real world, however if you go unconventional places the illusion quickly breaks.

Bonus round:

> You: You're clueless about fruit!

> ChatGPT: Haha, fair enough! Looks like I missed the mark there. You’re probably right, though—there are definitely better fruits for this kind of thing than I suggested. Let’s try again. If you're looking to make a rope from a fruit specifically (and not a vegetable or tuber), something like coconuts or durian would be way better options, since they’re harder, more rigid, and can handle some tying pressure without collapsing. And apples and pears would definitely hold up well due to their firmer texture.

Coconuts!


I just entered your prompt into ChatGPT (4.5):

Prompt:

I'm trying to make a rope, made of only aubergines, not using any twine or string. What kind of knot would you recommend to tie together each aubergine with the next? I'm trying to use up as little of the length of each aubergine as possible.

Response:

Your request doesn't entirely make sense. Aubergines (eggplants) are firm but brittle vegetables; they lack flexibility, elasticity, or tensile strength required to create or sustain knots. Attempting to tie aubergines together directly, without using twine, string, or another flexible medium, isn't feasible. They'd likely break or tear apart under stress.

If your goal is genuinely to form a rope-like structure using only aubergines, you'd be much better off slicing the aubergines into thin, flexible strips (perhaps using a vegetable peeler or mandoline), drying or partially dehydrating them to increase flexibility, and then braiding or twisting them together, rather than knotting.

If you clarify your intended purpose or constraints further, I can suggest alternative methods.

The response looks good to me.


> Because LLMs have terrible comprehension of the real world.

That doesn't seem to be the case.

> You: If you put a toddler next to a wig on the floor, which reaches higher? > ChatGPT: ...

I answered it wrong too.

I had to read it, and your reaction to the implied obvious reasoning 3 times, to figure out the implied obvious reasoning, and understand your intent was the toddler was standing and the wig was laying in a heap.

I scored 99.9+% on the SAT and LSAT. I think that implies this isn't some reasoning deficit, lack of familiarity with logical reasoning on my end, or lack of rigor in reasoning.

I have no particular interest in this argument. I think that implies that I'm not deploying motivated reasoning, i.e. it discounts the possibility that I may have experienced it as confusion that required re-reading the entire comment 3 times, but perhaps I had subconcious priors.

Would a toddler even understand the question? (serious question, I'm not familiar with 3 year olds)

Does this shed any light on how we'd work an argument along the lines of our deaf and mute friend typing?

Edit: you edited in some more examples, I found it's aubergine answers quite clever! (Ex. notching). I can't parse out a convincing argument this is somehow less knowledge than a 3 year old -- it's giving better answers than me that are physical! I thought youd be sharing it asserting obviously nonphysical answers


> I had to read it, and your reaction to the implied obvious reasoning 3 times, to figure out the implied obvious reasoning, and understand your intent was the toddler was standing and the wig was laying in a heap.

It seems quite obvious even on a cursory glance though!

> toddler was standing and the wig was laying in a heap

I mean how would toddler be laying in a heap?

> Would a toddler even understand the question?

Maybe not, I am a teen/early adult myself, so not many children yet :) but if you instead lay those in front of a toddler and ask which is higher, I guess they would answer that, another argument for multi-modality.

PS: Sorry if what I am saying is not clear, english is my third language


Hah, I tried it with gpt-4o and got similarly odd results:

https://chatgpt.com/share/67d6fb93-890c-8004-909d-2bb7962c8f...

It's pretty good nonsense though. It suggests clove hitching them together, which would be a weird (and probably unsafe) thing to do even with ropes!


That’s interesting.

Lots of modern kids probably get exposed to way more fiction than fact thanks to TV.

I was an only child and watched a lot of cartoons and bad sitcoms as a kid, and I remember for a while my conversational style was way too full of puns, one-liners, and deliberately naive statements made for laughs.


i wish more people were still like that

Mice can survive, forage, reproduce. Reproduce a mammal. There is a whole load of capability not available in an LLM.

And if it stood for "Large Literal Mouse", that might be a meaningful point. The subject is artificial intelligence, and a brief glance at your newspaper, TV, or nearest window will remind you that it doesn't take intelligence to survive, forage, or reproduce.

The mouse comparison is absurd. You might as well criticize an LLM for being bad at putting out a fire, fixing a flat, or holding a door open.


Oh mice can solve a plethora of physics problems before it's time for breakfast. They have to navigate the, well, physical world, after all.

I'm also really curious what benchmarks LLMs have passed that include surviving without being eaten by a cat, or a gull, or an owl, while looking for food to survive and feed one's young in an arbitrary environment chosen from urban, rural, natural etc, at random. What's ChatGPT's score on that kind of benchmark?


Oh a rock rolling down a hill is, well, navigating the physical world. Is it well, solving physics problem?

> mice can solve a plethora of physics problems before it's time for breakfast

Ah really? Which ones? And nope, physical agility is not "solving a physics problem", otherwise a soccer players and figure skaters would all have PhDs, which doesn't seem to be the case.

I mean, an automated system that solves equations to keep balance is not particularly "intelligent". We usually call intelligence the ability to solve generic problems, not the ability of a very specialized system to solve the same problem again and again.


>> Ah really? Which ones? And nope, physical agility is not "solving a physics problem", otherwise a soccer players and figure skaters would all have PhDs, which doesn't seem to be the case.

Yes, everything that has to do with navigating physical reality, including, but not restricted to physical agility. Those are physics problems that animals, including humans, know how to solve and, very often, we have no idea how to program a computer to solve them.

And you're saying that solving physics problems means you have a PhD? So for example Archimedes did not solve any physics problems otherwise he'd have a PhD?


> Those are physics problems that animals, including humans, know how to solve

No, those are problems that animals and humans solve, not know how to solve. I'm not the greatest expert of biochemistry that ever lived because of what goes on in my cells.

Now, I understand perfectly well the argument that "even small animals do things that our machines cannot do". That's been indisputably true for a long time. Today, it seems that the be more a matter of embodiment and speed of processing rather than a level of intelligence out of our reach. We already have machines that understand natural language perfectly well and display higher cognitive abilities than any other animal- including abstract reasoning, creating and understanding metaphors, following detailed instructions, writing fiction, etc.


usual disclaimer: you decide on your own whether I'm an insider or not :)

where LeCun might be prescient should intersect with the nemesis SCHMIDHUBER. They can't both be wrong, I suppose?!

It's only "tangentially" related to energy minimization, technically speaking :) connection to multimodalities is spot-on.

https://www.mdpi.com/1099-4300/26/3/252

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

With Ravid, double-handedly blue-flag MDPI!

Sunmarized for the layman (propaganda?) https://archive.is/https://nyudatascience.medium.com/how-sho...

>When asked about practical applications and areas where these insights might be immediately used, Shwartz-Ziv highlighted the potential in multi-modalities and tabula

Imho, best take I've seen on this thread (irony: literal energy minimization) https://news.ycombinator.com/item?id=43367126

Of course, this would make Google/OpenAI/DeepSeek wrong by two whole levels (both architecturally and conceptually)


>1) You can't learn an accurate world model just from text. >2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

LLMs can be trained with multimodal data. Language is only tokens and pixel and sound data can be encoded into tokens. All data can be serialized. You can train this thing on data we can't even comprehend.

Here's the big question. It's clear we need less data then an LLM. But I think it's because evolution has pretrained our brains for this so we have brains geared towards specific things. Like we are geared towards walking, talking, reading, in the same way a cheetah is geared towards ground speed more then it is at flight.

If we placed a human and an LLM in completely unfamiliar spaces and tried to train both with data. Which will perform better?

And I mean completely non familiar spaces. Like let's make it non Euclidean space and only using sonar for visualization. Something totally foreign to reality as humans know it.

I honestly think the LLM will beat us in this environment. We might've succeeded already in creating AGI it's just the G is too much. It's too general so it's learning everything from scratch and it can't catch up to us.

Maybe what we need is to figure out how to bias the AI to think and be biased in the way humans are biased.


Humans are more adaptable than you think:

- echolocation in blind humans https://en.wikipedia.org/wiki/Human_echolocation

- sight through signals sent on tongue https://www.scientificamerican.com/article/device-lets-blind...

In the latter case, I recall reading the people involved ended up perceiving these signals as a "first order" sense (not consciously treated information, but on an intuitive level like hearing or vision).


Hugely different data too?

If you think of all the neurons connected up to vision, touch, hearing, heat receptors, balance, etc. there’s a constant stream of multimodal data of different types along with constant reinforcement learning - e.g. ‘if you move your eye in this way, the scene you see changes’, ‘if you tilt your body this way your balance changes’, etc. and this runs from even before you are born, throughout your life.


> non Euclidean space and only using sonar for visualization

Pretty good idea for a video game!


I'm curious why their claims are controversial. It seems pretty obvious to me that LLMs sometimes generate idiotic answers because the models lack common sense, do not have ability for deductive logical reasoning, let alone the ability to induce. And the current transformer architectures plus all the post-training techniques do not do anything to build such intelligence or the world model per LeCun's words.

"LeCun has been on about it for a while and it's less controversial these days."

Funny how that sentence could have been used 15 years ago too when he was right about persevering through neural network scepticism.


Late 80's and 90's had the 'nouvelle AI' movement that argued embodiment was required for grounding the system into the shared world model. Without it symbols would be ungrounded and never achieve open world consistency.

So unlike their knowledge system predecessors, a bit derogatory refered to as GOFAI (good old fashioned AI), nAI hawked back to cybernetics and multi layered dynamical systems rather than having explicit internal symbolic models. Braitenberg rather than blocksworld so to speak.

Seems like we are back for another turn of the wheel in this aspect.


> grounding the system into the shared world model

before we fix certain things [..., 'corruption', Ponzi schemes, deliberate impediment of information flow to population segments and social classes, among other things, ... and a chain of command in hierarchies that are build on all that] is impossible.

Why do smart people not talk about this at all? The least engineers and smart people should do is picking these fights for real. It's just a few interest groups, not all of them. I understand a certain balance is necessary in order to keep some systems from tipping over, aka "this is humanity, silly, this is who we are", but we are far from the point of efficient friction and it's only because "smart people" like LeCun et al are not picking those fights.

How the hell do you expect to ground an ()AI in a world where elected ignorance amplifies bias and fallacies for power and profit while the literal shit is hitting all the fans via intended and unintended side effects? Any embodied AI will pretend until there is no way to deny that the smartest, brightest and the productive don't care about the system in any way but are just running algorithmically while ignoring what should not be ignored - should as in, an AI should be aligned with humanities interests and should be grounded into the shared world model.


I hear you, but while you can have many layers of semantic obfuscation, no amount of sophistry will allow you to smash your face unharmed through a concrete wall. Reality is a hard mistress.

In absense of being able to sense reality, post modernism can run truly unchecked.


> LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

It feels like special pleading: surely _this_ will be the problem class that doesn’t fall to “the bitter lesson”.

My intuition is that the main problem with the current architecture is that mapping into tokens causes quantization that a real brain doesn’t have, and lack of plasticity.

I don’t build models, I spend 100% of my time reading and adjusting model outputs though.


So blind people aren't generally sentient?

(I'm obviously exaggerating a bit for the sake of the argument, but the point stands. Multimodality should not be a prerequisite to AGI)


blind people still have other senses including touch which gives them a size reference they can compare to. you can feel physical objects to gain an understanding of their size.

the LLM is more like a brain in a vat with only one sensory input - a stream of text


I don't know about telling better the size from a picture. I can imagine seeing 2 pictures of the moon. One is extreme telephoto showing moon next to a building and it looks real big. Then there would be another image where moon is a tiny speckle in the sky. How big is the moon? I would rather understand a text: "its radius is x km".

I think the example is simplified to make its point efficiently, but also: the moon is something whose size would very likely be precisely explained in texts about it. While some hunting journals might brag about the weight of a lion that was killed, or whatever, most texts that I can recall reading about lions basically assumed you already know roughly how big a lion is; which indeed I learned from pictures as a pre-literate child.

A good, precise spec is better that a few pictures, sure; the random text content of whatever training set you can scrape together, perhaps not (?)


Reading "its radius is x km" would mean nothing to you if you'd never experienced spatial extent directly, whether that be visually or just by moving through space and existing in it. You'd need to do exactly what is being said in the paper, read about thousands of other roughly spherical objects and their radii. At some point, you'd get a decent sense of relative sizes.

On the other hand, if you ever simply see a meter stick, any statement that something measures a particular multiple or fraction of that you can already understand, without ever needing to learn the size of anything else.


This seems strongly backed up by Claude Plays Pokemon

Isn't Claude Plays Pokemon using image input in addition to text? Not that it's perfect at it (some of its most glaring mistakes are when it just doesn't seem to understand what's on the screen correctly).

Yes but because it's trained on text and in the backend, images are converted to tokens, it is absolutely dogshit at navigation and basic puzzles. It can't figure out what Squirrels can about how to achieve goals in a maze.

The images are converted to an embedding space the size of token embedding space. And the model is trained on that new embedding space. A joint representation of text and images is formed.

It’s not as though the image is converted to text tokens.


Are people born both blind and deaf incapable of true learning?

There's a whole fork lore/meme around how hard it is, in America culture...

But given blidness and deafness is an impediment to acquiring language, more than anything else, I'd say that's the exact opposite of the conclusions from the comment you're replying to.

But yes, depending on where you set the bar for "true learning" being blind and deaf would prevent it.

I assume you're asking if vision and sound are required for learning, the answer I assume is no. Those were just chosen because we've already invented cameras and microphones. Haptics are less common, and thus less talked about.


>(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

Ehhhh, energy-based models are trained via contrastive divergence, not just minimizing a simple loss averaged over the training data.


I'm not an ML researcher, but I do work in the field.

My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.

I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.

Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.

I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.

[1]: https://www.open.edu/openlearn/nature-environment/organisati...


> Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found.

That seems to be how science works as a whole. Long periods of little progress between productive paradigm shifts.


It's been described as fumbling around in a dark room until you find the light switch. At which point you can see the doorway leading to the next dark room.

Punctuated equilibrium theory.

That is how science seems to work as a whole. What worries me is that the market views the emergence of additional productive paradigm shifts in AI as only a matter of money. A normal scientific advancement plateau for another five years in AI would be a short-term disaster for the stock market and economy.

This is actually a lazy approach as you describe it. Instead, what is needed is an elegant and simple approach that is 99% of the way there out of the gate. Soon as you start doing statistical tweaking and overfitting models, you are not approaching a solution.

In a way yes. For models in physics that should make you suspicious, since most of our famous and useful models found are simple and accurate. However, in general intelligence or even multimodal pattern matching there’s no guarantee there’s an elegant architecture at the core. Elegant models in social sciences like economics, sociology and even fields like biology tend to be hilariously off.

Not an official ML researcher, but I do happen to understand this stuff.

The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.


> The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

I don't think this explanation is correct. The input to the decoder at the end of all the attention heads etc (as I understand it) is a probability distribution over tokens. So the model as a whole does have an ability to score low confidence in something by assigning it a low probability.

The problem is that thing is a token (part of a word). So the LLM can say "I don't have enough information" to decide on the next part of a word but has no ability to say "I don't know what on earth I'm talking about" (in general - not associated with a particular token).


I feel like we're stacking naive misinterpretations of how LLMs function on top of one another here. Grasping gradient descent and autoregressive generation can give you a false sense of confidence. It is like knowing how transistors make up logic gates and believing you know more than CPU design than you actually do.

Rather than inferring from how you imagine the architecture working, you can look at examples and counterexamples to see what capabilities they have.

One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels. It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.

Models predict the next word, but they don't just predict the next word. They generate a great deal of internal information in service of that goal. Placing limits on their abilities by assuming the output they express is the sum total of what they have done is a mistake. The output probability is not what it thinks, it is a reduction of what it thinks.

One of Andrej Karpathy's recent videos talked about how researchers showed that models do have an internal sense of not knowing the answer, but fine tuning on question answering I'd not give them the ability to express that knowledge. Finding information the model did and didn't know then fine tuning to say I don't know for cases where it had no information allowed the model to generalise and express "I don't know"


Thanks for writing this so clearly... I hear wrong/misguided arguments like we see hear every day from friends, colleagues, "experts in the media" etc.

It's strange because just a moment of thinking will show that such ideas are wrong or paint a clearly incomplete picture. And there's plenty of analogies to the dangers of such reductionism. It should be obviously wrong to anyone who has at least tried ChatGPT.

My only explanation is that a denial mechanism must be at play. It simply feels more comfortable to diminish LLM capabilities and/or feel that you understand them from reading a Medium article on transformer-network, than to consider the consequences in terms of the inner black-box nature.


No an ML researcher or anything (I'm basically only a few Karpathy video into ML, so please someone correct me if I'm misunderstanding this), but it seems that you're getting this backwards:

> One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels.

My understanding is that there's simply not “'an' ahead of a word that starts with a vowel”, the model (or more accurately, the sampler) picks “an” and then the model will never predict a word that starts with a consonant after that. It's not like it “knows” in advance that it wants to put a word with a vowel and then anticipates that it needs to put “an”, it generates a probability for both tokens “a” and “an”, picks one, and then when it generates the following token, it will necessarily take its previous choice into account and never puts a word starting with a vowel after it has already chosen “a”.


The model still has some representation of whether the word after an/a is more likely to start with a vowel or not when it outputs a/an. You can trivially understand this is true by asking LLMs to answer questions with only one correct answer.

"The animal most similar to a crocodile is:"

https://chatgpt.com/share/67d493c2-f28c-8010-82f7-0b60117ab2...

It will always say "an alligator". It chooses "an" because somewhere in the next word predictor it has already figured out that it wants to say alligator when it chooses "an".

If you ask the question the other way around, it will always answer "a crocodile" for the same reason.


Again, that's not a good example I think because everything about the answer is in the prompt, so obviously from the start the "alligator" is high, but then it's just waiting for an "an" to occur to have an occasion to put that.

That doesn't mean it knows "in advance" what it want to say, it's just that at every step the alligator is lurking in the logits because it directly derives from the prompt.


You write: "it's just that at every step the alligator is lurking in the logits because it directly derives from the prompt" - but isn't that the whole point: at the moment the model writes "an", it isn't just spitting out a random article (or a 50/50 distribution of articles or other words for that matter); rather, "an" gets a high probability because the model internally knows that "alligator" is the correct thing after that. While it can only emit one token in this step, it will emit "an" to make it consistent with its alligator knowledge "lurking". And btw while not even directly relevant, the word alligator isn't in the prompt. Sure, it derives from the prompt but so does every an LLM generates, and same for any other AI mechanism for generating answers.

> While it can only emit one token in this step, it will emit "an" to make it consistent with its alligator knowledge "lurking".

It will also emit "a" from time to time without issue though, but will never spit "alligator" right after that, that's it.

> Sure, it derives from the prompt but so does every an LLM generates, and same for any other AI mechanism for generating answers.

Not really, because of the autoregressive nature of LLMs, the longer the response the more it will depend on its own response rather than the prompt. That's why you can see totally opposite response from LLM to the same query if you aren't asking basic factual questions. I saw a tool on reddit a few month ago that allowed you to see which words in the generation where the most “opinionated” (where the sampler had to chose between alternative words that were close in probability) and where it was easy to see that you could dramatically affect the result by just changing certain words.

> "an" gets a high probability because the model internally knows that "alligator" is the correct thing after that.

This is true, though it only works with this kind of prompt because the output of the LLM has little impact on the generation.

Globally I see what you mean, and I don't disagree with you, but at the same time, I think that saying that LLMs have a sense of anticipating the further token misses their ability to get driven astray by their own output: they have some information that will affect further tokens but any token that get spit can, and will, change that information in a way that can dramatically change the “plans”. And that's why I think using trivial questions isn't a good illustration, because it pushes this effect under the rug.


yunwal has provided one example. Here's another using much smaller model.

https://chat.groq.com/?prompt=If+a+person+from+Ontario+or+To...

The response "If a person from Ontario or Toronto is a Canadian, a person from Sydney or Melbourne would be an Australian!"

It seems mighty unlikely that it chose Australian as the country because of the 'an', or that it chose to put the 'an' at that point in the sentence for any other reason that the word Australian was going to be next.

For any argument that you think that this does not mean that have some idea of what is to come, try and come up with a test to see if your hypothesis is true or not, then give that test a try.


No, the person you're responding to is absolutely right. The easy test (which has been done in papers again and again) is the ability to train linear probes (or non-linear classifier heads) on the current hidden representations to predict the nth-next token, and the fact that these probes have very high accuracy.

> It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.

That is a very interesting observation!

Doesn’t that internal state get blown away and recreated for every “next token”? Isn’t the output always the previous context plus the new token, which gets fed back and out pops the new token? There is no transfer of internal state to the new iteration beyond what is “encoded” in its input tokens?


>Doesn’t that internal state get blown away and recreated for every “next token”

That is correct. When a model has a good idea of the next 5 words, after it has emitted the first of those 5 most architectures make no further use of the other 4 and regenerate likely the same information again in the next inference cycle.

There are architectures that don't discard all that information but the standard LLM has generally outperformed them, for now.

There are interesting philosophical implications if LLMs were to advance to a level to be considered sentient. Would it not be constantly creating and killing a thinking being for every token. On the other hand if context is considered memory, perhaps continuity of identity is based upon memory and all that other information are simply forgotten idle thoughts. We have no concept of what our previous thoughts were except from our memory. Is that not the same.

Sometimes I wonder if some of the resistance to AI is because it can do things that we think requires abilities that we would like to believe that we possess ourselves, and showing that they are not necessary creates the possibility that we might not have have those abilities.

There was a great observation recently in an interview (I forget the source, but the interviewer's last name was Bi) that some of the discoveries that met the most resistance in history such as the Earth orbiting the Sun, or Darwin's theory of evolution were similar in that they implied that we are not a unique special case.


I think your analogy about logic gates vs. CPUs is spot on. Another apt analogy would be missing the forest for the trees—the model may in fact be generating a complete forest, but its output (natural language) is inherently serial so it can only plant one tree at a time. The sequence of distributions that is the proximate driver of token selection is just the final distillation step.

It literally doesn't know how to handle 'I don't know' and needs to be taught. fascinating.

I think it would be more accurate to say that after fine tuning on a series of questions with answers that it thinks that you don't want to hear "I don't know"

I think it's more fundamental than that. If you start saying "it thinks" in regards to an LLM, you're wrong. LLMs don't think, they pattern match fuzzily.

If the training data contained a bunch of answers to questions which were simply "I don't know", you could get an LLM to say "I don't know" but that's still not actually a concept of not knowing. That's just knowing that the answer to your question is "I don't know".

It's essentially like if you had an HTTP server that responded to requests for nonexistent documents with a "200 OK" containing "Not found". It's fundamentally missing the "404 Not found" concept.

LLMs just have a bunch of words--they don't understand what the words mean. There's no metacognition going on for it to think "I don't know" for it to even think you would want to know that.


>I think it's more fundamental than that. If you start saying "it thinks" in regards to an LLM, you're wrong. LLMs don't think, they pattern match fuzzily.

I'm not sure if this objection is terribly helpful. We use terms like think and want to describe processes that are clearly not involve any form of understanding. Electrons do not have motivations but they 'want' to go to a lower energy level in an atom. You can hold down the trigger for the fridge light to make it 'think' that the door has not been opened. These are uncontentious phrases that convey useful ideas.

I understand that when people are working towards producing reasoning machines the words might be working in similar spaces, but really when someone is making claims about machines having awareness, understanding, or thinking they make it quite clear about the context that they are talking about.

As to the rest of your comment, I simply disagree. If you think of a concept of an internal representation of a piece of information, then it has been shown that they do have such representations. In the Karpathy video I mentioned he talks about how researches found that models did have an internal representation of not knowing, but that the fine tuning was restricting it to providing answers. Giving it fine-tuning examples where it said "I don't know" for information that they knew the model didn't know. This generalised to provide "I don't know" for examples that were not in the training data. For the fine tuning examples to succeed in that, it requires the model to already contain the concept.

I would agree that models do not have any in-depth understanding of what lack of knowledge actually is. On the other hand I would also think that this also applies to humans, most people are not philosophers.

I think that the models can express details about words shows that they do have detailed information about what each word means semantically. In many respects because of tokenisation indexing embeddings it would perhaps be more accurate to say that they have a better understanding of the semantic information of what words mean the what the words actually are. This is why they are poor at spelling but can give you detailed information about the thing they can't spell.


> We use terms like think and want to describe processes that are clearly not involve any form of understanding.

...and that's why so many people are confused about what's going on with LLMs: sloppy, ambiguous use of language.

> In the Karpathy video I mentioned he talks about how researches found that models did have an internal representation of not knowing, but that the fine tuning was restricting it to providing answers. Giving it fine-tuning examples where it said "I don't know" for information that they knew the model didn't know.

This is why I included the HTTP example: this is simply telling it to parrot the phrase "I don't know"--it doesn't understand that it doesn't know. From the LLM's perpective, it "knows" that the answer is "I don't know". It's returning a 200 OK that says "I don't know" rather than returning a 404.

Do you understand the distinction I'm making here?

> I would agree that models do not have any in-depth understanding of what lack of knowledge actually is. On the other hand I would also think that this also applies to humans, most people are not philosophers.

The average (non-programmer) human, when asked to write a "Hello, world" program, can definitely say they don't know how to program. And unlike the LLM, the human knows that this is different from answering the question. The LLM, in contrast thinks it is answering the question when it says "I don't know"--it thinks "I don't know" is the correct answer.

Put another way, a human can distinguish between responses to these two questions, whereas an LLM can't:

1. What is my grandmother's maiden name?

2. What is the English translation of the Spanish phrase, "No sé."?

In the first question, you don't know the answer unless you are quite creepy; in the second case you do (or can find out easily). But the LLM tuned to answer I don't know thinks it knows the answer in both cases, and thinks the answer is the same.


>...and that's why so many people are confused about what's going on with LLMs: sloppy, ambiguous use of language.

There is a difference between explanation by metaphor and lack of precision. If you think someone is implying something literal when they might be using a metaphor you can always ask for clarification. I know plenty of people that are utterly precise in their use in their language which leads them to being widely misunderstood because they think a weak precise signal is received as clearly as a strong imprecise signal. They usually think the failure in communication is in the recipient but in reality they are just accurately using the wrong protocol.

>Do you understand the distinction I'm making here? I believe I do, and it is precisely this distinction that the researches showed. By teaching a model to say "I don't know" for some information that they knew the model did not know the answer to, the model learned to respond "I don't know" for things that it did not know that it was not explicitly taught to respond with "I don't know". For it to acquire that ability to generalise to new cases the model has to have already had an internal representation of "That information is not available"

I'm not sure where you think a model converting its internal representation of not knowing something into words is distinct from a human converting its internal representation of not knowing into words.

When fine tuning directs a model to profess lack of knowledge, usually they will not give the same specific "I don't know" text as a way to express that it does not not know because they want the want to bind the concept "lack of knowledge" to the concept of "communicate that I do not know" rather than any particular word phrase. Giving it many ways to say "I don't know" builds that binding rather than the crude "if X then emit Y" that you imagine it to be.


I think some “reasoning” models do backtracking by inserting “But wait” at the start of a new paragraph? There’s more to it, but that seems like a pretty good trick.

The problem is exactly that: the probability distribution. The network has no way to say: 0% everyone, this is non sense, backtrack everything.

Other architectures, like energy based models or bayesian ones can assess uncertainty. Transformers simply cannot do it (yet). Yes, there are ways to do it, but we are already spending millions to get coherent phrases, few ones will burn billions to train a model that can do that kind of assessments.


Has anybody ever messed with adding a "backspace" token?

Yes. (https://news.ycombinator.com/item?id=36425375, believe there's been more)

There's a quite intense backlog of new stuff that hasn't made it to prod. (I would have told you in 2023 that we would have ex. switched to Mamba-like architectures in at least one leading model)

Broadly, it's probably unhelpful that:

- absolutely no one wants the PR of releasing a model that isn't competitive with the latest peers

- absolutely everyone wants to release an incremental improvement, yesterday

- Entities with no PR constraint, and no revenue repurcussions when reallocating funds from surely-productive to experimental, don't show a significant improvement in results for the new things they try (I'm thinking of ex. Allen Institute)

Another odd property I can't quite wrap my head around is the battlefield is littered with corpses that eval okay-ish, and should have OOM increases in some areas (I'm thinking of RWKV, and how it should be faster at inference), and they're not really in the conversation either.

Makes me think either A) I'm getting old and don't really understand ML from a technical perspective anyway or B) hey, I 've been maintaining a llama.cpp wrapper that works on every platform for a year now, I should trust my instincts: the real story is UX is king and none of these things actually improve the experience of a user even if benchmarks are ~=.


Oh yeah, that's exactly what I was thinking of! Seems like it would be very useful for expert models with domains with more definite "edges" (if I'm understanding it right)

As for the fragmentation of progress, I guess that's just par the course for any tech with a such a heavy private/open source split. It would take a huge amount of work to trawl through this constant stream of 'breakthroughs' and put them all together.


For sure read Stephenson’s essay on path dependence; it lays out a lot of these economic and social dynamics. TLDR - we will need a major improvement to see something novel pick up steam most likely.

Yeah everyone spending way to much money in things we barely understand is a recipe for insane path dependence.

Right. And, as a result, low token-level confidence can end up indicating "there are other ways this could have been worded" or "there are other topics which could have been mentioned here" just as often as it does "this output is factually incorrect". Possibly even more often, in fact.

My first reaction is that a model can’t, but a sampling architecture probably could. I’m trying to understand if what we have as a whole architecture for most inference now is responsive to the critique or not.

You get scores for the outputs of the last layer; so in theory, you could notice when those scores form a particularly flat distribution, and fault.

What you can't currently get, from a (linear) Transformer, is a way to induce a similar observable "fault" in any of the hidden layers. Each hidden layer only speaks the "language" of the next layer after it, so there's no clear way to program an inference-framework-level observer side-channel that can examine the output vector of each layer and say "yup, it has no confidence in any of what it's doing at this point; everything done by layers feeding from this one will just be pareidolia — promoting meaningless deviations from the random-noise output of this layer into increasing significance."

You could in theory build a model as a Transformer-like model in a sort of pine-cone shape, where each layer feeds its output both to the next layer (where the final layer's output is measured and backpropped during training) and to an "introspection layer" that emits a single confidence score (a 1-vector). You start with a pre-trained linear Transformer base model, with fresh random-weighted introspection layers attached. Then you do supervised training of (prompt, response, confidence) triples, where on each training step, the minimum confidence score of all introspection layers becomes the controlled variable tested against the training data. (So you aren't trying to enforce that any particular layer notice when it's not confident, thus coercing the model to "do that check" at that layer; you just enforce that a "vote of no confidence" comes either from somewhere within the model, or nowhere within the model, at each pass.)

This seems like a hack designed just to compensate for this one inadequacy, though; it doesn't seem like it would generalize to helping with anything else. Some other architecture might be able to provide a fully-general solution to enforcing these kinds of global constraints.

(Also, it's not clear at all, for such training, "when" during the generation of a response sequence you should expect to see the vote-of-no-confidence crop up — and whether it would be tenable to force the model to "notice" its non-confidence earlier in a response-sequence-generating loop rather than later. I would guess that a model trained in this way would either explicitly evaluate its own confidence with some self-talk before proceeding [if its base model were trained as a thinking model]; or it would encode hidden thinking state to itself in the form of word-choices et al, gradually resolving its confidence as it goes. In neither case do you really want to "rush" that deliberation process; it'd probably just corrupt it.)


> i.e there isn't a "I don't have enough information" option.

This is true in terms of default mode for LLMs, but there's a fair amount of research dedicated to the idea of training models to signal when they need grounding.

SelfRAG is an interesting, early example of this [1]. The basic idea is that the model is trained to first decide whether retrieval/grounding is necessary and then, if so, after retrieval it outputs certain "reflection" tokens to decide whether a passage is relevant to answer a user query, whether the passage is supported (or requires further grounding), and whether the passage is useful. A score is calculated from the reflection tokens.

The model then critiques itself further by generating a tree of candidate responses, and scoring them using a weighted sum of the score and the log probabilities of the generated candidate tokens.

We can probably quibble about the loaded terms used here like "self-reflection", but the idea that models can be trained to know when they don't have enough information isn't pure fantasy today.

[1] https://arxiv.org/abs/2310.11511

EDIT: I should also note that I generally do side with Lecun's stance on this, but not due to the "not enough information" canard. I think models learning from abstraction (i.e. JEPA, energy-based models) rather than memorization is the better path forward.


I watched an Andrej Karpathy video recently. He said that hallucination was because in the training data there were no examples where the answer is, "I don't know". Maybe I'm misinterpreting what he was saying though.

https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4832s


> The problem with LLMs is that the output is inherently stochastic

Isn't that true with humans too?

There's some leap humans make, even as stochastic parrots, that lets us generate new knowledge.


I think it is because we don't feel the random and chaotic nature of what we know as individuals.

If I had been born a day earlier or later I would have a completely different life because of initial conditions and randomness but life doesn't feel that way even though I think this is obviously true.


> there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Have you ever tried telling ChatGPT that you're "in the city centre" and asking it if you need to turn left or right to reach some landmark? It will not answer with the average of the directions given to everybody who asked the question before, it will answer asking you to tell it where you are precisely and which way you are facing.


That's because, based on the training data, the most likely response to asking for directions is to clarify exactly where you are and what you see.

But if you ask it in terms of a knowledge test ("I'm at the corner of 1st and 2nd, what public park am I standing next to?") a model lacking web search capabilities will confidently hallucinate (unless it's a well-known park).

In fact, my person opinion is that, therein lies the most realistic way to reduce hallucination rates: rather than trying to train models to say "I don't know" (which is not really a trainable thing - models are fundamentally unaware of the limits of their own training data), instead just train them on which kinds of questions warrant a web search and which ones should be answered creatively.


I tried this just now on Chatbot Arena, and both chatbots asked for more information.

One was GPT 4.5 preview, and one was cohort-chowder (which is someone's idea of a cute code name, I assume).


I tried this just now on Chatbot Arena, and both chatbots very confidently got the name of the park wrong.

Perhaps you thought I meant "1st and 2nd" literally? I was just using those as an example so I don't reveal where I live. You should use actual street names that are near a public park, and you can feel free to specify the city and state.


I did think you meant it literally. Since I can't replicate the question you asked, I have no way of verifying your claim.

[flagged]


Neither do I. Right after I read your reply I knew I had made a mistake engaging with you.

i dont think the stochasticity that's the problem -- the problem is that model gets "locked in" once it picks a token and there's no takesies backsies.

that also entails information destruction in the form of the logits table, but for the most part that should be accounted for in the last step before final feedforward


>This is due to the fact that LLMs are basically just giant look up maps with interpolation.

This is obviously not true at this point except for the most loose definition of interpolation.

>don't rely on things like differentiability.

I've never heard lecun say we need to move away from gradient descent. The opposite actually.


If multiple answers are equally likely, couldn't that be considered uncertainty? Conversely if there's only one answer and there's a huge leap to the second best, that's pretty certain.

I don’t buy Lecun’s argument. Once you get good RL going (as we are now seeing with reasoning models) you can give the model a reward function that rewards a correct answer most highly, an “I’m sorry but I don’t know” less highly than that, a wrong answer penalized, a confidently wrong answer more severely penalized. As the RL learns to maximize rewards I would think it would find the strategy of saying it doesn’t know in cases where it can’t find an answer it deems to have a high probability of correctness.

How do you define the "correct" answer?

Certainly not possible in all domains but equally certainly possible in some. There’s not much controversy about the height of the Eiffel Tower or how to concatenate two numpy arrays.

obviously the truth is what is the most popular. /s

A lot of the responses seem to be answering a different question: "Why does LeCun think LLMs won't lead to AGI?" I could answer that, but the question you are asking is "Why does LeCun think hallucinations are inherent in LLMs?"

To answer your question, think about how we train LLMs: We have them learn the statistical distribution of all written human language, such that given a chunk of text (a prompt, etc.) it then samples its output distribution to produces the next most likely token (word, sub-word, etc.) that should be produced and keeps doing that. It never learns how to judge what is true or false and during training it never needs to learn "Do I already know this?" It is just spoon fed information that it has to memorize and has no ability to acquire metacognition, which is something that it would need to be trained to attain. As humans, we know what we don't know (to an extent) and can identify when we already know something or don't already know something, such that we can say "I don't know." During training, an LLM is never taught to do this sort of introspection, so it never will know what it doesn't know.

I have a bunch of ideas about how to address this with a new architecture and a lifelong learning training paradigm, but it has been hard to execute. I'm an AI professor, but really pushing the envelope in that direction requires I think a small team (10-20) of strong AI scientists and engineers working collaboratively and significant computational resources. It just can't be done efficiently in academia where we have PhD student trainees who all need to be first author and work largely in isolation. By the time AI PhD students get good, they graduate.

I've been trying to find the time to focus on getting a start-up going focused on this. With Terry Sejnowski, I pitched my ideas to a group affiliated with Schmidt Sciences that funds science non-profits at around $20M per year for 5 years. They claimed to love my ideas, but didn't go for it....


Would you care to post your ideas somewhere online so others can read, critique, try etc?

"we love your ideas" == no

"when do you close the round?" = maybe

money in the bank account = yes


I believe that so long as weights are fixed at inference time, we'll be at a dead end.

Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.

Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.


I 100% agree with this and sampling from thought space rather than "thinking" in terms of language. I spent forever writing up an NSF grant proposal on exactly this idea and submitted it last May. I haven't heard back, but it probably won't be funded.

Why, even animals sleep? And if you for example learn an instrument you will notice that a lot of the learning of the muscel memory happens during sleep.

I guess you're saying that non-inference time training can be that "sleep period"?

Yes, i could imagine something like a humanoid robot, where the "short term memory" is just a big enough context to keep all input of the day. Then during "sleep" there is training where the information is processed.

But I also think that current LLM tech does not lead to agi. You cant train something on pattern matchin and then it becomes magically intelligent (although i could be wrong).

Imo an AGI would need to be able to interact with the environment and learn to reflect on its interactions and its abilities within it. I suspect we have the hardware to build s.th. intelligent as a cat or a dog, but not the algorithms.


Very much this. I’ve been wondering why I’ve not seen it much discussed.

There are many roadblocks to continual learning still. Most current models and training paradigms are very vulnerable to catastrophic forgetting. And are very sample inefficient. And we/the methods are not so good at separating what is "interesting" (should be learned) vs "not". But this is being researched, for example under the topic of open ended learning, active inference, etc.

As a leader in the field of continual learning, I somewhat agree, but I'd say that catastrophic forgetting is largely resolved. The problem is that the continual learning community largely has become insular and is mostly focusing on toy problems that don't matter, where they will even avoid good solutions for nonsensical reasons. For example, reactivation / replay / rehearsal works well for mitigating catastrophic forgetting almost entirely, but a lot of the continual learning community mostly dislikes it because it is very effective. A lot of the work is focusing on toy problems and they refuse to scale up. I wrote this paper with some of my colleagues on this issue, although with such a long author list it isn't as focused as I would have liked in terms of telling the continual learning community to get out of its rut such that they are writing papers that advance AI rather than are just written for other continual learning researchers: https://arxiv.org/abs/2311.11908

The majority are focusing on the wrong paradigms and the wrong questions, which blocks progress towards the kinds of continual learning needed to make progress towards creating models that think in latent space and enabling meta-cognition, which would then give architectures the ability to avoid hallucinations by knowing what they don't know.


Thanks a lot for this paper and the ones you shared deeper in the thread!

Any continual learning papers you're a fan of?

Depends on what angle you are interested in. If you are interested in continual learning for something like mitigating model drift such that a model can stay up-to-date where the goal is attain speed ups during training see these works:

Compared to other methods for continual learning on ImageNet-1K, SIESTA requires 7x-60x less compute than other methods and achieves the same performance as a model trained in an offline/batch manner. It also works for arbitrary distributions rather than a lot of continual learning methods that only work for specific distributions (and hence don't really match any real-world use case): https://yousuf907.github.io/siestasite/

In this one we focused on mitigating the drop in performance when a system encounters a new distribution. This resulted in a 16x speed up or so: https://yousuf907.github.io/sgmsite/

In this one, we show how the strategy for creating multi-modal LLMs like LLaVA is identical to a two-task continual learning system and we note that many LLMs once they become multi-modal forget a large amount of the capabilities of the original LLM. We demonstrate that continual learning methods can mitigate that drop in accuracy enabling the multi-modal task to be learned while not impairing uni-modal performance: https://arxiv.org/abs/2410.19925 [We have a couple approaches that are better now that will be out in the next few months]

It really depends on what you are interested in. For production AI, the real need is computational efficiency and keeping strong models up-to-date. Not many labs besides mine are focusing on that.

Currently, I'm focused on continual learning for creating systems beyond LLMs that incrementally learn meta-cognition and working on continual learning to explain memory consolidation works in mammals and why we have REM phases during sleep, but that's more of a cognitive science contribution so the constraints on the algorithms differ since the goal differs.


> working on continual learning to explain memory consolidation <how> works in mammals and why we have REM phases during sleep

That's a nice model: human short-term memory is akin to the context window, and REM sleep consolidating longer-term memories is akin to updating the model itself.

How difficult would it be to perform limited focused re-training based on what's been learnt (e.g. new information, new connections, corrections of errors, etc.) within a context window?


Self updating requires learning to learn, which I'm not sure we know how to do.

He's right but at the same time wrong. Current AI methods are essentially scaled up methods that we learned decades ago.

These long horizon (agi) problems have been there since the very beginning. We have never had a solution to them. RL assumes we know the future which is a poor proxy. These energy based methods fundamentally do very little that an RNN didn't do long ago.

I worked on higher dimensionality methods which is a very different angle. My take is that it's about the way we scale dependencies between connections. The human brain makes and breaks a massive amount of nueron connections daily. Scaling the dimensionality would imply that a single connection could be scalled to encompass significantly more "thoughts" over time.

Additionally the true to solution to these problems are likely to be solved by a kid with a laptop as much as an top researcher. You find the solution to CL on a small AI model (mnist) you solve it at all scales.


Not exactly related, but I wonder sometimes if the fact that the weights in current models are very expansive to change is a feature and not a "bug".

Somehow, it feels harder to trust a model that could evolve over time. It's performance might even degrade. That's a steep price to pay for having memory built in and a (possibly) self-evolving model.


We degrade, and I think we are far more valuable than one model.

For a kid with a laptop to solve it would require the problem to be solvable with current standard hardware. There's no evidence for that. We might need a completely different hardware paradigm.

Also possible and a fair point. My point is that it's a "tiny" solution that we can scale.

I could revise that by saying a kid with a whiteboard.

It's an einstein×10 moment so who know when that'll happen.


I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

https://arxiv.org/abs/2502.09992

https://www.inceptionlabs.ai/news

(these are results from two different teams/orgs)

It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.


And they seem to be about 10x as fast as similar sized transformers.

No, 10x less sampling steps. Whether or not that means 10x faster remains to be seen, as a diffusion step tends to be more expensive than an autoregressive step.

If I understood correctly, in practice they show actual speed improvement on high-end cards, because autoregressive LLMs are bandwidth limited and do not compute bound, so switching to a more expensive but less memory bandwidth heavy is going to work well on current hardware.

The SEDD architecture [1] probably allows for parallel sampling of all tokens in a block at once, which may be faster but not necessarily less computationally demanding in terms of runtime times computational resources used.

[1] Which Inception Labs's new models may be based on; one of the cofounders is a co-author. See equations 18-20 in https://arxiv.org/abs/2310.16834


You could reframe the way LLMs are currently trained as energy minimization, since the Boltzmann distribution that links physics and information theory (and correspondingly, probability theory as well) is general enough to include all standard loss functions as special cases. It's also pretty straightforward to include RL in that category as well.

I think what Lecun is probably getting at is that there's currently no way for a model to say "I don't know". Instead, it'll just do its best. For esoteric topics, this can result in hallucinations; for topics where you push just past the edge of well-known and easy-to-Google, you might get a vacuously correct response (i.e. repetition of correct but otherwise known or useless information). The models are trained to output a response that meets the criteria of quality as judged by a human, but there's no decent measure (that I'm aware of) of the accuracy of the knowledge content, or the model's own limitations. I actually think this is why programming and mathematical tasks have such a large impact on model performance: because they encode information about correctness directly into the task.

So Yann is probably right, though I don't know that energy minimization is a special distinction that needs to be added. Any technique that we use for this task could almost certainly be framed as energy minimization of some energy function.


My observation from the outside watching this all unfold is that not enough effort seems to be going into the training schedule.

I say schedule because the “static data once through” is the root of the problem in my mind is one of the root problems.

Think about what happens when you read something like a book. You’re not “just” reading it, you’re also comparing it to other books, other books by the same author, while critically considering the book recommendations made by your friend. Any events in the book get compared to your life experience, etc…

LLM training does none of this! It’s a once-through text prediction training regime.

What this means in practice is that an LLM can’t write a review of a book unless it has read many reviews already. They have, of course, but the problem doesn’t go away. Ask an AI to critique book reviews and it’ll run out of steam because it hasn’t seen many of those. Critiques of critiques is where they start falling flat on their face.

This kind of meta-knowledge is precisely what experts accumulate.

As a programmer I don’t just regurgitate code I’ve seen before with slight variations — instead I know that mainstream criticisms of micro services misses their key benefit of extreme team scalability!

This is the crux of it: when humans read their training material they are generating an “n+1” level in their mind that they also learn. The current AI training setup trains the AI only the “n”th level.

This can be solved by running the training in a loop for several iterations after base training. The challenge of course is to develop a meaningful loss function.

IMHO the “thinking” model training is a step in the right direction but nowhere near enough to produce AGI all by itself.


This is a somewhat nihilistic take with an optimistic ending. I believe humans will never fix hallucinations. Amount of totally or partially untrue statements people make is significant. Especially in tech, it's rare for people to admit that they do not know something. And yet, despite all of that the progress keeps marching forward and maybe even accelerating.

Yeah, I think a lot of people talk about "fixing hallucinations" as the end goal, rather than "LLMs providing value", which misses the forest for the trees; it's obviously already true that we don't need totally hallucination-free output to get value from these models.

Even as language models can partially solve a few problems, we remain with the problem of achieving Artificial General Intelligence, that the presence of LLMs has exacerbated because they so often reveal to be artificial morons.

Intelligence finds solutions - actual, solid solutions.

More than "fixing" hallucinations, the problem is going beyond them (arriving to "sobriety").


I’m not sure I follow. Sure, people lie, and make stuff up all the time. If an LLM goes and parrots that, then I would argue that it isn’t hallucinating. Hallucinating would be where it makes something up that is not in its training site nor logically deducible from it.

I think most humans are perfectly capable of admitting to themselves when they do not know something. Computers ought to do better.

You must interact with a very different set of humans than most.

Once one starts thinking of them as "concept models" rather than language models or fact models, "hallucinations" become something not to be so fixated on. We transform tokens into 12k+ length embeddings... right at the start. They stop being language immediately.

They aren't fact machines. They are concept machines.


Not an argument. "Many people are delirious, yet some people create progress". What is that supposed to imply?

I haven't read Yann Lecun's take. Based on your description alone my first impression would be: there's a paper [1] arguing that "beam search enforces uniform information density in text, a property motivated by cognitive science". UID claims, in short, that a speaker only delivers as much content as they think the listener can take (no more, no less) and the paper claims that beam search enforced this property at generation time.

The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.

Then again, perhaps they have one in mind and I just haven't read it.

[1] https://aclanthology.org/2020.emnlp-main.170/


I believe he’s talking about some sort of ‘energy as measured by distance from the models understanding of the world’ as in quite literally a world model. But again I’m ignorant, hence the post!

In some respects that sounds similar to what we already do with reward models. I think with GRPO, the “bag of rewards” approach doesn’t strike me as terribly different. The challenge is in building out a sufficient “world” of rewards to adequately represent more meaningful feedback-based learning.

While it sounds nice to reframe it like a physics problem, it seems like a fundamentally flawed idea, akin to saying “there is a closed form solution to the question of how should I live.” The problem isn’t hallucinations, the problem is that language and relativism are inextricably linked.


When an architecture is based around world model building, then it is a casual outcome that similar concepts and things end up being stored in similar places. They overlap. As soon as your solution starts to get mathematically complex, you are departing from what the human brain does. Not saying that in some universe it might be possible to make a statistical intelligence, but when you go that direction you are straying away from the only existing intelligences that we know about. The human brain. So the best solutions will closely echo neuroscience.

This sort of measure is a decent match for BPB though. BPB=-log(document_probability)/document_length_bytes and perplexity=e^(BPB*document_length_bytes/document_length_tokens). We already train models by minimizing perplexity, and model outputs are already those that are high probability. Though like with EBMs, figuring out outputs with even higher probability would require an expensive search step.

Many of his arguments make “logical” sense, but one way to evaluate them is: would they have applied equally well 5 years ago? and would that have predicted LLMs will never write (average) poetry, or solve math, or answer common-sense questions about the physical world reasonably well? Probably. But turns out scale is all we needed. So yeah, maybe this is the exact point where scale stops working and we need to drastically change architectures. But maybe we just need to keep scaling.

This concept comes from Hopfield networks.

If two nodes are on, but the connection between them is negative, this causes energy to be higher.

If one of those nodes switches off, energy is reduced.

With two nodes this is trivial. With 10 nodes it's more difficult to solve, and with billions of nodes it is impossible to "solve".

All you can do then is try to get the energy as low as possible.

This way also neural networks can find out "new" information, that they have not learned, but is consistent with the constraints they have learned about the world so far.


So, what’s modeled as a “node” in an EBM, and what’s modeled as a connection? Are they vectors in a tensor, (well I suppose almost certainly that’s a yes). Do they run side by side a model that’s being trained? Is the node connectivity architecture fixed or learned?

Sligtly related: Energy Based Models (EBMs) are better in theory and yet too resource intensive. I tried to sell using EBMs to my org, but the price for even a small use case was prohibitive.

I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...

Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.

Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.

Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.

There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.

There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.

He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.

Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.


WTF. The cardinality of words is 100,000.

Since this exposes the answer, the new architecture has to be based on world model building.

The thing is, this has been known since even before the current crop of LLMs. Anyone who considered (only the English) language to be sufficient to model the world understands so little about cognition as to be irrelevant in this conversation.

Thanks. This is interesting. What kind of equation is used to assess an ebm during training? I’m afraid I still don’t get the core concept well enough to have an intuition for it.

Jürgen Schmidhuber has a paper with Lucas Beyer? I'm not aware of it. Which do you mean?

Oh, I made a mistake unfortunately. I meant hardmaru (David Ha).

I am very sorry.


>> believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors

I may have an actual opinion on his viewpoint, however, I have a nitpick even before that.

How exactly is 'LLM' defined here? Even if some energy-based thing is done, would some not call even that an LLM? If/when someone finds a way to fix it within the 'token choice' method, could some people not just start calling it something differently from 'LLM'.

I think Yann needs to rephrase what exactly he wants to say.


I've always felt like the argument is super flimsy because "of course we can _in theory_ do error correction". I've never seen even a semi-rigorous argument that error correction is _theoretically_ impossible. Do you have a link to somewhere where such an argument is made?

In theory transformers are Turing-complete and LLMs can do anything computable. The more down-to-earth argument is that transformer LLMs aren't able to correct errors in a systematic way like Lecun is describing: it's task-specific "whack-a-mole," involving either tailored synthetic data or expensive RLHF.

In particular, if you train an LLM to do Task A and Task B with acceptable accuracy, that does not guarantee it can combine the tasks in a common-sense way. "For each step of A, do B on the intermediate results" is a whole new Task C that likely needs to be fine-tuned. (This one actually does have some theoretical evidence coming from computational complexity, and it was the first thing I noticed in 2023 when testing chain-of-thought prompting. It's not that the LLM can't do Task C, it just takes extra training.)


As soon as you need to start leaning heavily on error correction, that is an indication that your architecture and solution is not correct. The final solution will need to be elegant and very close to a perfect solution immediately.

You must always keep close to the only known example we have of an intelligence which is the human brain. As soon as you start to wander away from the way the human brain does it, you are on your own and you are not relying on known examples of intelligence. Certainly that might be possible, but since there's only one known example in this universe of intelligence, it seems ridiculous to do anything but stick close to that example, which is the human brain.


> of course we can _in theory_ do error correction

Oh yeah? This is begging the question.


The fundamental distinction is usually made to contrastive approaches (i.e. make correct more likely, make everything else we just compared unlikely). Ebms are "only what is correct is more likely and the default for everything is unlikely"

This is obviously an extremely high level simplification, but that's the core of it.


And in this categorization auto regressive llms are contrastive due to the cross entropy loss.

Yann LeCun understands this is an electrical engineering and physical statistics of machine problem and not a code problem.

The physics of human consciousness are not implemented in a leaky symbolic abstraction but the raw physics of existence.

The sort of autonomous system we imagine when thinking AGI must be built directly into substrate and exhibit autonomous behavior out of the box. Our computers are blackboxes made in a lab without centuries of evolving in the analog world, finding a balance to build on. They either can do a task or cannot. Obviously from just looking at one we know how few real world tasks it can just get up and do.

Code isn’t magic, it’s instruction to create a machine state. There’s no inherent intelligence to our symbolic logic. It’s an artifact of intelligence. It cannot imbue intelligence into a machine.


Well, it could be argued that the “optimal response” ie the one that sorta minimizes that “energy” is sorted by LLMs on the first iteration. And further iterations aren’t adding any useful information and in fact are countless occasions to veer off the optimal response.

For example if a prompt is: “what is the Statue of Liberty”, the LLMs first output token is going to be “the”, but it kinda already “knows” that the next ones are going to be “statue of liberty”.

So to me LLMs already “choose” a response path from the first token.

Conversely, a LLM that would try and find a minimum energy for the whole response wouldn’t necessarily stop hallucinating. There is nothing in the training of a model that says that “I don’t know” has a lower “energy” than a wrong answer…


Not an ML researcher, but implementing these systems has shown this opinion to be correct. The non-determinism of LLMs is a feature, not a bug that can be fixed.

As a result, you'll never be able to get 100% consistent outputs or behavior (like you hypothetically can with a traditional algorithm/business logic). And that has proven out in usage across every model I've worked with.

There's also an upper-bound problem in terms of context where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD. This is when hallucinations and random, unrequested changes get made and a previously productive chat spirals to the point where you have to start over.


> where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD.

Humans brains have the same problem. As any intelligence probably. Solution for this is structural thinking. One piece at a time, often top-down. Educated humans do it, LLM can be orchestrated to do it too. Effective context window will be limited even though some claim millions of tokens.


I feel like success of LLM's have been combination of multiple factors coming together favourably: 1) Hardware becoming cheap enough to train models beyond a size where we could see emergent properties. Which is going to become cheaper and cheaper. 2) Model architecture which can in computationally less expensive manner being able to look at all inputs at the same time. CNN's, RNN's all succeded at smaller scale becuase they added inductive bias in architecture favourable to the input modality, but also became less generic. Attention is simpler in computation to scale it and also has lower inductive bias. 3) Unsupervised text on internet being source of data which requires light pre-processing hence almost no efforts wrt annotations etc reaching scale wrt scaling laws corrosponding to large size models. Also text data being diverse enough to be generic to encompass variety of topics, thoughts vs imagenet etc which is highly specific and costly to produce.

Assuming that text only models will hit a bottleneck, then to have next generation models, in addition to a new architecture, we also have to find rich dataset which is even more generic and much richer in modalities and the architecture being able to natively ingest it?

However something that is not predictible is how well the emergent properties can scale with model size further. Maybe few more unlocks like model being able to retain information well in spite of really large context length, ability to SFT on super complex reasoning tasks without disrupting weights enough to loose unsupervised learning might take us much further?


I have a paper coming up that I modestly hope will clarify some of this.

The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.


I think the hard question is whether those wins can be realized with less effort than what we’re already doing, though.

What I mean is this: A brain today is obviously far more efficient at intelligence than our current approaches to AI. But a brain is a highly specialized chemical computer that evolved over hundreds of millions of years. That leaves a lot of room for inefficient and implausible strategies to play out! As long as wins are preserved, efficiency can improve this way anyway.

So the question is really, can we short cut that somehow?

It does seem like doing so would require a different approach. But so far all our other approaches to creating intelligence have been beaten by the big simple inefficient one. So it’s hard to see a path from here that doesn’t go that route.


Also, a brain evolved to be a stable compute platform in body that finds itself in many different temperature and energy regimes. And the brain can withstand and recover from some pretty severe damage. So I'd suspect an intelligence that is designed to run in a tighter temp/power envelope with no need for recovery or redundancy could be significantly more efficient than our brain.

The brain only operates in a very narrow temperature range too. 5 degrees C in either direction from 37 and you're in deep trouble.

Most brain damage would not be considered in the realm of what most people would consider "recoverable".

In some cases it doesn't recover even without physical or chemical damage. Psychiatric clinics are full of this stories.

How does this idea compare to the rationale presented by Rich Sutton in The Bitter Lesson [0]? Shortly put, why do you think biological plausibility has significance?

[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html


I'll have to refer you to my forthcoming paper for the full argument, but basically, humans (and all animals) experience surprise and then we attribute that surprise to a cause, and then we update (learn).

In ANNs we backprop uniformly, so the error correction is distributed over the whole network. This is why LLM training is inefficient.


I’m not GP, but I don’t think their position is necessarily in tension with leveraging computation. Not all FLOPs are equal, and furthermore FLOPs != Watts. In fact a much more efficient architecture might be that much more effective at leveraging computation than just burning a bigger pile of GPUs with the current transformer stack

Right.

Honest question: Given that the only wide consensus of anything approaching general intelligence are humans and that humans are biological systems that have evolved in physical reality, is there any arguments that better efficiency is even possible without relying on leveraging the nature of reality?

For example, analog computers can differentiate near instantly by leveraging the nature of electromagnetism and you can do very basic analogs of complex equations by just connecting containers of water together in certain (very specific) configurations. Are we sure that these optimizations to get us to AGI are possible without abusing the physical nature of the world? This is without even touching the hot mess that is quantum mechanics and its role in chemistry which in turn affects biology. I wouldn't put it past evolution to have stumbled upon some quantum mechanic that allowed for the emergence of general intelligence.

I'm super interested in anything discussing this but have very limited exposure to the literature in this space.


The advantage of artificial intelligence doesnt even need to be energy efficiency. We are pretty good at generating energy, if we had human level AI even if it used an order of magnitude more energy that humans use that would likely still be cheaper than a human.

Inference is already wasteful (compared to humans) but training is absurd. There's strong reason to believe we can do better (even prior to having figured out how).

That would mean with current resources AI can get so much more intelligent than humans, right? Aren't you scared?

That's a potential outcome of any increase in training efficiency.

Which we should expect, even from prior experience with any other AI breakthrough, where first we learn to do it and then we learn to do it efficiently.

E.g. Deep Blue in 1997 was IBM showing off a supercomputer, more than it was any kind of reasonably efficient algorithm, but those came over the next 20-30 years.


I’m looking forward to it! Inefficiency (if we mean energy efficiency) conceptually doesn’t bother me very much in that feels like Silicon design has a long way to go still, but I like the idea of looking at biology for both ideas and guidance.

Inefficiency in data input is also an interesting concept. It seems to me humans get more data in than even modern frontier models; if you use the gigabit/s estimates for sensory input. Care to elaborate on your thoughts?


> and biologically implausible

I really like this approach. Showing that we must be doing it wrong because our brains are more efficient and we aren't doing it like our brains.

Is this a common thing in ML papers or something you came up with?


Nah it’s just physics, it’s like wheels being more efficient than legs.

We know there is a more efficient solution (human brain) but we don’t know how to make it.

So it stands to reason that we can make more efficient LLMs, just like a CPU can add numbers more efficiently than humans.


Wheels is an interesting analogy. Wheels are more efficient now that we have roads. But there could never have been evolutionary pressure to make them before there were roads. Wheels are also a lot easier to get to work than robotic legs and so long as there’s a road do a lot more than robotic legs.

People think the first wheel was invented for making pottery. Biological machinery for the most part has to be self-reproducing so there is a lot of limitations on design, also it has to be able to evolve, so you get inefficient solutions like the vargas nerve (i think that's its name), basically there's a really long nerve in your body that takes a route under your trachea and then back up to another part of your brain, in giraffes its something like 40 feet long to go a few inches shortest path.

Wheels other than rolling would likely never evolve naturally because there's no real incremental path from legs to wheels, where as flippers can evolve from webbed fingers incrementally getting better for moving in water.

I dunno, maybe there's an evolutionary path for wheels, but i don't think so.


Evolution does not need to converge on the optimum solution.

Have you heard of https://en.wikipedia.org/wiki/Bio-inspired_computing ?


I don't think GP was implying that brains are the optimum solution. I think you can interpret GP's comments like this- if our brains are more efficient than LLMs, then clearly LLMs aren't optimally efficient. We have at least one data point showing that better efficiency is possible, even if we don't know what the optimal approach is.

I agree. Spiking neural networks are usually mentioned in this context, but there is no hardware ecosystem behind them that can compete with Nvidia and CUDA.

Investments in AI are now counting by billions of dollars. Would that be enough to create an initial ecosystem for a new architecture?

A new HW architecture for an unproven SW architecture is never going to happen. The SW needs to start working initially and demonstrate better performance. Of course, as with the original deep neural net stuff, it took computers getting sufficiently advanced to demonstrate this is possible. A different SW architecture would have to be so much more efficient to work. Moreover, HW and SW evolve in tandem - HW takes existing SW and tries to optimize it (e.g. by adding an abstraction layer) or SW tries to leverage existing HW to run a new architecture faster. Coming up with a new HW/SW combo seems unlikely given the cost of bringing HW to market. If AI speedup of HW ever delivers like Jeff Dean expects, then the cost of prototyping might come down enough to try to make these kinds of bets.

Nvidia has a big lead, and hardware is capital intensive. I guess an alternative would make sense in the battery-powered regime, like robotics, where Nvidia's power hungry machines are at a disadvantage. This is how ARM took on Intel.

It does not, you're right. But it's an interesting way to approach the problem never the less. And given that we definitely aren't as efficient as a human brain right now, it makes sense to look at the brain for inspiration.

How are you separating the efficiency of the architecture from the efficiency of the substrate? Unless you have a brain made of transistors or an LLM made of neurons how can you identify the source of the inefficiency?

You can't but the transistor-based approach is the inefficient one, and transistors are pretty good at efficiently doing logic, so either there's no possible efficient solution based on deterministic computation, or there's tremendous headroom.

I believe human and machine learning unify into a pretty straightforward model and this shows that what we're doing that ML doesn't can be copied across, and I don't think the substrate is that significant.


I am an MLE not an expert. However, it is a fundamental problem that our current paradigm of training larger and larger LLMs cannot ever scale to the precision people require for many tasks. Even in the highly constrained realm of chess, an enormous neural net will be outclassed by a small program that can run on your phone.

https://arxiv.org/pdf/2402.04494


Best in class chess program actually is a NN, just not a LLM.

> Even in the highly constrained realm of chess, an enormous neural net will be outclassed by a small program that can run on your phone.

This is true also for the much bigger neural net that works in your brain, and even if you're the world champion of chess. Clearly your argument doesn't hold water.


For the sake of argument let’s say an artificial neural net is approximately the same as the brain. It sounds like you agree with me that smaller programs are both more efficient and more effective than a larger neural net. So you should also agree with me that those who say the only path to AGI is LLM maximalism are misguided.

smaller programs are better than artificial or organic neural net for constrained problems like chess. But chess programs don't generalize to any other intelligence applications, like how organic neural nets do today.

> It sounds like you agree with me that smaller programs are both more efficient and more effective than a larger neural net.

At playing chess. (But also at doing sums and multiplications, yay!)

> So you should also agree with me that those who say the only path to AGI is LLM maximalism are misguided.

No. First of all, it's a claim you just made up. What we're talking about is people saying that LLMs are not the path to AGI- an entirely different claim.

Second, assuming there's any coherence to your argument, the fact that a small program can outclass an enormous NN is irrelevant to the question of whether the enormous NN is the right way to achieve AGI: we are "general intelligences" and we are defeated by the same chess program. Unless you mean that achieving the intelligence of the greatest geniuses that ever lived is still not enough.


Any chance that “reasoning” can fix this

It kind of depends. You can broadly call any kind of search “reasoning”. But search requires 1) enumerating your possible options and 2) assigning some value to those options. Real world problem solving makes both of those extremely difficult.

Unlike in chess, there’s a functionally infinite number of actions you can take in real life. So just argmax over possible actions is going to be hard.

Two, you have to have some value function of how good an action is in order to argmax. But many actions are impossible to know the value of in practice because of hidden information and the chaotic nature of the world (butterfly effect).


Isn't something about alphago also involves "infinitely" many possible outcomes? Yet they cracked it, right?

Go is played on a 19x19 board. At the beginning of the game the first player has 361 possible moves. The second player then has 360 possible moves. There is always a finite and relatively “small” number of options.

I think you are thinking of the fact that it had to be approached in a different way than Minimax in chess because a brute force decision tree grows way too fast to perform well. So they had to learn models for actions and values.

In any case, Go is a perfect information game, which as I mentioned before, is not the same as problems in the real world.


Not an insider, but:

I don't know about you, but I certainly don't generate text autoregressively, token by token. Also, pretty sure I don't learn by global updates based on taking the derivative of some objective function of my behavior with respect to every parameter defining my brain. So there's good biological reason to think we can go beyond the capabilities of current architectures.

I think probably an example of the kind of new architectures he supports is FB's Large Concept Models [1]. It's still a self-attention, autoregressive architecture, but the unit of regression is a sentence rather than a token. It maps sentences into a latent space via an autoencoder architecture, then has a transformer architecture in which the tokens are elements in that latent space.

[1] https://arxiv.org/abs/2412.08821


I argue that JEPA and its Energy-Based Model (EBM) framework fail to capture the deeply intertwined nature of learning and prediction in the human brain—the “yin and yang” of intelligence. Contemporary machine learning approaches remain heavily reliant on resource-intensive, front-loaded training phases. I advocate for a paradigm shift toward seamlessly integrating training and prediction, aligning with the principles of online learning.

Disclosure: I am the author of this paper.

Reference: (PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh... [accessed Mar 14, 2025].


Update: Interesting paper, thanks. Comment on selection for Hydra — you mention v1 uses an arithmetic mean across timescales for prediction. Taking this analogy of the longer windows encapsulating different timescales, I’d propose it would be interesting to train a layer to predict weighting of the timescale predictions. Essentially — is this a moment where I need to focus on what just happened, or is this a moment in which my long range predictions are more important?

Ty for reading the paper! I completely agree! Assigning soft weights to the window based on context is a fascinating research area. This concept is similar to Ebbinghaus' forgetting curve, which emphasizes recency bias while requiring repeated exposure for long-term retention.

So you believe humans spend more energy on prediction, relative to computers? Isn't that because personal computers are not powerful enough to train big models, and most people have no desire to? It is more economically efficient to socialize the cost of training, as is done. Are you thinking of a distributed training, where we split the work and cost? That could happen when robots become more widespread.

The human brain operates at just 25W of power—less than the monitor you're likely using right now—whereas AI models like ChatGPT consume nearly 1GWh every 24 hours!

As I discuss in the paper, predictive coding suggests that the brain actively generates predictions and compares them to incoming sensory data (vision, hearing, etc.), prioritizing anomalies. Its efficiency stems from a hierarchical memory system that continuously updates only the "deltas"—the differences that matter. Embracing this approach could lead to a paradigm shift, enabling the development of significantly more energy-efficient AI in the future.


Thank you. So, quick q - it would make sense to me that JEPA is an outcome of the YLC work; would you say that’s the case?

Sincere question - why doesn't RL-based fine-tuning on top of LLMs solve this or at least push accuracy above a minimum acceptable threshhold in many use cases? OAI has a team doing this for enterprise clients. Several startups rolling out of current YC batch are doing versions of this.

If you mean the so called agentic AI, I don't think it's several. Iirc someone in the most recent demo day mentioned ~80%+ were AI

I have no idea about EBM, but I have researched a bit on the language modelling side. And let's be honest, GPT is not the best learner we can create right now (ourselves). GPT needs far more data and energy than a human, so clearly there is a better architecture somewhere waiting to be discovered.

Attention works, yes. But it is not naturally plausible at all. We don't do quadratic comparisons across a whole book or need to see thousands of samples to understand.

Personally I think that in the future recursive architectures and test time training will have a better chance long term than current full attention.

Also, I think that OpenAI biggest contribution is demostrating that reasoning like behaviors can emerge from really good language modelling.


I feel like some hallucinations aren't bad. Isn't that basically what a new idea is - a hallucination of what could be? The ability to come up with new things, even if they're sometimes wrong, can be useful and happen all the time with humans.

Hallucinations is a terrible term for this. We want models that can create new ideas and make up stories. The problem is that they will give false answers to questions with factual answers, and that they don't realize this.

In humans, this is known as confabulation, and it happens due to various forms of brain damage, especially with damage to orbitofrontal cortex (part of prefrontal cortex). David Rumelhart, who was the main person who came up with backpropagation in a paper co-authored with Geoff Hinton, actually got Pick's disease which specifically results in damage to prefrontal cortex and people with that disease exhibit a lot of the same problems we have with today's LLMs:


That’s a really interesting thought. I think the key part (as a consumer of AI tools) would be identifying the things that are guesses vs deductions vs complete accurate based on the training data. I would happily look up or think about the output parts that are possibly hallucinated myself but we don’t currently get that kind of feedback. Whereas a human could list things out that they know, and then highlight the things they making educated guesses about, which makes it easier to build upon.

To be fair most people don't give you that level of detail. But I agree

Any transformer based LLM will never achieve AGI because it's only trying to pick the next word. You need a larger amount of planning to achieve AGI. Also, the characteristics of LLMs do not resemble any existing intelligence that we know of. Does a baby require 2 years of statistical analysis to become useful? No. Transformer architectures are parlor tricks. They are glorified Google but they're not doing anything or planning. If you want that, then you have to base your architecture on the known examples of intelligence that we are aware of in the universe. And that's not a transformer. In fact, whatever AGI emerges will absolutely not contain a transformer.

It's not about just picking the next word here, that doesn't at all refuse whether Transformers can achieve AGI. Words are just one representation of information. And whether it resembles any intelligence we know is also not an argument because there is no reason to believe that all intelligence is based on anything we've seen (e.g us, or other animals). The underlying architecture of Attention & MLPs can surely still depict something which we could call an AGI, and in certain tasks it surely can be considered an AGI already. I also don't know for certain whether we will hit any roadblocks or architectural asymptotes but I haven't come across any well-founded argument that Transformers definitely could not reach AGI.

The transformer is a simple and general architecture. Being such a flexible model, it needs to learn "priors" from data, it makes few assumptions on its distribution from the start. The same architecture can predict protein folding and fluid dynamics. It's not specific to language.

We on the other hand are shaped by billions of years of genetic evolution, and 200k years of cultural evolution. If you count the total number of words spoken by 110 billion people who ever lived, assuming 1B estimated words per human during their lifetime, it comes out to 10 million times the size of GPT-4's training set.

So we spent 10 million more words discovering than it takes the transformer to catch up. GPT-4 used 10 thousand people's worth of language to catch up all that evolutionary finetuning.


> words spoken by 110 billion people who ever lived, assuming 1B estimated words per human during their lifetime..comes out to 10 million times the size of GPT-4's training set

This assumption is slightly wrong direction, because not exist human who could consume much more than about 1B words during their lifetime. So humanity could not gain enhancement from just multiply words of one human by 100 billion. I think, correct estimation could be 1B words multiply by 100.

I think, current AI already achieved size need to become AGI, but to finish, probably need to change structure (but I'm not sure about this), and also need some additional multidimensional dataset, not just texts.

I might bet on 3D cinema, and/or on automobile targeting autopilot dataset, or something for real life humanoid robots solving typical human tasks, like fold shirt.


> Does a baby require 2 years of statistical analysis to become useful?

Well yes, actually.


of the entire human race's knowledge, and it's like from written history, not 2 years ago.

One small point: Token selection at each step is fine (and required if you want to be able to additively/linearly/independently handle losses). The problem here is the high inaccuracies in each token (or, rather, their distributions). If you use more time and space to generate the token then those errors go down. If using more time and space cannot suffice then, by construction, energy minimization models and any other solution you can think of also can't reduce the errors far enough.

The next-gen LLMs are going to use something like mipmaps in graphics: a stack of progressively smaller versions of the image, with a 1x1 image at the top. The same concept applies to text. When you're writing something, your have a high-level idea in mind that serves as a guide. That idea is such a mipmap. Perhaps the next-gen LLMs will be generating a few parallel sequencies, the top-level will be a slow-pace anchor and the bottom-level being the actual text that depends on slower upper levels.

Not an insider but imo the work on diffusion language models like LLaDA is really exciting. It's pretty obvious that LLMs are good but they are pretty slow. And in a world where people want agents you want a lot of the time something that might not be that smart but is capable of going really fast + searches fast. You only need to solve search in a specific domain for most agents. You don't need to solve the entire knowledge of human history in a single set of weights

I wonder if the error propagation problem could be solved with a “branching” generator? Basically at every token you fork off N new streams, with some tree pruning policy to avoid exponential blowup. With a bit of bookkeeping you could make an attention mask to support the parallel streams in the same context sharing prefixes. Perhaps that would allow more of an e2e error minimization than the greedy generation algorithm in use today?

This seems really intuitive to me. If I can express something concisely and succinctly because I understand it, I will literally spend less energy to explain it.

Not an ML researcher, but neither of those ideas are going to work.

The token approach is inherently flawed because the tokens pre-suppose unique meaning when in fact they may not be unique.

Said another way, it lacks properties that would be able to differentiate true from false because the differentiating input isn't included and cannot be derived from the inputs given. This goes to decidability.


We are actually working on scaling energy-based models http://traceoid.ai

I hope you succeed. I've downloaded and at least skimmed hundreds of papers ML, many alternative architectures. A subset of them built prototypes that claimed good results on benchmarks. Of those, many didn't pass further scrutiny due to various failures. Those that did pass often failed on real-world tasks despite doing well on benchmarks. That we're so jaded by failures of published models makes us even more skeptical of unpublished methods.

So, each architectural advance needs published prototypes solving real-world problems. The smallest I've seen do useful stuff are in 100+M-3B range. There are also papers about testing advances with low, pretraining cost: BabyLM; GPT2 replications; MosaicBERT. Some do straight pre-training while others distill field-proven models. Alternative architectures would do well to crank out examples like this to prove themselves.

Please, do build at least one of the above using your method. Post it to your site. Link to demos of the actual prototype in use. This might get an ecosystem going that builds on your ideas.


How many of these were actually addressing scaling EBMs though? I'm guessing none.

Including yours. Your landing page has no architecture, model, or performance comparisons. It's non-existent. You need something more tangible for us to believe in.

Remember that scientific method requires us to reject everything by default. Only after rigorous review of a working theory or prototype do we treat it as truth. Build what you want us to believe in. Let us see it smoke the competing models of similar size in key metrics. That will do more for you than anything else.

Again, I hope you're right and I get to see energy-based models being highly competitive. I haven't.


> Remember that scientific method requires us to reject everything by default

You are nowhere near as smart as you think you are. You are a STEMlord who has never produced any new knowledge who just repeats some platitudes. People doing actual research do not talk like this.

You might benefit from watching this video https://x.com/styx_boatman/status/1811820327552315805

Our work is very much work in progress. I mentioned it because we have a very promising path to scaling EBMs and I wanted to have a convo about it.

If you were actually curious and you actually cared about my claims, you would have asked some concrete followup questions. You responded with the dumbest cliches, so I will ignore your comments.


I suggested that you put a description of your ideas on the link you use to promote your company. You responded with several insults. Your profile of me was even opposite of the truth. I caution you that, if you're the founder, speaking this way might block great talent who might worry you are similarly abusive to employees.

I can see the inner problem, though, since I was very arrogant. After seeing a miracle, I put my faith in Jesus Christ who died for our sins (even us) and rose again. He turned a cold heart of stone into a warm one of flesh. I no longer feel a need to beat or dominate people online. Even better, I won't burn alive in Hell for it. Even better, He's taught me to serve more humbly.

I believe Christ can help you, too. You can be like the first Adam who led us to sin by selfish choices or like the last Adam who saved us by His self sacrifice. The renewal of the Holy Spirit will cause inner change that permeates your social life, business, everything. You'll be amazed. I pray He also frees you from the slavery of sin, esp arrogance, that once drove my life.


So much of our fundamental scientific progress has been made by people who were considered crazy and their ideas delusional. Even mundane software engineering is done with layers of code review and automated tests because even the best engineers are still pretty bad at it. At a larger level, humanity itself seems to largely operate more like an ensemble method where many people in parallel solve problems and we empirically find who was "hallucinating".

Which is just to say, it feels to me like there's a danger that the stochastic nature of outputs is fundamental to true creative intelligence and all attempts to stamp it out will result in lower accuracy overall. Rather we should be treating it more like we do actual humans and expect errors and put layers of process around things where it matters to make them safe.


It's weird that we don't have human-level AGI yet considering that we have AIs that are in some ways much smarter than humans.

Top-end LLMs write better and faster than most humans.

Top-end stable diffusion models can draw and render video much faster and with much more precision than the best human artists.


Whether or not he's right big picture, the specific thing about runaway tokens is dumb as hell.

There was an article about in the March 2025 issue of Communications of the ACM: https://dl.acm.org/doi/pdf/10.1145/3707200

I'm not deep researcher, more like amateur, but could explain some things.

Most problem with current approach, to grow abilities, need to add more neurons, but this is not just energy consuming, but also knowledge consuming, mean, at GPT-4 level all text sources of humanity already exhausted and model become essentially overfitted. So looks like multi-modal models appear not because so good, but because they could learn on additional sources (audio/video).

I seen few approaches to overcome problem of overfitting, but as I understand not exist universal solution.

For example, tried approach to create from current texts some synthetic training data, but this idea is limited by definition.

So, current LLMs appear to hit dead end, and researchers now trying to find exit from this dead end. I believe, nearest years somebody will invent some universal solution (probably, complex of approaches) or suggest another architecture, and progress of AI will continue.


The alternative architectures must learn from streaming data, must be error tolerant and must have the characteristic that similar objects or concepts much naturally come near to each other. They must naturally overlap.

A transformer will attend to previous tokens, and so it is free to ignore prior errors. I don't get LeCun's comment on error propagation, and look forward to a more thorough exposition of the problem.

Why do they have to use the word “hallucination” when the model makes a mistake, if you tell your teacher or boss you didn’t get the answer wrong, you’ve hallucinated it, he will send you to the hospital.


I can recommend his introduction to energy models. It is a bit older but explains the idea very well.

I don't think it's a coincidence that he is interested in non-LLM solutions, since he mentioned last year on Twitter that he doesn't have an internal monologue (I hope this is not taken as disparaging of him in any way). His criticisms of LLMs never made sense, and the success of reasoning models has shown him to be definitely wrong.

Yes, it is fascinating that humans can have such seemingly fundamental differences in how they function 'under the hood.' I also have a friend who is highly intelligent—they earned a STEM PhD from one of the best universities in the world—yet they struggle to follow complex movie plots, despite having a photographic memory. It would be interesting to develop mirror LLMs (or Large Anything Models) for all these different types of brains so we can study how exactly these traits manifest and interact.

Not formally an ML researcher but I’ve heard this (and similar from Melanie Mitchell) and it seems like ridiculous gatekeeping.

There’s no real rule worthy of any respect imho that LLMs can’t be configured to get additional input data from images, audio, proprioception sensors, and any other modes. I can easily write a script to convert such data into tokens in any number of ways that would allow them to be fed in as tokens of a “language.” Convolutions for example. A real expert could do it even more easily or do a better job. And then don’t LeCun’s objections just evaporate? I don’t see why he thinks he has some profound point. For gods sake our own senses are heavily attenuated and mediated and it’s not like we actually experience raw reality ourselves, ever; we just feel like we do. LLMs can be extended to be situated. So much can be done. It’s like he’s seeing http in 1993 and saying it won’t be enough for the full web… well duh, but it’s a great start. Now go build on it.

If anything the flaw in LLMs is how they maintain only one primary thread of prediction. But this is changing; having a bunch of threads working on the same problem and checking each other from different angles of the problem will be an obvious fix for a lot of issues.


cane someone elaborate on the energy bit? I vaguely recall something similar in ML 101 way back in university. Was that not widely used?

I don't think you need to be an ML researcher to understand his point of view. He wants to do fundamental research. Optimizing LLMs is not fundamental research. There are numerous other potential approaches, and it's obvious that LLMs have weaknesses that other approaches could tackle.

If he was Hinton's age then maybe he would also want to retire and be happy with transformers and LLMs. He is still an ambitious researcher that wants to do foundational research to get to the next paradigm.

Having said all of that, it is a misjudgement for him to be disparaging the incredible capabilities of LLMs to the degree he has.


> it is a misjudgement for him to be disparaging the incredible capabilities of LLMs to the degree he has.

Jeez, you'd think he kicked your dog.


Ever hear of Dissociated Press? If not, try the following demonstration.

Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.

Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.

LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.

What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles -- enough to form many different responses -- and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.


+1

But there's a fundamental difference between Markov chains and transformers that should be noted. Markov chains only learn how likely it is for one token to follow another. Transformers learn how likely it is for a set of token to be seen together. Transformers add a wider context to msrkov chain. That quantitative change leads to a qualitative improvement: transformers generate text that is semantically plausible.


Yes, but k-token lookup was already a thing with markov chains. Transformers are indeed better, but just because they model language distributions better than mostly-empty arrays of (token-count)^(context).

I've never understood this critique. Models have the capability to say: "oh, I made a mistake here, let me change this" and that solves the issue, right?

A little bit of engineering and fine tuning - you could imagine a model producing a sequence of statements, and reflecting on the sequence - updating things like "statement 7, modify: xzy to xyz"


I get "oh, I made a mistake" quite frequently. Often enough, it's just another hallucination, just because I contested the result, or even just prompted "double check this". Statistically speaking, when someone in a conversation says this, the other party is likely to change their position, so that's what an LLM does, too, replicating a statistically plausible conversation. That often goes in circles, not getting anywhere near a better answer.

Not an ML researcher, so I can't explain it. But I get a pretty clear sense that it's an inherent problem and don't see how it could be trained away.


"Oh, I emptied your bank account here, let me change this."

For AI to really replace most workers like some people would like to see, there are plenty of situations where hallucinations are a complete no-go and need fixing.


Isn’t that the answer if you tell them they are wrong?



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: