I'm on board with a lot of what's in this deck, but I take issue with the argument on slide 9. Roughly, the probability that an LLM-provided answer is fully correct decreases exponentially with the length of the answer. I think that's trivially true, but it's also true for human-provided answers (a full non-fiction book is going to have some errors), so it doesn't really get to the core problem with LLMs specifically.
In much of the rest of the deck, it's just presumed that any variable named x comes from the world in some generic way, which doesn't really distinguish why those are a better basis for knowledge or reasoning than the linguistic inputs to LLMs.
I think we're at the point where people working in these areas need some exposure to the prior work on philosophy of mind and philosophy of language.
The point is that LLMs can’t backtrack after deciding on a token. So the probability at least one token along a long generation will lead you down the wrong path does indeed increase as the sequence gets longer (especially since we typically sample from these things), whereas humans can plan their outputs in advance, revise/refine, etc.
Humans can backtrack, but the probability of an "correct" output is still (1-epsilon)^n. Not only can any token introduce an error, but the human author will not perfectly catch errors they have previously introduced. The epsilon ought to be lower for humans, but it's not zero.
But more to the point, in the deck provided, Lecun's point is _not_ about backtracking per se. The highlighted / red text on the preceding slide is:
> LLMs have no knowledge of the underlying reality
> They have no common sense & they can't plan their answer
Now, we generally generate from LLMs by sampling uniformly forward, but it isn't hard to use essentially the same structure to generate tokens conditioned on both preceding and following sequences. If you ran generation for tokens 1...n, and then ran m iterations of re-sampling internal token i based on (1..i-1, i+1..n), it would sometimes "fix" issues created initial generation pass. It would sometimes introduce new issues, which were fine upon original generation. Process-wise, it would look a lot like MCMC at generation-time.
The ability to "backtrack" does _not_ on its own add knowledge of reality, common sense, or "planning".
When a human edits, they're reconciling their knowledge of the world and their intended impact on their expected audience, neither of which the LLM has.
- GPT style language models end up internally implementing a mini "neural network training algorithm" (gradient descent fine-tuning for given examples): https://arxiv.org/abs/2212.10559
This is false. Standard sampling algorithms like beamsearch can "backtrack" and are widely used in generative language models.
It is true that the runtime of these algorithms is exponential in the length of the sequence, and so lots of heuristics are used to reduce this runtime in practice, and this limits the "backtracking" ability. But this limitation is purely for computational convenience's sake and not something inherent in the model.
I could be wrong, but I think the only probabilistic component of an LLM is the statistical word fragment selection at the end. Assuming this is true, one could theoretically run the program multiple times, making different fragment choices. This (while horribly inefficient) would allow a sort of backtracking.
Do you know of any work on holistic response quality in LLMs? We currently have the LLM equivalent of the html line break and hyphenation algorithm, when what we want is the LaTeX version of that algorithm.
I may be missing something as I don't claim to be as smart as LeCun, but to me that probabilistic argument is not even wrong.
The "probability e that a produced token takes us outside of the set of correct answers" is likely to vary so wildly due to a plethora of factors like filler words vs. keywords, hard vs. easy parts of the question, previous tokens generated (I know abuse of statistical independence assumptions is common and often tolerated, but here it's doing a lot of heavy lifting), parts of the answer that you can express in many ways vs. concrete parts that you must get exactly right, etc. that I don't think simplifying it as he does makes any sense.
Yes, I know, abstraction is useful, models are always simplifications, that probability doesn't need to be even close to a constant for the gist of the argument to stand. But everything can be bad in excess and in this case, the simplification is extreme. One can very easily imagine a long answer where the bulk of that probability is concentrated into a single word, while the rest have near-zero probability. Under such circumstances, I don't think his model is meaningful at all.
Connecting with your comment, if someone made that kind of claim about humans, I suppose most people would find it ridiculous (or at the very least, meaningless/irrelevant). With LLMs we find it more palatable, because we are primed to think about LLMs in terms of "probability of generating the next token". But I don't think it really makes much more sense for LLMs than for humans, as the problems with the argument are more in how language works than in how the words are generated.
I also don't get why this line of reasoning doesn't take into account that LLMs can just emit a backspace token in the same way that humans can ask for you to discard their statement.
The probability of a correct answer doesn't have to decrease in the length of the answer.
Epistemologically it certainly makes sense that you need to run a randomized controlled trial at some point to ascertain facts about nature, but alas we humans very rarely do. We primarily consume information and process that.
These LLMs are trained with lots and lots of text from many sources, 99.9+% of which do not contain backspaces. They could theoretically output them, given the right tokenizer, but they don‘t.
And they especially don‘t do it to correct errors.
They could potentially be trained on applying diffs though. Wikipedia has full source control version history of how things evolved, and so do most of the source code corpuses they use like github.
Google has keystroke by keystroke history of all google docs, could pull an Adobe Stock and opt everyone in with some TOS update but the blowback would be huge.
But I’d imagine Lecun has more than passing familiarity with those. This deck was put out with the Philosophy dept and he had a panel debate with NYU profs across depts (inc phil) recently on this topic.
I suspect this is all pushing the top Phil Lang and Phil Mind to their limits too. Besides, if those subjects were anywhere near resolved (or even… decently understood), they probably wouldn’t be in the Phil dept any more.
I feel like Lecun did a poor job of localizing the technique relative to any philosophy, or really explaining what it does for a nontechnical audience.
Saying he gives talks to philosophers, or saying this pushes philosophy to its limits doesn't fix the problem that lecun does a poor job - in this presentation - of philosophically motivating the proposal.
Perhaps I am wrong, and you can point out exactly how lecun explicates the philosophy in the presentation - perhaps it's really embedded in the maths, which I have not appreciated.
Appealing to lecun's authority won't fix the opacity of the presentation. But interpreting it can help! Are you up for it?
I haven't even given it a deep read, so I unfortunately can't help shed any light. From my quick read, it didn't seem to me that laying out a digestible or rigorous philosophical perspective was really the point here. It seemed more directly biology inspired than philosophy.
I also don't seek to valorize Lecun. But he was a very early figure working on these technologies and from the beginning there was a neuroscience-inspired impetus to machine learning.
My point was sort of the opposite: that assuming Lecun doesn't clear the bar of having "exposure to the prior work on philosophy of mind and philosophy of language" seems like a weak bet.
edit: For clarity, I'm making the assumption that lifelong AI researchers who put time into learning from neuroscience... would also gravitate towards and seek to learn from the nearest relevant branches of philosophy.
That probability formula is too general. It can practically model everything. Yan is just hiding behind epsilon. It's like assigning a probability to the origin of the universe. You can just make shit up with a algebraic letter representing a probability.
To illustrate the absurdity of it consider this:
I have a theory for everything. I can predict future. The mathematical equation for that is simply (1 - EPSILON) where EPSILON is the probability of any event not happening. What about the probability of a event happening as a result of the first event? Well that's (1 - EPSILON)^2.
Since this model can model practically everything it just means everything diverges and everything fundamentally can't be controlled. It's basically just modelling entropy.
Really? No. He just tricked himself. The Key here is we need to understand WHAT EPSILON is. We don't, and Yan bringing this formula up ultimately says nothing about anything because it's too general.
Not to mention, Every token should have a different probability. You can't have the same epsilon for every token. You have zero knowledge of the probabilities of every single token therefore they cannot share the same variable, unless you KNOW for a fact the probabilities are the same.
In Elazar et al. (2019) "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/pdf/1906.01327.pdf), it requires 100M English triples to roughly induce the answer to this question.
How many images do you think a model needs to see in order to answer this question?
Of course, there are fact books that contain the size of the lion. But fact books are non-exhaustive and don't contain much of the information that can quickly be gleaned through other forms of perception.
Additionally, multimodal learning simply learns faster. What could be a long slog through a shallow gradient trough of linguistic information can instead become a simple decisive step in a multimodal space.
If you're interested to read more about this [WARNING: SELF CITE], see Bisk et al. 2020, "Experience Grounds Language" (https://arxiv.org/pdf/2004.10151.pdf).
No. But with the caveat that it can only truly grasp things that are self contained systems of text. Allow me to give examples:
Transit wayfinding: A train service is nothing more than a list of stations it goes to, a station is nothing more than a list of train services that stop there. Nothing about the physical nature of trains or commuting is needed to have a discussion about train lines or to answer questions like "how many transfers does it take from x to y?". You could never have seen a train in your life or even a map of a train, if you've studied the dual lists of stations and services there's nothing more to learn.
Chess (and other board games): Standard Chess notation is a list of sequential move made by each player in the format [piece name][x coordinate][y coordinate]. Eg RB6 represents moving the Rook to column B row 6. Chess can be understood entirely in terms of a game of passing a paper back and forth, appending a new token each time, and the rules of the game expressed entirely in terms of how the next token must match the previous part of the list. At no point is it needed to have actually seen the physical board based representation of the game.
The machine is grounded in a text based existence. Conversational and linguistic objects are its literal physical objects. Anything outside of that, well it can understand lions and "large" the same we understand atoms and "eigen-states".
Experience may ground language, but you are free to ground language in a different reality with different basic constituents and relations. If AGI were to emerge out of trading bots, they would have a language grounded in money and trades. If it emerged out of a bot made to play Diplomacy, it would be a consciousness with the body of a nation state and the atoms of its world would be bits of plastic on a map of Europe. Grounding in our particular reality isn't strictly necessary, but it is helpful if the goal is to make something adept at the modalities of our particular existence and conversational form.
The problem with this line if thinking is that human language is not some self-consistent logical system you can reason about independently of its physical origins the way chess or transit or game systems are.
Instead a large part of the meaning of language is carried by shared human understanding of physical objects and events.
Edit: to give an example, the phrase "is it warm outside?" can be plausibly replied to in several ways, but the meaningful ones require knowing something about location and weather and human biology. No pure language-based learner can give the correct response without some kind of sensory-based information (in practice, that would come from some "oracle" based on weather services and location detection in current systems).
Sure there are terminal words at the bottom. Natural language sentences eventually come down to relating some raw physical objects to others. We live in a reality where objects have inherent properties like size and texture and warmth and location. That's not a priori better or worse than the reality where objects have inherent properties like being a station or a service.
>No pure language-based learner can give the correct response without some kind of sensory-based information
No pure classical reality learner could ever come up with quantum mechanics. They must have a sensory organ to feel the eigen states.
> No pure classical reality learner could ever come up with quantum mechanics. They must have a sensory organ to feel the eigen states.
That's obviously not true, as biological organisms who lack such organs have still arrived at learning quantum mechanics (granted, we spent a few billion years inventing language first, but still, those tiny little microbes evtually learned how quantum mechanics works).
I should also mention that we don't yet know whether the eigenstates exist in any sense, or are just a cool modeling trick to predict probabilities, or are actually wrong and there is some better way to model particle physics that perhaps doesn't require them at all.
Squares are physical. Historically exponents were only understood as measurements of the diagonal on a square, a neat trick for architects building something in our physical reality.
Aside, but I'll meander to the heart of your point:
As an American who's lived in Canada and Europe, I'll freely acknowledge the superiority of the metric system to imperial measures. EXCEPT celsius.
With Fahrenheit, you can describe a temperature with a single digit: in the thirties, in the forties, etc.
In Celsius, you're forced to add an extra bit of precision: low tens, high tens, etc.
Celsius-lovers find this argument infuriating and will go through all sorts of contortions to justify it: But what about the boiling point of water?
Who cares about the boiling point of water? Scientists. Chefs. But 99% of the time we talk about temperature, we care about the weather.
And this is sort of the point we're both getting at. That human-like intelligence is of particular value to us, in addition to non-human superintelligence at particular domains.
postscript: The anti-Fahrenheit thing is more a discomfort and lack of familiarity. An easy technique for explaining the intuition behind fahrenheit, for describing the weather, is "what percent 'hot' is it"? 80F is 80% hot. Not totally hot. Good-natured people will usually appreciate this rubric. The sort of people who enjoying sorting grains of rice will persist in winnowing the chaff here and perhaps point out that I'm mixing metaphors.
The boiling point of water may not be very significant to daily human experience, but the freezing point is, at least for people who don't live in the tropics. Celcius does better there than Fahrenheit, doesn't it? What "percent cold" is it in Fahrenheit land?
Also I don't like your argument that "only scientists" care about the boiling point of water because we live in a scientific society and for it to function well it is extremely important that everybody has at least some basic scientific literacy. Having separate temperature scales for scientists and "common folk" is at least one little step away from that goal.
Biologists and chemists, and yes, of course, meteorologists and climatologists, routinely use Celcius, not Kelvin, in scientific publications. So that's a strawman.
The point is that scientists use units appropriate to their domain and community. The general public uses units appropriate for their uses. These usually do not coincide.
There would be advantages (e.g. for standardization, "science literacy") if they used the same units but it would also be impractical for whichever side adjusts to use the other's units. I believe the advantages of such a switch are very small and the downsides small. Neither scientists nor the general public seem to think this is a problem.
I acknowledge your point that there might be rich ineffable depths that could be present in the game of Go that are beyond the ken of human understanding or even imagination.
But I fail to see how an AI whose entire domain of knowledge is limited to board games, or financial trading, would qualify as artificial general intelligence. Unless you take a radical decentering viewpoint that human experience is just one kind of intelligence, and aren't all kinds of intelligence equally valid?
That would be glib. Of course we want superhuman drug discovery algorithms that don't understand the pleasure of a sunny day. But most people, when they speak of the topic of AGI, mean: encompassing the scope of human experience.
>But I fail to see how an AI whose entire domain of knowledge is limited to board games, or financial trading, would qualify as artificial general intelligence.
Let me recast this:
>I fail to see how an I whose entire domain of knowledge is limited to low energy physics, or the large mass limit, would ever qualify as general intelligence.
You see the problem. You can extrapolate just fine beyond the reality of Newtonian physics and Euclidean geometry which is native to you. A trading bot AGI could talk about things that aren't money, it would just be using money related objects with money related properties to tell this narrative picture of a world which isn't money. Not too different from the sort of analogies we use all the time to map hard concepts back to the space we find natural.
So let me answer your question with a question. What makes a substrate of properties like "size" and "color" and "texture" a better reality to be grounded in than properties like "buy price", "sell price", "dividend"? Is the former uniquely suited for intelligence to generalize beyond that given domain, or is it possible that all groundings are good as any other?
Very interesting question! I would argue that concepts like "size", "colour", "texture" are grounded in a more base level of reality than concepts such as "buy price", "sell price", "dividend" etc. The former set of concepts relate to aspects of the physical world as perceived by creatures that inhabit that world, while the latter refer to social constructs of the human socio-economic world that are themselves reliant on abstract human concepts of utility, value, the market etc.
This thought is not fully developed but I'm drawn to the idea that if an intelligence is grounded in an understanding of a more base level of reality it will find it easier to generalize beyond any one given domain. I could be entirely wrong of course.
I'm philosophically comfortable with the notion that some things are unknowable and that eventually one just has to take certain cornerstone propositions about the nature of reality on trust and / or faith. Ultimately though the question is irrelevant to the discussion at hand, as the claim I'm making is that physical reality is at a lower level than socio economic reality (which I'll give you is a contested claim these days)
Yes, the systems we construct within this reality are obviously not the bottom of the stack because we built the stack. But suppose the trading bots figure out how to manipulate the rounding mechanics of high frequency trading to jerry rig a Turing machine within their reality. Now fast forward from this and suppose they've all obtained an equivalent of a home PC. For fun they all start playing a "sim city" like game version of our reality.
So far as they can tell, money is real reality (rather than socio-economic construct), and the "physical" reality is an abstract game constructed at a higher level.
I get your point, but you're making my head spin here from the sea-sickness involved in being asked to acknowledge that all viewpoints are equally valid.
People care about human-like intelligence. That's the point. We also care about trading and physics. But when we talk about the specific question of human-like intelligence (and not other kinds of intelligence), the human reality is the appropriate substrate. Acknowledging, of course, that for non-human intelligence there might be other substrates and also acknowledging that non-human intelligence can be valuable to people.
But let me take your argument to the extreme.
In classical machine learning theory, there's a proof of the value of bias. Bias plays a crucial role in guiding the learning process. If all hypotheses are considered equally possible, learning becomes infeasible due to the lack of any preference or constraint on the hypothesis space. Bias, in this context, refers to the inherent assumptions a learning algorithm makes about the data's underlying structure. Introducing some bias into the model allows it to favor certain types of solutions, thus narrowing down the hypothesis space and making learning feasible.
The logical extreme of your argument---if I understand it correctly---is that all machine learners are equally valuable. For example, an AI that learns a domain that is completely removed from all human values and thus we would be completely agnostic to and ignorant of, ipso facto, because it does not pertain or relate to us at all. In which case, who cares? By "who", I mean people. No one. By construction.
[edit: I'll give an example here. Let's say I randomly pick a corpus of images that are pure noise. I induce an algorithm to model that noise. That model will learn something valid but in a domain completely divorced from anything with human impact.]
So if we can acknowledge that some forms of learning are more important, merely by virtue of the fact that humans have preference, then perhaps we can perhaps wend our way back to agreeing that human-like intelligence is one valuable kind of intelligence, in addition to other valuable forms of intelligence like trading or board games.
We can play the game all day of "everything is subjective" and "up is down and black is white", but at the end of the day, for human beings, it's night.
The "general" in "general intelligence" is when you can hack together a picture of other domains by stitching together objects in your own domain. Our reasoning system only evolved to talk about 3 physical space dimensions and 1 time dimension and Newtonian physics and euclidean geometry. Things like trades and ownership and train lines aren't basic objects in our domain and the abstract rules governing their relations are completely unlike those rules of object properties like heavy and bright and loud. Can an intelligence stuck in a different substrate also come up with abstractions and generalize its intelligence? Maybe
>The logical extreme of your argument---if I understand it correctly---is that all machine learners are equally valuable.
That's not quite it. My argument is that general intelligence can in principle be grounded in the semantics of pretty much any baseline of objects and relations. The specific base line properties such as size, shape, color etc is orthogonal to the property of intelligence. We don't need to give the machine familiar sensory input for general intelligence, and general intelligence would not imply that it can do well at conversation in our domain (merely that it can talk about our domain in a very contorted way if it had to).
I'll admit if the goal is to make an intelligence "in our image" then sensory modalities will get us there faster. It will also lead people to erroneously believe that the key ingredient to general intelligence is physical grounding in our reality. My counter example to that belief is general intelligence embedded in a pure language model, which is perfectly recognizable as "in our image" AGI just so long as you stick to topics like transit lines and chess and whatever else can be grounded in words alone.
The definition of „general intelligence“ as chess, transit routing, and pure language, is not a commonplace interpretation of this term of art. But with that interpretation in mind, I see your point.
It can still be "general" outside of those areas, it will just be prone to mistakes of common sense. So to rephrase it all tersely:
Sensory grounding isn't needed for general intelligence or for computers to be capable of "understanding" (as per the title), but it is needed for that intelligence to have what we consider common sense.
Your comment seems to suggest the answer is no, and that LLMs can indeed learn sensory-grounded information, but it's just orders of magnitude less efficient to train them on text rather than on multimodal data.
"Meaning and understanding" can happen without a world model or perception. Blind people, disabled people have meaning and understanding. The claim that "Understanding" will arise magically with sensory input is unfounded.
A model needs a self-reflective model of itself to be able to "understand" and have meaning (and know that it understands; and so that we know that it understands).
But if they were augmented with a self-reflective model, they could understand. A self-reflective model could simply be a sub-model that detects patterns in the weights of the model itself and develops some form of "internal monologue". This submodel may not need supervised training, and may answer the question "was there red in the last input you processed". It may use the transformer to convey its monologue to us
So some have a knowledge of colors and, in general, places like diners and etc share color schemes so they can picture them in their surroundings pretty accurate.
from my naive perspective as a puny human, it doesn't seem that self-reflection leads to motivation. Motivation probably requires additional self-preservation circuits
The idea that it needs so is looking more and more questionable. Don't get me wrong, i'd love to see some multimodal LLMs. In fact, i think research should move in that direction...However "needing" is a strong word. The text only GPT-4 has a solid understanding of space. Very, very impressive. It was only trained on text. The vast improvement on arithmetic is also very impressive.
(People learn language and concepts through sentences, and in most cases semantic understanding can be built up just fine this way. It doesn't work quite the same way for math. When you look at some numbers and are asked even basic arithmetic , 467383 + 374748. Or say are these numbers primes or factors?. With a glance, you have no idea what the sum of those numbers would be or if the numbers are primes or factors because the numbers themselves don't have much semantic content.
In order to understand whether they are those things or not actually requires to stop and perform some specific analysis on them learned through internalizing sets of rules that were acquired through a specialized learning process.)
all of this is to say that arithmetic, math is not highly encoded in language at all.
and still the vast improvement. It's starting to seem like multimodality will get things going faster rather than any real specific necessity.
also, i think that if we want say vision/image modality to have positive transfer with NLP then we need to move past the image to text objective task. It's not good enough.
The task itself is too lossy and the datasets are garbage. That's why practically every Visual Language model flunks stuff like graphs, receipts, UIs etc. Nobody is describing those things t the level necessary
what i can see from gpt-4 vision is pretty crazy though. if it's implicit multimodality and not something like say MM-React, then we need to figure out what they did. By far the most robust display of computer vision i've seen.
I think what kosmos is doing (sequence to sequence for Language and images ) has potential.
Chris Espinosa just posted this example of GPT-4 attempting (and failing miserably) to explain how a square has the same area as a circle that circumscribes it:
Do you think GPT-4 knows what it’s own limits of understanding are? Most people have a sense of what they know and don’t know. I suspect GPT-4 has no concept of either.
GPT-4 is actually multimodal. It can't be publicly prompted with images yet, but that's planned.
I think that multimodal training will also solve the data shortage problem. There are hundreds of times more bytes in video and audio than there are in text, so we'll likely be able to scale pre-training for quite a while before needing to go fully embodied on Transformers.
They trained a text only model first and then made it multimodal. The experiments I was referring to (spatial understanding) were done with the text only model.
I'm not actually sure that's true. There is a lot of detail in the world represented in audio and video, and presumably large transformers could learn from the textures and shadows and articulated movements and the physical modeling of how sounds are made, etc.
How so? Is it simply better at predicting the answer to spatial questions based on being a more powerful autocomplete than predecessors? How is this proven?
In science results matter much more than vague and ill defined assertions.
A system reasons and understands when it demonstrates understanding and reasoning. That's how you asses reasoning in anything/anybody. Evaluation.
If you want to tell me there's a special distinction between what an LLMs outputs and "true reasoning TM" then cool but when you can't show me what that distinction is, how to test for it, the qualitative or quantitative differences then I'm going to throw your argument away because it's not a valid one. It's just an arbitrary line drawn on sand.
A distinction that can't be tested for is not a distinction.
The point is as far as I know it hasn't been properly tested for. No one has come up with a comprehensive framework of tests to evaluate spacial reasoning using entirely sentences that aren't isomorphic to training data.
Try asking chatgpt to solve a non-well-known programming exercise. It will fail miserably and insist on rehashing the same wrong solution when told that it was wrong.
A real person might give up, but would not claim that an obviously wrong answer was in fact the answer.
In the Microsoft sparks paper they ask it to "draw a unicorn in TikZ". It does well. Then they remove the horn and ask it to put the horn on the unicorn and it gets that to.
To the understanding of space, while it certainly has gaps, I've had GPT4 do basic graph layout by giving it (simple) Graphviz graphs in and asking it to generate draw.io xml out and telling it to avoid overlapping nodes and edges.
I didn't say it's not there. or that you couldn't communicate the ideas of math/arithmetic with language.
You can get GPT-3 (yes 3) to have 98.5% accuracy on addition arithmetic (even very large numbers) by..simply describing the algorithm of addition to performed on 2 numbers.
https://arxiv.org/abs/2211.09066
The basic idea i'm communicating is that not everything that can be extracted from self supervisory token prediction can be extracted with the same ease. and it is very easy to see why math is one the higher difficulty things.
It is extremely easy comparatively to infere "happy" from the sentence - John is smiling therefore he is ----. all the information you need to make that inference is packed tightly in the preceding words. it is not the same for arithmetic at all.
Us humans often discuss all kinds of real subjects despite lacking any firsthand experience at all. I see no reason why a machine couldn't do the same.
Call it antiscientific. Solipsistic even. But it isn't entirely disasterous, is it?
We had all the senses, yet we didn't come up with even the motivation to do empirical science until the 17th century. And even today, a general model of human thought would not necessarily contain a clear recipe for doing science.
We have (I think the current recognized number is about) twenty-seven senses.
> we didn't come up with even the motivation to do empirical science until the 17th century
That's historically inaccurate, to put it mildly.
(E.g. "Science Education in the Early Roman Empire", Richard Carrier)
> a general model of human thought would not necessarily contain a clear recipe for doing science.
Every human child is a scientist?
Anyway, don't overthink it. Once these systems have sensors and can integrate the effects of their behavior on external systems, they'll have empirical data and Bayesian reasoning, they will develop reliable models of the world, they will be able to check their "hallucinations" against real world conditions and adapt.
In other words, science is adaptive in the evolutionary sense. These devices are not talking apes, they don't have the "baggage" of glands and DNA and history.
Yes, that's a great point. I was assuming, without really thinking about it, that the AI's still carry our historical baggage. We feed a large portion of that baggage to them as their training data, to see if they can impress us with human-like behaviors. Under those conditions, the AI's stand roughly the same chance as a human child of learning to think scientifically.
But that's not a hard requirement. I work for a company that makes sensors. Some of our lunchtime conversations have revolved around the idea of a colossal computer being given a plethora of sensors. In a science-fiction sense, in which we don't worry too much about the cost or practicality of doing so.
> We feed a large portion of that baggage to them as their training data, to see if they can impress us with human-like behaviors.
But we don't feed them only the baggage, eh? We also [can] give them all the writings and teachings of all the great thinkers and humanitarians, all the saints and sages, and then ask them to impersonate e.g. Jesus or Buddha... Whoever it is that "floats your boat", anyone from Papa Smurf or Santa Claus to Gandhi or George Washington...
> Under those conditions, the AI's stand roughly the same chance as a human child of learning to think scientifically.
It's our choice, eh? If we value scientifically-grounded outputs the networks will change their weights (etc.) and that's what we'll get.
> I work for a company that makes sensors.
Ah! I envy you. :)
> Some of our lunchtime conversations have revolved around the idea of a colossal computer being given a plethora of sensors. In a science-fiction sense, in which we don't worry too much about the cost or practicality of doing so.
I can't escape the fact that Lecuns complicated charts give the appearance of the required complexity to emulate robust general intelligence but are simply that, added complexity which could simply be encoded in emergent properties of simpler architecture models. unless he's sitting on something that's working I'm not really excited about it.
personally, I'm waiting to see what's next after GATO from Deep Mind. their videos are simply mind-blowing.
Quality of output does not mean that the process is genuine. A well-produced movie with good actors may depict a war better than footage of an actual war, but that is not evidence of an actual war happening. Statistical LLMs are trying really hard at "acting" to produce output that looks like there is genuine understanding, but there is no understanding going on, regardless of how good the output looks.
This is Searle's Chinese Room posit though right? The argument that there's no abstraction or internal modelling going on. Wish I could find the post I read recently that demonstrated some fairly clear evidence though that there IS some level of internal abstraction/reasoning going on in LLMs.
Do we allow for a matter of degree, rather than a binary, of "zero" vs" "complete" understanding?
At what point does abstraction and reasoning turn into "sentient understanding"? How do you express that as the thing doing the abstracting? Even humans struggle at that kind of task because there's a metaphysical assumption of first principles that these models seem to be challenging (or else this thread would not have started).
Processes such as back propagation are still not understood very well in the human brain. The brain certainly uses electrical impulses in order to transfer signals, much like the 1s and 0s in your computer or phone. The gap between us and intelligent machines is probably not as well understood or clear as most people in the software industry think it is.
> At what point does abstraction and reasoning turn into "sentient understanding"?
This is a loaded question, you're assuming that abstraction and reasoning can somehow magically "turn into" sentience, whereas I posit that those two things are completely different. You can have sentience without reasoning (i.e. pure non-judgemental awareness that is the goal of Buddhist meditation), and vice-versa.
I like this topic not least because it helps me answer the question "how is Philosophy relevant". Here we are again, asking elementary epistemological questions such as "what constitutes justified true belief for an LLM" about 3000 years post Plato with much the same trappings as the original formulation.
I wonder if -- as often it ends up -- this audience will end up re-inventing the wheel.
Philosophical questions have never been answered unsing fully-replicatable quantitative model like these. So this is progress indeed, even if it is pondering the same age old questions (which will never change)
In some sense they already have sensory grounding if they are coupled to a visual model. It might sound vacuous but if you ask a robot for the "red ball" and it hands you the red ball, isn't it grounded?
Describing LLMs: "Training data: 1 to 2 trillion tokens"
Is number of tokens a good metric, given relationships between tokens is what's important?
An LLM with 100000 trillion lexically shorted tokens, given one by one, wont be able to do anything except perhaps spell checking.
I guess the idea is that tokens are given in such "regular" forms (books, posts, webpages) that their mere count is a good proxy for number of relevant relationships.
I think the underlying assumption is that there is (an invariant) structure and distribution that becomes more resolved (q) as more data is incorporated. Training LLMs also involves sequences of tokens by design so that is also implicit. The only venue left is to question whether this (n -> q) is a linear relationship or something else. That would only matter if we were looking at absolute measure (to compare say an LLM architecture to something else), but as a comparative measure for LLMs it works, whether it is o(n) or o(logn) or whatever, since more tokens means more sequences and more sequences mean greater ability.
Minecraft would be a pretty good medium for finding out the answer to this question actually.
Stick a multimodal LLM thats already got language into Minecraft, train it up and leave it to fend for itself (it will need to make shelter, find food, not fall off high things etc).
Then you could use the chat to ask it about its world.
Minecraft or any sufficiently complex simulated environment. But I think people will put AI models in physical robot really soon too if it's not already happening.
Is "learning to reason" a real challenge? (Kahneman's system I in the slides) From a naive perspective of view, formal methods like SAT solvers, proof assistants works pretty well.
I think it is time to move from intelligent systems to conscious systems. Based on [1] in order to have more intelligent systems we do need sensory as the slides state but we also need other things like attention, memory, etc. So we can have intelligent systems that can have a model of the world and make plans and more complex actions (see [2,3]). Maybe not so big models as today's Language Models. I know the slides show some of the ideas, but we cannot add some things without adding other things first. For example we need some kind of memory (long and short term) in order to do planning, adding a prediction function for measuring the cost of an action is a way of doing planning but it have a lot of drawbacks (as loops because the agent does not remember past steps, or what happened just before). Also a self representation is needed to know how the agent takes part in the plan, or a representation of other entity if it is that one who executes the plan.
I have wondered about sensory input being needed for AGI when thinking about human development and feral children[1]. It seems that complex sensory input, like speech, may be a component of cognitive development.
Yes, and it is also related with consciousness. In order to be intelligent, it needs to be embodied, so it can sense and act in the world. More ideas related with consciousness apply here.
Though many people confused embodied with 'robot', as in an AI would be packed in a unit walking around. I personally see it the opposite way. An AI could have a 1000 bodies all gathering data in the world. Or a million cameras connected to it. Or a set of sensors that spans the globe.
In this since a digital entity could have far more embodiment and connectivity then us humans could ever have.
Waiting for the first research paper with the term 'globally conscious' at this point.
So I understand the author has high standing in the community.
But I think they are making actually disingenuous arguments by mixing assertions that are true but irrelevant together with assertions that are probably wrong.
For example we can break down the following firehose of assertions by the author:
Performance is amazing ... but ... they make stupid mistakes
Factual errors, logical errors, inconsistency, limited reasoning, toxicity...
LLMs have no knowledge of the underlying reality
They have no common sense & they can’t plan their answer
Unpopular Opinion about AR-LLMs
Auto-Regressive LLMs are doomed.
They cannot be made factual, non-toxic, etc.
They are not controllable
> they make stupid mistakes
OK maybe some make stupid mistakes, but it's clear that increasingly advanced GPT-N are making fewer of them.
> Factual errors
Raw LLMs are pure bullshitters but it turns out that facts usually make better bullshit (in its technical sense) than lies. So advanced GPT-N usually are more factual. Furthermore, raw GPT-4 (before reinforcement training) has excellent calibration of its certainty of its beliefs as shown in Figure 8 of the technical report, at least for multiple choice questions.
> logical errors
Same thing. More advanced ones make fewer logical errors, for whatever reason. It's an emergent property.
> inconsistency
Nothing about LLMs requires consistency just like nothing about human meaning and understanding requires consistency, but more advanced LLMs emergently give more coherent continuations. This is especially funny because the opposite argument used to be given for why robots will never be on the level of humans - robots are C3P0-like mega-dorks whose wiring will catch fire and circuit boards will explode if we ask them to follow two conflicting rules.
> limited reasoning
Of course their reasoning is limited. Our reasoning is limited too. Larger language models appear to have less-limited reasoning.
> toxicity
There is nothing saying that raw LLMs won't be toxic. Probably they will be, according to most definitions. That's why corporations lobomize them with human feedback reinforcement learning as a final 'polishing' step. Some humans are huge assholes too, but probably they have meaning and understanding anyway.
> LLMs have no knowledge of the underlying reality
OK fine you can say that any p-zombie has no knowledge of the underlying reality if you want, if that's your objection. Or maybe they are saying LLMs don't have pixel buffer visual or time series audio inputs. Does that mean when those are added (they have already been added) then LLMs can possibly get meaning and understanding?
> They have no common sense
If you say that inhuman automata are by definition incapable of common sense then sure they have no common sense. But if you are talking about testing for common sense, then GPT-N is unlocking a mindblowing amount of common sense as N is increasing.
> they can’t plan their answer
Probably they are saying this because of next-token-prediction which is tautologically true, in the same way that it's true that humans speak one word after another. But the implication is wrong. They can plan their answer in any sense that matters.
> Auto-Regressive LLMs are doomed.
OK. Do you mean in terms of technical capabilities, or in terms of societal acceptance? They are different things. Or do you mean they are doomed to never attain meaning and understanding?
> They cannot be made factual, non-toxic, etc. They are not controllable.
Those same criticisms can all be made against even the most human of humans. Does it mean humans have no meaning or understanding? No.
Of course these are also conditional on whatever prompt you are putting to get them to answer questions. If you prompt an advanced raw GPT-N to make stupid mistakes and factual and logical errors and to act especially toxic then it will do it. And, perhaps, only then will it have truly attained meaning and understanding.
I don’t think there is any evidence at all that they meaningfully make fewer mistakes.
In addition while humans make mistakes they will not casually make contradictory claims the way GPT* do. Basic self-contradictions of the sort even a three year old would catch.
Define how they are planning their answer in any sense that matters, please.
> I don’t think there is any evidence at all that they meaningfully make fewer mistakes.
I'm not sure I was clear, when I wrote "increasingly advanced GPT-N are making fewer [mistakes]" I didn't necessarily mean they were making fewer mistakes than humans, but rather that GPT N+1 makes fewer mistakes than GPT N. I assume this is pretty uncontroversial because for example the evidence in the test suites where bigger N generally get better scores meaning fewer mistakes.
> In addition while humans make mistakes they will not casually make contradictory claims the way GPT* do. Basic self-contradictions of the sort even a three year old would catch.
OK I believe you saw some of them. Do you think the contradictory claims imply that large language models need sensory grounding for meaning and understanding? If they are mistakes then presumably the more advanced models will have fewer mistakes, but this is only an extrapolation from the evidence of model progressions on test suites. The mistakes that 3 year olds would catch I would like to see, especially if GPT 4 fails them. I will be curious to see if GPT 5 will fail them also.
If they are more like, taking opposing opinion stances in different query instances, then maybe it's not a problem that prevents meaning and understanding. Maybe it's just how the AI rolls.
> Define how they are planning their answer in any sense that matters, please.
Honestly idk what you expect me to put here. Probably you are saying that AIs that generate things word by word are by definition not planning, even if they might have a big context window. But like I was saying, people also speak word by word. Does it mean people aren't planning either? I mean, GPT-4 scores well on suites of puzzles. Presumably those suites involve some puzzles that require thinking about (planning) the solution to the puzzle and then writing the solution word by word.
I thought of two simple ones from the top of my head. Probably you won't say they require any planning but I'm not sure. The first example I prompted gpt3.5 "Find two numbers you add together and get ten." and it put "Two numbers that add up to ten are 5 and 5." Does that answer require planning? Maybe or maybe not I guess it depends how you define it.
The other one I made from the top of my head was to put the last line of a story first and have the gpt3.5 make a story that ends with that line. I prompted it like "Make a story that ends with the exact sentence 'And that's why we always plant the corn so early in the year.'" and it wrote a story. I don't know if you will say it required any planning or not. Probably not. I don't know what kind of evidence you would want. Anyway here is the story it made:
Once upon a time, there was a farmer named John who lived in a small village. John was known for his exceptional farming skills and his ability to grow the best corn in the entire region. One year, John decided to experiment with planting his corn later in the season, hoping to yield a better harvest.
As the days went by, John noticed that his corn was not growing as well as it usually did. The plants were smaller, and the ears of corn were not as big as they should be. John was worried that he had made a mistake by planting his corn later in the year.
One day, John decided to visit his neighbor, who was also a farmer, to ask for advice. His neighbor told him that planting corn later in the season was not a good idea because the soil was not warm enough, and the corn would not have enough time to grow before the cold weather set in.
John realized that he had made a mistake and decided to plant his corn early the following year. He prepared the soil, planted the seeds, and took care of the plants every day. As a result, his corn grew tall and strong, and he had a bountiful harvest.
From that day on, John always planted his corn early in the year, and he never had a bad harvest again. He shared his story with other farmers in the village, and they all learned from his mistake. And that's why we always plant the corn so early in the year.
The paper questioned whether meaning and understanding require sensory grounding.
During her life, Helen Keller demonstrated a grasp of both meaning and understanding despite her severe sensory deprivation. This doesn't prove anything, but it does add human context to the question.
....she lost her sight and her hearing after a bout of illness when she was 19 months old.
In order to develop "understanding", at least one input channel must be available to provide something to understand. In order to ascertain that development, we require at least one output channel. Anything less and we are describing a pet rock. Language can be thought of as the rules of the communication.
The root question seems to be whether a machine can learn to communicate meaning and understanding over a bidirectional binary channel without previous training. I suspect that such communication can evolve. Some might look around and remark that it already has.
Ok, so as the link is some slides with a bunch of bullet points and a handful of images, I am going to be limited in my understanding of this in many of the same ways that my (aforementioned limited) understanding suggests that LeCunn is saying that LLMs are limited.
So: factual errors/hallucinations (or did I?), logical errors, lacking "common sense" (a term I don't like, but this isn't the place for a linguistics debate on why).
So if I understand, then I don't understand; and if I don't understand then I have correctly understood.
I wonder why you can't get past the paradoxes of Epimenides and Russell by defining a state that's neither true nor false and which also cannot be compared to itself, kinda like (NaN == NaN) == (NaN < NaN) == (NaN > NaN) == false? I assume this was the second thing someone suggested as soon as mere three-state-logic was demonstrated to be insufficient, so an answer probably already exists.
Hmm.
Anyway, I trivially agree that LLMs need a lot of effort to learn even the basics, and that even animals learn much faster. When discussing with non-tech people, I use this analogy for current generation AI: "Imagine you took a rat, made it immortal, and trained it for 50,000 years. It's very well educated, it might even be able to do some amazing work, but it's still only a rat brain."
Although, obvious question with biology is how much of default structure/wiring is genetic vs. learned; IIRC we have face recognition from birth so we must have that in our genes; I'd say we also need genes which build a brain structure, not necessarily visual, that gives us the ability to determine the gender of others because otherwise we'd all have gender agnostic sexualities, bi or ace, rather than gay or straight.
But, a demonstration proof learning can be done better than it is now doesn't mean the current system can't do it at all. To make that claim is also to say that "meaning and understanding" of quantum mechanics, or even simple 4D hypercubes, is impossible because the maths is beyond our sensory grounding.
I was going to suggest that it makes an equivalent claim about blind people, but despite the experience of… I can't remember his name, born blind (cataracts?) surgery as an adult, couldn't see until he touched a money statue or something like that… we do have at least some genetically coded visual brain structures, so there is at least some connection to visual sensory grounding.
And of course, thinking of common sense (:P) there are famously 5 senses, so in addition to vision, you also have balance, proprioception, hunger, and the baroreceptors near your carotid sinus which provide feedback to your blood pressure control system.
In much of the rest of the deck, it's just presumed that any variable named x comes from the world in some generic way, which doesn't really distinguish why those are a better basis for knowledge or reasoning than the linguistic inputs to LLMs.
I think we're at the point where people working in these areas need some exposure to the prior work on philosophy of mind and philosophy of language.