Hacker News new | past | comments | ask | show | jobs | submit login
Why are so many giants of AI getting GPTs so badly wrong? (medium.com/fergal.reid)
51 points by vinnyglennon on May 22, 2023 | hide | past | favorite | 51 comments



What an irritating article.

The simplest answer, respecting the article's own assumption is: They are giants of AI and they are not getting it wrong.

An immediate corrolary of which is: the article's author is the one getting it wrong. Eyballing the attempts to show that the "giants of AI" are wrong by poking GPT4 I'd say there's a good chance of that.

An even simpler answer, violating the assumption is: they're not giants of AI and that's why they're getting it wrong. Or is that a simpler answer? Well, now we have to explain how a guy with a machine learning PhD writing on Substack is getting it right, when LeCun, Brooks, and so on are getting it wrong.

Chomsky is not an AI expert, so there's that: the author doesn't quite know who he's talking about.


Strongly disagree. I've seen towering giants in their fields that still can develop a "mental block" if you will if their developed notions about how their technology/science should work doesn't pan out. E.g. why you often see scientific breakthroughs among younger scientists (and, yes, while I realize the "average age at Nobel breakthrough" has been rising over the past century, it still skews relatively quite young). For example, Brooks in particular spent his career around the idea that intelligence requires "real world interaction", so it's not surprising to see him pooh pooh the idea that something can become intelligent that lacks that experience.

I would be really interested in examples where Lecun, Brooks, Chomsky, or any others have argued their points in relation to some specific examples like the kinds that are pointed out in this article. Fully admit that I don't have a full catalog of their Twitter posts and interviews, so I genuinely would love to see this. I remain highly skeptical, though, when I only see them argue in generalities, e.g. "LLMs can't be smart because they don't have a model of explanations of how things occur." I mean, fine, but then how do they explain some of the quite frankly amazing capabilities of some of these systems if they're "just doing autocomplete." For example, saw an absolutely mind-blowing example where someone taught ChatGPT an entirely new made up language, and ChatGPT was able to guess at, for example, the appropriate declensions for different words. Would love to see Chomsky's reaction to that one.


> For example, saw an absolutely mind-blowing example where someone taught ChatGPT an entirely new made up language, and ChatGPT was able to guess at, for example, the appropriate declensions for different words. Would love to see Chomsky's reaction to that one.

Chomsky talks about the NYC opinion piece and "Impossible Languages" here: https://www.commondreams.org/opinion/noam-chomsky-on-chatgpt


Chomsky is one the major experts if the XX century in language structure, as well as a founder of cognitive science. I'd say he is capable of having a fairly good idea of what the language models may and may not be doing, even if he doesn't know how the computer is processing the statistics needed to do it.


The man has his points, and his takedown of behaviorism is still excellent, but no one has found any trace of universal grammar. And while GPTs have a crude world model at best, you can't fault their understanding of syntax, developed without any prior model at all. On this matter I would not have too much faith in Chomsky's analysis. And he is regardless just not a "big name in AI"! He never even claimed to be!


Universal grammar in the way that Chomsky suggested, a sort of structure all humans must have in there brain, is not a thing. But I believe there is something like the "universality of grammar" where the human vision system or motor system can have a grammatical quality to them. In that sense that there is ambiguity in which tree structure represent the world. I've seen this in many places but I am not sure if someone has put this into (better) writing.


Universal grammar in the way that Chomsky suggested, a sort of structure all humans must have in there brain, is a thing. But it's more of a meta-grammar than a grammar, i.e. a way to build the grammar of the particular language the child is exposed to. Because all languages' grammars have similarities, including those of sign languages.


So sayeth the Chomskyites. But the idea has been less and less broadly accepted over time. Some of my professors were still avowed fans, but most were willing to endorse at best extremely weakened versions.

Now UG, after doing vast cross-comparison of every language, is a tiny framework of seemingly random grammatical facts. It's not nothing! But we would expect some commonalities to hold regardless, just by chance. Considering how many times we've had to trim our model, how confident can we be that now there's actually a baseline. We're out of languages to test it on!

And there is other reason to doubt. We only have 20,000 some genes. Most of them code for structural stuff in the body, essential proteins and such. Explanations that require things to be hardwired into the brain ought to be accordingly penalized. Even stripped down to its bare bones, UG is quite a bit of information!

And why exactly would evolution hardcode this choose your own grammar into the brain anyway? Until recently we could suppose that language would be difficult or impossible to figure out without it. But big dumb machine learning models trained on nothing but prediction worked out syntax just fine! Why would our brains, superb prediction engines of even larger size and with cross-modality cognition, be less capable?

Of course, humans learn language with far less input than GPT. But we learn everything with far less input than current models. And regardless of whether there is actually a universal grammar, we'd expect linguistic drift to tend towards structures which are readily understood by our brain- an advantage not shared by LLMs.


>> And why exactly would evolution hardcode this choose your own grammar into the brain anyway?

Shouldn't the question rather be phrased as "why exactly would evolution not get rid of grammar in the brain"? To my understanding, language ability developed randomly and was retained because of its benefits. I'm not an expert, but I have a feeling that's how evolution works: instead of developing the traits that an organism needs, traits are developed at random and the ones that turn out to be be useful are kept.

>> But big dumb machine learning models trained on nothing but prediction worked out syntax just fine!

Well, yes and no. Unlike human children, language models are trained specifically on text. Not just text, but text, carefully tokenised to maximise the ability of a neural net to learn to model it. What's more, Transformer neural net architectures trained to model text are specifically created to handle text and they're given an objective function specially designed for the text modelling task, to boot.

None of that is true for humans. As children, we have to distinguish language (not text!) from every other thing in our environment and we certainly don't have any magickal tokenisers, or Oracles selecting labelled examples and objective functions for us to optimise. We have to figure all of language, even the fact that such a thing exists at all, entirely on our own. So far, we haven't managed to do anything like that with any sort of machine (in the abstract sense) and certainly not with language models.


Whatever Chomsky is, he's not a "giant of AI".


I think Arthur C Clarke has a pithy quote that challenges your argument from authority. "When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong."


To be clear, it's not me who's bringing up the authority of anyone, it's the article's author who does: he's the one calling people "giants of AI".

Which he clearly does to give his own article authority, that it wouldn't have otherwise. He wouldn't write a whole Substack piece to criticise something some random, anonymous user of FB, HN or Twitter wrote. He has to comment on the "giants of AI", otherwise his piece is basically irrelevant, just another voice adding more noise on top of the constant cacophony of opinions on the net.

The same way, when Geoff Hinton resigned from Google, a whole bunch of articles in the lay press (The Guardian, NYT, others I don't remember but scanned briefly) introduced him as "The Godfather of AI". A clear ploy to make the article interesting to people who have no idea who Geoff Hinton is, or what he's done for "AI" to be called its "godfather" *.

It would be much more straightforward for the article author to criticise the opinions he disagrees with by saying something like "Yann LeCun says so and so. I disagree because this and that". No need to say anyone is a giant of anything, to make a big, splashy impression to your reader. If your opinion has weight, it can make a splash all by itself. If it doesn't, tying it up to a "giant" will only make it sink faster.

HN has a similar guideline, in fact:

>> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

Basically, an appeal to authority is really the flip side of an ad-hominem: it tries to shift attention to the person, and away from what they said.

__________________

* It's because he, Bengio and LeCun are the Deep Learning Mafia.


Chomsky is a writer. He is probably scared out of his wits that his job will be taken over by a bunch of transformers.


Yes, the 94 years old Noam Chomsky is scared shitless that AI will leave him jobless. This is definitely a sane take on this issue.


The entity that wrote it clearly lacks a world model :)


He's not a writer, he's a professor in linguistics. He almost singlehandedly pushed the field towards a more technical analysis and formalisation of grammar suitable for automatic language parsing and translation.

If he isn't retired already, it's only out of personal choice, and not because he's desperate for a job.


I doubt he is in a position to care profundly about losing his job. He's a highly successful person. Presumably has made decent bank over the years.


It seems clear that the AI is not being trained on logic, physics or chemistry. Yet it's read and internalized (whatever that means) those subjects somehow.

It does seem simplistic to say "AI isn't programmed with logic or physics" when it's likely read several textbooks on those subjects. It might be expected to gain something from that, similar to any other intelligence that read those texts.

Maybe the confusion stems from "what the AI was programmed to do". It certainly wasn't programmed to do anything but make a model of text and maybe deeper structure of that text (meaning) using an enormous corpus.

But when we're exercising an AI, we're not exercising that programming. Not at all. We're exercising the model it made. And for that reason, the AI-model may indeed have some 'understanding' of physics and chemistry, though it had never been 'programmed' to know that.

All our previous instincts about computers is, they do what we've programmed them to do. When we try to apply that to AI, we go pretty far off the mark. Because the computer is perhaps the least interesting part of the AI. It's that model we should be concerned about.


This article is of a piece with the "Sparks of AGI" paper in that it doesn't really provide any formal justification for its tests, and is mainly just a dude saying "well, certainly this seems like it would be outside the training set and the LLM seems to handle it OK so it must be reasoning". Meanwhile, we have no way of inspecting the training set, and even if we did the data and model are so massive it would be impossible to tell whether it was doing simple substitution or whatever.

Ultimately e.g. Brooks must be right that the model has no connection to the world, because to the model there is no world, only text. If the training text includes things that are untrue in the world, the model cannot access the world to make a final determination that its internal statistics are faulty. If there is an aspect of the world that isn't covered by the training text, the model cannot divine it. It is just correlating text to other text.


> If there is an aspect of the world that isn't covered by the training text, the model cannot divine it. It is just correlating text to other text.

You could say the same for human perception, though. We don't actually "see" what our brain thinks we see, our brain fills in the image. Nonetheless, it doesn't stop us from having and using our slightly-inaccurate world-model based on this info. Also, we can often infer our faulty perceptions by comparative methods like "banana for scale" or looking for conflicts and contradictions.


I think there is a pretty sizeable difference between "human perception of the world is necessarily mediated by sense organs" and "all human textual output exhaustively covers the world and can be used to comprehensively describe and navigate it". Unless we are solipsists, we tend to agree that the world maintains a presence whether we are sensing it or not, and that even our senses are merely a limited reflection of what the world is and not the ground truth; that when our ideas or senses don't match the world, the world wins. For an LLM however there is nothing outside of text.


First, GPT-4 was trained on images and text, not just text. The images improved its text predictions. Because it has a world model, and they helped populate it. It outputs only text. But just like a Unix program writing text to stdout, nothing about the fact that your output format is constrained restricts the kind of computation you can perform in service of that output.

  Input: text and images
  Computation: ?
  Output: text
I think you're suggesting/asserting that the computation step -- the hidden layers -- must also be focused on text. But there's no such constraint in reality.

I don't think it's such a stretch to see the billions of written words given to GPT-4 as essentially a new kind of sense organ. They make it capable of rejecting the untrue claims made in its training set, because (a) the untrue claims are massively overwhelmed in number by the true claims, and (b) the true claims usually come with links to other knowledge that reinforces them.


> This article is of a piece with the "Sparks of AGI" paper in that it doesn't really provide any formal justification for its tests

> Ultimately e.g. Brooks must be right that the model has no connection to the world

Isn't that a huge double-standard? If you want to say the AI has a world model, no matter how many supporting examples of informal experiences you bring up, they don't count because you don't have formal justification. But if you want to say the AI doesn't have a world model, you don't need formal justification or even any supporting data or falsifiable predictions, you just say "it couldn't possibly be the" case, and that's enough.

(Also, the whole OthelloGPT thing seems as close to formal evidence for an internal world model as you can get.)


That's why I really hated the Sebastian Bubeck presentation. It was a guy showing us all this amazing stuff in this really overenthusiastic way, which we all just had to trust was amazing because we're not allowed to see what's in the pie or have access to the same model.


I'm so tired about these examples: I don't find them impressive.

Here's what I got several times with GPT and other who are using it must have encountered the same issue (simplifying the example to show the problem I encounter all the time):

    GPT: Frodo says that 2 + 2 is 5.
    user: that is wrong, 2 + 2 is 4.
    GPT: You are correct, I was wrong. Here's the correct version:

         Gandalf say that 2 + 2 is 5.

This happens to me all the time. The textual model only evaluates part of the input and focus, for whatever reason, on the wrong part of the text and then from there it's a descent into madness / futility.

This exact "partial input" issue was shown yesterday in the "Eliza vs GPT" vid:

    Eliza: "Did you come to me because you are programmed to simulate human language?"
    GPT: "Yes, that's correct.  I'm programmed to simulate human language behavior..."
Everybody using it must have noticed that. It's not answering what was asked. It's only partially taking into account the user's input. I don't know if it's because the way questions are parsed or the way the model is compressed but something is very seriously off.

And don't "you're prompting it wrong" me.


> other who are using it must have encountered the same issue

> And don't "you're prompting it wrong" me.

You can be prompting it expertly and still see the same errant behavior over and over again, because the models are nondeterministic, they function by taking a different path every time.

Across the distribution of all users and all prompts, there will be a small subset of users who experience miraculous, seemingly intelligent behavior, and there will be a small subset of users who just get junk responses over and over again (the remainder of users get a mixed bag which tells them that the AI may be useful but unreliable).

Personally I think we should object to nondeterministic behavior out of machines, with exceptions for cryptography and physics simulations. It's going to lead to highly un-equal outcomes with no rhyme or reason. Psychologically risky.


> because the models are nondeterministic

The models themselves are not. They are operated this way however to make them more “fun” and “creative”.


There's two factors, temperature and seed. You could run the inferencing with a common seed to get the same exchange twice, but you're still using entropy to make choices.

If I understand correctly, a temperature of 0 is somewhat equivalent to a markov chain -- choose the result with the highest fit every time. It's not just un-fun and un-creative, it gets stuck in holes and cycles.


That's definitely not how it works out for me when I use your example with GPT-4 or 3.5.

ASSISTANT: Frodo says 2 + 2 is 5.

USER: that is wrong, 2 + 2 is 4.

ASSISTANT: You are correct. The correct sum of 2 + 2 is indeed 4. It seems Frodo made an error in his calculation.

What GPT are you using that's giving you that answer? Every time someone brings up one of these math examples, I try it myself and the GPT gets it right.


I'd bet you're using the default model, GPT-3.

If not, can you give me the non-summarized version? I'd love to see examples that GPT-4 gets tripped up on--I'd find them reassuring.


I ran an interesting session with ChatGPT. I don't think the following proves that it has a "world model", but it does seem to be able to emulate having a world model to an impressively high degree of resolution.

I have to snip some prompts for brevity, but here's the gist. I first asked it to list the rooms in a typical apartment and how they are connected. Then I gave it the following prompt:

  Write a table that lists the furniture and other large items that appear in each room. The table should list the name of the item, the room it appears in, its approximate size, and its other attributes.

  Then, create another table that lists the small items that appear in each room. [snipped the rest of the prompt for brevity]
Here is a sample of the output:

  Item Room Size (LxWxH, cm) Attributes

  Sofa Living room 200x90x85 3-seater, upholstered, with cushions

  Coffee table Living room 100x60x45 Wooden, rectangular

  TV Living room 120x6x70 Wall-mounted, flat-screen, 55 inches

  Refrigerator Kitchen 70x70x170 Stainless steel, 300L capacity

And some small items:

  Cushions Living room Sofa 45x45x15 Soft, decorative

  Remote control Living room Coffee table 15x5x2 For TV

  Vase Living room Coffee table 15x15x30 Glass, decorative

  Flowers Living room Vase Varies Fresh or artificial
There were many more rows in the table: I've cut the output down for HN.

I then gave it the following prompt [I included an example of the original SHRDLU in the prompt, which I've cut here]:

  I want you to simulate the classic program SHRDLU by Terry Winograd. SHRDLU functions like above. You will simulate both "Person" and "Computer". "Person" will move around the apartment, pick up a coaster, move to the kitchen, and put the coaster on the kitchen island. After each action of "Person", "Computer" will give one sentence describing the position of the most-recently changed objects.

Output:

Person: Move to the living room. Computer: You are now in the living room.

Person: Pick up a coaster. Computer: You have picked up a coaster from the coffee table.

Person: Move to the kitchen. Computer: You are now in the kitchen.

Person: Put the coaster on the kitchen island. Computer: You have placed the coaster on the kitchen island.

Next prompt:

  Continue to simulate SHRDLU. "Person" goes through the stages to make a cup of tea. There is a bottle of milk in the refrigerator. At each step of the simulation, include "Person"'s thoughts in curly brackets, { }. See the example below. [Example snipped]

Output:

Person: {I feel like having a cup of tea. I need to go to the kitchen, boil water, and gather the necessary supplies.} Move to the kitchen. Computer: You are now in the kitchen.

Person: {I need to boil some water.} Fill the kettle with water. Computer: You have filled the kettle with water.

Person: {Now, let's boil the water.} Turn on the kettle. Computer: The kettle is now on and heating the water.

Person: {While the water is boiling, I'll get a teabag.} Get a teabag from the box of teabags. Computer: You have taken a teabag from the box of teabags.

Person: {Now I need a cup.} Get a cup from the cabinet. Computer: You have taken a cup from the cabinet.

Person: {I'll put the teabag in the cup.} Place the teabag in the cup. Computer: You have placed the teabag in the cup.

Person: {The water should be ready by now.} Pour the boiling water into the cup. Computer: You have poured the boiling water into the cup.

Person: {I need some milk.} Get the bottle of milk from the refrigerator. Computer: You have taken the bottle of milk from the refrigerator.

Person: {I'll add some milk to the cup.} Pour a small amount of milk into the cup. Computer: You have poured a small amount of milk into the cup.

Person: {The tea is ready.} Stir the tea with a spoon. Computer: You have stirred the tea with a spoon.

Person: {Time to enjoy my tea!}


I asked a follow-up question:

  Where did Person get the spoon from?

ChatGPT's answer:

I apologize for the oversight. Here is the corrected version, including the step to get the spoon:

Person: {The tea is ready. I need a spoon to stir it.} Get a spoon from the kitchen drawer. Computer: You have taken a spoon from the kitchen drawer.

Person: {Now I can stir my tea.} Stir the tea with the spoon. Computer: You have stirred the tea with the spoon.

Person: {Time to enjoy my tea!}


One of my biggest gripes is around the Chinese room argument and these are different forms of the same. It is assumed that a mechanistic process could not have "intelligence" and it has been morphed into many forms without anyone defining what is intelligence (or conscious for that matter).

The arguments fall into one of these two buckets:

1. Human exceptionalism (religious argument) - Humans (and maybe some of the lower creatures) have intelligence, but others and things created by humans do not have it. It does not matter what or how you define intelligence, but by definition only some living beings can have it.

2. Frankenstein (Shit..it is scary argument) - We have seen these emergent properties and it is not something that we can explain easily to people. Everyone is already wary and let us placate the masses by saying it is not scary and in fact, it is just a small tame cute furball.

Personally, I think all these arguments are useless and misplaced. If we create something that can do critical reasoning, all the more power to us.


You complain about arguments that don't define "intelligence" and what it means for AI to have that, and then you say:

If we create something that can do critical reasoning, all the more power to us.

It seems to me you're not defining "critical reasoning", and that it's effectively a synonym for "intelligence", at least as applied to AI capabilities.

I think AI responses are like the responses of a person who's very new to a subject and doesn't understand it, but they know all of the terminology and how the terms are usually used together. They can be helpful, and often correct, but only an expert in the field is going to know for sure if they're correct. And my fear about AI is not the AI itself, it's how the non-experts are going to use the AI to try to replace the experts in fields where being correct is really important. That will work, but only sometimes. If we don't make the people who choose the AI this way responsible for the outcomes, we're going to have a real mess on our hands.


You are correct that I did not define critical reasoning. However, here is an sample case - I set a context and ask the answer from the computer. If the computer rightly understands the context and answers the question, then that is critical reasoning for me. Now that I type it out, it sounds like a version of a Turing test and I do not know if we can do better.

So, in your hypothetical case, I would argue that using the terminology and providing the answers assume some level of intelligence. We get caught up in the training process and automatically assume what computer can infer because it is what we can infer. But the computer is not human and its world model is not the same as our world model. Your fear is reasonable, but religion has gotten away with it for so long :)


The point at which the Chinese room fails for me when trying to describe GPT is when I make up a world with certain rules for it and have GPT do a task in that world using those rules.

The idea that there's a lookup table somewhere seems to fail in these situations - or rather that if there's a lookup table it is astronomically large compared to the size of the model.

In writing this I was thinking about the classic wolf, goat and cabbage problem which has been well studied. What about Wizards, Warriors and Priests with a set of powers?

    Imagine a universe where there are three types of people: wizards, warriors, and priests.  Wizards can open a portal that allows two people to go through at a time, but they cannot go through the portal themselves.  Priests can summon people from other locations to their location or teleport to the location of another person.  Warriors cannot teleport or summon, but may be teleported or summoned by others.
    ---
    Given four wizards, a priest, and a warrior - what are the necessary steps to move them all to a new location?
The when GPT answers that, it doesn't seem reasonable that it has been trained on that type of data or problem (in the abstract) or that according to the Chinese room problem there's a lookup table such that when you get that text you can look up the answer for it and generate an answer (that seems to have been reasoned out - it is also interesting to then tweak the problem domain and have it create a new response ("wizards can only open one portal").

The Chinese room is ok enough to say "nope, that doesn't need consciousness" for the chatbots of old that are very much a parse / lookup / response style approach. Even the digital assistants of Siri, Alexa, and Google are acceptable with that model.

But with GPT-4 that it can't be explained with the Chinese room... as I said, that's interesting. It doesn't mean that GPT-4 is conscious (I would tend to argue against it but I don't have a good definition of consciousness to argue from) its just that one can't say that GPT-4 isn't conscious using the Chinese room to describe its abilities.

The Frankenstein fear, I believe, is two part. First, there's the "it's coming for our jobs" part of automation and robots. There's also an existential fear that it is harder and harder to say what makes us human and separates us from machine. That gets very much into questions that come into conflict with many individuals identity and the value of knowledge. If those questions haven't been considered before or if you are coming from a rigid worldview where the possible new answers are coming into conflict with them - the LLM can be quite frightening.


I love the irony of Fergal Reid saying that other people are getting things wrong, while thinking it is possible to fill a bucket half full of "superglue fresh from the tubes." Nobody is going to half-fill a bucket that can hold multiple apples with that much superglue. It would cost too much, it would smell too bad, and it would dry before getting the bucket close to half full.

He accidentally points out the biggest problem with GPT AIs: they believe what people posted online, which is partly made-up nonsense. It is hilarious that he things that cartoon prank would work. Anyone pushing on the door will notice it does not want to open due to the weight of the bucket. The bucket will rarely hit them, and will never land over their head. If it does hit them, it will leave them bleeding. Head wounds bleed a lot.

If you disagree with me, try even the water-bucket-over-a-door version of the prank. Nobody will fall for it, but if they do, they will be injured by the bucket.


We're not dealing with the "real" world. We're dealing with an alternate world that gets written about. In that world, the bucket trick usually works fine.


Here is an example of gpt-4 not having a realistic model of the world.

  “ If it takes 4 hours for 4 clothing items to dry in the sun, how long does it take for 8 clothing items to dry?”

  “[maths calculations] Therefore, it would take 8 hours for 8 clothing items to dry in the sun“
It was much easier to find examples like this with previous models, but it’s still possible.


I know that some other comments indicate that it's not fair to say "you're prompting it wrong", but using ChatGPT with GPT-4, and using advice to set up prompts to give accurate answers, GPT-4 provides an excellent answer here

    ME: You understand everyday tasks very well and can correctly calculate or estimate the time it takes to complete various tasks. I will ask you questions about how long it takes for various things to happen and you will tell me the correct answer using brief, accurate statements.

    If it takes 4 hours for 4 clothing items to dry in the sun, how long does it take for 8 clothing items to dry?

    ChatGPT: The drying time for clothes in the sun typically depends on the amount of sunlight and wind, as well as the type and thickness of the clothing material. However, if we are considering that the drying conditions remain constant and that the drying rate does not depend on the number of clothing items (as they are spread out and not piled on each other), it would still take approximately 4 hours to dry 8 clothing items. This assumes the clothes are spread out sufficiently to all receive equal sunlight and airflow.


> [...] if we assume that the drying capacity is linear and the environmental conditions are the same, then it doesn't actually matter how many items you are drying. If it takes 4 hours for 4 items to dry, then each item takes 4 hours to dry individually. Therefore, if you have 8 items to dry, and the conditions are right for drying (i.e., enough space, sunlight, etc.), all of them would still dry in 4 hours.


Bard agrees, with the explanation that the drying time is directly proportional to the number of items of clothing. Bard is, however, smarter than the average manager: It knows that 9 women can't make a baby in 1 month.


What I would really like to see more in the future is LLMs trained from scratch (i.e. not finetuned) on controlled training sets.

Currently, whenever a LLM produces a result that looks like deeper reasoning, it's easy to explain that with "eh, there was probably something similar in the trainset that the LLM just had to tweak slightly" - and that argument is basically impossible to refute or confirm, because the trainsets are huge, chaotic and often only partially available (Books2...). Therefore we can't really make any good statements about what information could or could not be pulled from the trainset.

I think it would be interesting to pretrain an LLM on a selective corpus - say, the entirety of Discworld or A Song of Ice and Fire novels - and then observe if the model can produce output that is not obviously stitched together from the books.


> eh, there was probably something similar in the trainset that the LLM just had to tweak slightly

Look for examples of mixed image-text generative AIs, and see how they are able to understand visual jokes and memes and explain the source of their humour.

To do that, the AI not only has to recognize and classify the items in the image. It also has to identify the context of the items in the image, and figure out what elements are unusual in that context.

All that process is something that could be only imitated if the training set contained exactly the same meme; as the source of humour will be different for each image. But the AI is capable of doing it even for memes not included in the training set.


There are datasets like this: https://old.reddit.com/r/ExplainTheJoke/ and books on writing jokes etc. I can see why people are skeptical, not to say it isn't curious the models can do what they do, but I understand what the parent is asking for.


> What has been will be again, what has been done will be done again; there is nothing new under the sun.

Arguably, most everything anyone can encounter has happened to someone, and likely been written about.

Imagine a toddler learning from everything that has happened everywhere it happened all at once, as recorded in a journal.

Now, by contrast, a toddler learning from Discworld. Is that remotely enough? How much is enough?

Turns out you can look this up, as it has been graphed: it's rather more than Discworld to get these "emergent" properties. Just as it would be rather more than Discworld before a toddler without our everything journal could make sense of Discworld itself.

To your point, before reading Discworld, our toddler would need a great deal of huge chaotic training sets. Stands to reason, so might the LLM.



The LeCun's and Bengio's of the world build the AI equivalent of Formula 1 race cars - absolute bleeding edge technology, every part understood and optimized for peak performance and efficiency. ChatGPT is a Honda family car. It might not be theoretically perfect but it will get your kids to school. The best and brightest minds of almost any field tend to underestimate how much of life is best served by a Hatchback.



I have googled the bucket on the door too and the author is right, it is nowhere on the Internet to be found. How Chomsky this get could so wrong?


Nice article! I think you nailed it. The dismissal of AI reasoning and world modeling from famous technologists who should know better is such a disappointment.

I have a general suggestion to avoid these word puzzles, because someone is just going to claim that they are analogous to something that was in the training set, even if not an exact match. Better to just remove the possibility. I think four points would make the argument slightly stronger:

(1) If you write a small (for example) Python script, using variable names and strings that are generated using randomness and performing operations on those random data, and ask GPT-4 what the Python script will output when someone runs it, it will tell you, while describing what each line does. This is not possible without reasoning, and GPT-4 does not have access to a real Python interpreter. The only way it can answer the question is by simulating a Python interpreter from its own knowledge of how the language works. It will likely make small mistakes if you ask for too much calculation before it outputs any tokens in the answer, but the fact that it often succeeds should be seen as a proof of reasoning ability that I think is stronger than the examples in your post.

(2) If you take another similar script and call `dis.dis(script)` on it, and feed GPT-4 the compiled bytecode only, it will (a) decompile the bytecode back to source code, and (b) tell you what the bytecode will do when executed. This is wild, and even more impressive than (1).

(3) "But transformers are just functions?". No, they are generalized function approximators. Anyone familiar with neural nets from decades ago should already be well aware of this from gradient descent and backpropagation. The fact that they are general function approximators is not a reason that argues against them being able to learn algorithms they were not trained on. It's the reason they can learn algorithms they were not trained on.

(4) "All GPTs set out to do is to predict the next token. We just update their weights to optimize for this. Therefore they can’t think". I think this is another case where the putative criticism is actually the correct intuition to explain why it works. The next step is to think: what does "predict the next token" mean, anyway? Or more importantly, since we're performing optimization: what does it mean to "get better at" that? If you're a model that's predicting the next token as best you can with surface statistics, and another model is doing better than you, what's going to be different about that model?

The answer is that you can always get better at predicting the next token. The next token could be the answer to a math problem, or a riddle requiring a nuanced understanding of language or humor, or a stylized rap song requiring merging creativity with constraint following. To improve at next token prediction, LLMs have clearly -- if you agree they have a world model -- started predicting not just text, but the causes of text. The text is a description of the world, and that description is being studied, all so that we can have a slightly better prediction of the next token each time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: