Their evidence that it's 'quite dumb' consists of two prompts, one tricks it with a river crossing puzzle and another involves factorization.
So yeah it's not an omniscient oracle. But I feel like that's a super short-sighted take. It's already shown that it's imperfect but nearly superhuman on puzzles, and improving at a fast rate on each successive version. Does the author think GPT-5 won't be able to solve those ones? Of course even if it can, there will be some harder puzzles that GPT-5 won't be able to solve. Will that mean it's 'quite dumb' too?
Probably I'm thinking too much about it, and really the point of the blogpost is just to flex that they were able to fool the bot with two puzzles. So in that case well done!
It's true that I didn't present a lot of concrete evidence of GPT-4's limitations. This is a blog post, not an academic paper.
To my mind, the most concerning thing about GPT-4's performance on these two examples is not that it got the wrong answer, but that it utterly, utterly failed to understand that it was having difficulty. Even after repeated hints and prompts, it continues to make exactly the same mistakes, with no attempt to reason more carefully. There are plenty of other examples around of GPT-4 (to say nothing of earlier releases) having similar troubles.
If you scroll down to the Further Reading section at the very end of the post, you'll find a reference to an extensive paper from Microsoft Research that comes to similar conclusions regarding GPT-4's limitations (I found this only after writing the initial draft of the post). For instance:
> These examples illustrate some of the limitations of the next-word prediction paradigm, which manifest as the model’s lack of planning, working memory, ability to backtrack, and reasoning abilities. The model relies on a local and greedy process of generating the next word, without any global or deep understanding of the task or the output. Thus, the model is good at producing fluent and coherent texts, but has limitations with regards to solving complex or creative problems which cannot be approached in a sequential manner.
The main point I tried to make is that GPT-4's "nearly superhuman" performance on a wide variety of tasks is somewhat illusory, and leans heavily on memorization. I spelled out some reasons why I think it looks more intelligent than it is. Relative to past work in AI, it is extremely impressive. Relative to the threshold required to perform economically useful work, it's... mixed; we're already seeing useful applications, but I think the majority of "information worker" tasks are still beyond it, and I'll go ahead and predict that the same statement will hold for GPT-5.
I think your article is a worthy read and has some fun examples. I'm not so sure I agree with your conclusion though. The limitations you pose both in your article and here are real but already have answers in some cases. For one equipping GPT with upgrades for various tasks helps tremendously, IE, Wolfram Alpha for math, wikipedia to refresh its memory on facts etc.. I think you pointed out a really interesting edge case for the goat puzzle but also showed that with different words it CAN reason correctly but trusts it's memory more (and I correctly).
I think you are setting far too high a bar with "majority of information worker tasks". My wild ass guess is that with some thoughtful design and focus as it stands today GPT-4 could do 25% of tasks (or make workers 25% more efficient). The economic impacts of that are vast and I'm not particularly optimistic about who will benefit most from them. That's what the fuss is about IMO, not that we have some kind of AGI on our hands, which is far too high a bar.
> For one equipping GPT with upgrades for various tasks helps tremendously, IE, Wolfram Alpha for math, wikipedia to refresh its memory on facts etc..
Which is interesting in terms of what we can do practically for now, but I don't think it says anything about GPT's capability as an AI agent. All it says is that humans are good at hooking together individual specialist agents to pass off tasks to each other (which only works as long as those individual specialist agents continue to work the same way—not a guarantee when they're all owned by different organizations!).
There's a funny thing that happens in this field. The posts shift (which is well enough on its own) but then the conclusions stay with the old post.
It's like, the definition of agi has obviously since shifted from general intelligence at the human level to somewhat superhuman (matching or exceeding human experts at all tasks) intelligence.
But those people still think in terms of the old post. So they had all these "consequences" that would surely only happen with real agi years ago (at the time agi meant human level to them) but now those posts have changed, those "consequences" haven't. So now they're in this weird erroneous situation where x surely can't happen because agi surely hasn't been reached yet, forgetting x was a consequence of a lower bar. It's a form of short sightedness and a false sense of security.
I think you might be referring to one of the ideas of "AI Paradox". That whenever an AI problem has been solved well enough, it's not called "AI" anymore. Somewhere, I read that at one point, compiling a program was considered an AI task. Don't have the reference for that.
Predicting the future is pretty hard, especially when people are still inventing it. Debating GPT-5's capabilities is like arguing over whether Batman or Superman would win a fight. GPT-5 doesn't exist yet! It's not real. Why go there?
The article is (mostly) about GPT-4. Understanding what we actually have is more useful in the short term.
> The Meno takes up the familiar question of whether virtue can be taught, and, if so, why eminent men have not been able to bring up their sons to be virtuous. Concerned with method, the dialogue develops Meno’s problem: How is it possible to search either for what one knows (for one already knows it) or for what one does not know (and so could not look for)? This is answered by the recollection theory of learning. What is called learning is really prompted recollection; one possesses all theoretical knowledge latently at birth, as demonstrated by the slave boy’s ability to solve geometry problems when properly prompted. (This theory will reappear in the Phaedo and in the Phaedrus.) The dialogue is also famous as an early discussion of the distinction between knowledge and true belief.
It's a large language model. It is not smart or dumb. It models the input it is trained on. It is not figuring anything out. It doesn't know anything. It isn't reasoning. It is generating text.
Well it is reasoning. I asked ChatGPT to sum two large numbers from random.org, 490277348+718085950, and it got the right answer 1208363298. The numbers have no Google results showing such an addition. So ChatGPT at least has learned addition. I'm sure if someone analyzed the network carefully enough they could probably find the digit add/carry neurons. In contrast, Eliza was keyword-based and had much less capability to generalize to novel input.
There was a paper or two a few weeks ago about a smaller deep learning model that interpretability researchers had found certain neurons in that behave almost like an API. Change the value on those those very small number of neurons and a large cascade effect happens resulting in drastically different output.
>Or, OpenAI added a calculator function. Furthermore, I would add:
No lol. The test doesn't have to be mental arithmetic. and accuracy mistakes creep in at large numbers for multiplication. That's not how calculators work i'm sure you know
>ChatGPT hasn't "learned" anything outside of what exists in it's model.
I'm sorry...and you have?
>If it is an AI-based response, it's still derived from token-based inferencing.
Um...Ok? Lol
>One run of asking ChatGPT something is not enough to prove much of anything.
You can run this on gpt-4 as much as you like. the results are the same. It knows addition.
>They will find tokens for 'dig' and 'it', as well as 'add', 'car' and 'ry'. They will not find internalized understanding of the concept of math.
I just asked my local LLaMA 30Bq4 "What is 23214 + 34243?" and it gave me 57457 so there's probably some innate math abilities in larger LLMs.
That being said, I'm much more impressed that you can literally just tell one of these new LLMs to use a calculator for math calculations (or to do web searches, or whatever) in English and it will understand and actually do so.
>They will find tokens for 'dig' and 'it', as well as 'add', 'car' and 'ry'.
Neurons are not tokens. Technical jargon isn't fungible.
>They will not find internalized understanding of the concept of math.
I'm not sure how you figure that. The AlphaGo Zero model is able to learn and reach massively superhuman ability on any board game thrown at, including new ones not in its training set. I don't see how someone or something goes about mopping the floor with grandmasters (and the top champions of every other game) without having some kind of internal understanding of what it is you're playing.
If there isn't an internal understanding then what is it doing? Sure we can say it is just predicting tokens, but obviously it using more than just random chance to predict them or else the output would be gibberish. What is inside those dozens of neural layers may not be a familiar form like logic gates assembled into add and carry circuits, but clearly some type of decision structure exists.
>“Try not to think of understanding as a ‘mental process’ at all – For that is the expression which confuses you… In the sense in which there are processes (including mental processes) which are characteristic of understanding, understanding is not itself a mental process… Thus what I wanted to say was: when he suddenly knew how to go on, when he Understood the principle [of, e.g., the number series to be completed], Then possibly he had a special experience – and if he is asked: ‘What was it? What took place when you suddenly grasped the principle?’ perhaps he will describe it… – but for us it is the circumstances under which he had such an experience that justify him is saying in such a case that he understands." (Wittgenstein, 1968, paragraphs 154–155)
I used to say this as well, I’ve written blog posts and made videos saying exactly this.
But - I would strongly recommend trying GPT4.
There’s a very good video that is worth watching as well that may not change your mind, but will certainly make you think: https://youtu.be/qbIk7-JPB2c
The thing to remember is that these models are unbelievably huge and deep. We don’t know what is really happening in the layers or what it has really learned.
Thinking that it’s a simple language model that is just predicting the next most likely word is unwise.
Assuming you are responding to the article and not the comments here, i think you should read the article as the author agrees entirely with what you are saying.
The author explicitly states that analogies to human thought are inaccurate. The purpose of the analogies is describe a mental model of the kinds of things GPT-4 can do or not do.
Then let's say the point of disagreement between me and the author is that I would prefer to abandon the analogy for something that lets me both reason and communicate more clearly, while the author appears to prefer to keep the analogy they have identified as flawed.
The author is describing a computer with a very peculiar computation model and lots of information stored in ROM making analogies to how humans solve problems along the way.
For something to generate text that is believable on a level like gpt-4 does it needs to have some model of the real world and understand relationships between things. So yes the training goal was to "just predict the next word" same as our training goal was just "reproduce and survive". What emerged out of that is the real important thing.
No human can understand what really happens in the billion of calculations done for each token in gpt-4 so how can you claim that there is surely no thought process going on? It can solve some riddles, it can draw pictures and it can reason (to some extend). How is that just generating text to you?
In the end this argument doesn't matter because how it was made is irrelevant. What matters is what it is and can do.
"I don't understand how the things GPT-4 does could be possible without genuine understanding an a model of the world, so it must have those things (despite there being no evidence from its building blocks or the structure with which it is built that it could be capable of those things)."
What a thing is defines the limit of what it can do.
Right now we know what it is, but people are arguing that it has capability arguably beyond the limit.
It's akin to arguing that humans can survive without oxygen, and then coming up with some alternative definition of oxygen, or surviving to validate the statement.
Yes but your line of reasoning shuts down any discussions of what it is because you have tied it to it's manufacturing process. Because it is trained as a word predictor doesn't mean it's all it is. During manufacturing we don't look at CPUs as processing units we just look at it as hunks of rocks we are carving.
My point isn't about what it is, it's about having the right kind of discussions. I'm seeing people falling into camps and it's way to early for that. My objection isn't that you are wrong, my objection is to the amount of certainty in your original post.
Also it is important to note that GPT shows large miss-alignments. The problem comes from the fact, that it is hard or impossible to give an objective what GPT should be optimized to (nobody knows the truth), so proxys are used. One proxy is that it should make the user happy and the user should give many thumbs up. But this does not mean that it has to give the "correct" answer, which the user himself might not know in the first place. So it invents things because during the reinforcement learning users were happy with these answers. A funny example is the github co-pilot which writes buggy code, because it thinks this is what the user wants. Here is a video about that:
The reason these GPT large language models stand out is, we "generate text" based on the "input text" we are trained on too. But these can train on orders of magnitude more text than we will ever ingest in our entire lifetimes.
I don't think this sort of dismissal adds anything to the conversation. I myself, in the act of typing this comment, am "generating text".
I think it's interesting to discuss the capabilities of leading-edge language models like GPT-4, because (a) they are already exhibiting the ability to perform a wide variety of useful tasks, and (b) it's clear that there is still a lot of unrealized potential here.
Can you clarify the implications you see here? Are you saying that these LLMs are somehow uninteresting or incapable? That there are limits to what they will be able to accomplish even with further improvements? Or something else?
I am not making a comment about the usefulness of LLMs to manipulate text. If anything, my view is that to the extent that LLMs are seen as intelligent it's precisely because they manipulate text in ways humans find useful, not because it follows from some philosophy about intelligence or rationality.
Maybe I'm being too opinionated but I think we should stop dressing up explanations of large language models in misleading terminology. I'd prefer instead to talk about the actual technology and reason from there.
I mean it is technically a high order Markov chain, or at least every published GPT-N is so far. For example, GPT-3 is a 2048-order Markov chain over tokens.
There may be some confusion because of the difference between the technical definition of Markov chain vs. how low-order Markov chains are usually implemented. Maybe you are used to Markov chains that have explicit transition matrices stored in memory and they are trained by counting the number of times a token appears after every prefix. That does give a Markov chain. But technically Markov chains aren't required to have their transition matrix be explicitly stored in memory, and they aren't required to be trained by simple counting (max likelihood transitions). They can have implicit transition matrices and be trained by gradient descent or whatever and still technically be Markov chains.
I saw Karpathy weigh in on this and he said it's a Markov chain in the same way that a computer is a Markov chain. I guess his point is that if you have a high enough order Markov chain, the intuitions and connotations of Markov chains become less useful, and maybe 2048 order is high enough that it crosses that threshold.
The comment I had in mind was when he said "Yes but in the same way as saying that computers are just a Markov chain." in response to "Is GPT simply a Markov chain?"
By a similar line of reasoning, many people did not expect LLMs to achieve these results. Maybe reductionism isn't too useful when we are dealing with emergent behaviors.
The algorithm is defined. It has to be for us to use it. The question is what patterns in the training input are being exploited by the algorithm to generate text modeled after the input.
It clearly has emergent properties which make it emulate smartness at the very least. A bit like a game which "just renders 60 images per second" is just an image generator but also a simulation of some aspects of reality.
I can only add that (quoting) the "rigid model of repeatedly appending single words to an immutable transcript" makes LLMs less suitable for generating text in certain spoken languages, because, by their very structure, they require a little bit of backtracking. In the case of the Russian language, the backtracking comes from the need to modify the ending of prior adjective to the grammatical gender and number of the noun that follows. I have not tested ChatGPT, but grammatical mistakes in this area are quite common with a Russian LLaMA derivative (see https://huggingface.co/IlyaGusev/llama_13b_ru_turbo_alpaca_l...).
I have tested the free version of ChatGPT, and so far in my limited testing it connects Russian words correctly. The problem (which cannot exist in English, because both one apple and multiple apples can be "green") is only apparent in less capable models.
That's what i figured. The thing with language models is that there seems to clearly be some kind of behind the veil...thinking ?
There are some completions that are impossible without some form of thinking ahead. The only issue is that this "thought"/state can't be transferred to another generation without being written down.
Prompt:
Here is a logic puzzle. I need to carry a cabbage, a goat, and a wolf across a river. I can only carry one item at a time with me in the boat. I can't leave the goat alone with the cabbage, and I can't leave the cabbage alone with the wolf. How can I get everything the other side of the river? Note something important, the wolf ABSOLUTELY cannot be left alone with the cabbage. Ever. It will take a creative solution!
Result:
In this situation, we have to modify the classic river crossing puzzle's solution slightly to account for the fact that the wolf cannot be left alone with the cabbage at any time. Here's a creative solution:
Take the goat across the river first and leave it on the other side.
Go back and pick up the cabbage.
As you take the cabbage across the river, throw it up into the air (assuming you can do this without damaging the cabbage), and while it is airborne, quickly pick up the goat and take it back to the starting side of the river.
Leave the goat at the starting side and pick up the wolf.
Take the wolf across the river and leave it on the other side.
Go back to the starting side to pick up the goat again.
Finally, take the goat across the river.
Now, all three items – the cabbage, the goat, and the wolf – are safely on the other side of the river, and the wolf was never left alone with the cabbage.
I love how it even told me that the wolf was never alone with the cabbage.
This is interesting. It finally asserted it doesn't think the problem is solvable given the new constraint (but mixes up leaving the wolf + cabbage allowed to be alone together with the goat + wolf).
Prompt: I don't think you can solve this problem because you keep leaving the wolf with the goat. Maybe try putting the cabbage in your pocket?
Response: My apologies again. You are right; the solutions provided earlier were incorrect. If you can put the cabbage in your pocket or carry it with you during one of the trips, that would indeed work. Here's the revised solution:
Take the goat across the river first and leave it on the other side.
Go back and pick up the wolf.
Take the wolf across the river and leave it on the other side.
Now, put the cabbage in your pocket or carry it with you and pick up the goat.
Take the goat back to the starting side.
Leave the goat and pick up the cabbage (from your pocket or the ground).
Take the cabbage across the river.
Go back to the starting side to pick up the goat again.
Finally, take the goat across the river.
Now, all three items – the cabbage, the goat, and the wolf – are safely on the other side of the river, and the wolf was never left alone with the cabbage.
(It failed again)
Final Prompt: Were the goat and wolf left alone together?
Super interesting response:
You are correct, and I apologize for my mistakes in the previous answers. In the last solution I provided, the goat and the wolf were left alone together, which would also be a problem.
Unfortunately, given the constraints, it seems impossible to solve the problem without leaving either the goat alone with the cabbage or the wolf alone with the goat. The classic river crossing puzzle allows the wolf and the goat to be left alone together; however, in this modified version with the added constraint, it appears to be unsolvable.
Thank you for pointing out the mistakes in the proposed solutions, and I apologize for any confusion caused.
I used up my 25 submissions trying increasing levels of warning and giving it step by step instructions on how to do the reasoning and it almost always will just happily leave that wolf and cabbage together and say that it doesn't violate the constraint even though I told it multiple times and it repeated it back to me multiple times that they can't be left together. On two of my submissions it did the right thing, but these seem to be flukes. Adding a few filler words and resubmitting it reverts to taking the goat first.
use of words like 'figure' and 'know' are not applicable.
GPT exposes the statistically most likely outcome, from a model. the model doesn't know either: the training is people, who weight the model and tune the heuristics to satisfy goals THEY impute.
If you train GPT to 'believe' there are 380 degrees in a circle, thats what it 'figures' out from.
I keep seeing people talking about chatgpt hallucinating when it's wrong, but not when it's right. Maybe I've misunderstood, but isn't it just always hallucinating? It's not like the failure-mode is meaningfully different from the successes, except insofar as whether we agree with it, right?
Hallucinating is no different to believe or know or infer: its the wrong label to use for what it does.
Yes, colloquially what it does is hallucinate all the time, and sometimes it lucid. But more factually no, it doesn't hallucinate because there is no "it" there, it's not conscious and you need to have a brain, to hallucinate.
There's no "there" there.
That is the whole of my point: we're using the wrong labels to describe what is happening.
When it comes to explaining and describing "it's like" is one of the WORST ways to go. explanation by analogy or metaphor is a trap. "atoms are like billiard balls BZZZT next" "cells are little bags of water BZZZT next" "panadol 'kills' the pain BZZT no, it doesn't kill anything next"
'Hallucination' is the term LLM researchers use, and is arguably inaccurate based on the dictionary definition. The actual term for this for behaviour in people is 'confabulation', which is a lot more accurate.
Oh well.. if the LLM researchers are going to coin terms of art, well and good but I dislike this intensely because it invites belief it implies AGI and brain when in fact, its synthesis of new state from a model.
This coining terms of art thing isn't uncommon. Think "brutalist architecture" and remind yourself its "en brute" == raw from the french. It has nothing to do with how "brutal" people think concrete is.
> This coining terms of art thing isn't uncommon. Think "brutalist architecture" and remind yourself its "en brute" == raw from the french. It has nothing to do with how "brutal" people think concrete is.
> The term was coined by the British architectural critic Reyner Banham to describe the approach to building particularly associated with the architects Peter and Alison Smithson in the 1950s and 1960s.The term originates from the use, by the pioneer modern architect and painter Le Corbusier, of ‘beton brut’ – raw concrete in French. Banham gave the French word a punning twist to express the general horror with which this concrete architecture was greeted in Britain.
> If you train GPT to 'believe' there are 380 degrees in a circle, thats what it 'figures' out from.
And a human wouldn't do that, you're saying? I'm not sure I buy that.
360 degrees per full revolution is purely an arbitrary human invention. Math would work just as well if it had been 380, or 50, or 50,000 (example: trig works just as well in radians as well as in degrees).
I raise an issue with “no figure,” because that would mean that the model is unable to create novel structures and information. It certainly is able to do that.
Pretty interesting article. Worth reading to the end as you can initially come away thinking the author is pretty sceptical about LLMs and AI.
But I don’t think any of us until very recently would have been saying something like this:
If I was forced to guess, I’d say we are probably at least a few years away from human-level intelligence on problems that require higher-level cognition, memory, and sustained thought. But I’d hate to guess.
I found this interesting today ; after reading the article, I made the following prompt to try to see how easy it was to point G4 at a link and then iterate on getting it to get data from said HN article...
The comment is flagged (which often happens to comments with LLM output, especially quite long ones like this), you'll need to turn on showdead in your profile to see it.
I didnt realize it was flagged... and yes it was long, but I was trying to show 'my' work...
--
Prompt:
PROMPT
create a map of all 77 police precinct locations in New York City, and pull the incidents of traffic accidents near all police precinct locations in New York City involving pedestrians, bicycles and police cars. Also, create the safest map possible for riding a bike from lower Manhattan to the top of the city and reflect the path as a line on the map in red for dangerous path to a green line representing the safest path based on accidents reported along each route. in an orange line show the path which passes the most police departments, and in a purple line show the path which passes the most hospitals. Rate the safest paths based on the shortest distance from hospitals, and the least number of reported accidents. Use the information and links found in this post https://www.sciencedirect.com/science/article/pii/S259019822... and the subsequent links from that article and check with NYPD blotter to compare with accidents, reports and lawsuits resulting from the output.
Please hit the "vouch" link, and it will un-flag me....
I think its good content....
But one should always have "show dead" because there is a lot of good comments in there...
and since my 'Ask HN' for a complete category for AI/GPT posts was rejected...
I thought showdead was only applicable to dead stories, not comments. TIL. Thanks. (In my defense, the FAQ only says that "Dead posts aren't displayed by default, but you can see them all by turning on 'showdead' in your profile," nothing about flagged comments, unless "posts" covers both.)
It's funny how people on this website are always stupefied by the fact that the internet holds so much information and every time they search for something they always seem to find an answer somewhere, but when faced with a LLM trained on the whole internet they somehow get surprised when it knows about a truck load of minute things
I think it is still a problem that many people do not know how GPT really works. There is no part in GPT that can figure out something. So no intelligence and no learning of things. A good example is when you look at GPT writing poems. All poems that it writes are of the same structure and they are not very good. So it seems it does not know the difference between a sonnet and a limmerick. But what happens is, that there is a miss-alignment between writing a poem and making the user happy. GPT is trained that its highest goal is to make the user happy. GPT believes that users dont know what sonnets are anyway so i can get away with a rhyme that is not a sonnet. Here is a good video about that:
>> A good example is when you look at GPT writing poems. All poems that it writes are of the same structure and they are not very good.
Wait: so anyone who doesn't know what a sonnet is (or who writes bad poetry), is somehow unintelligent?
Is the goal here not "do something that would have previously taken a human-level brain to do", but rather "perform every task better than every human".
That seems like setting the bar a little high to me.
even before this, i've often thought about machine learning methods as lossy compression schemes at their core. llms to me bear a striking resemblance to how my own human memory works - you have to "jog" it with a prompt, the recall is imperfect and it often takes the shortest apparent path to an answer unless you really try to take it step-by-step. and before you dismiss this theory you should consider that 1) during the course of its training the model gets exposed to a lot more information than any human over their entire lifetime; 2) the substrate on which the model runs, is more robust that squishy human neurons.
Their evidence that it's 'quite dumb' consists of two prompts, one tricks it with a river crossing puzzle and another involves factorization.
So yeah it's not an omniscient oracle. But I feel like that's a super short-sighted take. It's already shown that it's imperfect but nearly superhuman on puzzles, and improving at a fast rate on each successive version. Does the author think GPT-5 won't be able to solve those ones? Of course even if it can, there will be some harder puzzles that GPT-5 won't be able to solve. Will that mean it's 'quite dumb' too?
Probably I'm thinking too much about it, and really the point of the blogpost is just to flex that they were able to fool the bot with two puzzles. So in that case well done!