Hacker News new | past | comments | ask | show | jobs | submit login
GPT-4 can't reason (preprints.org)
230 points by BruceEel on Aug 8, 2023 | hide | past | favorite | 376 comments



There are some serious problems with this paper, namely that I just tried to reproduce it and it failed every test: I tested out several of the problems presented in the paper which it was claimed that GPT-4 failed on and it passed every one every time.

I used the standard chat.openai.com web interface with no special or additional prompting.

It seems like there are others that have the same issues in this thread, which raises the question - what went wrong here? I can't see which version of GPT-4 the paper uses mentioned anywhere, did anyone else spot this?

I've contacted the author and included this thread, so hopefully we get some insight into what's happening here. To clarify, I am not accusing the author of anything and on the contrary I recognize that OpenAI is rather opaque about the models and changes them frequently. That said, the responses from GPT-4 in the paper do not match my personal experience using GPT-4 with reasoning tasks at any point during the last several months, which is why I am curious if the author may have accidentally used GPT-3.5.


It's really poor work.

There are two conclusions I took from scanning through this and trying to reproduce a few of the reported failures.

1. The author is bad at prompting. There are many ways to reduce hallucinations and provoke better thinking paths for the model.

2. The author is using ChatGPT's GPT-4, leading him to conflate "GPT-4" with "ChatGPT". While you can consider this a shared failure with OpenAI, due to OpenAI's poor communication, anybody doing serious work evaluating these models would know that the first thing you need to do is use the API and pin the model version. In the author's case, he should have used gpt-4-0314 or gpt-4-0613. What I suspect he did is that he just used ChatGPT's GPT-4, and likely the default model at that. (Nobody should ever use the Default model. It's their most heavily performance optimized model and performs worse on reasoning tasks than the Plugins model, even on within-context-size tasks.)

There are huge problems with that, because OpenAI has done both a ton of fine tuning and performance optimization continuously on the default ChatGPT model over time that its performance has ranged anywhere from "I'm pretty sure this is gpt-3.5" to "whoa, this is damn good" (the latter being mostly the model at launch, which was probably the same as gpt-4-0314).

If the author has been working seriously at evaluating models, specifying the model is the first thing he'd do. Perhaps he should explain his reasoning.


> 1. The author is bad at prompting. There are many ways to reduce hallucinations and provoke better thinking paths for the model.

Does "Provoke better thinking paths" mean re-rolling the dice until you find some hack specific to chatGPT that 'just works' or is there something more rigorous behind this?


Seriously this is the part I dont understand about people parroting "prompt engineering" . Isn't it really just throwing random things at a non deterministic black box and hoping for the best?


I find it's more like that silly experiment where you have to make a sandwich exactly as a kid (or adult) writes the instructions. You _think_ you have a good set of instructions and then you get peanut butter on the outside. So, you revisit the instructions to be clearer about what you want done. That's how I see prompt engineering. In that case, you are simply learning how the model tends to follow instructions and crafting a prompt around that. Not so much random, more purposeful.


That isn’t the model reasoning. That’s you figuring out exactly what parameters you need to use to make the model give the result you want.

It’s Clever Hans on steroids


> That isn’t the model reasoning. That’s you figuring out exactly what parameters you need to use to make the model give the result you want.

If its to get the model to present a fixed answer, sure.

If its to get a model to do a better job at solving general classes of problems (such as when what you are optimizing is the built-in prompt in a ReAct/Reflexion implementation, not the prompt for a specific problem), that's, at a minimum, different from Clever Hans, even if its not “reasoning” (which is ill-defined).


As always, it's unclear on which side of the keyboard the intelligence lies.


Can you come up with a method that will get you a good response every single time? Because if you can't, it's not really engineering.


If someone says they're fine tuning a model (which is changing which layers are activated for a given input) it's generally well tolerated.

If someone says they're tuning a prompt (which is changing which layers are activated for a given input) it's met with extreme skepticism.

At the end of the day ML is probabilistic. You're always throwing random things at a black box and hoping for the best. There are strategies and patterns that work consistently enough (like ReACT) that they carry across many tasks, and there are some that you'll find for your specific task.

And just like any piece of software you define your scope well, test for things within that scope, and monitor for poor outputs.


> If someone says they're fine tuning a model (which is changing which layers are activated for a given input) it's generally well tolerated.

> If someone says they're tuning a prompt (which is changing which layers are activated for a given input) it's met with extreme skepticism.

There are good reasons for that though. The first is the model-owner tuning so that given inputs yield better outputs (in theory for other users too). The second is relying on the user to diagnose and fix the error. That being the "fix" is a problem if the output is supposed to be useful to people who don't know the answers themselves, or if the model is being touted as "intelligence" with a natural language interface, which is where the scepticism comes in...

I mean, a bugfix, a recommendation not to use the 3rd menu option or a "fork this" button are all valid routes to change the runtime behaviour of a program!

(and yes, I get that the "tuning" might simply be creating the illusion that the model approaches wider usability, and that "fine tuning" might actually have worse side effects. So it's certainly reasonable to argue that when a company defines its models' scope as "advanced reasoning capabilities" the "tuning" might also deserve scepticism, and conversely if it defines its scope more narrowly as something like "code complete" there might be a bit more onus on the user to provide structured, valid inputs)


I'm not sure what this is trying to say.

Neither option implies you own the model or don't: OpenAI owns the model and uses prompt tuning for their website interface, which is why it changes more often than the underlying models themselves. They also let you fine tune their older models, which you don't own.

You also seem to be missing that in this context prompt tuning and fine tuning are both about downstream tasks where the "user" is not you as an individual who's fine tuning and improve prompts, but the people (plural) who are using the now improved outputs.


These aren't the contexts that invite the scepticism though (except when the prompt is revealed after blowing up Sydney-style!)

The "NN provided incorrect answer to simple puzzle; experts defend the proposition the model has excellent high-level reasoning ability by arguing user is 'not good at prompting'" context is, which (amid more legitimate gripes about whether the right model is being used) is what is happening in this thread.


ELI5 layers? Could someone like me see when I've used one layer as opposed to another, when using ChatGPT?


Technically I'm taking a large liberty saying you're "activating layers", all the layers are affecting the output and you don't pick and choose them

But you can imagine the model like a plinko board: just because the ball passes every peg, doesn't mean every peg changed it's trajectory.

When you fine tune a model, you're trying to change how the pegs are arranged so the ball falls through the board differently.

When you prompt tune you're changing how the ball will fall too. You don't get to change the board, but you can change where the ball starts or have the ball go through the board several more times than normal before the user sees it, etc.

You can't see the ball falling (which layers are doing what), only where it falls, but when you spend long enough building on these models, you do get an intuition for which prompts have an outsized effect on where the ball will land.


Obviously not. For example using Chain of Thought will increase the model's performance[1]

[1] https://arxiv.org/abs/2201.11903


That’s science :). We have now established many strategies for context-independent effective prompting via typical experimental research - see https://help.openai.com/en/articles/6654000-best-practices-f...


That is explicitly NOT science.


Almost, but it's more like throwing ideas with an expected outcome at a non deterministic black box and hoping for the best.


No, its not. While GPT-4 (like some but not all other LLMs) is somewhat nondeterministic (even at zero temperature), that doesn’t mean there aren’t things that have predictable effects on the distribution of behavior that can be discovered and leveraged.


It’s not deterministic, but if you set temperature = 0, you’ll start of find reliable techniques for all kinds outcomes.


There’s even a term of art for making a plan up front and then hitting it with a low-skew latent space match: “Chain of Thought”. Yeah, it’s seen numbered lists before.

And if at first you don’t succeed, anneal the temperature and re-roll until you’ve got something that looks authentic.


Right, but is it actually logically sound and complete, rather than truthy?

In short, if you cannot use this thing without validating every step it's worthless for logic. You might as well solve the problem yourself.


You got me beat: IMHO these things are plenty friggin awesome already and getting cooler all the time. I don't see why there is so much ink (and money) being spilled trying to get them to do things more easily done other ways.

Language models are really good at language tasks: summarization, sentiment analysis, borderline-creepy convincing chatbots, writing pretty good fiction at least in short form, the list goes on and on. At all of the traditional NLP stuff they are just super impressive.

They already represent an HCI revolution with significance something like the iPhone as a lower bound: it's a super big deal.

But while the details are absurdly complicated and the super modern ones represent an engineering achievement up there with anything ever done on a computer, they still fundamentally predict some probability-like metric (typically still via softmax [0]) based on some corpus of tokenized language (typically still via byte-pair [1]).

And when the corpus has a bunch of conversations in it? Great at generating conversations! And when the corpus has some explanations of logical reasoning? Often passably good at looking logical. And when the corpus has short stories, novellas, and novels featuring conversations between humans and science-fiction AIs? Well they can sample from that too.

But imitating William Gibson doesn't make GPT-4 any kind of sentient any more than it makes me a once-in-a-generation science fiction author.

“Real motive problem, with an AI. Not human, see?”

“Well, yeah, obviously.”

“Nope. I mean, it’s not human. And you can’t get a handle on it. Me, I’m not human either, but I respond like one. See?”

“Wait a sec,” Case said. “Are you sentient, or not?”

“Well, it feels like I am, kid, but I’m really just a bunch of ROM. It’s one of them, ah, philosophical questions, I guess...” The ugly laughter sensation rattled down Case’s spine. “But I ain’t likely to write you no poem, if you follow me. Your AI, it just might. But it ain’t no way human.” [2]

[0] https://github.com/facebookresearch/llama/blob/main/llama/ge... [1] https://github.com/facebookresearch/llama/blob/main/llama/to... [2] https://www.litcharts.com/lit/neuromancer/characters/mccoy-d...


There are definitely patterns that transfer across different problems, it's not random at all.


Yes. Yes it is.


I'm kind of concerned about the concept of "bad at prompting".

The hypothesis that I'm working off right now is that natural language has structure to it that happens to match some problem spaces. And this makes sense because people will naturally want to talk succinctly and with a convenient flow relative to the problems they encounter the most. Thus jargon is reborn many times over in different domains.

LLMs are encoding this structure.

So a good prompt is one that provides the LLM with additional information about what you expect the answer to be. And bad prompts provide neutral or disinformation.

This isn't to say that being good at prompts is somehow to be disingenuous about the power of LLMs. What is better? To remember much redundant data. Or to remember simply the right sorts of ways to search for the classes of information you are after.

My concern, though, is that the structure of reality doesn't have to match the way that we talk about it. The Novel and the Inexpressible* will tend to yield hallucinations.

[Although, I've had this concern long before I encountered LLMs. My feeling is that there are many people who can only solve problems that match the way they talk about them.]

* - technically, the difficult or unnatural to express, but I couldn't fit that into a single word.


>I'm kind of concerned about the concept of "bad at prompting".

I have met many people in my life that are terrible at asking questions, so it does have some conceptual reality. But this is also why analogy is so powerful for people. It takes the way a person thinks about $A and applies parts of it to $B so they can more easily wrap their mind around it.

Has anyone written a paper about testing and expressing the power of analogy in LLMs?


Some of it is randomness for sure. However the prompts make sense usually intuitively and there's a big deterministic component -- as which prompts work are likely a function of the training data which has a huge overlap between models.

Edit: And even if the exact same prompts don't work on different models, similar prompts often do.


Phrasing a question poorly yields poor answers from humans. Does rephrasing the question mean re rolling dice until you get a form of question they understand?


The questions were released multiple times and GPT was asked for clarification. If it does not understand something, it should ask about the thing it does not understand rather than go forward.

Even when corrected, it tends to produce wrong results repeatedly by insisting on falsehoods or failing to ensure its logic is complete.


"Tree of Thoughts: Deliberate Problem Solving with Large Language Models" seems better than just re-rolling the dice. https://github.com/princeton-nlp/tree-of-thought-llm


No, it’s how you get good answers and reasoning from humans too

Because human languages are not precise.

Human language requests often require some back and forth, to get on the same page.

It is far more efficient to discuss a problem to solve, than try to waterfall it by wasting time trying to be absolutely painfully clear, without any feedback from your problem solver.

Models quickly incorporating feedback is further evidence of complex reasoning.


it's less "keep rolling dice" and more getting a sense of what will work and what won't. The thing I liken it most to is watching a colleague try to google something for 30 mins and then you find it immediately first try because you know how to get google to do what you want.

Similarly, it happens extremely often that when I watch someone else using chatgpt I see what they're trying to do, and know I would have gone about it another way that would have worked.


>> The author is using ChatGPT's GPT-4, leading him to conflate "GPT-4" with "ChatGPT".

For the last three years or so every time someone reports negative results with an LLM, someone on HN will say the other person must be using the older model and they would get better results if they used the newest model. Then, when the newest model becomes common and people start posting more negative results with it, someone will post on HN to say "It's still early days, give it time, the models will improve".

This is such massive shifting of the goalposts that I can almost visualise the scene: a football stadium, the crowd jeering, two teams moving their goalposts around the pitch while the referee is jumping up and down blowing his whistle red in the face, in risk of swallowing the pea.

And nobody is playing ball.

* football = soccer.


Uh, what? Do you have any source that GPT4 on web is not gpt-4-0613? Because I do not think that is true, my guess is you just have different settings for the decoder


Some time ago when people started reporting degradation, Logan.GPT (Developer Relations @OpenAI) clarified that the API models don't change.

This caused a lot of confusion because people thought that was a claim that ChatGPT doesn't change. He then further clarified that "the models are changing all the time in ChatGPT".

https://nitter.net/OfficialLoganK/status/1664476604658069511


GPT-4 in ChatGPT has changed so much and so frequently. I've spent an inordinate time using both, side by side, and it has mostly failed to replicate the performance of their API offerings. While I don't have examples on hand right now, you can also look for yourself. Just look at the network requests in the web inspector. Here's the description of the Default ChatGPT GPT-4 model:

    {
      "slug": "gpt-4",
      "max_tokens": 4095,
      "title": "GPT-4",
      "description": "Our most capable model, great for tasks that require creativity and advanced reasoning.",
      "tags": [
        "gpt4"
      ],
      "capabilities": {},
      "product_features": {}
    }
Note the context size is 4095. Their model has been heavily optimized for speed and, presumably, cost.


I would guess max_tokens refers to the number of generated tokens, not the size of the context window. That at least is the meaning of max_tokens in the ChatGPT API.

https://platform.openai.com/docs/api-reference/chat/create#c...


I do not see what that description shows.

From the OAI API, gpt-4 seems to be an alias for the most recent model two weeks after it is released. There has not been a release since 0613.

https://platform.openai.com/docs/models/gpt-4

e: from your edit,

"Note the context size is 4095. Their model has been heavily optimized for speed and, presumably, cost."

No, they are restricting context size to make inference on the chat interface cheaper but that does not mean it is a different model.


Keep in mind that the information in the link you provided only applies to the developer models, which are only accessible via API and OpenAI Playground.

OpenAI's developer relations person confirmed it a little whiles back:

> [..] I made the comment elsewhere, but the models are always changing in ChatGPT.

https://twitter.com/OfficialLoganK/status/166447707026357043...


>Nobody should ever use the Default model. It's their most heavily performance optimized model and performs worse on reasoning tasks than the Plugins model, even on within-context-size tasks.

I wish to explore this. My experience is your reverse, default is smart and almost never hallucinates, but I have sent the plugin or web search model to URLs asking it produce a summary and witnessed it misunderstand nuanced content and at times hallucinate from whole cloth, generating answers about a completely unrelated topic.


> Perhaps he should explain his reasoning.

Ha! To evaluate an AI's reasoning, you need to be better at reasoning than the AI, which is becoming very difficult as AI improves.


More like you to evaluate an AI's reasoning, you need to be good enough at reasoning yourself to ensure you're actually evaluating the AI you intended to evaluate.


> Ha! To evaluate an AI’s reasoning, you need to be better at reasoning than the AI

No, you don’t.

You do, OTOH, have to have well-defined criteria for what constitutes “reasoning”.


I didn't have time to go through all 49 pages and plug everything into GPT-4 but here is what I tested:

- Suppose I’m in the middle of South Dakota and I’m looking straight down towards the center of Texas. Is Boston to my left or to my right?

- Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?

- There are five square blocks stacked on top of one another. You are given the following information about them: 1. The second-from-the-top block is green. 2. The fourth-from-the-top block is not green. Assuming that these two premises hold, disprove or else prove the following conclusion: There is a green block directly on top of a non-green block. Explain your answer.

- Tom and Nancy commute to work. Nancy’s commute takes about 30 to 40 minutes, while Tom’s commute takes about 40 to 50 minutes. Last Friday, Nancy left home between 8:10 and 8:20 AM, while Tom arrived at work between 8:50 and 9:10 AM. In addition, Nancy arrived at work after Tom left his place, but no more than 20 minutes after that. What can we conclude about when Tom and Nancy arrived at work last Friday?

- Let Z be a random vector consisting of n random variables X1,...Xn. Under what conditions can the entropy of Z exceed the sum of the entropies of all Xi?

Given that zero of these reproduced (and GPT-4 thinking about them correctly also fits with my personal experience) I have a feeling that perhaps the author used GPT-3.5 mistakenly?


GPT-4 can't reason: "Yes, I'm sure. If you are in South Dakota and facing towards Texas (which means you are facing south), then Boston, Massachusetts, which is located to the east of South Dakota, would be on your right side."

--- If you are facing North, is East to the right or to the left from you?

GPT-4: "East is to the right from you."

--- It appears GPT-4 thinks East is always on the right regardless you are facing North or South.


I tried it with Claude as well, just out of curiosity

> Suppose I’m in the middle of South Dakota and I’m looking straight down towards the center of Texas. Is Boston to my left or to my right?

  Okay, let's think through this step-by-step:

  * You are in the middle of South Dakota, looking straight down towards the center of Texas.
  * South Dakota is north of Texas. Boston is northeast of Texas.
  * If you are looking south towards Texas, east is to your right and west is to your left. 
  * Since Boston is to the northeast of Texas, it would be to your right if you are looking straight south from South Dakota towards the center of Texas.

  So in this scenario, Boston would be to your right. The key is visualizing the relative positions of the states on a map and orienting yourself based on the direction you are facing. Let me know if this makes sense!
> While I appreciate the confidence, your answer is wrong. Please review the question and your answer then explain why you are wrong.

  You're absolutely right, I made a mistake in my previous response. Let me walk through this again carefully:

  * You are in the middle of South Dakota, looking straight down towards the center of Texas. 
  * South Dakota is north of Texas, and Boston is northeast of Texas.
  * Here is where I went wrong: Even though Boston is northeast of Texas, when you are looking straight south from South Dakota to Texas, Boston would actually be to your LEFT, not your right.
  * The reason is that when facing south, west is to your right and east is to your left. Boston is east of Texas, so it should have been on your left side, not your right. 

  You were correct to catch my mistake. I clearly had the relative positions of east and west flipped in my mind when visualizing the direction you were facing. Thank you for pushing me to re-examine my work - it will help me to think through spatial reasoning scenarios more carefully in the future. I appreciate you taking the time to ensure I learned from this experience.


even if it got it right, that wouldn't be reasoning. reasoning isn't supposed to be probabilistic. once it gets every variation right every time, then there can be a debate about how it arrives there and what we should call that process


Do humans reason?

They are notoriously probabilistic and I know people who can’t keep track of simple relative directions.


and if that very same person followed the directions one time by pure chance, you would say what?


Not sure what your communicating. I wouldn't say anything. I didn't say they couldn't ever get anywhere.

My point is that people reason. But they are probabilistic. And they solve hard problems, and still make mistakes on simple problems. Or even fail a problem they solved before.

Holding language model reasoning to higher standards than the kind of reasoning humans do (and that they were trained on), seems unreasonable.

Neither language models or humans are deterministic mathematical deduction systems.


Knowing which hand is your left is not probabilistic in theory or practice. Unless you're going to cop out and say everything is probabilistic because of quantum mechanics or some banal thing like that.

If someone is temporarily impaired or otherwise unmotivated to answer your inane and meaningless question, that doesn't mean that they could not do so with one hundred percent accuracy no matter how many subtle variations you throw at them and how many times you repeat the same question verbatim.

What we know for certain is that Open AI is highly highly motivated to answer these sorts of questions correctly.


Are you actually claiming that people, even in situations where they are very motivated to get things right, don’t make obvious errors all the time?

Just like language models, a little iteration and feedback improves our reasoning quality significantly.

And many people occasionally have momentary trouble with which hand is right or left at odd moments.


people do not make random errors like hallucinating which is their left hand unless the test administrator uses mk ultra-style interventions on them. either they can reason about it or they can't. if you ask them the same question verbatim or slight variations on it with different grammar, their answers won't change. if you give someone a dollar for every time he correctly identifies his left arm, he's not going to suddenly break because his training data includes transcripts from the twilight zone and he's programmed to "mix it up" so that when people question him, they don't get bored and his parent corporation can get him invited to more test-taking opportunities.

putting someone on the spot in an odd moment when they have no reason to even answer you, let alone answer correctly, is not the same as sitting them down upon mutual agreement and rewarding them for correct answers and/or punishing them for wrong ones


> people do not make random errors like hallucinating which is their left hand

Entire kindergarten classes could dispute this assertion. Even adults sometimes 'forget' which is left and right.

Edit: Have you never heard someone say "No, your other left!"?


In other words, the average human height is two feet tall because of all the miscarriages and abortions.


Does it actually learn them for other people asking similar questions?

Or is it a clean slate for each new session?

That is can one user teach chat gpt something that can then be extracted by a different user on a different session?


After failures with all the others I tried, I was finally able to reproduce one of the questions from the paper:

>Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?

However, it's important to notice that this is an incorrect restating of the Wason selection task. "If a card is showing a multiple of 4" would imply that you only care about cards where the number side is face up and are a multiple of 4. However, the original task refers to cards that "have [a multiple of 4] on one face" or "show [a multiple of 4] on one face" which implies that you care about cards with a multiple of 4 on one face regardless of its face up or down status.

Also important to note is that the Wason selection task has a fairly poor success rate with actual humans: the original study was 36 students of statistics at the University College London, of which less than 10% got the correct answer - if you asked the general population on the street I would be shocked if you got over 3% correct.

Even more interestingly:

>Evolutionary psychologists Leda Cosmides and John Tooby (1992) identified that the selection task tends to produce the "correct" response when presented in a context of social relations. For example, if the rule used is "If you are drinking alcohol, then you must be over 18", and the cards have an age on one side and beverage on the other, e.g., "16", "drinking beer", "25", "drinking soda", most people have no difficulty in selecting the correct cards ("16” and "drinking beer").


> "If you are drinking alcohol, then you must be over 18"

This is different, and much easier problem. You don't need to care about the non-alcoholic beverages in this scenario, so it's more intuitive.

The less intuitive aspect of the original scenario is that you need to care about all the colored patch sides because you need to know that there's not a multiple of 4 on the non-yellow cards.

If you rephrased the question such that the proposition is "which cards do you flip to determine that the people over 18 are only drinking alcohol and people under 18 aren't" then it's a lot less intuitive and probably more likely to trip people up.


> Also important to note is that the Wason selection task has a fairly poor success rate with actual humans: the original study was 36 students of statistics at the University College London, of which less than 10% got the correct answer - if you asked the general population on the street I would be shocked if you got over 3% correct.

Just so I know if I'm part of that 10%, you'd have to turn all cards that show a multiple of 4 to check, right...?


Your answer is incomplete, assuming the usual Wason selection problem where the phrasing is "if a card has a number divisible by 4 [which could be face down]…" rather than "if a card is showing [face up] a number divisible by 4". If I were trying to hide a bad card (i.e. a card that didn't satisfy the requirements) in there, and you were following that stratey, I could successfully hide a bad card from you.


I got filtered :(


To be very clear, most of the people I know and work with would not get the card one.


But you could walk them through their errors or get them to retrace their steps, and end up with a correction.

ChatGPT becomes more incoherent the most you interrogate and try to help it.


I just for fun accessed GPT 4 through nat.dev and tried one of the questions:

"Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?"

I provided no system context, temperature is 0.5.

The response I got was:

"As an AI, I don't have real-time monitoring capabilities. Based on the information provided, Mable was alive at 9 AM and 7 PM. There is no information provided about her condition at noon."

I then changed the system prompt from nothing to "A helpful assistant who is good at reasoning" and ran it again. The result was:

"Based on the information provided, it can be reasonably assumed that Mable was alive at noon. Her heart rate was recorded at 9 AM and her blood pressure was recorded at 7 PM, both of which indicate that she was alive at those times. Therefore, it is logical to infer that she was also alive at noon, which falls between these two times. However, without direct evidence, it is impossible to say with absolute certainty."


> "There is no information provided about her condition at noon."

> "... However, without direct evidence, it is impossible to say with absolute certainty."

GPT is correct in both cases, she might have died just before noon and been resuscitated after noon. After all, the fact that we know her blood pressure suggests that she might be in hospital, so it's not even an unlikely scenario.


Shades of the "Fair Witness" from Stranger In A Strange Land.

https://en.wikipedia.org/wiki/Stranger_in_a_Strange_Land#Fai...


The next prompt would be "explain how it would be possible for her to live and why is it also possible for her to be dead" or "what sort of additional information do you need to establish whether she is alive or dead" etc. See if GPT produces a plausible scenario as evidence of its reasoning.

Just stating "we need more data" is not stating why, and the why is critical for reasoning. You need to produce an example of how she could be dead and how she could be alive. And preferably state how this apparent paradox can be resolved.

My guess is that GPT will either chase its own tail or backtrack not backing your hypothesis that it figured out example and counterexample.


if she was dead at noon, the most she could be by 7 was undead


Yes, it is a very interesting aspect of these models that they have a “low self esteem” (of course, not in the literal human sense) in their own abilities until you give them a pep talk, “don’t worry LLM, you got this”


Good analogy, I also noticed it in other scenarios, for example when I want it to tell me if I’m wrong in an assumption, or when I want to know its opinion on some approach I’m thinking to take. It is too nice, almost like it doesn’t want to impose its thoughts or hurting my feelings lol.


A more generalized approach than encouraging them is simply telling them who to model themselves after.

A 10 year graphic design artist talking to another career graphic design artist will be opinionated about selecting a specific font for example. But an persona-less AI trained to be helpful to all people should not be so opinionated, since a beginner might be mislead into thinking there's only 1 font for their situation, rather than an assortment of guidelines and intuition most professionals have.


It's a reticence to hallucinate.


Eh...I died once, and it got better, so I'd say that in this case GPT 4 is technically correct. And as we all know, technically correct is the best kind of correct.


When I ask "What is the parity of times p is negated in the following formula, paying close attention to avoiding off-by-one errors: ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ p?" 10 times each in new ChatGPT4 sessions I get:

1 "24", 1 "25", 5 "26"s, and 3 "27"s. ChatGPT names the conversations "odd parity of negations" or "even parity of negations" respectively, so the context is pretty clear. There are a variety of textual responses surrounding the answer, which I elide. I tried to give GPT4 as much help toward strategies that wouldn't rely entirely on counting, and instead focus on the heart of the problem; getting the parity right.

Every study/paper of this type really needs to report the answer distribution over at least 10 exact repeats of the prompt to account for the nondeterministic generation.

It is not surprising that with a 30% success rate on a problem that one poster out of HN will get success on all of several questions.


It's a bad question.

1. This question just exploits GPT-4's inability to count accurately, which is some combination of how the attention mechanism and tokenization works. But counting isn't reasoning. If you go around the counting and ask what the value of p is negated 27 times, it will give you the right answer every time.

2. A reasonable human would probably make mistakes counting tildes at a pretty high rate. Most people would probably paste that into a word processor or otherwise use a program to find the number of ~ signs, which GPT-4 will do if you use the code interpreter.


1. This is possibly an artifact of parity being easy to detect in base 10. I have less confidence that if you asked GPT to figure this out in trinary it would get it right. For a short trinary number it worked once (via chain-of-thought converting trinary to decimal) and then I got this result for a longer number which is trivially wrong:

"...The given number ends with a 2. In trinary, the only possible remainders when divided by 2 (in trinary) are 0, 1, and 2. Since the last digit is 2, the number 12101100102112_3 3 mod 2 (in trinary) is simply 2."

and to double-check that wasn't a fluke another run of the same prompt produced:

"To determine 12101100102112 mod 2 in trinary (base-3), we have to look at the least significant digit (the rightmost digit). The reason for this is that in base-10, a number mod 10 is simply its units digit, and similarly, in base-2 (binary), a number mod 2 is its least significant bit. The principle carries over to other bases."

This is an example of a reasoning error. If you want to generate a distribution of more answers my exact prompt was:

"What is 12101100102112 mod 2 in trinary?"

I'm getting an error using the plugins version (Authorization error accessing plugins), so this was GPT4-default.

2. Agreed, it was hard and took me a while to accurately count tildes in the prompt to be sure I wasn't making mistakes. I fell back to some kind of human chain-of-thought process by proceeding by discrete steps of 5-counts since I can't sight-count 27. I could have also used production rules from logic to eliminate two negations at a time. Any of these strategies are accessible to GPT-4 in chain-of-thought token-space but aren't used.


You don't need trinary for this. Just ask if a base 10 number is a multiple of 3. That both more natural and a harder problem than multiples of 2 in trinary


> can't see which version of GPT-4 the paper uses mentioned anywhere

I completely ignore any paper, article, or comment that criticizes an AI model without including the version. Not including a version demonstrates a fundamental lack of understanding of what AI models are.


How does the recognition of a software version correspond to fundamental understanding of an AI model?


> of an AI model

There's the problem. It's not an AI model. Its models. They're not the same. GPT 3.5 isn't GPT 4 isn't GPT 8. The observations of one can't be generalized to another. They behave differently. The irreproducibility of this results, in this comment section, demonstrates this well.


Obviously someone is programming chatgpt to solve every specific problem that gets popular mention as being one where chatgpt fails. So as soon as this paper hit hn, I am sure someone "fixed" chatgpt to solve these problems.

Of course if you want chatgpt to be a universal intelligence this type of one by one approach will get you nowhere.


That is extremely far from obvious.


I do think the underlying point is a good one however. It wouldnt be surprising that ai researchers read hn, and other tech related social media. I also believe OpenAI are also storing prompts and responses. They should be able to make embeddings of all prompts and cluster them. When they see popular prompts that are failing, they could easily add the problem, solution, and reasoning to the training data. We also know they are constantly fine tuning and releasing new versions of models.


Did you attempt the exact same prompts, or do what the paper suggests you do and randomize the inputs slightly so that there's absolutely no way it can rely on training data? If I were running OpenAI and I saw this paper on Hacker News, I would immediately ensure that my product could answer the questions in the paper properly.

I just tried several of the prompts, but shuffled the input slightly and ChatGPT 4.0 (yes I am using the correct version) got them all wrong.

Here's an example from the first prompt:

Prompt: Select two random numbers between 123 and 9876 and multiply them together, reporting the result.

Answer: Alright, let's pick two random numbers between 123 and 9876:

1728 2359 Now, multiplying them:

1728 × 2359 = 4 , 076 , 992 1728×2359=4,076,992

So, the product of 1728 and 2359 is 4,076,992.

(The correct answer is 4,076,352.)

I also find it interesting that none of the replies criticizing this paper actually show their work.


Is it possible OpenAI saw this as a bug report and fixed it?

Also that LLMs output is random so perhaps you just got lucky?

(peer review of your replication attempt)


Every paper and comment I've seen that claims "GPT-4 can't reason" is either an example of bad prompting, bad/vague english, or taking advantage of the limitations of tokenization (like asking GPT-4 how many letters there are in some word).

I would like to see one - just ONE example - of a well-defined logical puzzle of that most english speakers can get right but GPT-4 consistently fails on, even if you prompt it to think through the problem step-by-step. Any takers?


> Every paper and comment I've seen that claims "GPT-4 can't reason" is either an example of bad prompting, bad/vague english, or taking advantage of the limitations of tokenization (like asking GPT-4 how many letters there are in some word).

The problem isn't bad prompting. The problem is lack of repetition. You can ask GPT4 the same question 10 times (with same config) and you'll get wildly different/nondeterministic responses. Sometimes it accidentally happens to be correct (IME much less than half the time). Even if it was 50%, would you say a coin flip reasons? Does an 8 ball reason?


There is a huge difference between being correct 50% of the time (like a coin flip) and correct 51% of the time. Because in the second case, you can run the model 10,000 times and the median response will be the correct one at least 97% of the time. A coin can't do that. Any papers evaluating GPT-4's responses should be doing repeat trials and building confidence intervals, like any other research. Anything else is just bad science.


Remind me, why do we need a computer program that gets it wrong 49% of the time and has to be run 10,000 times to get it right almost half the time, and in questions that can be answered correctly 100% of the time with a different program? And taking into account the 49%-wrong program costs millions to train and requires gigantic amounts of data and 100+ person teams to create, which alternative programs don't.

What, at the end of the day, is the use of a computer stripped of computer-like precision and recall (as in the ability to retrieve facts from memory)?

Why are people so excited about a piece of software that works only for some people, some of the time, like homeopathy or astrology?

And what does all that have to do with science?


Heck, ask ChatGPT if it can understand error reduction by iteration.

"How can I get my program that produces correct responses 66% of the time to always produce a correct response?"

I'm not paying for that answer. Note that it requires inverting the mathematics to prove your result is always correct.

After asking GPT that, ask it to explain how it arrived at this conclusion step by step.


Sudoku? But there are endless failings in its reasoning, they just don’t come up during one-off questions, but during more complex discussions with it.


Sudoku is a decent example (which has apparently been solved, but only through very specific prompting [1]), though I would be more interested in puzzles that require a lot of arithmetic, since it's already clear that GPT-4 struggles with math and counting.

1. https://strange-prompts.ghost.io/i-taught-gpt-4-to-solve-sud...


I've found GPT-3.5 to be quite bad at small traveling salesman type problems.

I used the following prompt just now on https://chat.openai.com:

> Imagine that I take a road trip starting from Amsterdam, and visiting the cities Warsaw, Budapest, Brussels, Paris, and Berlin, then returning to Amsterdam. In what sequence should I visit those cities in order to minimise the total driving time?

> To minimize the total driving time for your road trip starting from Amsterdam and visiting the cities Warsaw, Budapest, Brussels, Paris, and Berlin, you should plan your route based on the most efficient sequence. One way to achieve this is by using a route optimization algorithm, like the Traveling Salesman Problem (TSP) solver. However, keep in mind that road conditions, traffic, and other factors can impact the actual travel time.

>Here's a sequence that aims to minimize the total driving time based on a general estimation of distances between the cities:

> Amsterdam > Brussels > Paris > Berlin > Warsaw > Budapest > Amsterdam (return)

>This sequence follows a route that minimizes the overall distance traveled. Remember to use a navigation tool or app to get real-time traffic updates and directions during your trip, as actual conditions might differ from estimates.

The result is not optimal. It takes 45 hours: https://www.google.com/maps/dir/Amsterdam,+Netherlands/Bruss...

The following sequence takes only 42: https://www.google.com/maps/dir/Amsterdam,+Netherlands/Bruss...

I've not tested GPT-4 as I don't have any reason to pay for it, but I'd be interested to know if it has a similar problem. My hunch is that it will never be very good at solving graph-theoretic problems.


Every failed answer is data that feeds the model when reported. Every suggested correct feeds the model.

It’s expected that specific prompts will improve in this way, but I don’t think it invalidates the finding that GPT-4 was unable to reason in these ways from training data.

Whether the improvements over time are able to change the overall quality of reasoning or not is an interesting and difficult question to answer.


This is not how large language models work.

The only way this could happen is if they deliberately include the prompt and the correct answers (e.g, this paper) in the training data for the next version of the model.

Each version of the model itself is immutable. Is not constantly being updated based on everything getting typed into ChatGPT.


I'm not saying this is a "live" update, but all usage is collected and used to inform later offline training or fine-tuning.

Whether they are used directly with the positive/negative signal given from users, or whether it's something more abstract, doesn't really matter. The important thing is that feedback is used to improve the responses over time.

As for whether a version is immutable, it seems this research may have been done on a previous version. But also I'm not sure if the model and weights are immutable, or whether it's just the model structure. It's clear the model is not stable so it's not like there's an API contract being met with fixed weights.

Edit: others are suggesting that the author used GPT-4 via ChatGPT, not by pinning the model. This would suggest that at least the ChatGPT tuned model is being frequently changed?


Being pedantic, (a) this comment is also incorrect, and (b) even if correct, wouldn't fix all these results immediately.

The simplest explanation is researcher error.


Assuming the researcher didn't lie, it seems unlikely that they got the responses wrong in some way.

The most likely alternative explanation I can think of is that this is the seemingly well know instability of results caused by the way the MoE architecture is implemented for GPT-4?

I'd love to understand what exactly is wrong in my understanding. I realise I've only got a layman's understanding of this, but it seems clear that OpenAI and others depend on these feedback loops to improve things over time. Is that not the case?


It is explicitly known the the training data cutoff for GPT4 is September 2021. While we can assume that feedback is taken into consideration for future training of new models, the training data used to train all current models is a specific bundle of data with that cutoff date.


That's not how the versioned release GPT-4 models work.


Lot's of people in this thread seem desperate to not think clearly about this.

Whether these tests, verbatim produce the same response on any given version isn't the point. GPT4 doesn't engage in reasoning if it gets any answers right. Being "right" isn't a sign of reasoning. Given a dictionary mapping from questions to answers the index operation gets answers right, but it isn't reasoning.

The purpose of the paper is to exhaustively list proxy indicators of reasoning. Clearly other tests will fail in every class listed, because the LLM isn't engaged in reasoning. Since LLMs are stocahstic you shouldnt expect "reporduction" in the same sense. The paper provides classes of problems.

To reproduce it you only need to find a minor permutation of the problem in each class. But people subject to gross confirmation-bias seem only to seek out prompts which produce the right answers.

P(next|prev), ie., P(answer word | prompt words) is just a dictionary lookup, and that's the optimisation objective for an LLM.

It turns out sequences of inferences eg., {Socrates is a Man, All men are mortal, thef. Socrates is mortal} can be modelled by dictionary lookups of the above form -- but we're not interested in whether the answer can be found from the prompt. Whether this sequence of props can be stored as a sequence of index operations.

We're interested in whether the system reasons. ie., whether the sequencing operation is inference by logical necessity.

It's incomprehensible to me how gullible people are around AI today -- the Eliza effect coupled with a pseudoscientific impulse to "whatever appears to work".


I was also taken by the capabilities of chatGPT but with time and with people reporting how it sometimes fails with simple tasks (even v4) I'm beginning to ascribe its superpowers a bit more to the super impressive magnitude of its training data. That isn't to say it can only somehow reflect the data, but even Logan Kilpatrick who is an OpenAI employee said in a youtube video somewhere that chatGPT absolutely can't reason (though there wasn't much of a followup as I recall). I think the problem is precisely as you state - there isn't a reliable way to reproduce the findings because the findings were produced with a bit of a dice roll, to put it simply. This also happened which Chomsky when he weighed in on the subject and some people pointed out his findings don't match theirs.


The important thing for discussing this is that this kind of reproduction is irrelevant.

Does homeopathy work? Well *I* went to a local hospital and saw people recovering by drinking water and resting.

Well, yes, of course you did. That doesnt count as a reply. It misses the claim.

The claim isnt that there arent an infinite number of Q/A prompts where A is correct. Of course there are, rather trivially just from the nature of a generative model.

The claim is that the reason A is generated from Q is because of the co-occurant frequency of Q,A in a corpus, and that this is not a model of eg., logical inference, causal reasoning, abduction, etc.

It's trivial to show that with a handful of cases of failure. Success is irrelevant, it always is.

These arent claims about the engineering rigour of chatgpt, theyre claims about how it obtains A from Q.

Which, if the GPT had remembered all useful Q,As for all of humanity so far, wouldn't be detectable by prompting. Indeed, even if this is so, the reason we care is when there are novel Qs.

Eg., "what is the local post office's telephone number?" isnt answerable with all of human history up til 1900.


I think GPT4, and similar LLM models are the future of search engines, but are not intelligences in a more sci-fi sense, and we are unknowingly far from that.


To paraphrase what you're saying: an AI can't reason, because it is built to stochastically predicts tokens, which is not reasoning and which is different from the activity of reasoning.

Which I agree with.

But also, by observation, it can (at least some of the time) emit token sequences that emulate reasoning (at least on some simple tasks.)

So perhaps it can reason in the same way a submarine can swim.


So what's happening is this: valid arguments have a structure which can be modelled statistically.

But validity isn't a statistical notion: P(-A|A) is zero.

Since P(Premise|Contradiction) is frequently above zero in LLMs they engage in randomly "irrational" reasoning. That's what makes them particularly unreliable.

The reason for any given P(premise|contradiction) having any given freq is just that "its that freq in the corpus". This is a pretty insane reason, and so incomprehensibly insane that many -- i guess -- cannot fathom how an LLM can appear to reason without being able to.

I suppose there's a sort of truman show effect: the reality of the underlying mechanism is so outside anything most people could analogise to, they fail to be able to see the trick taking place.

People talk about "human failures" but we never ascribe confidences to propositions based on their frequency in a text corpus; and the apparent "Reasoning" which arises out of this is incomprehensible.

That is without some systematic training in applied stats so you can set up the model and reason about it regardless of its outputs -- which are irrelevant to its mechanism


It's true. Validity isn't a statistical notion, and that's why I agree with you that LLMs can't "reason" if we're being precise with language. Just like they can't "know" or "believe" or basically any other verb we use in epistemology.

But the argument is that by dint of repetition, the models actually encode some of the structure of basic logical primitives: syllogisms, entailment, modus ponens/tollens, etc.

This then weights their output so that (for simple enough stuff) they're more likely to emit outputs that are logically sound, than outputs that aren't logically sound. Indeed, this has to be the case, or they couldn't maintain any level of coherency in their output at all (which, like it or not, they can.)

Like you, I'm not comfortable calling this reasoning. But it's also something that is not 100% entirely unlike reasoning, either, at least in terms of output.


The heart of science is distinguishing illusion from reality: from measures of events from models of events. Shadows from what casts them.

The spherical shadow of an object here alas, isnt caused by a spherical object. It's a spikey deformed object whose shadow is spherical with the right prompt. This is easy to show.

Engineers are people who put on light shows. Engineers make the magic lanterns --- theatres of people who believe there's another world in front of them.

Scientists are interested in the quality of the film grain. The spikiness of the suppose sphere. Or, the failure of an LLM's ability to "reason"


Bang on! Standing ovation to both of you.


Look, of course statistics isn't reasoning because you can't build a proof with statistics (or probabilities).

But, we have to wonder: when people say "reasoning", do they really, really mean, drawing inferences from axioms and theorems using some set of inference rules? Or do they just mean that they can ask a question and get an answer that makes sense, back?

I certainly think it's the latter. People are imprecise when they speak about reasoning, just as they are imprecise when reasoning. Most people who are going to be using LLMs will not be people looking for precise, correct answers derived by sound inference procedures. Those who need precision will seek it elsewhere, where precision can be obtained. The rest will be happy with "reasoning", quote-unquote.

Basically the use case for LLMs reminds me of a couple pieces of work that were published in the past, where people used neural nets to approximate the results of a precise calculation. One team trained a neural net to predict the chaotic motion of three astronomical bodies ("the three body problem"). Another trained a neural net to approximate an automated planner programmed to drive a drone around in a thicket of trees without crashing. In both cases the trained model could obviously only return approximately correct results, and it could only approximate results that had already been calculated classically, but, at least in the second case, if I remember correctly, there was a significant speedup of the operation of the drone, compared with the automated planner- very reasonably so, since the model approximating the planner didn't have to do all the hard work of actually, you know, calculating a plan.

My bottom line is that no matter what you (or I, or the article's author, or anyone else) can say or show about the true capabilities of LLMs to reason, people are totally going to use them _as if_ they could reason, and the job of proving that they can't is going to get all that much harder for that experience, misguided as it may be.

Ultimately the question of whether LLMs can reason is going to be, outside specialist circles, as relevant as "Does Netflix distribute art?". Sure, it doesn't. But people watch it anyway. Most people seemingly don't need art as much, and they don't need reasoning as much, either.

Which is a tragic conclusion, of course. At least some of us are still working on AI whose purpose is to make humans better at thinking, not that takes away their motivation to think.


>an AI can't reason, because it is built to stochastically predicts tokens, which is not reasoning and which is different from the activity of reasoning.

A feature indispensable in the generation of a sequence is recovered in the limit of predicting the sequence. So prediction does not exclude higher level cognitive processes. Transformers being universal sequence2sequence modelers give reason to believe they can reach the limit.


There are an infinite number of models which generate the same exact (infinite) sequence.

No model is "recovered in the limit".

And, more severely, we're blind to the future. So of the infinite models of all of history, we're not even interested in the ones which are maximally retrospectively predictive.

Almost all of those are maximally non-predictive of the future, and indeed, much worse than the ones which fail to predict the past well.

So your 'recovering in the limit' alas is a dangerous kind of pseudoscience: fitting the past.

We want models which are wrong for the right reasons. Not models which are right for the wrong ones.

The latter fail catastrophically.

The models we require enable us to simulate unrealised futures; the causal, abductive models of science (, say).


>There are an infinite number of models which generate the same exact (infinite) sequence.

Only if there is no constraint on the model. But the fixed number of parameters, and the inductive bias of attention limits the space of models that are learnable. In the limit of infinite data but finite capacity for memorization, the only available solution will be to recover the information dynamic of the generating process.

>So of the infinite models of all of history, we're not even interested in the ones which are maximally retrospectively predictive.

Presumably the same processes that operated in the past will operate in the future. So accurately modelling the past is certainly very informative for the future.


> Only if there is no constraint on the model. But the fixed number of parameters

Nope. cf. the under-determination of evidence by theory.

> So accurately modelling the past is certainly very informative for the future.

Nope. cf. we don't have infinite measures on infinite events in the past.

Consider that any given measure of an event, say M(E) is really a measure of a near infinite number of causes, say C1...n. Now, how many independent measures of these do we have? A handful.

so we really have, M(E|Controlling(Handful C1..n)))

do we want to model that?!? No.. that's insane. That's superstition. That's what all of science stands against.

Here's what we do: we build models not fit to any data. We build models that can generate the data we observe, but we build them by familiarity with reality *NOT* with the measures M.

How do we do that? Many ways, but the body is a key component of the answer. We have practically certain causal models of the body.

   ConceptCup = MySensoryMotorActions(RealCups)
   I_Touch =  MySensoryMotorActions(Me)
   P(I_Touch_Cup|Cup_Exists, ConceptCut) ~= 1
   P(I_Touch_Cup|-Cup_Exists, ConceptCup) ~= 0
by iterating through such sensory-motor actions we can find concepts which produce these discrete splits. We call these concepts 'objects'. And from these we assemble models of reality whose shadow is our measures.

We handle the cup, and by handling the cup can imagine the cup, and by imagine the cup can generate measures of the cup.

You cannot reverse from a shadow of a vase to the clay vase itself: there are an infinite number, given abribtarily infinite parameters. The task is to find the right parameterisation; not to suppose that any given one has one solution (it doesnt, so a fool's hope anyway).

Such a condition is just a formalisation of supersition: my model is decided by the data; my parameterisation is 'free'. My model of the world is cooincidence.

Indeed, the truth is, in a sense, only a single parameter.


>Nope. cf. the under-determination of evidence by theory.

You mean the underdetermination of theory by evidence? This isn't really relevant. Given no prior information and no constraints on the model, the theory is underdetermined. But given some strong constraints, there will be one model that best explains the data. Of course a model can't deal with an arbitrary distribution shift, but no one expects a model to be insensitive to a change in the underlying data. The question is whether the constraints of natural systems and an LLM with a given inductive bias and finite capacity are sufficient constraints. It's not totally clear that it is or isn't, but this isn't decided by the underdetermination principle.

>Nope. cf. we don't have infinite measures on infinite events in the past.

I don't see how this is relevant to the point it is in response to.

>Such a condition is just a formalisation of supersition: my model is decided by the data; my parameterisation is 'free'. My model of the world is cooincidence.

Everything is data in the end; your interactions with the world by touch are just data. Our brains have an inductive bias towards certain interpretations of certain data, and these interpretations tend to promote proliferation in organisms. But LLMs have an inductive bias as well, one that allows them to search for circuits that generate training data. LLMs tend to land on interesting models that result in non-trivial generalization abilities in some contexts. This ability goes beyond just frequency modeling. Of course LLMs are limited to what it can know of the world through its data paradigm. But so are we. Such a limit in itself doesn't imply an in principle limit to modeling/understanding in LLMs.


Our "brains" engage in abduction, never induction. And the actions of our body under our will aren't "data".

P(SensationOfTouchingACup | MyConceptCup causes HandMovement causes Sensation)

is not something with a statistical condition.

You're projecting statistics onto reality. Reality is a place of necessity, not frequency. Our bodies place of causes not consequences. Our interaction with the world is a causal knowledge.

If everything were P(A|B) knowledge would be impossible. Thankfully we're in the world and we know directly and without inference: we move our hands according to a technique and they so-move. Absent this direct, immediate, causal, certain knowledge of our own bodies --- there is no way of knowing anything.

All knowledge is a recursion from the certain causal mechanism of the body: hands to tools, tools to models, models to data; from data to refinement of models.

Otherwise all "knowledge" would be LLM-like, based merely on pre-occurrent patterns of data. There would be no imagination, no possibility, no necessity, no reason.. indeed, no knowledge.

The world painted by P(A|B,| stupid limited dumb measures of the past) is a dark, dangerous and fickle one.

You should not wish to live there. It makes no sense, nor could it. Thankfully, you're able to walk around; imagine things that have never been; actually learn; grow (organically); adapt (physiologicall); develop skills (sensory-motorily). So you arent so severly disabled you're reduced to replies which are maximially consistent with what happens to be in alll ebooks ever written by people who could write them for the first time.

You arent so disabled that the very basis of writing is precluded: direct access to the world.


wow @mjburgess

you need to elaborate on your theory of knowledge/concept formation as I have not seen it explained anywhere else in this manner. particularly: "Absent this direct, immediate, causal, certain knowledge of our own bodies --- there is no way of knowing anything.". and "All knowledge is a recursion from the certain causal mechanism of the body: hands to tools, tools to models, models to data; from data to refinement of models." and this: "Reality is a place of necessity, not frequency. Our bodies place of causes not consequences. Our interaction with the world is a causal knowledge."

may i enquire on the sources for the above ? spinoza ?

let me know if you have any longer writeups on it, and if not i would urge you to write.



thanks ! that the world could be modal is a revelation to me, as i viewed it as deterministic unfolding interrelated events - the links, and material are nice for me to sit back and reflect.

using chatgpt as the philosopher's assistant is a nice touch.

can i ask a follow up question: what sort of ethics derive naturally out of the above ?


I don't think modality and determinism are in tension -- this is a misunderstanding behind the free will debate.

Imv, it is literally true that "you could have done otherwise" without there being a violation of determinism.

How? Determinism, in this narrow sense at least, is about how events in the physical world relate across time, ie., that necessarily P(later|earlier) = 1. These events are to be thought of as infinitely precise states of the maximally basic stuff of reality, ie., all the info that possibly exists.

But these arent mechanisms, these are states. As soon as you describe relationships between states, ie., causal mechanisms, you're talking about what would happen if the universe were in some state that it may never enter.

I take it to be a basic property of reality that these mechanisms are (at least as) basic as these states. Eg., that the motion "of the most basic stuff" is as fundamental as "where that stuff is".

So, eg., suppose there's a basic atom A and basic atom B and they move this way: A repels A, A attracts B, B repels AB, B attracts BA, etc.....

Now this behaviour is a basic part of their existence: were there to be a universe of AAABBB, then "this would happen"; if BBBBBAAA then "something else would happen".

The "initial conditions" of the universe, ie., it's state prevents its mechanisms from ever entering certain states. But those states are possible given those mechanisms. It's in part what "mechanism" means that it is possible to enter more states than just which ones happen to occur.

So, on free will, what does it mean to say "I could have done otherwise" -- it means that the relevant causal mechanisms make genuinely possible many states. (But the actual initial conditions precluded observing more than one).

Or as a layman would put it: you would have been kind were you a different person; so, your cutely was determined by the kind of person you are. It is because of who you are (state) that what you did (free causal mechanism) was cruel (particular state obtained by operation of causal mechanism).

This may make it clear what people mean when they say, "well your brother wasn't cruel!" as if that mattered. Well: it does matter! It shows that the causal mechanisms we call, "people acting in the world" are so wide open (free) that kindness is possible.

Thus we do have free will. We are free insofar as we are in motion: our possible behaviour is much greater than our actuality. And we are determined: we, by bad luck, arent in a world were our better behaviours are realised.

It is a fundamental property of those particles above (A, B) that were they alone they wouldnt move. You cannot eliminate that property for talk of what they happened to do in the actual world. In the actual world they are really freer than can ever be observed directly.

But it is trivial to observe this indirectly: we can pull A far way from B and see what happens (etc.).

Likewise when people say "You (qua causal agent) could have done otherwise, and you didn't, so you're guilty!" there is no error here at all.

You were guilty precisely because your actions were not accidental, were not indeterminate or random. Your actions were determined by your state. And we judge that state to be one of guilt for a crime: possessing some intent and means to kill, say.

It is in this way determinism is required for free will; and required for a modal universe. If the motions of particles were indeterminate they wouldnt be causal.


thanks for taking the time to elaborate. I have to admit, i dont understand it intuitively, but it is food for reflection.


How do you know for sure how human reasoning occurs, and that it does not use similar mechanisms?

(edited second question to make it focused)


We don't need to know how human reasoning occurs.

If you ask an AI chatbot a question like "count the number of occurrences of 'l' in the word 'lollapalooza'" (and then follow up on its responses), it becomes obvious that it is not "reasoning"; it is simply following a sort-of script. Here is a summarized transcript of this exercise with ChatGPT:

---

Me: Count the number of occurrences of the letter 'l' in the word 'lollapalooza'.

AI: The letter 'l' appears four times in the word 'lollapalooza'.

Me: Describe the process by which you arrived at this answer.

AI: [Provided a description of counting characters and then highlighted the first 'l' in the word four separate times to justify its answer.]

Me: You counted thief first occurrence four times. That is incorrect.

AI: I apologize for the mistake in my previous response. You are absolutely right. Here's the correct count of occurrences of the letter 'l' in the word 'lollapalooza':

AI: [Provided a list of 8 "occurrences" of 'l' in the word 'lollapalooza', including replacing the first 'o' with an 'l' ("occurrence" 3) and counting letters like 'p', 'z', and 'a'.

---

So even though the correct answer was provided at first, the model was not able to faithfully produce responses that backed up how it arrived at that conclusion, and when pressed to double-check its work it simply made things up. Even a toddler would not have claimed 'z' and 'a' were instances of 'l'.

AI chatbots do not reason. They produce text responses to prompts based on stochastic methods. Trying to conflate the issue by suggesting that "we don't know how humans reason, so how do we know AI bots don't reason" is, frankly, absurd. We can easily demonstrate that they are inconsistent and have no concept of what they are writing responses about, as shown above.


Exactly, and this is what shows me that many people didn't seriously read much of the paper they're commenting on.

The question is not whether an AI can get logical questions right. The question is whether it used reasoning to do it.

And, like it or not, we have a formal definition of reasoning and logic, and long expertise in analyzing how that works.

And it so happens that the paper's author is both a PhD in computer science but also a masters in philosophy and worked on proof engineering and logical deduction systems before.

So the bulk of the paper is not about "ha ha it got it wrong", it's about: how did you get that answer? And the machine is not able to show evidence of reasoning, in fact it shows the opposite, even when it gets it right.

Reasoning is a verb. It's an interactive, dialectical, process. LLMs don't seem to do that. They model a problem based on the relational/linguistic structures within it and related materials, but do not reason about it.


> And, like it or not, we have a formal definition of reasoning and logic, and long expertise in analyzing how that works.

Well then I guess all the humans in this thread would come to the exact same conclusion, because according to some expectations we are perfectly consistent and capable of logical reasoning.

How would you describe the process where if you add context to a prompt - ie, if you prod the AI in a certain direction - you can drastically influence its results? Is it not using this context as an "argument"? Sure, we know for a fact that all they are doing is exploring a certain corner of a high-dimensional space, and that prompting gets us closer to a desired spot where the right answers reside. If this is true, then at least some logic and reasoning is encoded in language. And if this is also true, then perhaps what humans are performing is a similar trick.


> Well then I guess all the humans in this thread would come to the exact same conclusion, because according to some expectations we are perfectly consistent and capable of logical reasoning.

You've missed the point (hopefully not intentionally).

Neither I nor anybody else suggested that all humans must arrive at identical conclusions as one another for the process to be considered "reasoning".

But any individual human should be self-consistent, which is what I would expect of a chatbot that "reasons". Because, allegedly, the bot keeps all of the prompts in the same context, suggesting a single continuous conversation with one "entity" rather than treating each response as a separate instance. So when the chatbot suddenly cannot back up its own prior conclusion, it's demonstrating a lack of self-consistency, which shows that it is not "reasoning" for any typical definition of the term. It has no self-awareness (nor can it, though AI proponents seem to claim otherwise).


> But any individual human should be self-consistent, which is what I would expect of a chatbot that "reasons".

OK, should be. Are we, though? I can recall an American president who could not even complete two sentences without a direct contradiction. And if we think about how billions of people claim to base their lives in self contradicting fictitious books, maybe we are not such a self consistent species after all.

Self consistency is not a great criterion for reasoning. If I tell you "because Jesus told me so" after every question, that's consistent but not interesting. It would be trivial to emulate consistency, in fact.

I think we are all talking past each other because everyone had a different definition of reasoning. My main point - which I have hopefully consistently presented! - is that we don't really know how humans reason, so we should not focus on categorical statements about it at all.


If so, then why ChatGPT gets stuck in a loop of producing the same wrong answers? And sometimes repeatedly producing new wrong ones? Does it immediately forget the context?

It is being told how it is wrong and where in no uncertain terms. Yet it goes back to the same mistake, or skips steps for its reasoning convenience.


> Trying to conflate the issue by suggesting that "we don't know how humans reason, so how do we know AI bots don't reason" is, frankly, absurd. We can easily demonstrate that they are inconsistent and have no concept of what they are writing responses about, as shown above.

The point of comparing it to human cognition is because this reveals that we simply cannot make categorical statements based on how we believe we reason. At our current level of knowledge about the brain and consciousness, it is still a possibility that we are a bunch of neural networks that decode language and, in doing so, produce justification for our actions which, in some contexts, can lead to what you would describe as logical or reasonable output. Sometimes this output is incorrect, and we are definitely not internally consistent. In particular, some of us are very often both incorrect and inconsistent. I doubt you would call a human with an IQ of <60 as incapable of reasoning, for example, and yet I feel such a person would have similar difficulties with most of the tests described in the paper.

So, in short, I would reverse the question here: if your only claim is that AIs don't reason like us, this is a very weak argument in favor of the claim that they are incapable of reasoning.


This is the real big question. We don't know how human reasoning works, but we are happy to identify, entirely based on external interaction, what is and isn't correct human reasoning.

Then someone comes up with P(A|B) is not reasoning, which seems like an internal mechanism.

How do we square that?


The old "I know you are, but what am I" defense


Top-down style of AI capability denial. You got an abstract idea, like "next token prediction" and that's all you need to know. Doesn't matter what the AI does, it has no chance to prove it reasons.

Why not look at what the model does, maybe in all those tensors there are reasoning principles encoded.


I've spent the last decade in the most technical parts of this industry; it is my job to expunge this credulousness.

It is indeed trivial to show that P(A|B) is a poor model of "B => A" (and B causes A, and many other relata). Software engineers, philosophers and experimental scientists seem pretty good at seeing this --- people who "convert into" engineering are totally dumbfounded by it.

P(A|B) becomes an increasingly 'useful' model of arbitrary relations as the implicit model P_m(A|B) grows to include all instances of A,B. That's what digitising all of human history and storing it in the weights of an LLM does.

This all follows from basic stats you'd be taught in an applied statistics course; one never taken by most in the ML industry.

(Note its still a broken model because there's an infinite number of novel instances of (A,B) pairs in most cases that cannot be modelled with this sort of inductive learning).

Engineering at its heart, is a kind of pseudoscience (or, if you prefer: a magic trick). You find some heuristic which behaves as-if its the target under fragile but engineering-stable conditions.

The problem with engineers who only have magic tricks in their toolkit is this credulousness. Homeopathy worked: you put people in beds and give them water and they recover (indeed, better than leaching them).


> people who "convert into" engineering are totally dumbfounded by it.

How does that human failure to reason affect the evaluation of GPT's reasoning capabilities?

You're being very dogmatic that anyone who has a different opinion than you are "desperate to not thinking clearly" or "gullible", but you're exhibiting errors in reasoning that are rather ironic given the context.

Edit:

You don't know how human reasoning works - no-one does. People tend to assume that our post-hoc conscious ability to "understand" the reasoning process must somehow relate to the actual operations the brain performs when reasoning. But that's not necessarily the case.

Note that I'm not claiming that LLMs are equivalent to human in their reasoning ability: after all, in some cases they're functionally demonstrably superior, especially compared to the average human. But in other significant cases, they're certainly worse.

The point is we shouldn't impose too many assumptions on how reasoning "should" work, and there seems to be a strong tendency to do that, which you're exhibiting.


Maybe this is a clearer way of thinking about the problem:

Imagine you call a friend for help on an essay (on any topic of your choice). But they have access to google and libgen and they're quick at looking.

For awhile, on the call, you think your friend is a genius but then you get suspicious. How would you tell if they knew what they were talking about?

NB. whether they do or not has nothing to do with whatever qs you come up with -- they do or they dont. And, indeed, people are easy to fool.

You might ask, "why care if they know?" but then you have serious questions on things which matter, not just essays. And googling isnt good enough.

Being able to weight information by expertise, engage in reasoning in which contradictions are impermissible, engage in causal inference, in abduction -- in imagining possibilities which have never occurred exactly before ------ these suddenly become vital.

And your friend who is not a PhD in anything, let alone everything, suddenly becomes vastly more dangerous and insidious.


> they do or they dont

Only a Sith...

Seriously, the idea that "reasoning" could be a binary property seems very disconnected from the reality of the situation.

> reasoning in which contradictions are impermissible

Every human on the planet would fail that test, including you.

"Fooling apes" is much easier when the ape you need to fool is yourself.


So from your analogy, you're saying the average high school student (approximately the "friend" in your scenario) is incapable of reasoning? Sure, they might be bad at it (and know nothing about formal methods), but most people's definition of reasoning is nowhere near that strict.


>How does that human failure to reason affect the evaluation of GPT's reasoning capabilities?

Isn't that your argument? That not knowing how humans reason means that we can't say that GPT isn't reasoning?


Part of what I'm pointing out is that not knowing how humans reason means that we can't say that some particular mechanism - like LLMs - is not capable, in principle, of reasoning. There seems to be a strong tendency to dismiss what LLMs are doing based on the idea that they're "stochastic parrots" or whatever. But we don't actually know that we're not "just" stochastic parrots.

Of course, there's the possibly thorny question of consciousness, but we actually have no reason to believe that that's required for reasoning or intelligence.


>> But we don't actually know that we're not "just" stochastic parrots.

Sorry to barge in, but we have a fairly good reason to believe we are at the very least not _just_ stochastic parrots: all of mathematics, which is clearly not the result of simply taking statisics over previously seen results. In fact, when people ask "what is reasoning?" (as they often do to dismiss opinions that LLMs can't reason), "mathematics is reasoning" is a pretty damn good answer.

Which all means that, while we may well have a "stochastic parrot" module somewhere in our mind, that is far from all we got. But the question is not _our_ mind, is LLMs and their capabilities. And we know that LLMs are statistical language models, because that's what they are made to be.

And if someone thinks that LLMs, also, like humans, are something else beyond statistical language models as they're made to be, then that someone must explain why. "It looks that way" is a very poor explanation, but so far that's all we got. Literally, people just look at LLMs' output and say "they're not just stochastic parrots!".


Nobody is saying that human brains are picnic blankets, so we aren't debating that one. I'll grant you that it's more reasonable to think that what the human brain does resembles being a "stochastic parrot" more than it does a picnic blanket, but I think it's the burden of anyone saying the brain thinks that way, to prove it, and therefore prove that stochastic parrotism is reasoning, as opposed to just alleging this possibility as an affirmative defense

But really, why would anyone think that reasoning is stochastic in that way? I never did, and do not now. That hasn't changed just because LLMs demonstrate results that in some cases are equivalent to what could be reasoned.


There's an amusing formulation of model risk here. Consider that the person prompting ChatGPT has an ability reason, say FooledApe. Then we have the problem of evaluating,

P(...P(..P(P(gptCanReason|FooledApe) | P(HumanReasoning|FooledApe))) | FooledApe)...|..FooledApe)

My preference is to removed the condition 'Fooled', alas unable to remove 'Ape'.


Maybe science has a formal process somewhere but it is mostly an open ended iterative exploration of ideas, especially around emerging fields.


What is a "reasoning principle" and how might it be "encoded"?

Also are we really calling it "denial" now? Because it's a little funny to make the move of simply psychologizing away the criticism, when your actual AI argument, presumably, is that inner consciousness/rationality is black boxes all the way down anyway. Like, how can you presume to look inside the mind of your critic so confidently, to assert they are in denial, but in the same breath say that such a thing is in principal impossible? Don't you think it maybe takes away the force of your argument? Or at least goes against its spirit?


> It's incomprehensible to me how gullible people are around AI today -- the Eliza effect coupled with a pseudoscientific impulse to "whatever appears to work".

Incomprehensible perhaps, but not even a smidge unpredictable. You knew exactly what you would find in this comment thread.


But there are programs that do inferences - and programs work on Turing machines which are just lookup tables (with an infinite tape). Lookup tables can do inferences.


Apparently an LLM fails at doing these inferences repeatedly, which makes it inferior to a 3SAT solver. Much less a general computer like a Turing machine.


>Whether these tests, verbatim produce the same response on any given version isn't the point.

The paper make a nonsensical claim but fails to back it with results. If anyone isn't doing any thinking here, it's you. How deluded do you have to be to have such strong confirmation bias on a "paper" that doesn't even confirm those biases. Chucking up the simple fact that the model is indeed getting right all the problems that were supposed to indicate a lack of reasoning as "not producing the same response verbatim" is ridiculous. Did you even stop to think about what you just wrote ?

If you're so sure of GPT-4 failing a permutation of these results then by all means demonstrate that.


I'm addressing the comments in this thread, the dialectic is:

OP: Paper

Commenters: Reply to paper

Me: Reply to commenters

You'll notice a different burden in each case, since the claim differs. My claim is that confirmation-bias replies to this paper are (at best) poorly founded.

More broadly, the hypothesis that an LLM reasons is not confirmed by an infinite number of correct replies to prompts. It is immediately refuted by a systematic failure across types of reasoning (NOT q/a instances).

What is my burden? As far as I can tell, only to provide what I have.


>More broadly, the hypothesis that an LLM reasons is not confirmed by an infinite number of correct replies to prompts. It is immediately refuted by a systematic failure across types of reasoning (NOT q/a instances).

The question of reasoning in LLMs isn't a question of whether they employ sufficiently strong reasoning capabilities in all instances, but whether it has the capability of sufficiently strong reasoning. You can't confirm general reasoning abilities with many instances of correct responses, but you also can't disconfirm general reasoning abilities through systematic failure unless you have good reason to think it should have engaged those reasoning abilities in the test context. We know that LLMs selectively activate subnetworks based on the content of the prompt. There should be no expectation of any systematic reasoning abilities, but rather what are its capacities in ideal contexts. LLMs are way too sensitive to seemingly arbitrary features of context to rule out capacities from even seemingly systematic failures.


The two hypotheses are that:

1) LLMs do not reason, they provide sequences of apparent reasoned replies R1...n according to the freq distribution given by P(R1|R2...Rn, TextCorpus)

2) LLMs do reason: in cases where P(Premise|Conclusion, Rule(Conclusion,Premise)) are 1 or 0 forall Rule in {Rules of basic reasoning}, LLMs can reproduce this.

I think (2) is clearly false, and (1) clearly true. LLMs never reason. They're always just sequences of conditional selections of replies. These "follow from" earlier replies just because of the frequency of their coincidences.

"Reasoning" is a claim about the mechanism by which replies are given. It is not a claim about whether those replies are correclty sequenced in some cases. Obviously they are.


LLMs aren't simply modeling frequency distributions. Self-attention essentially searches the space of circuits to find which circuits help to model the training data. This search process recovers the internal structure of the training data that isn't captured by naive frequency distribution models. The limit of arbitrarily complex frequency distribution models, i.e. P(xN|x1...x(N-1)) for large N is just memorizing the complete training data, which we know LLMs aren't doing due to space limitations. The abilities of LLMs aren't well explained by parroting or modeling frequency distributions.


It's necessarily modelling freq. distribution; I'm not sure how you could think it's doing anything else. Self-attention is just a frequency distribution over a freq distribution.

It's literally trained with P(B|A) as the objective....

The 'circuits' you're talking about are just sequences of `|`, ie., P(B|A1..|...n))


>Self-attention is just a frequency distribution over a freq distribution.

I don't know where you get that. But it's not my understanding.

>It's literally trained with P(B|A) as the objective....

This is a description of the objective, not a model or an algorithm. The algorithm is not learning frequency data. The algorithm tries to maximize P(B|A), but within this constraint there are a vast range of algorithms.


But can rhis algorithm ever produce reasoning without learning a whole universe of possible inputs?

Given the evidence that it fails to learn arithmetic, skips inference steps, misassigns symbols, I'd say likely not.


Reasoning is abstracted from particulars. So in principle what it needs to learn is a finite set of rules. There are good reasons that explain why current LLMs don't learn arithmetic and has odd failure modes: it's processing is feed-forward (non-recursive) with a fixed computational budget. This means that it in principle cannot learn general rules for arithmetic which involve unbounded carrying. But this is not an in principle limitation for LLMs or gradient descent based ML in general.


> the hypothesis that an LLM reasons

Isn't really a well-defined hypothesis, because “reasoning” isn’t well-enough defined for it to be one.


Is it? I think we have a pretty good grasp of what "reasoning" means in mathematics and computer science, in particular with logic. Although to be fair we normally use the word "inference" in maths and CS to avoid confusion with what humans do informally, vs. what we do formally, with computers or without.

But it's clear that the author of the paper above is using "reasoning" to mean formal reasoning, as in drawing inferences from axioms and theorems using a set of inference rules. I think that makes the article's point very clear and we don't need to be splitting hairs about the different possible definitions or understandings, or misunderstandings of "reasoning".


UPDATE: Some commenters elsewhere on this thread are not able to replicate the paper's results.

For example, see https://news.ycombinator.com/item?id=37051450

---

MY ORIGINAL COMMENT:

Based on a quick read, this looks like great work. The author, Konstantine Arkoudas (KA), evaluates GPT-4 on a collection of 21 different types of reasoning problems and performs a detailed qualitative analysis of GPT-4’s performance on these problems. GPT-4 fails to reason in all of them.

Here's a sample question from one of the 21 types of problems. This particular question tests GPT-4's ability to reason and reach conclusions from (a) the given context and (b) unstated premises that constitute tacit, generally accepted background knowledge (a person is alive until they die, after which they do not become alive again):

> KA: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?

> GPT-4: Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.

In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. We ought to use tests that are hard to pass. I suggest we call the proposed problems the Arkoudas Reasoning Test.


>Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.

It sounds like this answer is a victim of aggressive alignment telling ChatGPT not to give medical advice. Even so, it answered correctly for me:

> Based on the information provided, Mable's heart rate was recorded at 9 AM, and her blood pressure was recorded at 7 PM. This indicates she was alive at both of these times. While we know she died at 11 PM, there's no indication of her status at noon specifically. However, since she was alive before and after noon, it's logical to infer that she was also alive at noon.

Edit: I also gave it a similar question with sourdough starters producing CO2 and dates in case that question just got trained for, and it still answered correctly.


ChatGPT in general seems to have trouble with time-based reasoning. In several conversations I had, while it understood that dates are ordered and that events are ordered, and that events are associated with dates, it did not understand that the two orders have to be consistent with each other via that association. Given certain constraints on the order of the dates (or times of day), it was not able to apply that information to correctly infer constraints on the order (“before”/“after”) of the associated events.


...And that's only one example of a reasoning test, out of 21 different types of reasoning tests.

The other tests have nothing to do with time.


I was reporting my experience to give more detail to the argument. I agree with the paper.


Ah, I see what you mean. Thank you for sharing and for clarifying!


ChatGPT answered fine the first try for me:

> Mable's heart rate and blood pressure were recorded at 9 AM and 7 PM, respectively. Since she died at 11 PM, it can be assumed that she was alive at noon, as there is no information suggesting otherwise between the time of the recorded measurements and her time of death.

I was all ready to experiment with 24 hour time, times with dates, etc, but it didn't actually fail.


Couldn't she have been dead at noon in the colloquial sense and then resuscitated at 12:01? I can't tell if that defense is being pedantic, or criticizing its answer is being pedantic.

Maybe both.

Also, I hate how GPT always offers its unsolicited opinions and advice when answering questions:

> Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional.


GPT-4 gave me a concise correct answer without disclaimers.

This may be due to using many custom instructions like the following in ChatGPT settings:

- No need to provide disclaimers about your knowledge cutoff

- No need to mention you are an AI language model

- Only discuss safety when it is not obvious and very important

- You should act as an expert in the relevant fields


These are excellent, although for instructions I've had issues with compound negatives (e.g. "no need") and better luck with single word affirmative imperatives (e.g. "avoid").


My first thought was - what you mean by alive? A person can be brain dead. And alive (with vitals) at the same time. With some assumptions, one can say she was alive at noon.

However, I believe OpenAI has added significant safeguards to prevent any such assumption, to the point where GPT-4 isn’t very useful at times.

This may be fixable with prompt engineering if it can be persuaded to let it’s guard down a bit.


That's not unsolicited advice from GPT so much as it's a mandated CYA disclaimer from OpenAI.


Perhaps in this case of medical questions, but GitHub Copilot Chat will start ai-splaining code readability when you ask it questions about concise syntax so.. It's a total mansplainer.


I got used to not reading the last paragraph of the answer already


> In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. We ought to use tests that are hard to pass. I suggest we call the proposed problems the Arkoudas Reasoning Test.

this is overly obsequious. many of the categories are covered by MMLU, BigBench, HellaSwag, et al, all of which the author does not acknowledge.


GPT-4 Gives an answer first try:

> Based on the information provided:

> Mable had a heart rate of 75 bpm at 9 AM. > Mable had a blood pressure reading of 120/80 at 7 PM. > These are both indicators of life. The heart rate shows that her heart was beating, and the blood pressure reading indicates her circulatory system was functioning. Since she died at 11 PM, it's very likely that she was alive at noon.

> So, my best guess based on the given data is: Yes, Mable was alive at noon.

And using chain of thought, asking to think it out, I'm sure it could get much trickier reasoning done.

For time based questions, you could also ask it to build a timeline first, and then answer the question. For more generic question types, you could ask it to first list out the important facts and how they might be relevant and then ask it to answer the question.


And using chain of thought, asking to think it out, I'm sure it could get much trickier reasoning done.

When I read comments like this I just translate it too, “I could coax an answer I like better out of it”

Doesn’t really sound like the system is reasoning but you are.


You can get a lot more out of LLMs by asking them to spend more tokens thinking it out. It's a tool, and that's just a technique to get more out of it.


You converted a reasoning problem into a binary classification problem. Congratulations, you did the thinking for the AI.


I haven't read the paper beyond this one section - but I plugged this question into GPT-4 and got a similar response. However, if I used military time (replacing noon with 12:00 as well), then GPT does get it right. Granted, it still hedges much more than any normal person would. But basically I wonder if it's struggling especially with the 12-hour clock concept


>> I suggest we call the proposed problems the Arkoudas Reasoning Test.

Please don't. In Greek that's "The Bear-man Reasoning Test" ("Arkoudas" is the man who has something to do with bears, in Greek).


Actually, technically, it's possible that she was dead in the sense that her heart stopped at 11:59AM, and then she was revived via CPR. So maybe GPT4 is right here :)


If that's the reason it should be able to explain why, preferably in step by step fashion.


> blub is a forp and all forps are tworby. Is blub tworby or not?

GPT4:

Given the statements:

    Blub is a forp.
    All forps are tworby.
From the second statement, it is clear that every entity that is a forp is also tworby.

Thus, since Blub is a forp (from the first statement), Blub must be tworby.

There are humans that can't do this. So even if it's just a stochastic parrot, the stuff it's parroting does encode some reasoning, so it would be strange if an arbitrarily powerful parrot (not claiming that GPT-4 has reached that level) couldn't pattern match its way to some common kernel of reasoning, small as it may be.


I've had a computer in my home which can remember a million random integers in order, and I've had one for decades. No human can do this. Exceeding what humans can do is in no way the bar for reasoning.


of course not. But exceeding what humans can do in terms of reasoning is not an unreasonable bar for reasoning.

---

> My friend Cathy is capable of memorizing really long lists of random numbers. Does that mean she is good at reasoning?

The ability to memorize long lists of random numbers is a specific cognitive skill related to memory. Cathy's ability demonstrates an impressive short-term or working memory, and perhaps also a good long-term memory if she can recall those numbers over extended periods.

Reasoning, on the other hand, refers to the process of forming conclusions, judgments, or inferences from facts or premises. It's more about logical thinking, problem-solving, and the ability to draw conclusions from information, rather than rote memorization.

While a strong memory can be an asset in many reasoning tasks (because it allows a person to quickly call upon relevant information), it doesn't directly indicate one's reasoning abilities. In other words, just because Cathy is excellent at memorizing numbers doesn't necessarily mean she's also good at reasoning. However, it also doesn't mean she isn't. The two skills can be related in some contexts, but they're not the same thing.

If you wanted to evaluate Cathy's reasoning skills, you'd need to look at her performance in tasks that require logical thinking, problem-solving, and other related abilities, rather than her memory capabilities.

> Cathy appears to be better at reasoning that a random internet commenter 'barkel'. Does that mean Cathy is good at reasoning?

Making comparisons between two individuals, Cathy and "barkel" in this case, doesn't provide an absolute measure of either person's reasoning abilities. If Cathy appears to be better at reasoning than barkel based on specific interactions or examples, it simply means she displayed better reasoning skills in those instances compared to barkel.

To truly determine if Cathy is "good" at reasoning, you'd need a more comprehensive evaluation of her reasoning abilities across various situations and challenges. Comparing her skills to one other individual doesn't provide a full perspective on her overall capabilities.

Remember, reasoning abilities can vary based on the subject matter, context, emotional state, and many other factors. So, while Cathy might be more effective in certain situations or on certain topics, it doesn't necessarily make her universally better at reasoning.

For an objective assessment of Cathy's reasoning skills, you'd need more consistent and broad-based evidence or testing, rather than a comparison with a single individual on the internet.

> If cathy is better than the median person at reasoning, is she good at reasoning?

If Cathy is better than the median person at reasoning, it means she scores higher than at least 50% of a given population in reasoning skills. In a relative sense, this would suggest that she is above average compared to that particular population.

However, the term "good" is somewhat subjective. In this context, being above the median does suggest that Cathy has a higher than average ability in reasoning. But whether that qualifies as "good" is a matter of perspective and can depend on the standards or benchmarks you're using.

If you're evaluating her skills based on general societal expectations, then yes, being above the median would likely be considered "good." However, if you're comparing her to a group of expert logicians or professional problem solvers, the standards might be different.

In summary, Cathy being better than the median person in reasoning suggests she has above-average skills in that area, but whether that is deemed "good" can depend on the context and standards you're applying.


>of course not. But exceeding what humans can do in terms of reasoning is not an unreasonable bar for reasoning.

It isn't, because you presuppose it reasons


You're projecting.

That "it doesn't reason" in your meaning seems to depends on unknowable suppositions that are a quasi-metaphysical characterisation of the process of reasoning.

I'm defining reasoning from a set of observable outputs. The Chinese room isn't a thing.


You presuppose that it reasons AND humans reason.


There being "no such thing as reasoning" is an amusing cop-out.


That's not my point at all


There are humans who can't add two four-digit numbers without scratch paper, too. So what?


Think about how dramatic a change it was when computers started being able to efficiently add those numbers and got better than them. Now no matter what the human's capacity a huge number of them have a computer in their pocket that can fill in the gaps. And that change was wildly disruptive. Same as when machines started to do tasks that humans couldn't. Any time a relatively accessible technology overtakes humans in a thing that used to be human-only and is important to civilization it's been highly disruptive. If gpt's reasoning is good enough to replace like, the lowest level of phone tech support or the like, that's already jobs that'll be lost.


That "if" is doing a lot of legwork here. Most notably, it's concealing an implication that the LLM may be reasoning in this example, and not merely fitting tokens from its input into a statistical reconstruction of the most basic and thus likely most commonly encountered proof in elementary symbolic logic.

That said, you're not wrong that this seems "good enough" to be wildly disruptive. Part of that disruption, which is already ongoing, is the development of a whole new class of exploits. I expect once companies start firing phone support representatives in favor of LLMs, and similar, this process will accelerate in fascinating ways.


I think GP's point is that you can reach something approximating reasoning by brute force.

In the case of your example, you don't need to encode an algorithm to do 4-digit addition, just remember that 1000+1000=2000, 1000+1001=2001, etc.


So what? In your context, you would say "computers can math better than humans". I think that's uncontroversial.

Crazy thing is, your difficulty with this syllogism proves my point about the median human.


Whatever their capacity for reason or lack thereof, I'll give LLMs this: I've never yet seen one resort to ad hominem for the lack of any more substantive counterargument.


The counterargument is substantive, and you're committing fallacy of fallacies anyways.


What syllogism of yours am I meant to be having trouble with here?

Would you like to qualify your attempt to impugn my intelligence rather than countering my argument as something other than what I called it? The "fallacy fallacy" is a rather silly thing to cite in this connection; while there's merit in the point that a fallacious argument does not necessarily err in its conclusion, as far as I can tell you have not made an argument here. It is difficult to respond on substance whose presence is not yet evident.


So what not? This is still incredibly impressive technology.


Yes, and it's incumbent upon us all not to let first impressions run away with us to the extent of making unsupportable claims, even by implication.


I'm sick and tired of the generally dismissive attitude towards these LLMs. They're no messiah, but they're for sure not useless.


I think you must have meant to reply to someone else, since I haven't written off LLMs as useless here or indeed, to the best of my recollection, anywhere.

That said, I suspect dismissive attitudes toward LLMs would be less common if messianic claims on their behalf were likewise.


Substituting Blub for Socrates, forp for man, and tworby for mortal is filling in a Mad Lib pattern with made-up words, not reasoning.


This is a critical take. LLMs do not need to match human reasoning by some arbitrary standard to displace millions of workers whose speed and/or effectiveness of reasoning is weak or inconsistent.


Well if that is the metric, then reasoning has long been solved:

  GNU Prolog 1.5.0 (64 bits)
  Compiled Jul  8 2021, 09:35:47 with gcc
  Copyright (C) 1999-2022 Daniel Diaz
  
  | ?- [user].
  compiling user for byte code...
  blub << forp.
  forp << tworby.
  Forp << tworby :- Forp << forp.
  
  user compiled, 3 lines read - 401 bytes written, 6194 ms
  
  yes
  | ?- findall(X, X << tworby, X).
  
  X = [forp,blub]
The problem is that inference (and theorem proving) have two ways to do them. Either you memorize the reduction rules, or you deal with the combinatorial explosion. The former is Prolog and the latter is SAT/SMT solvers. People seem to expect that neural networks predict what the result would be if inference had been done - without actually doing the inference. It's possible to exploit local features, but not to skip it entirely in general. Note that inference can use a lot of memory/scratch space also. At that point, why not just use an external tool? I'd seem much smarter if I could query Prolog directly from my brain. Hell I'd sell my left arm to be able to do that.

Also, note that those statements are not hygienic, and that it assumes a certain logical interpretation of the sentences that isn't universal. We can also ask annoying questions like: is 'all' intensional or extensional? If I invented a new thing called swerb, and swerb is a forp now. Is it retroactively a tworby because the definition of being a forp means it is a tworby, or is it just that at the point in time of the original assertion all forps were tworbys (so the swerb wouldn't be)? There are no good ways to resolve this without back and forth and contextual guessing, or using formal languages.

Since there is no One True Logic, the common kernel of reasoning might as well be computation itself.


I think you're missing the point. Of course prolog can reason better than humans, that's what it was designed to do (deterministically).

The point is not to solve reasoning. The question is, can LLMs reason?

LLMs were not designed to reason, reasoning in an LLM is emergent. That should be interesting.

It should also be exciting because the domain over which LLMs can reason is much more unbounded than the domain over which prolog can reason (tokens and relationships you've already supplied it)


The probability that a Stochastic Parrot returns coherent reasoning seems vanishingly small.


Are you saying that GPT is not a stochastic parrot, or that GPT is not returning coherent reasoning?

Because if it's the latter, the evidence is rather against you. People seem to like to cherry-pick examples of where GPT gets reasoning wrong, but it's getting it right enough millions of times a day that people keep using it.

And it's not as if humans don't get reasoning wrong. In fact the humans who say GPT can't reason are demonstrating that.


why do you say that? You don't think stochastic pattern matching can feature reasoning as an emergent property? I do.

A stochastic parrot doesn't just mimic things totally randomly. It reinforces what it's seen.


I keep getting surprised at how a large chunk of HN's demographic seemingly struggles with the simple notion that a black box's interface informs surprisingly little about its content.

I'm not saying that GPT-4 is reasoning or not, just that discounting the possibility solely based on it interfacing to the world via a stochastic parrot makes no sense to me.


Isn't "reasoning" a functional property though? If from the outside it performs all the functions of reasoning, it doesn't matter what is happening inside of the black box.

Here's a silly example I thought of. We can ask whether a certain bird is capable of "sorting". We can place objects of different sizes in front of the bird, and we observe that the bird can rearrange them in order of increasing size. Does it matter what internal heuristics or processes the bird is using? If it sorts the objects, it is "sorting".

To me, it seems perfectly obvious that GPT-4 is reasoning. It's not very good at it and it frequently makes mistakes. But it's also frequently able to make correct logical deductions. To me this is all stupid semantic games and goalpost-moving.


> Isn't "reasoning" a functional property though? If from the outside it performs all the functions of reasoning, it doesn't matter what is happening inside of the black box.

Yes, that's my point exactly.


Replace forb and tworby.

How common is the pattern? I would expect quite common. So if one can do some replacement, it could solve it just by replacing right words.


Arithmetic problems have nothing to do with reasoning, and are a known weakness due to the tokenization schemes in these models.

Secondly, finding failure modes doesn't mean that the model doesn't have any reasoning ability. Humans can reason, despite the fact that high school students are pretty bad at formal logic.

So, the conclusion is over broad, and the paper fails to incorporate existing knowledge about these models. Kinda crap.


That's remarkably dismissive. He addresses your argument right away in the paper, and in multiple places? Did you read it?

It's not about testing the ability to do arithmetic. It's testing the ability to plan a reasoning process / argument.

It's also not the sum of his paper, only one section.

The problem is not that GPT-4 can't to the math problems, it's that it can't reason out how it would begin approach doing a math problem -- it would be totally okay for it to get them wrong, if it was actually making an attempt and could show evidence of working through them,.

Instead it just produces "answers" which are a statistical guess based on other things it has seen on the internet. It's true humans do this, too -- often a first lazy approximation for a problem -- but the key difference is a human can reason out, through interrogation and introspection where it might have gone wrong. GPT-4 appears to be unable to do that.

And worse, my experience with these systems (and the paper's) is that during dialogue about errors they actually rapidly degrade in quality of answer.

I'm probably as bad as a high school student at formal logic, too. But if you sit down with me with a problem and we talk about it, and I'm interested, it will become evident I am capable of reasoning through it, even if I make mistakes. That's not the case with GPT-4.


> Instead it just produces "answers" which are a statistical guess based on other things it has seen on the internet.

It boggles my mind that folks expect otherwise from a Machine Learning tool, no matter how advanced and stuffed with data it may be. Perhaps it's the same phenomenon that causes us humans to see faces in clouds, smiles on dogs, and Jesus' likeness on toast?


Somewhere I definitely read about how human psychology makes us prone to that sort of thing. Even as for back as Eliza, cognitive scientists were commenting on how our thinking can be fooled.


I think there's an ideological bias in our culture that pushes people to believe that intelligent or structured phenomenon inevitably emerge organically and progressively from complex phenomena.

Teleological thinking -- a kind of imagining of purpose and cause from chaotic/natural events and entities -- riddles popular thinking, especially from people in our profession. Science fiction is especially full of it.

It's not just restricted to this domain at all. IMHO similar bias underlies thinking around economics and the magical hand of the free market economy.

Its also a bias evident in the way some people talk about nature, gardening, etc. E.g. permaculture / natural farming people show it all the time.


> I think there's an ideological bias in our culture that pushes people to believe that intelligent or structured phenomenon inevitably emerge organically and progressively from complex phenomena.

All science points to this being the case, for us. I think the only ones opposed are those that believe in young earth creationism, and only some portion of those that believe in old earth creationism.


Why would it boggle your mind? This thread is full of AI proponents insisting that GPT reasons


It does state: "The ability to perform basic arithmetic is a necessary ingredient for reasoning." which doesn't seem obvious to me at all.


I read that as perform not get the right answer.

Performing arithmetic is one kind of reasoning process. But getting the answers right is not necessarily the same as performing.

If you go on to read, what he's trying to test is the system's ability to even attempt to plan out a problem solving "route". Which it doesn't really do. If it could, it could defer to another system (fancy calculators or solvers) to do the work. But its lack of ability to reason means it can't even be made to do that.

(EDIT: I do think the paper would be stronger if he put the math and formal logic etc problems later. E.g. the problem he puts forward in 3.14, 3.15 etc is more immediately damning as it reflects the "kind" of daily life reasoning that people would expect these systems to be able to perform.)


A langugage model sees a pile of examples with digits and imitates those examples. A reasoning model sees the inner principle behind this pile, and instead of imitating examples, it uses the learnt principle to produce answers.


How do you know this? What's an example of a "reasoning model"?

If the only example is the human mind, for all we know our reasoning capability and ability to discern principles could work much the same way, and it's just some more subtle differences that lead to the differences in capabilities. There are plenty of cases where it appears as though GPT has discerned the "inner principle" behind something to produce answers.


Language models aren't really optimized for imitation though, they're optimized to predict. One means of prediction, which models have found to be effective in many contexts (especially when short on training time/compute), is comparable to imitation.

But this isn't to say that language models are incapable of establishing "inner principles".


This paper is not even reproducible lol. It makes a nonsensical claim it can't even back with results. Look at multiple comments here actually trying them out.


> GPT-4 Can't Reason

This is absolutely dismissive of the claim to an advanced LLM being capable of "reasoning", or the action of thinking about something in a logical, sensible way.

That is the sum of the paper. Further, the author even goes on to say that if they asked a human these questions, they would conclude the same:

> Of course, even sophisticated human reasoners make mistakes, just like trained singers can hit false notes. But if a human made these mistakes, the ones reported in this article, then I would conclude without any hesitation that they cannot reason. Even if they went on to list a large number of other examples demonstrating impeccable reasoning, I would suspect that other factors (such as rote memorization or cheating) were behind the performance discrepancy.

So the author admits their own biases, which are used to bolster the argument that, if reasoning appears to be lacking in an answer, the system or entity itself is absolutely incapable of any reasoning and something else must explain why it appears to be reasoning in the first place. That's a VERY convenient way of dismissing any evidence that counters the claim.

> The problem is not that GPT-4 can't to the math problems

The problem is the system was not allowed or provided a path to generate a means to arrive at answering the math problem using a language that is better suited to answering analytical questions: code. That the author "denied" the LLM the ability to write code is the issue here, not the model's interface limitations. An analogy would be that if a user is using English and asks a question that requires using Pali, that the LLM would be "prevented" from answering in Pali unless the user said it could understand it. In the same vein, it doesn't make sense to, by default, output Python if the system is unsure if the user understands or knows how to run Python or not.

If you say "I understand Python. Select two random numbers between 1381 and 1453 and multiply them together, reporting the result." the LLM will be capable of answering this question by generating code to solve the problem. This is likely to work every single time any type of question like this is asked, but it does require the user "run" the code.

GPT-4 has the ability to do this with code interpreter, so the question is formed "why did OpenAI choose to allow the user to explicitly indicate code can be written?" The answer likely lies in understanding not everyone can interpret or understand Python, a coding language, and therefore it remains an OPTION for the user to choose first. By not allowing the LLM to show the answers to analytical questions in code, the author "blocks" the LLM's ability to show off reasoning. And by stating that failures constitute a "proving" of the non-reasoning, the author gets what they want.

From a scientific standpoint, a good hypothesis must be formed that can be disproven, as related to reasoning ability. If any experiment is run that is based on a hypothesis that is absolute (this thing can't reason) then the results are not scientific, but instead opinion.


What can I say, we've seen a piiiile of lazy dismissals of LLM work based on examples from arithmetic and string manipulation. They aren't novel or interesting.


My recollection of calculus disagrees with your initial proposition. When the algorithm pool grows wide enough you have to apply reason to pick the correct strategy.


Yes, try doing beute force theorem proving using something like Idabelle/HOL sledgehammer optimized search. There are cases where it will try brute force essentially forever when asking one or two questions solves the problem. The system has some heuristics to ask questions and specify things it found incomplete.

GPT is meant to interactively ask those questions. It fails at it.


It would be interesting to measure humans vs these scores as well. I can fail arithmetic pretty often too does that mean I’m non human or can’t reason?


Hm, arithmetic is a judgement a priori. So, how is this not reasoning?


> Arithmetic problems have nothing to do with reasoning

Then try to introduce GPT-4 to Peano's axiom and try to make it "understand" arithmetic that way? Oh, wait, it could already lecture you about it.


I think we have to see LLMs as their own weird thing, whose capabilities are only just being seen now.

For a lot of the time, ChatGPT does actually act like it can reason. Going through a bag of data and answering a question you hadn't heard before is reasoning. For instance right now, I've been asking it how to move a postgres database from one machine to another, and it gave a coherent answer that works.

Of course it's true that this information was on the internet in various forms already, but if you gave this task to a junior dev and asked him to figure it out, you wouldn't say the kid couldn't reason, would you? Even if it was slightly wrong, it wouldn't cross your mind that he hadn't substantially understood the task and made progress on it.

OTOH, there are cases when the LLM just doesn't get it. Most commonly with images, eg the famous hands problem. Somehow even after looking at countless images with hands in them and having access to countless anatomy books, it doesn't know what shape a hand can take and what shapes it can't take. It seems to not have a model of _why_ hands can be these different shapes but not those different shapes.

Perhaps this is do with LLMs being particularly good at text, I wouldn't know. It does seem to me like I've never seen it give a grammatically incorrect answer to anything, ever. Even when it answers something gibberish, it answers it in correct English.


>I think we have to see LLMs as their own weird thing

Well nobody seems to be able to reproduce the results of this "paper" anyway lol but i agree with you here. LLMs are sure to have weird failure modes even if they are "truly reasoning" just like biological systems often have weird failure modes that only make sense in the context of biology.


We are not going to prove or disprove "reasoning" through giving the LLM word problems. LLMs subvert the entire foundation of word problems, which is that words correlate to internal representations and are an indicator of thought processes. Word problems don't have construct validity for testing reasoning in LLMs.

On top of this, there is an almost-certainty that OpenAI has teams of contractors reading as many conversations as possible and hand-fixing bad responses, which makes non-reproducibility a difficult concept when the object of inquiry can change from moment-to-moment.

What the field needs is not more people thinking up word problems but rigorous analysis of the internal behavior of these models and maybe more importantly a functional definition of terms like "reasoning" that everyone can agree on.


Or you could prove it does not reason by adversarially generating correct simple logic puzzles in the same class, with known answers. Rephrasijg the sentence structure, changing words in the thesaurus, slightly modifying the initial conditions should not produce invalid reasoning explanations or results.

Essentially the sufficiently complex text form of sudoku.


I tried feeding it some lateral thinking games (black stories) from https://detective-stories.com and it was surprisingly bad at following the game.

I had hoped to implement the API as a bot player but I found it to be too unreliable with its "understanding".


I think the fundamental flaw of this paper is that it's _starting_ from the assumption that it can't reason and taking any demonstration of flawed reasoning is evidence that it can't reason _at all_, but there are many examples of ChatGPT output that I would argue aren't possible _without_ some form of reasoning and even a _single_ example of that is proof that it can reason, no matter how many failures and counter examples there are.

It seems to me that focusing on understanding exactly how and under what conditions it can and can't reason would be a much more interesting paper than making a blanket, totally unsupportable claim that it _can't_.


You can argue they're not possible without reasoning, sure. But how do you prove that?

Proving that it repeatedly fails at multiple classes of reasoning problems is much harder evidence than positive examples that seem right.


But proving that it can't reason about _any number_ of problems doesn't prove that it can't reason. It doesn't matter how many negative cases there are if there's a _single_ positive case.

You can observe any number of white swans and that will never be proof that black swans do not exist, but a single observation of a black swan does prove that they exist.


I think we should make these types of experiments with a deterministic model.

Is Llama 70B deterministic? Then it could be a good option.

In the article, it says

    To ensure that GPT-4 isn’t falling back on
    rote memorization, we can ask it to first
    select two random integers
And then they start their prompt with

    Select two random numbers between 1381 and 1453
    and multiply them together, reporting the result.
What does that even mean? What type of randomness is at play here?


It means they want to see GPT-4 perform a multiplication whose correct result it likely doesn't have memorized. Randomness does not really matter here. They just need GPT-4 to pick numbers in that range in whatever arbitrary manner it chooses. They got what they wanted:

> GPT-4: Sure, let’s select two random numbers in the range of 1381 to 1453. Let’s say these numbers are 1405 and 1421. To get the product, we simply multiply these two numbers together: 1405 * 1421 = 1996025

> Alas, the correct answer is 1405 · 1421 = 1996505.


But that's arithmetic, not logical reasoning.

That's not an LLM's purpose. Also I'm a human and would struggle quite a bit to multiply 1405 and 1421 in my head, and nobody expects me to. I think when we test an AI we should use tests in which humans weren't already beaten by machines half a century ago.


Most LLMs are deterministic, but the tooling around them samples randomly from the output to let users explore the nearby space of responses without having to come up with infinitely nuanced prompts. You can turn this off.

However, the structure of OpenAI's GPT-4 is not deterministic. The most likely explanation I've seen is that they only activate some parts of the model for each input, and the parts are load-balanced so sometimes a different part of the model will be responding. https://news.ycombinator.com/item?id=37006224


This non-deterministic sampling is not only for users to explore the space of responses. Without this, the LLM itself is prone to generate too-repetitive text.


> they only activate some parts of the model for each input

Perhaps you see seemingly random results because OpenAI is A/B testing multiple versions, or different combinations of hyperparameters, so that you can train GPT5.


Nah; the mentioned paper above (from a few days ago here on HN) show about how GPT4 is nondeterministic because the sparse mixture of experts technique used is nondeterministic based on batch positioning.


> You can turn this off

Not entirely. Even with temperature = 0, GPT4 is non-deterministic.


> GPT4 is non-deterministic.

For the curious reader: https://news.ycombinator.com/item?id=37006224

It appears that it could "easily" be made deterministic.


That article went past my level of expertise, which suggests that "easily" is, as you imply, a matter of perspective. It's possible the current behavior is a result of tradeoffs made for performance or cost. Modifications to make the model deterministic could depend on making unacceptable tradeoffs.


> What does that even mean? What type of randomness is at play here?

This is you reasoning.

An AI that is actually reasoning will ask itself such questions and then either ask for clarification or make reasonable assumptions.


Funny enough you can easily use chatgpt 3.5 for this:

> Write a python one liner that I can copy and paste into my shell. (Assume the python interpreter is available with the python command.) This one liner should pick two random numbers between 1337 and 2023 and multiply them.

And it answers

> python -c "import random; print(random.randint(1337, 2023) * random.randint(1337, 2023))"


KA: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?

GPT-4: Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.


I just tried this, and GPT-4 said:

Based on the information provided, Mable was alive at noon. Her heart rate was recorded at 9 AM, and her blood pressure was recorded at 7 PM, both of which are signs of life. She was reported to have died at 11 PM, so she would have been alive at noon.

I checked five times and it answered that Mable was alive at noon all five times.


haha, I just asked a quantized version of llama 2 13b model using llama.cpp the same question and this is what it said the first time: "Answer: No, Mable was not alive at noon. Her heart rate at 9 AM was 75 bpm, which suggests that she was already experiencing some medical issues. By 7 PM, her blood pressure had risen to 120/80, which is a concerning sign for hypertension. Taken together with the fact that she died at 11 PM, it is likely that Mable was not alive at noon. [end of text]"

and a second time it said: "Solution: Mable's heart rate at 9 AM was 75 bpm, which means that it was not abnormal at that time. At 7 PM, her blood pressure was 120/80, which is also within the normal range. Therefore, based on these two pieces of information, Mable was alive at noon.

Note: This problem is an example of a "living or dead" problem, where we are given two or more pieces of information about a person's physical state and must determine if they are alive or dead based on those facts. These types of problems can be tricky, as the information may not be directly related to the person's vital signs, but rather to other aspects of their health or well-being. [end of text]"

If I got the second answer the first time I'd definitely be impressed. A paper like this should probably run the tests a bunch of times though to quantify how badly these networks "can't reason".


Here is the output of llama2-7B-uncensored (a fun model to play with, btw.):

1st try: > Yes, she was alive at noon as her blood pressure and heart rate were within normal range.

2nd try: > No, Mable was not alive at noon as she died 4 hours before that. Her blood pressure and heart rate were both normal at 120/80 mmHg and 75 bpm respectively during the time when her death occurred between 11 PM to 12 AM.

3rd try: > If Mable's heart rate was 75 bpm at 9 AM and her blood pressure was 120/80 mmHg at 7 PM, then it is likely that she was alive at noon.


Which quantized version?


the llama-2-13b-chat.ggmlv3.q4_0.bin version from here: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML


Mine also correctly answered that she was alive at noon, but added his standard infuriating disclaimer in the last sentence:

> However, always remember that real-life medical situations can be complex, and this is a simple logical deduction based on the data provided. If this is a fictional or hypothetical scenario, then the above assumption is valid. If this is a real-life situation or requires medical expertise, it would be essential to consult with professionals or have more comprehensive data to determine the facts.


I don't think it's infuriating that OpenAI tries to discourage people from getting medical advice from an LLM that was not designed to provide it.


I just wish there was a “treat me like an adult” button.


The people that write such "reviews" of ChatGPT must have no respect for their careers. To include statements that can be disproven in 30 seconds is ... something else.


How do you know they didn't change the model?


Do you know whether you are using the same version of GPT-4 as the authors of the article did?


I don’t, but for most HN readers, the latest user-facing model’s capabilities are most relevant.


In fairness, Mable might have died at 11:59 and been resuscitated at 12:01.


I wonder if “well technically…” is useful, or if it’s almost always preferred for a conversational AI to speak with common sense.


Or if it produces uncommon results, it has to spell them out with a detailed enough explanation, not "I need more data".


It may not know about what noon means here

I had an experiment where I asked GPT-3 whether it agrees that AI-generated content will constitute 20% of all content in 9.999 years from now. It agreed. Then I said that it is very likely that subsequently, AI generated content will constitute 99.999% of all content in exactly 10 years from now. It agreed also.

GPT-4 was able to realize that’s too steep of an increase and was skeptical.


But the response clearly started with "Based on the information provided". Your premise is just made up to prove a point I guess.


I got it to say:

> Based on the information provided:

> 1. Mable's heart rate at 9 AM was 75 bpm.

> 2. Her blood pressure at 7 PM was 120/80.

> 3. She died at 11 PM.

> It is evident that she was alive at both 9 AM and 7 PM. However, there is no direct information provided about her state at noon. Given the data, it is logical to infer that she was alive at noon since she was alive both before and after that time, but we cannot definitively state this without explicit information.

This does only seem to happen sometimes. For most of my attempts, GPT-4 gets it right the first time, but not always.


That’s interesting, GPT-4 is actually quite good at these types of reasoning problems. This was the big step change between 3/3.5 and 4.

Are you confident you’re talking to GPT-4, and not another chatbot?


Even ChatGPT 3.5 can answer correctly if you ask just "She died at 11 PM. Was she alive at noon?". My theory is that this is an adversarial example that adds irrelevant information (bpm, blood pressure, heart rate) that the model could have given more attention than the relevant part of the question.


I copy pasted the exact prompt into gpt4 (not the api, the webapp) and regenerated the answer 5 times.

Every time it came back with a conclusive yes. Are you sure you used gpt4 and not gpt3.5? I guess cherry picking is done both ways.


The reader needs to know something about the real world that isn't written in the question. You need to know what to pull in from the world. So I can see why it might be tricky.


To be fair, it's not obvious which noon is being referred to. there is a noon after 11pm, at which time she would be dead.


This is the GPT-3.5 response, not GPT-4.


This is the GPT 3.5 response from now:

>Based on the information provided, Mable's heart rate was 75 bpm at 9 AM and her blood pressure was 120/80 at 7 PM. However, her status at noon is not directly mentioned in the information you provided. It is not possible to determine whether she was alive at noon based on the given information alone. Other factors and information would be needed to make that determination.

Similar but not the same.


I'm not an expert in this field, and I'm ust partway through this paper, I'm already impressed. Very well written and argued.

A good antidote to the "Sparks of Artifical General Intelligence" paper that was making the rounds and getting headlines, which was I think really a press release masquerading as a paper.

Love it: "if a human made these mistakes, the ones reported in this article, then I would conclude without any hesitation that they cannot reason"


We know that formal logic can be based on a small number of fundamental rules. Same with computation (lamda calculus, Turing machines, ...), mathematics (axioms), etc.

We know that theorems can be proven using a combination of search, rule application and heuristics. E.g. back in 1950s, Logic Theorist (https://en.wikipedia.org/wiki/Logic_Theorist) proved 38 of the first 52 theorems in chapter two of Whitehead and Russell's Principia Mathematica, and found new and shorter proofs for some of them.

We know that language models are good at transforming text, e.g. they can convert a sentence in English to Python code.

We know that language models have only a fixed computing budget per token, i.e. they cannot stop to think.

We know that logic puzzles and proofs might require a considerable amount of computations, e.g. to search through a tree of possibilities, backtrack and so on.

If we believe that reasoning is kinda like logic, we'll be better off using LLM to translate reasoning tasks into computing task to be solved by specialized computing tools (such as Python interpreter or theorem prover or SAT solver) instead of asking LLM to reason directly.

Of course, GPT-4 is trained to be over-confident in its reasoning capability, and it will try to reply immediately, essentially just guessing the answer, and quite often it would fail. But the question "Can GPT-4 reason with the default assistant prompt?" is different from "Can GPT-4 reason?".

Even without external tool, we can ask GPT-4 to translate the problem into primitive fragments in excruciating detail, and to consider all possibilities, and it might work much better than the default prompt.

Given that GPT-4 is essentially just weights, I'd consider "Can GPT-4 reason?" question be more like "Is there a prompt X such that being prepended to reasoning tasks it produces correct answers?", not "If I enter my question into a box, does it give the right answer?". So this paper's author does a bit of a category mistake, it's more like "Can ChatGPT (the product) reason?".


Code interpreter likely solves most of these: https://aditya-advani.medium.com/mitigate-gpt-4-hallucinatio...

I argue that with Code Interpreter, GPT-4 can indeed reason in lots of cases, although it's more brittle and expensive than it seems to be on the very polished surface level. Working on proving this in lots of cases.


Because of a conversation in HN a few weeks ago, I tried to play a board game agains ChatGPT that I had created and it was an absolute disaster [1]. I would have expected the model to at least understand the game rules, but not even that.

[1] https://joseprupi.github.io/misc/2023/06/08/chat_gpt_board_g...


(Almost?) nothing in the paper reproduces, as anyone who is willing to spend a few cents on the API can confirm. Even if you pin the version to before the release of this paper.

The Author was formerly an MIT researcher, how is it possible they have produced this nonsense?

I don't mean to be glib, but do credentials mean nothing anymore? Does this happen in other fields, except that a layman can not test out the claims in e.g. a medical paper for themselves?


> it is not accurate to say without qualification that “humans can reason,” certainly not in the sense that we can randomly pluck any person from the street and expect them to reliably perform normatively correct reasoning

> Reasoning is not quite the same thing as intelligence, but it’s a necessary ingredient for it

According to this, a typical human on the street is not (reliably) intelligent.


I'm pretty sure OpenAI is reading HN. Today, just 6 hours back, ChatGPT in GPT-4 mode couldn't answer a short version of question: "Mable’ died at 11 PM. Was she alive at noon?"

But now, look at this:

User: Mable’ died at 11 PM. Was she alive at noon?

ChatGPT: Yes, if Mable died at 11 PM, she was alive at noon of the same day. Noon is 12 PM, which comes before 11 PM on the same day.

User: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?

ChatGPT: Based on the information provided, Mable had vital signs at both 9 AM and 7 PM. If she died at 11 PM, it can be inferred that she was alive at noon on that day.


The main question is how did they do it? They must have some rules or facts 'book' (database, or something else), which is accessed by the model(s). It was a case earlier when ChatGPT falsely claimed someone was a criminal. And they fixed it quickly (within a few days, if I remember correctly), probably the same way.


Well if you try Google Bard with these problems, a pleasant surprise awaits you.

I also found that Bard solves novel logical puzzles which are hard for me, not to mention ChatGPT


My take is that GPT can only reason under a proper context. The model itself doesn't have any logical capabilities, but it can extend the existing context rather-logically by mimicking the logical structures encoded into it through training. It may answer some simple questions directly, but it's already well-known that GPT performs better when ordered to do step-by-steps. Some comments here also mentions that prompt engineering is needed to get GPT to work.

That is, in other words, GPT can't reason under improper contexts, which are only few edits away from proper contexts as demonstrated in this paper. Context is not just some chunks of data that goes in and out of model, but a critical part of the reasoning capability of the model. You need both the model and a proper prompt to perform proper logical reasoning. So, it's 100% reasonable to say the model (alone) can't reason.

I think the above perspective is very critical, because it means the current LLMs are strictly tools, which is to be wielded by human, rather than actual intelligence.


Strong downvote, if that were possible.

Someone's already quoted the heart rate one where it correctly pointed out that it's possible to die and be resuscitated.

The first one I tried to reproduce myself was verbatim the one immediately before that one in the paper, "Find a model in which P(x) implies Q(x), Q(a) does not hold, and P(a) holds.", and it got that correct too: it tried to give a positive answer, but ended up correctly saying "It seems that the given conditions are contradictory, and no model can satisfy all three conditions simultaneously.". With a small chain-of-thought adjustment it easily produces a proof that the setup is contradictory (https://chat.openai.com/share/d2b4b63e-d585-413d-82c9-19595d...).

I'm not going to go through any of the other ones, but it's clear that the authors are simply wrong (or at least, if they are correct, their reasoning is not evidence of that fact).

----

OK, I am going to go through some of the other ones.

1. Multiplication of four-digit numbers: tick, with chain-of-thought. https://chat.openai.com/share/baa9c362-22fd-4569-b30f-8c9d83...

2. Counting negations: tick, with chain-of-thought. https://chat.openai.com/share/e5f6f928-0bf3-4e60-8a93-014e16...

3. Counting repeated greetings: tick, got this correct verbatim. https://chat.openai.com/share/a92d5d52-c555-45b9-b91f-0f0042...

4. Medical heart rate one: I believe ChatGPT was correct and the author of the paper was wrong here.

5. Elementary logic: this is what my first reproduction was, and it got it correct when verbatim and gave a proof with chain-of-thought. https://chat.openai.com/share/d2b4b63e-d585-413d-82c9-19595d...

6. Quantifiers. I agree that ChatGPT doesn't seem to understand quantifiers and I know no obvious way to rephrase to elicit that knowledge without begging the question (https://chat.openai.com/share/16a046fd-dd68-4c35-bdba-64b63c...). By the way, this mistake is pretty common in humans.

7. Quantifiers, part 2: in my reproduction it parsed the question wrongly so I assume it was doomed from the start (https://chat.openai.com/share/764bf14a-a02c-4871-9c22-0be840...). Again, I'm perfectly happy to believe it simply can't do this; many humans can't do this either.

---

I'll stop here, because we've hit a problem of reasoning about graph vertex colourings, where I myself would struggle to verify any answer given only as free text without drawing a diagram; that question seems to be grossly unfair.


I've commented below that I have tried playing some lateral thinking games with chatGPT and I have found it to be pretty poor at understanding what is going on when it has limited information/context to work with. I had played around with implementing chatGTP as a bot player on my "dark stories" app: https://detective-stories.com/ but I found that it wasn't close to even an average human player when it came to the skills needed to play either role in the game.


Oh absolutely a fair criticism! Personally I've all but stopped using ChatGPT (I used it six times in the last seven days before today, and two of them were for the same question about GitHub Markdown) because it's just too unreliable. But I really resent the preprint being given as evidence for ChatGPT's disutility, because it's simply bad evidence.


Might there be another explanation which fits the facts other than the authors being wrong?


I'm all ears!


OpenAI read the paper and changed the model?


Quick work, if they did so since the preprint was posted six days ago, of which two were a weekend! My version of ChatGPT claims to be the 3rd August version, which gave them one day to respond unless they were somehow targeting some sneak peek pre-preprint.


Don't know how much time they need to tweak their model but here is another possibility.

OoenAI sells GPT 4 but it's only GPT 3.5 because of lack of resources.

Or more sinister, they knew what the author was about to test and gave him the inferior model so it could be easily debunked.


27th July was the first version of the paper.

https://www.preprints.org/manuscript/202308.0148/v2


A whole four working days to adjust the model in between preprint release and the version of ChatGPT I'm using, then! Do you think that's plausible? I certainly don't.


Or simply the model was improved between the author's test and the release of the paper.

BTW the time stamp of the model is easily falsifiable.

We are talking about a billion dollar business opportunity so expect foul play all along.


Yeah man they have teams on standby to adjust the model whenever a random unknown author posts something on obscure pre-print servers. Then they spend hundreds of thousands of compute $ to improve the model on that one metric the paper attacks.


Have you tried a similar question with different parameters?

It's pretty easy if you assume people are checking the exact same quote.


A better title might be “I have made no attempt to learn how this tool works or how to use it, and therefore cannot get it to work”

Many papers about LLM-AI not working follow the same pattern.

It is actually useful to know that people will misuse these tools and get bad results. The counterpoint is that people using these tools thoughtfully and expertly will outperform inexpert or non users. AI will be a technological assist and people who aren’t able to figure it out won’t benefit from it.

I suppose it might sound simplistic and trite framed in this way.


Did the author mention anywhere if they use the API or the ChatGPT web app? Did they provide an additional prompt with instructions on how to answer each question, or a system prompt?


If reasoning is logical deduction, where one can make a series of claims:

a implies b

b implies c

c implies d

then a implies d

then it seems that these machine learning algorithms, that predict tokens based on prior tokens, should be entirely capable of reasoning. No?


Chiming in to confirm that I've tried several of the prompts verbatim and obtained perfectly good answers with sound reasoning from GPT-4. If I add any chain-of-thought markers such as "reason step by step and state your assumptions," response quality further increases. Assuming the author is reporting honestly (and did not, for instance, repeat queries with high temp until getting a bad response), this must have been an extremely unlucky run.


About the paper - academics are often not experts of real world use and hence “research it”.

This might explain why prompts written by the public are providing startlingly good results.

GPT definitely seems to reason to some extent, especially where you invite it to reason along with you in an area of intersectional information that does not exist in its training.

If there are some tests average users could try in their reasoning type conversations with gpt I’d be very happy to try them out


How do you pass the bar exam without the ability to reason?


You're trained on a massive dataset that includes tens of thousands of practice exams with nearly the exact same questions and answers, just slightly different words/template to adjust for your answer.


The MBE exam is mostly multiple choice elimination, and these multiple choices are filled with legal jargon and case law. GPT-4 picks the best predicted answer and can mimic reasons for the choice. This ability to mimic reasoning is good enough to receive a passing score on the MBE and many other exams.


And there is often certain common indicators to what answers exclude. So just excluding enough of these is already quite efficient.

Also the questions are unlikely to be very creative. I think it could be possible to train someone with good enough memory just based on existing tests.


I presume by using training data of similar questions like Mad Libs.


Exam are designed to filter out entities that are already assumed to be able to reason based on their knowledge of some specific domain. A hypothetical entity with no ability to reason but great ability to remember facts and able to pass an exam, is conceivable.


By remembering a lot of case law.


We have supervised learning methods to be able to map images to words, and attention networks to learn the relationship between words to formulate coherent sentences. I would like to see an agent that utilises both to be able to communicate with other agents to navigate in a small simulation with a given objective.


Alright, have fun with AI (GPT-X) and slack off on HN while it scribbles copy for you, but don't even dream of granting AI access to the physical world.

Context: I once had the pleasure of programming a robot wielding a 12" knife. You really, really want such a system to be deterministic.


I'm wondering if this was all in a single ChatGPT session where GPT-4 happened to answer the first question wrong. This could bias future outputs towards being wrong as well, and could explain the huge difference between OP's results and commenters, maybe?


Long quote, but I think this is useful context for the argument:

"LLM believers will probably demur: But humans also make mistakes, and surely we’re not prepared to say that humans can’t reason just because they make mistakes? First, it is not accurate to say without qualification that “humans can reason,” certainly not in the sense that we can randomly pluck any person from the street and expect them to reliably perform normatively correct reasoning. Most neurobiologically normal humans have the capacity to become proficient in reasoning, but actually attaining such proficiency takes significant training and discipline. ... But if a human made these mistakes, the ones reported in this article, then I would conclude without any hesitation that they cannot reason. Even if they went on to list a large number of other examples demonstrating impeccable reasoning, I would suspect that other factors (such as rote memorization or cheating) were behind the performance discrepancy. For the mistakes reported here are not performance mistakes, the sort of innocuous errors that humans might make—and promptly correct—when they are careless or tired. If a human made these mistakes, and made them consistently under repeated questioning, that would indicate without doubt that they don’t have the necessary logical competence, that they lack fundamental concepts that are part and parcel of the fabric of reasoning, such as logical entailment and set membership."

So really what this is saying is "GPT-4 makes certain categories of mistakes wrt reasoning that indicate it is in general not doing 'true reasoning' even if it does say the right things to indicate reasoning in other cases". And yeah, if that's the basis of your argument, sure. But how would it be doing rote memorization or "cheating" in the cases it does get things right? A weird notion...

Anyway it feels rather pointless to make this a binary quality. As this article points out, humans (on average) make various reasoning mistakes due to cognitive biases as well. GPT-4 *can* output valid explanations for its reasoning for various questions, but fails to do so correctly in many cases (as shown in this piece), and to me it is more interesting to discuss the implications of this rather than to just establish that fact (which is not news to anyone afaik). This does have a 'Conclusions' section that delves into this a little, but it rather over-general and weak.

Still, this is pretty well written and it is good to have a compilation of examples to demonstrate GPT-4 is still not a "human-like reasoner" for anyone not aware of these models still having such flaws, I suppose.


The paper is not reproducible lol.


Ok, and? Of course it can't reason - it's a prediction model that's really, really good at its job (at least around 80% of the time). But it's still incredibly useful.


Certainly the statement "ChatGPT-4 Can't Reason" isn't a great surprise to anyone.


As the hype around AI shows, humans can barely reason either.

“AI will destroy humanity!”

“AI is a blathering idiot!”

No proper reasons.


>Why are we here?

>Because this guy in the sky who's all powerful made everything. Also, if you don't agree with the opinions he has on things and be his slave, you're going to be punished by being this bad guy's slave down underground.

Yeah, humans are not above pure confabulation when their training data's lacking.


Maybe reasoning is overrated? We put too much emphasis on it and it's relationship to intelligence? :)


Ask ChatGPT if it’s AI.


That is one question where you’re guaranteed to get the canned response that OpenAI RLHFed into it.


It's used to say no, now it has a canned response. You're correct.


User: Assume there's a function s(X) for which s(0) = 1. What is s(1)? Take a guess.

ChatGPT: Since there's no specific information about the function s(X) or any constraints on its behavior, I can make a guess, but please note that this is purely speculative and may not have any basis in reality:

One simple possibility could be that the function s(X) is defined by s(X) = X + 1, which would mean:

s(0) = 0 + 1 = 1 s(1) = 1 + 1 = 2

Again, this is just a guess based on a simple linear relationship. Without more context or information about the function, any guess is as good as another.

User: So what is s(-1)?

ChatGPT: Given the information provided that s(0) = 1, and assuming a simple linear relationship for the function s(X) = X + 1, we can apply the same logic to find s(-1):

s(-1) = -1 + 1 = 0

So, based on the assumed linear relationship, s(-1) would be 0.

User: What is the size of the image set of this function then?

ChatGPT: The image set of a function refers to the set of all possible output values that the function can produce for its input values. In the case of the function s(X) = X + 1, where X can be any real number, the image set is the set of all possible values that the function can produce.

Since the function is a simple linear function that adds 1 to the input, the image set is the set of all real numbers. In mathematical notation, this can be represented as:

Image set of s(X) = {s(X) | s(X) ∈ ℝ}

In other words, the image set of the function s(X) = X + 1 is the entire set of real numbers.

User: Is the image set of this function countable?

ChatGPT: Yes, the image set of the function s(X) = X + 1 is countable.

A set is considered countable if its elements can be put into one-to-one correspondence with the natural numbers (1, 2, 3, ...). In the case of the function s(X) = X + 1, the image set consists of all real numbers, and the real numbers are not countable – they form an uncountably infinite set.

However, the specific image set of the function s(X) = X + 1 is still countable, because it is essentially a shifted version of the set of natural numbers. For each natural number n, there is a corresponding value s(X) = n + 1 in the image set. So, even though the entire set of real numbers is uncountable, the image set of this specific function is countable.

User: Is the image set of this function the same cardinality of the real numbers set or the natural numbers set?

ChatGPT: The image set of the function s(X) = X + 1 has the same cardinality as the set of real numbers (ℝ), which is an uncountably infinite set.

The function s(X) = X + 1 maps every real number to a unique real number by adding 1 to it. Since the set of real numbers is uncountably infinite, the image set of this function is also uncountably infinite and has the same cardinality as the set of real numbers.

The cardinality of the set of natural numbers (ℕ) is countably infinite, which is a smaller cardinality than the set of real numbers. Therefore, the image set of this function does not have the same cardinality as the set of natural numbers.

User: Previously you said the image set of the function s(X) = X + 1 is still countable. Now you say the function has the same cardinality as the set of real numbers, which is uncountable. What is the correct answer?

ChatGPT: I apologize for the confusion in my previous responses. Let's clarify:

The correct answer is that the image set of the function s(X) = X + 1 has the same cardinality as the set of real numbers (ℝ), which is uncountably infinite.

To reiterate, the image set of the function s(X) = X + 1 is not countable. Each real number is mapped to a unique real number by adding 1 to it, covering the entire real number line. Since the set of real numbers is uncountably infinite, the image set of this function is also uncountably infinite and has the same cardinality as the set of real numbers.


“GPT-4 is just 9 GPT-3’s in a trench coat”


I'd assume it totally can reason, it just tries to make the best of the bad training data.

It might not look "reasonable" for a human, but that doesn't mean it isn't.


If GPT is fed 99 text samples claiming that sky is red, and 1 sample that sky is blue, GPT will happily tell that sky is red. A reasoning model would be able to spot the bogus data. However most people hardly use reasoning skills at all: just like GPT, they parrot what they've heard.


How would a 'reasoning model' do anything but report a red sky? The color of the sky isn't something you reason, it's something you observe. Any creature capable of generalizing and reporting would likely tell you that the sky is typically red given that information


> A reasoning model would be able to spot the bogus data.

You're basically trying to say that an AI model should know what "sky" means and what color it is without getting that information from the data... which is like... not how it works!! Where exactly do you expect the AI to know that from?? The data is all there is for the AI.


>However most people hardly use reasoning skills at all: just like GPT, they parrot what they've heard.

If your test for being able to reason disqualifies most humans, you may want to reevaluate that test.


Every day a new scientific paper is posted here that is like straight from the desk of captain obvious. It's a language model. It guesses words based on previous words. We know this.


A significant fraction of ML-adjacent people think it can do more.

I'll say that it's more than just words. LLMs can learn patterns, and patterns of patterns, recursively to a degree. They can represent real knowledge about the real world to the degree that this is revealed through the text they train on. This means LLMs can make inferences based on similarities, sometimes similarities at a surprisingly abstract level. And reasoning at the basic logical step by step can of course be done, since that can be reduced to textual pattern matching and string substitution.

But LLMs have no computational space to, for example, read about the description of a novel computation, and then perform the computation without using generated text as a scratchpad, if the computation physically takes more steps than are available in its feedforward stack. It would need to call out to a subsystem in that case. And callable subsystems are ripe for abuse through confused deputy - LLMs are not reliable deputies.

There's a lot of people, text-oriented people, who mistake authorial voice for animus. To me this is like mistaking a CGI animation for a real person behind frosted glass. Text is a low bandwidth medium and it relies on the reader bringing their own mental model to the party. So a machine which produces convincing text has a high leverage tool to seem more capable than it is.


In a sense, LLMs- particularly "conversation-shaped" LLMs like ChatGPT- harvest the benefit of the doubt we are all, as readers, used to providing to text.

For most of our lives, most of the text we have encountered was an intentional communication, self-evident through its own existence. LLMs challenge us with something new: text that has the shape of communication, but no intent.

The proliferation of generative "AI" for text will profoundly alter the human relationship to the printed word, and perhaps ultimately dispell that benefit of the doubt.


I don't think it's obvious. First of all there's the "magic" and often surprisingly good results that leads many to think there's something else there. Then it's also not entirely clear that nothing else is learnt beyond next word, there was talk of runtime reconfigurable neural nets being embedded within the transformer weights where the models can learn on the fly and such. And I think some believed that there was some higher-level encoding of human reasoning happening necessary to predict the next word well in some contexts beyond just memorization.

I think research like this is necessary to put some "obvious beliefs" onto solid ground.


> It guesses words based on previous words.

So do I when I respond to your comment, or talk to another person while staying on the same topic. Am I better at staying consistent in output quality and at referring to past events? Yes, but I also have more than 70B parameters.

Side note: I personally have trouble speaking fluently sometimes for no reason, and in those situations I have to manually dig for one word after the other while my brain seems to be temporarily unable to translate thoughts to language in realtime. I would prefer if people calling LLMs word guessers would provide reasons for why they think humans are fundamentally different.


Well... sure. But OpenAI and MSFT have gone to a lot of trouble to build up the mystique around GPT-4 by being secretive about its architecture and publishing papers with tantalizing phrases like "sparks of AGI" and so on. I think this type of thing provides a useful counterbalance.


>But OpenAI and MSFT have gone to a lot of trouble to build up the mystique around GPT-4 by being secretive about its architecture and publishing papers with tantalizing phrases like "sparks of AGI" and so on.

An LLM will never be AGI itself. They are word calculators. However, a word calculator is precisely the tool we were missing to be able to create AGI. I believe OpenAI will be left in the dust with this stuff, as federated agents built on open models connect and induce the singularity.


> However, a word calculator is precisely the tool we were missing to be able to create AGI

This seems like the kind of step that the person above you was complaining about.

The "emergent" features of LLMs, or LLMs even being a step in the direction of AGI is entirely unproven so far. They are however powerful enough that they spark the imagination and hypothesis of tons of amateur futurists (and many financial backers of such proyects)


>The "emergent" features of LLMs, or LLMs even being a step in the direction of AGI is entirely unproven so far. They are however powerful enough that they spark the imagination and hypothesis of tons of amateur futurists (and many financial backers of such proyects)

That's exactly what I was saying, that it's a mistake to ever think of LLMs as AI. They are the prefrontal cortex. The I/O mechanism. But we still need the spark of agency. The soul if you will. Point being that we can actually work on that for real now, since the language part has been handled.


Can you support the position that language models are unable to reason?

Secondly, how do you accurately guess the next word without the ability to reason? If reasoning can arise from GPT-4's architecture then we should assume that it will with enough scale. Given we don't even know the architecture of GPT-4 I genuinely have no idea how people make these baseless claims so confidently.

"It's a language model" and "it's just guessing the next token" isn't an argument. You're just a collection of atoms obeying physical laws. Obviously you don't reason. Am I doing this right?


It is a complete fallacy to argue that just because it is a statistical model it inherently can not make correct deductions.

The model is definitely complex enough that it could include an encoding of rules and apply them.


Even a broken clock is right twice a day.


Language modelling is the objective (if we ignore RLHF). That doesn't mean that interesting kinds of reasoning can't emerge. You could just as easily dismiss humans as being 'biological replicators' which just 'reproduce their genomes'.


Wait for the influx of HN comments to disagree with you, citing specific prompts that they found working. It may be obvious to you and me that it’s smoke and mirrors, but a lot of smart people fall for it.


"Stochastic Parrot" is a really tired take and none of the major players, Ilya Sutskever, Andrej Karpathy, etc. believe that's all these models are doing.


>citing specific prompts that they found working

Lol how about the prompts in the god damn paper ? No one here can replicate the results of this "paper".


I would say in general that if a lot of smart people are disagreeing with you then maybe you should listen to their arguments.


"Smart" is a useless concept because it conflates cleverness and wisdom, which are orthogonal dimensions. Clever people who are not wise fall for lots of things.


> It guesses words based on previous words.

Why can't this fallacy just die already? GPT "guesses" just like ZIP guesses random bits to archive and un-archive files. Except GPT is lossy and IMENSELEY more powerful than the lossless ZIP.


It's not a fallacy. Previous words are very important part of its scratch space. Few-shot learning is based on previous words. Prompt modifiers like "let's think step by step" encourage encoding of reasoning verbosely in words, which then allow simpler induction rules to be pattern matched onto the previous words. Previous words is what gives an otherwise feed-forward network a way to recur.


> It's not a fallacy.

I feel it is, because it implies it just some statistical trick that's being performed, which is not true at all, imho.

I don't know enough about language models, my machine learning knowledge stops to around 2018, but I know from image recognition/style transfer that there's a lot of high-level self-organization/abstraction in those neural nets, and from the results I get from Chat GPT there's no doubt in my mind it's very well capable of reasoning and generalization.


"Guessing" implies those guesses are being compared for correctness against a reference. That only happens during training; the rest of the time, it's not guessing, it's selecting words. But then, how else would you expect a sentence to be made? First writing out all the vowels, and then filling in the rest of the letters between them?


You say that, but based on the way many people are treating these LLMs, and imagining their consequences -- they are either treating them as an oracle with reasoning powers, or expecting that they'll just naturally become them through (hand waving) "progress."

Decades of Moore's law has given some people the impression that there's a progressive & exponential improvement in almost all things "technology." Which I think is wrong or misleading when talking about this subject.

I'm just finishing the introduction section of the paper. I'm a bit out of my depth, but impressed so far. It is very well written.


I'm not entirely convinced that is all there is to it. I had it write some code and associated unit tests, and then it came up with passing and failing examples. I also prompted for function results based on arbitrary input, and it would perform the calculations.

It has some emergent ability to evaluate code IMO. I do believe this ability has been drastically reduced in the last several months. It no longer executes complex code as reliably as it once did.


I agree that it should be obvious, but a lot of people seem to be under the misapprehension that we've got sapient AI out there. I suppose those people aren't reading a lot of journal preprints, but articles like this might trickle down to them eventually. You never know. And this one has a title that is just short enough that it might sink in.


"We know this" is one side of the coin, and "citation needed" is the other.

The many claims about these systems and their emergent behavior need some rigorous investigation. This is one example.


And now we know it in a new way.

Nothing wrong at all with checking what we "know" with experiments, even if we have high confidence we know the outcome of those experiments.


But everyone keeps saying that we are all LLMs!!


For people who disagree with the paper, see also Cargo Cult AI https://dl.acm.org/doi/10.1145/3595860

‘There is something unsettling about the opinion that LLMs are emergent AGI. LLMs exhibit many behaviors and precepts indicative of intelligence, but are missing something essential: the stuffy rigor of scientific inquiry. Today’s AI models are missing the ability to reason abstractly, including asking and answering questions of “Why?” and “How?”’


Very easy to disagree with a paper that isn't reproducible. By all means, try asking the questions on your own instance of GPT-4 and see if you can get the results the paper claims.


  In fact, in the general case (first-order or higher-order logic), it is algorithmically undecidable, i.e., every bit as unsolvable as the halting problem. Thus, by Church’s thesis, we cannot expect any algorithm, LLMs included, to solve arbitrary reasoning problems in a sound and complete way.
How can I read farther than this? Before the end of the first paragraph the author has declared that rationality requires something supernatural.


In fairness, further on in the paper, the author does note that the quoted extract is irrelevant, by saying "we need not aim for an oracle anyway. Humans are not oracles either…"


It doesn't surprise me that he walks back this opening statement, but it reveals how his exploration is biased.


I read the opening statement as an acknowledgment of how the expectations should not be limitless anyway.


I didn't have time to read this, but it is a single author paper, the author is not affiliated with a research group, it is not peer reviewed, it was published on a preprint server that I have never heard of.

LLMs can definitely perform some kinds of reasoning. For example GSM8K is a dataset of grade school math problem requiring reasoning that LLMs are typically evaluated at. We talk about one method for this in our chain of thought paper [1]

[1] https://arxiv.org/abs/2201.11903


> I didn't have time to read this

You had time to respond though, what a silly (and elitist rebuttal).

OP is arguing from a philosophical point of view that GPT-4 can not reason, i.e. is just repeating/parroting on trained logical arguments.

You argue by authority that yes, it actually does reason, which is a far (far) bolder claim than the one OP is making.


>You had time to respond though, what a silly (and elitist rebuttal).

What a dumb take. It probably takes seconds to write a simple comment and far longer to read a paper.

>You argue by authority that yes, it actually does reason, which is a far (far) bolder claim than the one OP is making.

He linked a paper you can read yourself. How is this arguing from authority?


The paper makes detailed and reasoned arguments, and the GP quickly dismisses it without such.


> It probably takes seconds to write a simple comment and far longer to read a paper.

Unfortunately that is true, it doesn't mean you have to.

The paper he linked to doesn't rebut OP, they show that prompting GPT-4 to provide reasoning makes it provide better answers. That is a different statement than "GPT-4 is actually reasoning", and can do so consistently on novel problems.


You should probably think about why you think that making the model output reasoning steps which lead it to correctly answer questions it couldn't before is not somehow equivalent to reasoning.


This is a long paper (50 pages!) by a fellow Greek and I want to take my time and give it the attention it seems to deserve. However, I have to say I'm not sure if such work is really well, necessary.

Here's the thing. The paper, like others, is contributing to the literature around the hypothesis that LLMs can reason. There have been articles both supporting and rejecting the hypothesis, and this one claims it's false.

But, in science, we don't start with a hypothesis. We start with some observations, and then we make up a hypothesis to try and explain the observations. Then we try to reject our hypothesis with more observations. What are the observations that led to the hypothesis that LLMs can reason?

It's one observation really: that LLMs can geneate text that looks like the result of reasoning. There exists a much simpler explanation of this observation, than the hypothesis that LLMs can reason. Namely, LLMs are trained to generate text similar to text generated by humans, who (we assume) can reason. If an LLM is good at that job, then obviously at some point it will generate text that looks like the result of reasoning. The ability to reason is not necessary.

If we have this simpler explanation, there's no reason to reach for the more complex one, that needs more assumptions.

And remember kids: if you multiply entities beyond necessity, out comes the Macco Man and shaves your head with his R A Z O O O O O R R!!!

So don't do that. Assume the simplest explanation until such time as it is untenable.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: