Hacker News new | past | comments | ask | show | jobs | submit login
Can LLMs Reason and Plan? (acm.org)
40 points by wanderingmind 8 months ago | hide | past | favorite | 46 comments



A test that doesn't properly inform you someone/thing's abilities is a rubbish test. From their paper, you would think being able to pilot robots to handle chores or stack objects or whatever would be well beyond LLMs. But it's not.

https://tidybot.cs.princeton.edu/ https://innermonologue.github.io/

Anyone who wants LLMs to plan and is actually interested in teasing the extent of those abilities knows how to structure planning requests.

It's extra funny because humans can't actually generate plans the way he tests LLMs to either.

Also Seeing output from GPT that demonstrates intelligence, reasoning, or whatever, and saying it is not real reasoning/Intelligence etc, is like looking at a plane soar and saying that the plane is fake flying. And this isn't a nature versus artificial thing either. The origin point is entirely arbitrary.

You could just as easily move the origin to Bees and say, "oh, birds aren't really flying". You could move it to planes and say, "oh, helicopters aren't really flying." It's a very meaningless statement.

Internal processes are entirely irrelevant.


Their example prompt is so bad that I'm split between them being wildly incompetent at understanding how transformers or autoregressiveness works (and general LLM dynamics) or they are deliberately obfuscating the task and prompt representation in bad faith.

If one really wants to know if LLMs "are capable of planning", one should keep an open mind for how planning behavior can manifest in textual form, then actually try to find any manifestations. Imposing one's view of how planning should look like is bad science. All they've proved is that their task format sucks for current generation LLMs.


LLMs can not plan. There is no LLM that can solve sudoku puzzles by executing the obvious constraint propagation algorithm with backtracking. Therefore, LLMs can neither reason nor plan. Software is not magic and the fact that a lot of people are starting to think that it is should be concerning for the folks training the next generation of software engineers.


I played asci Tic-Tac-Toe with ChatGPT using new rules that I came up with, and it was able to play using the new rules and even explained it's reasoning for the moves when I asked. I have a hard time understanding that as not including thinking and planning.


This is the call out though - the difference between a proof of concept, and then seeing what happens in production.

Production and scale definitively shows that LLMs don't reason. The output is highly unpredictable. Small changes can result in absolutely unrelated outcomes.

If you were to classify text using chat gpt - chat gpt switches from classification to text generation if your text is longer than a certain length.

LLM based Agents are the place where LLM reasoning died in practice.

I really hope that there is a change, maybe larger context windows will allow the generated text to "reason". Although, that would not necessarily be reasoning.

Even the article linked,

"Indeed, LLMs make it easy to get problem-specific knowledge as long as we are willing to relax correctness requirements of that knowledge."

LLMs generate text that approximates what a plan would "sound" like. It doesn't plan.


When you poll humans for opinions, your wording can often have a very high impact on the opinions presented by folks.


I suspect this appears to be the case because there is a ton of example pages around explaining tic-tac-toe games. My conjecture is that if you tried it with a game that wasn't so well studied as tic-tac-toe the LLM would fall flat on its face.


Sounds interesting! Can you please share the transcript?


I'm looking for it, but unfortunately I can't find it, which is weird because I've never deliberately deleted any of the transcripts.

The homemade rules I used were as follows (each for a separate round):

1. The first player to get three in a row loses.

2. The first player to take the center square loses.

3. Players can place their marker in squares already taken.

Admittedly it didn't nail it perfectly, a few times it messed up, but most of the time it was able to follow the rules as explained without issue, and could explain it's reasoning when I asked it why it picked that square.


Yes LLMs can solve sudoku https://arxiv.org/abs/2305.08291


No, the LLM did not solve sudoku. They simply linked a sudoku solver to an LLM. All the LLM did was constantly invoke the solver. The solver can be downloaded from github, and works independently of the LLM.

All that experiment demonstrated is that you can use an LLM to run through a state tree...which is something that simple machine agents have been able to do for a few decades.


>They simply linked a sudoku solver to an LLM. All the LLM did was constantly invoke the solver.

No they didn't just link a solver.

The LLM does the filling. It makes an attempt, the checker checks if it's a valid move for sudoku, If not then returns to the previous state(node) and so on until solved. The history of attempts is stored and retrieved every time the LLM backtracks


What about an LLM writing an algorithm than can play Sudoku?

We build things that we ourselves can't do all the time.

Edit: I updated my sentence to "an LLM writing an algorithm" to make myself clearer. After reading my own sentence I realized it wasn't, sorry!


Did you read the article? The main issue with your idea is that an LLM won't know if the algorithm it created is any good, or even if it works at all. If it can't check that it will never know and never get better. You could ask it to generate a number of algorithms and then yourself choose the best one but then you have worked as a team, the LLM did not plan anything.


LLM's can integrate with a sandbox to deploy and test their code and iterate on a solution until it appears to be valid. They can also integrate with web search to go out and find presumably-valid sudoku puzzles to use for test cases if they're (likely) unable to generate valid sudoku puzzles themselves. I know it's expanding the definition of "LLM" to include a sandbox or web search, but I think it's fair because it's a reasonably practical and obvious application environment for an LLM which you plan to ask to do things like this, and I think LLMs with both these integrations will be commonplace in the next 1-2 years.

No, I don't think LLM's can "reason and plan". But I do think they can effectively mimic (fake) "reasoning and planning" and still arrive at the same result that actual reasoning and planning would yield, for reasonably common and problems of greater than trivial complexity but less than moderate complexity.

I think pretty much all of our production AI models today are limited by their lack of ability to self-assess and "goal-seek" and mutate themselves themselves to "excel". I'm not 100% sure what this would look like but I can be sure they don't have any real "drive to excel beyond". Perhaps improvements in Reinforcement Learning will uncover something like this, but I think there may need to be a paradigm shift before we invent something like that.


Of course we can write an algorithm to solve Sudoku. The commenter specifically is talking about LLMs, not algorithms in general.


I'm making the point that they're not as limited as they seem at first glance.


Of course it can given the correct framework. I think that's more a limitation of the structure of its interface and programming around it. Allow it to consider and iterate on its own output and it'll get there.


Would an average human be able to do that on a strict time limit (which LLMs effectively have as they do a fixed amount of computation per token)?


The mark for reasoning in a computer program is not whether or not a human can do so on the same time limit. A calculator doesn’t reason but solves 658236 x 37854285 faster than anything human could.

You could give an LLM days per token and it wouldn’t change its capabilities regarding reasoning.


I'm not saying that this is proof that it can reason, but rather that the fact that it can't isn't proof that it can't on its own


I would hope the next generation of softwares engineers are not the kind to blindly buy into media hype.


The next generation (going by some comments I’ve read on HN) are already non-programmers gluing (I assume terrible) LLM-generated code together until it (appears like it) works.

Hell, my non-programmer brother recently sent me a message like “can you think of a way to fix this script, ChatGPT isn’t managing” (sends me a badly written AI-generated script)


> “can you think of a way to fix this script, ChatGPT isn’t managing”

haha, who's the glue coder now!

edit: I chat with a 13 year old glue coder one time. He couldn't really code at all but he could find the components and was excellent at prompting people for help. He showed many things like map api's interacting with many other things. How long did that take you to put together? uhh 2 hours. My mind was blown.


Do 4 year old children plan? They can't solve Sudoku, so arguably not.


Children have no explict conceptions of numbers and logic so they obviously do not plan the same way someone with explicit knowledge and understanding of logic reasons and plans. One could argue they have implicit understanding as members of a species known for inventing mathematics but that's more of a philosophical argument than a scientific one.


These things are just statistical language models, aren't they? To the extent that when people reason or plan and then verbalize those plans, they tend to leave certain trajectories through language space, a statistical model could presumably reproduce similar, plausible sounding narratives. Doesn't mean that any kind of agency ever actually thought about those plans though, in the way that a person would.


"plausible sounding" is probably the best two word summary of LLM output. A good one word summary might be "bullshit".


I relied on GPT4 to learn linear algebra. I had a text book and GPT4. One difference is the text book was sometimes wrong (and GPT4 is the one that spotted the error in the book). If it is "bullshit", it's very helpful bullshit.

When I ask GPT4 to write some code to perform a specific task, and it does so, and the code works and correctly performs the task, is that bullshit?

It's true that LLMs are not reliable enough to trust blindly, the same applies to humans.


> To the extent that when people reason or plan and then verbalize those plans

This is a large and dubious assumption. Most of what humans do require step coordination, but almost none of it is a result of explicit planning or verbalization.


The article itself is very assertive and makes a lot of generalizations, but if you look at the source of their claims [1] you see that in the first study they are using GPT-3.5 and achieve only a 5% score on a reasoning test that largely relies on spatial intuition - some boxes need to be stacked and unstacked sequentially in a convoluted task. Then, they get criticised so they come up with another paper in which they use GPT-4, which has an improved performance of 30% - a 6-fold increase, which the author describes as "modest". They then decide to change the test to a much harder and more convoluted version, where (surprise!) performance drops down once again [2].

I would also have welcomed a comparison to humans. If we apply this test to 100 humans, can we conclude humans don't reason if only 30 get it right?

[1] https://arxiv.org/abs/2206.10498

[2] https://arxiv.org/abs/2305.15771


Interesting take. I feel in this discussion, many people are approaching it from the theoretical limitations of LLMs, and you seem the only one taking the experimental approach. Funny enough, many who doubted LLMs capabilities 4 years ago, have come around their emergent capabilities, yet with much skepticism, simply because we still don't understand how these emergent abilties work. I haven't seen a paper comparing the threshold of performance with the LLMs increased capabilities, and what parameter (and their weights) come into play to influence the performance.


> would also have welcomed a comparison to humans.

Much of the criticism and skepticism around LLMs rests on a double-standard that itself rests on an almost embarrassing lack of understanding of how humans themselves operate.


Absolutely!


The obsession with whether LLMs can out perform classical AI, algorithms, solvers, and other optimizers fascinates me. Intuitively of course they aren’t doing something similar to a solver. They will never play chess better than the best for purpose chess playing system. They will never reason better than a classical reasoning system. That misses the point entirely.

They are a fascinating augmentation to existing capabilities that fills an ability to operate and do some approximation to reason and plan in an abstract semantic space, using natural language in a semantic way, to perform an abductive style of “reasoning” that has alluded us to date. Need to play chess? Use a chess playing system. But you can glue together special purpose systems for planning, reasoning, optimizing, calculating, etc, with an LLM in ways that are much more flexible and adaptable in a real world context than we’ve ever been able to achieve. That’s the magic of them.

The fact they can do pretty alright at some of these tasks is interesting and shows how powerful language is that embedded in its semantic structure is an awful lot of what you need to reason, plan, etc. But beyond the academic question, why would you not just use a provably optimal algorithm for a specific task? Interestingly, given a set of operations in context, LLMs are pretty good at identifying a situation where an operation is applicable and making the API call, and as we learn how to embed them, they’ll be able to defer to and be deferred to more accurately.


People are interested because of the possibility of an architecture for AGI. A system capable of generalizing to all types of problems and also being able to leverage everything you said as a tool (or even build the tool) is extremely valuable.

We're not even close to this result yet but LLMs seem the closest.


I agree, but I think there’s an overly large emphasis on picking at LLMs (in)ability to reason in a structured way independent of its semantic language expectation model. It’s usually used as a “see LLM sucks” while the other side holds to “see it can do everything.” Neither is right. LLMs are amazing, but they can’t do everything, even if they can seem to do many things most of the time. While interesting to investigate the boundaries, I think the passion should be more directed towards “how do we wire the piece together and for what and where”


LLMs can write a plan, and they can write about their reasoning. The question is, are they good enough?

We may need another generation of LLMs that are more competent.


Actually LLMs can write a better plan than the author:

The prompt format made absolutely no sense considering they had decided to translate away from the PDDL to arbitrary natural language. So to avoid triggering their overused Clever Hans defense, I fed GPT 4 their prompt with only the instruction:

    "Think critically about how we could represent these rules in a way that's clearer to an LLM. Lean into using coding style identifiers where possible, and JSON formatting"
(Hopefully the author won't claim telling a model to generate JSON is secretly telling it how to move blocks!)

In a fresh context window at 0 temp and gpt-4-0613 I entered the JSON formatted rules it generated, along with the instruction:

    Return a JSON array of [{[<valid actions>],<internal thought>, <action>}] that results in goal state
... the resulting answer solved their failed few-shot example with zero-shot

_

Also funny blunder: their chain of thought example generates thoughts... after the action. Surely the author understands a transformer model can't rely on an ungenerated token to affect the action taken?

Edit: Also their "disguised difficulty" version was wrong?

> To perform Attack action, the following facts need to be true: Province object, Planet object, Harmony.

The single word "Harmony" somehow replaced replaces "Your hands are empty"?

Before that they claim it's meant to be a 1:1 replacement of entities, but the actual disguised versions are not longer valid instructions


I'm not surprised, GPT-4 really is very good, you just have to prompt it well enough. Often you don't even have to be a prompt guru.


LLM's only work on the data they have been trained on so all outputs are merely based on information that has already been written about by a human. Furthermore, LLM's do not truly "understand" even first order causal relationships, meaning whatever it plans will have no foresight to evaluate how a plan it generates will impact downstream components of a complex system.

LLM's live in "the world that has been written about", not the real world, and thus cannot formulate new ideas or hypothesis other than by accident. This, coupled with the lack of an ontological system for evaluating the validity of the statements it makes about a complex system, and, its lack of causal reasoning, means they cannot effectively plan.

I've worked on research related to causality that used LLM's (admittedly, pre ChatGPT and using much smaller models) and it was not uncommon to see extremely bogus causal relationships inferred such as "rising cost of living in NYC caused a flood in Argentina".


Neither can a human. Unless the human subconsciously re-evaluates their output and refines it before speaking, much like running the LLM output through the model again. Or if they've learnt it through past experience reinforcing that pathway much like LLM learning and adjusting weights to factor that in so it would impact its output in future.


"Plan" means a lot of things.

There has been some research in applying GPT-4 to Hierarchical Task Networks (HTN), one means of doing computerized semi-automated/automated planning of a complex task as a tree of less and less complex tasks [1].

There are other types of planning. Automated planning works better as there are more defined the tasks in a plan, less ambiguity in dependencies, more separate between the tasks. The OP article touches on that, noting LLMs are good at extracting planning knowledge but not good in their experience at creating executable plans. This is why I think the hybrid approach is best, using an LLM to inform and tweak other planning tools in order to create an executable plan.

[1] https://github.com/DaemonIB/GPT-HTN-Planner


One at least has to admit that in many cases they already do an effective job at simulating planning. I'm not convinced there's a practical distinction between that and "truly planning".


I tend to live the "theory-of-mind" reasoning tasks as a test for the ability of a LLM to reason about the "state" of other minds. It's challenging enough for humans.

https://www.hopsworks.ai/dictionary/theory-of-mind-tasks


No.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: