Hacker News new | past | comments | ask | show | jobs | submit login
The "Chain of Thought" Delusions (twitter.com/rao2z)
22 points by weinzierl 13 days ago | hide | past | favorite | 36 comments





Far as I can tell, all this demonstrates is there's tasks that CoT doesn't work for (yet)? If that? Or just that there's bad prompts? Really, the Twitter replies have already made these points so there's nothing to add. This just seems like research that's been written conclusion-first.

> Ever since I came across the Chain of Though (CoT) for LLMs paper, I wondered how it can possibly make sense given that there is little reason to believe that LLMs can follow procedures and unroll them for the current problem at hand. (After all, if they can do that, they should be able to do verification and general reasoning too--and we know they suck at those).

I don't follow. Why would procedure following imply general reasoning?


>> Far as I can tell, all this demonstrates is there's tasks that CoT doesn't work for (yet)?

What he says is that they can solve blocks world problems if you take them by the hand and show them how to do it, for every single problem you wan them to solve. Which is not very useful at all. He's arguing that this is what CoT achieves: it takes the LLM by the hand and introduces domain knowledge that the user has in every step of the way, so that it's not the LLM that's solving anything but the person using the LLM with CoT.

Do read the linked paper, it goes over this in more detail. It's a bit unfortunate that Rao chooses to communicate through the very noisy medium of twitter, but that's the internet.


I think LLMs and CoT are powerful, but I agree with this description of what they do.

The important step is that they demonstrate that there are no architectural limits that keep the LLM from acting in these domains, "only" knowledge/planning ones. Once we have a big enough dataset of CoT prompts, the model "just" has to generalize CoT prompting, not CoT following. It decomposes the problem into two halves, and demonstrates that if one half (instruction generation) is provided, the other (instruction following) becomes very tractable.


Thanks for clarifying. That's not the claim that Rao is tweeting about though. Rao as far as I can tell is responding to a stream of papers that appeared in the last couple of years claiming that LLMs can plan right now, whether using CoT, or not. See the paper I linked to my reply to amenhotep (!) below for an example.

As to blocks world planning in particular, LLMs already have plenty of examples of block stacking problems - those are the standard motivational experiment in planning papers, like solving mazes is for Reinforcement Learning. Google returns 10 pages of results for "blocks world planning". If LLMs were capable of generalising as well as you expect they will one day, they should already be capable of solving block stacking problems without CoT and with no guidance.


Saying that LLMs should be able to do problems without CoT is a bit like saying that humans should be able to write programs without thinking though. CoT is fundamentally an architectural necessity owed to the finite cognitive effort possible per token. Let it be CoT, let it be QuietSTaR ( https://arxiv.org/abs/2403.09629 ), let it be pause tokens ( https://arxiv.org/abs/2310.02226 ), let it even be dots ( https://arxiv.org/abs/2404.15758 ), but the thinking has to happen somewhere.

I think mostly, LLMs at the moment are incredibly uneven. LLM assistants can pull obscure knowledge out of nowhere one second and fail extremely basic reasoning the next. So just because some example happens a lot in the source material doesn't mean the LLM can learn it. IMO, that CoT works at all is more down to luck of the training set than any inherent capability of the LLM.


It struck me as really weird that he seemed to assume implicitly that his Blocksworld problems would be something an LLM should be able to do and presented the failures as shocking; one look at them immediately gives me the impression that they are the kind of problem LLMs tend to be pathologically bad at.

Probably he's better tuned in to the tone of the discourse than me but I feel like if you showed CoT helped with Blocksworld then that would be a huge and triumphant result, rather than something taken for granted that we must loudly disprove!


From the tweet:

>> And yet the practitioners of CoT swear that any and every problem can be solved with LLM by giving it a bit of a CoT help.

For example, see this arxiv paper:

Generalized Planning in PDDL Domains with Pretrained Large Language Models

https://arxiv.org/abs/2305.11014

Where the authors conclude:

In this work, we showed that GPT-4 with CoT summarization and automated debugging is a surprisingly strong generalized planner in PDDL domains.

The author of the tweet is an expert on planning and he's responding to that kind of thing.


I think these scenarios are compatible if we view LLMs as "fragile reasoners": they can occasionally reason, but it is an intermittent state that is easily disturbed. In such a world, we would expect to see that people who want LLMs to work can make them work with difficulty, and people who want or expect LLMs to fail can make them fail easily - or rather, maybe less adversarially phrased, one can generate examples of either outcome.

>> And yet the practitioners of CoT swear that any and every problem...

> Where the authors conclude:

> ...is a surprisingly strong generalized planner

Wtf is going on here?


LLMs are really good at system 1 thinking (fast, intuitive) but can't do system 2 (planning, calculating) at all. Just like in Go/Baduk, neural net based AIs were good at tactical play but fell apart against pro players with good long term strategy. Until NNs were combined with Monte Carlo Tree Search and we got AlphaGo with great tactical and strategial play.

We need a good search strategy that fits LLMs before we can get AGI. Maybe Chain Of Thought can become such a strategy, but it always felt too clunky to me.


We already do beam search, which allows LLMs to backtrack on branches until they find the highest probability path. Discounting things like Toolformer, this is about as good as it's going to get.

My understanding is that the main limitation of LLM is that their are single pass. I.e. they process once and return they answer. If that answer is correct that's great but if they answer doesn't make sense LLM has no ability to retrace steps. That is exactly why breaking into steps is useful, because each step is easier to quickly answer. I think at some point people will combine the LLM with some kind of LLM-critic that will enable automatic splitting into steps or reruns of the LLM.

> ...If they answer doesn't make sense LLM has no ability to retrace the steps.

Beam search already exists, it precisely allows the model to backtrack on its past tokens and find the global maxima of probability, at least within set depth/breadth. I think this is just a larger limitation of their shallow world model. Perhaps giving transformers registers will remedy this.


>> Consider STRIPS planning problems.

The thing to keep in mind in all these discussions about LLM planning and reasoning is that, when experts on planning and reasoning say that LLMs can't do planning and reasoning they have a very different, formal definition of those things in mind, than everybody else who hasn't studied them. Even some of the research papers on planning and reasoning with LLMs play very fast and loose with the use of those words and that's probably why there are so many positive results that later prove to be duds.

Now is probably a good time to re-educate computer scientists and AI researchers about planning and reasoning.

Wikipedia has a very short introduction to STRIPS planning:

https://en.wikipedia.org/wiki/Stanford_Research_Institute_Pr...

And here's a more comprhensive, but still short, introduction to automated planning and scheduling in general:

http://www.spacebook-project.eu/pubs/CogSci-13.pdf


This is exactly right.

Planning and reasoning have long standing meaning in AI, based on explicit knowledge representation and inference. It's a subspace of AI that predates LLMs by decades.

But LLMs don't do that kind of planning and reasoning, and CoT is very clearly not a reasoning technique in the AI sense.


Isn't the obvious approach to use thr LLM to write the code or definition for the STRIPS system and then use that system to search for the plan?

Have they tried that?


And there comes the fact LLMs are actually terrible at coding beyond advanced regurgitation, precisely because their single-pass architecture does not give them Turing completeness to model a Turing machine, which may also be the reason they model the full chain of cause and effect to formulate a coherent plan...

Turing machines are an irrelevancy for programming, and LLMs can code ... not well, but at a higher level than regurgitation. ("Advanced" regurgitation can, of course, describe anything.)

There are people working on that! Google for LLM + PDDL.

There are big problems with hallucinations because LLMs are not smart enough to know when they're starting to make mistakes.

But there's lots of work in this area, and generally in different ways to nail neural and symbolic systems together.


[I am the one who wrote that tweet thread]. fwiw, a lot of the questions that are being raised in this thread have been answered in this talk I gave for Google/Deepmind LLM reasoning seminar a few days back: https://www.youtube.com/watch?v=hGXhFa3gzBs (Also, the preprint paper on the CoT study with all the prompts etc is available at https://arxiv.org/abs/2405.04776 )

I was under the impression that “chain of thought” was getting the LLM itself to plan out the steps before it solves the problem. Is that not true?

It is true, but when it's not zero shot there's a possibility that you are introducing additional information with the example. As well, depending on the study there have been issues with effectively a 'halting' assistance where as long as it isn't getting it right it keeps on going until getting it right, which is effectively a post-selection bias.

But frankly this post reads like someone that doesn't understand much about CoT in the first place, let alone the various methods that have improved upon it since.

It reads like one of the ad nauseum "look at me use this tool poorly, clearly it's a poor tool" examples.

In general, I've noticed mathematicians, computer scientists, and engineers tend to be very poor at evaluating LLMs because they just aren't very good at correctly identifying the scope and depth of what was modeled in the training data in the first place.

It's getting boring watching people foolishly try to evaluate things like "stack these clear blocks" (because that's something I regularly saw in social media posts) while glossing over or actively sabotaging the unbelievable modeling/simulating of much more complex and higher order critical reasoning behind various applications of things like empathy or psychological modeling.

For anyone reading this who wants to have a fun project, try creating two versions of a set of word puzzles with the same underlying logic structure. One where the problem and solution are using engineering-ish language like "clear blocks" and another using emotional/social language like "grieving friends."

Even as we wait for the next generation of models, there's a lot of people criminally underestimating and underutilizing the current models because they can't look beyond their own specialized domain languages.


CoT works because it’s an instruction the LLM has encountered in its instruction-tuning (and, to a smaller extend, in its original training data). You could fine tune a base model to respond to every question by first outlining steps (sub prompts, if you will). Not useful for a general purpose chat model. But useful for some types of tasks.

LLMs are mostly trained on Internet posts, so they are Internet simulators.

If an Internet posts explains its reasoning, step-by-step it reaches a correct conclusion more often than other posts. Therefore LLMs that explain their reasoning and also more likely to reach the correct solution.


Find me good Internet posts that use verification and general reasoning. They are rare. The Internet posts I read suck at verification and general reasoning.

Therefore LLMs will suck at verification and general reasoning until we refine or augment our datasets.


Correct. Internet posts contain explanations, if at all, rather than reasoning, which is not the right model for LLMs because it starts with the conclusion and then describes in hindsight the steps that have already been taken, invisibly, to reach it. It's a retrospective trace of our own chain of thought, not the chain of thought in itself. We can only write explanations because our actual reasoning is, invisibly, present in our mental context window. But that's exactly the tokens that would actually help a LLM.

(That's also why we can explain answers backwards as easily as forwards.)


I always objected to the grumpy old people who complained "this has been covered and answered many times, so let's close this thread."

Having said that, this has been covered and answered many times. Can we please create a responder for any HN question that can be answered with "LLMs are Internet simulators, and they do that because that is how Internet posts are."


I had assumed this works because it points the model towards more ELI5 like the explanations in the training data rather than imbuing it with any sort of reasoning.

If it works…then the why is somewhat secondary though still interesting


If it works for easy things, but not for hard things then that's useless. From the tweet:

As you can see, the ease of giving CoT advice worsens drastically as we go from domain independent to domain specific to goal class specific to lexicographic goal-specific.

So he's arguing that CoT needs a lot of work from the user and it can only solve easy problems anyway.


Your inability to carefully prompt an LLM does not indicate an LLM's lack of ability to solve a problem.

Can we just admit already that LLMs were all hype and no substance? I still see articles written as though ChatGPT is doing something groundbreaking when all it does is regurgitate its training data. The only good use (at least as of now) of LLMs is GitHub Copilot (edit: Grammarly also is useful here). Everything else has been a severe let down imo.

Consider: you might be wrong about LLMs regurgitating their training data.

Maybe. But I have yet to see (for example) an LLM write a novel. Or build out a web framework. GitHub Copilot + Grammarly are the only two useful apps it has, at least so far. Feel free to convince me otherwise!

It seems like you're judging LLMs by human standards. If it could do those things, it would almost definitionally be AGI already. LLMs are "just" pattern-based token prediction engines with large scale. The interesting thing is that it looks like we should be able to build a general intelligence from that - like we're getting at least some facets of intelligence from that alone, that we previously had no idea how to produce - not that we're already there!

Treat it as a text-based addon to your brain, IMO. A human brain has many components necessary for general intelligence. Some of those components can now be augmented by a giant neural net; others still benefit from human execution. All the LLM programming work I do is based on a cooperative dialog, where I fill in the gaps where the LLM doesn't have an appropriate pattern. Which are many. (But getting less.)

At any rate, the idea that LLMs regurgitate their training data is completely unrelated to this and afaict one of those "true but trivial, or important but false" ones. (True in the sense that humans also represent a reflection of their inputs; false in the sense that what's going on is definitely more than a collage of samples.)


I would bet you X donuts that if you give me a novel and I’m allowed access to a corpus of all novels and other writing that running a similarity search on the embedded content of each would turn up a shocking number of matches. I’d bet dollars instead of donuts on the equivalent experiment with a web framework.

[flagged]


Have you tried Chrome or Firefox without any add-ons and a fresh profile?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: