Hacker News new | past | comments | ask | show | jobs | submit login
Understanding the Limitations of Mathematical Reasoning in Large Language Models (arxiv.org)
73 points by hnhn34 3 hours ago | hide | past | favorite | 71 comments





> we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning

I'd offer a simpler explanation: Tokenization.

If you tokenize "12345 * 27271" you will get the following:

  "123", "45", " *", " ", "272", "71"
The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.

You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".


I respectfully disagree.

While tokenization certainly plays a role in how language models process input, it's simplistic to attribute the challenges in mathematical reasoning solely to tokenization.

SOTA language models don't just rely on individual token predictions, but build up contextual representations across multiple layers. This allows them to capture higher-level meaning beyond simple token-to-token relationships. If this weren’t the case, it would be inconceivable that models would work at all in all but the most utterly simplistic scenarios.

The decline in performance as complexity increases might be due to other factors, such as:

- Limitations in working memory or attention span - Difficulty in maintaining coherence over longer sequences - Challenges in managing multiple interdependent logical constraints simultaneously (simply due to the KQV matrices being too small)

And in any case, I think OpenAI’s o1 models are crushing it in math right now. The iterative, model-guided CoT approach seems to be able to handle very complex problems.


>And in any case, I think OpenAI’s o1 models are crushing it in math right now.

My man, it cannot solve even the simplest problems which it hasn't seen the solution to yet, and routinely makes elementary errors in simple algebraic manipulations or arithmetic! All of this points to the fact that it cannot actually perform mathematical or logical reason, only mimic it superficially if trained in enough examples.

I challenge you to give it even a simple, but original, problem to solve.


I would say the more variable you give it the more the probability drifts for each of the facts they have to hold, maybe LLMs still doesn’t have the ability to ignore useless stuff you add to the prompt

Nanda, et al. successfully recovered the exact mechanism through which a transformer learned to carry out modular addition. [0] Transformers are all about the training data, and we will increasingly learn that structuring the order in which data is learned matters a lot. But it's clear that transformers are absolutely capable of encoding generalized solutions to arithmetic.

Given the right tokenization scheme and training regimen, we can absolutely create LLMs which have statistically sound arithmetic capabilities. I still wouldn't trust a stochastic model over the algorithmic certainty of a calculator, but what's more important for mathematicians is that these models can reason about complex problems and help them break new ground on hard mathematical problems by leveraging the full statistical power of their weights.

[0] https://arxiv.org/abs/2301.05217


Wouldn't a slight change in tokenization? (say mapping single digits to single tokens) help with this specific challenge?

Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to me to be domain specific tokenization (a very well defined one too — since programming languages are meant to be tokenizable).

Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.

Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.


Context-specific tokenization sounds a lot like old fashioned programming.

The llm will know 123 and 45 is a contiguious number just like how humans can tell if you say 123 and then a slight pause 45 as a single number

It's just so dissonant to me that the tokens in mathematics are the digits, and not bundles of digits. The idea of tokenization makes sense for taking the power off letters, it provides language agnosticism.

But for maths, it doesn't seem appropriate.

I wonder what the effect of forcing tokenization for each separate digit be.


It won't 'see' [123, 45] though, but [7633, 2548], or rather sparse vectors that are zero at each but the 7634th and 2549th position.

I think that as long as the attention mechanism has been trained on each possible numerical token enough, this is true. But if a particular token is underrepresented, it could potentially cause inaccuracies.

These results are very similar to the "Alice in Wonderland" problem [1, 2], which was already discussed a few months ago. However the authors of the other paper are much more critical and call it a "Complete Reasoning Breakdown".

You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.

To me, such results indicate that you can't trust any LLM benchmark results related to math and reasoning when you see, that changing the characters, numbers or the sentence structure in a problem alter the outcome by more than 20 percentage points.

[1] https://arxiv.org/html/2406.02061v1

[2] https://news.ycombinator.com/item?id=40811329


Someone (https://x.com/colin_fraser/status/1834336440819614036) shared an example that I thought was interesting relating to their reasoning capabilities:

A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

All LLMs I have tried this on, including GPT o1-preview, get this wrong, assuming that this the riddle relates to a gendered assumption about the doctor being a man, while it is in fact a woman. However, in this case, there is no paradox - it is made clear that the doctor is a man ("he exclaims"), meaning they must be the father of the person being brought in. The fact that the LLMs got this wrong suggests that it finds a similar reasoning pattern and then applies it. Even after additional prodding, a model continued making the mistake, arguing at one point that it could be a same-sex relationship.

Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem - perhaps humans also mostly reason using previous examples rather than thinking from scratch.


> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

Although we would like AI to be better here, the worse problem is that, unlike humans, you can’t get the LLM to understand its mistake and then move forward with that newfound understanding. While the LLM tries to respond appropriately and indulge you when you indicate the mistake, further dialog usually exhibits noncommittal behavior by the LLM, and the mistaken interpretation tends to sneak back in. You generally don’t get the feeling of “now it gets it”, and instead it tends to feels more like someone with no real understanding (but very good memory of relevant material) trying to bullshit-technobabble around the issue.


That is an excellent point! I feel like people have two modes of reasoning - a lazy mode where we assume we already know the problem, and an active mode where something prompts us to actually pay attention and actually reason about the problem. Perhaps LLMs only have the lazy mode?

I'm sure we fall back on easy/fast associations and memories to answer. It's the way of least resistance. The text you quote bears more than a superficial similarity to the old riddle (there's really nothing else that looks like it), but that version also stipulates that the father has died. That adds "gendered" (what an ugly word) information to the question, a fact which is missed when recalling this particular answer. Basically, LLMs are stochastic parrots.

How people don’t see the irony of commenting “stochastic parrots” every time LLM reasoning failure comes up is beyond me.

There are ways to trick LLMs. There are also ways to trick people. If asking a tricky question and getting a wrong answer is enough to disprove reasoning, humans aren’t capable of reasoning, either.


> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

We do, but we can generalize better. When you exchange "hospital" with "medical centre" or change the sentence structure and ask humans, the statistics would not be that different.

But for LLMs, that might make a lot of difference.


Both Claude-3.5 and o1-preview nail this problem

"Let's think through this step-by-step:

1. Alice has 3 brothers 2. Alice has 2 sisters 3. We need to find out how many sisters Alice's brother has

The key here is to realize that Alice's brothers would have the same sisters as Alice, except they would also count Alice as their sister.

So, Alice's brothers would have: - The 2 sisters Alice has - Plus Alice herself as a sister

Therefore, Alice's brothers have 3 sisters in total."


And here lies the exact issue. Single tests don’t provide any meaningful insights. You need to perform this test at least twenty times in separate chat windows or via the API to obtain meaningful statistics.

For the "Alice in Wonderland" paper, neither Claude-3.5 nor o1-preview was available at that time.

But I have tested them as well a few weeks ago with the issue translated into German, achieving also a 100% success rate with both models.

However, when I add irrelevant information (My mother ...), Claude's success rate drops to 85%:

"My mother has a sister called Alice. Alice has 2 sisters and 1 brother. How many sisters does Alice's brother have?"


Your experience makes me think that the reason the models got a better success rate is not because they are better at reasoning, but rather because the problem made it to their training dataset.

We don't know. The paper and the problem was very prominent at that time. Some developers at Anthropic or OpenAI might have included that in some way. Either as test or as a task to improve the CoT via Reinforcement Learning.

Absolutely! It's the elephant in the room with these ducking "we've solved 80% of maths olympiad problems" claims!

We do have chatbot arena which to a degree already does this.

I like to use:

"Kim's mother is Linda. Linda's son is Rachel. John is Kim's daughter. Who is Kim's son?"

Interestingly I just got a model called "engine test" that nailed this one in a three sentence response, whereas o1-preview got it wrong (but has gotten it right in the past).


My problem with this puzzle, is how do you know that Alice and her brothers share both parents?

Is it not correct English to call two people who share one parent, sisters, or brothers?

I guess I could be misguided by my native Norwegian where you have to preamble the word with "hell" (full), or "halv" (half), if you want to specify the number of shared parents.


They would usually be called “half-sisters”. You could call them “sisters” colloquially though but given it’s presented as a logic question I think it’s fine to disregard

It is pretty much the same in English. Unqualified would usually mean sharing both parents but could include half- or step-siblings.

I am not a native English speaker. Can you reformulate the problem for me, so that every alternative interpretation is excluded?

Alice has N full sisters. She also has M full brothers. How many full sisters does Alice’s brother have?

I won't take a strong stance on whether or not LLMs actually do reasoning, but I will say that this decrease in performance is similar to what I see in college freshmen (I'm currently teaching a calculus course in which almost half of the students took AP calc in high school). They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance (I have no data on whether this decrease is linear or not, as the paper assumes that the decrease should be linear with the number of steps). We see similar results with adding unrelated statements into a problem- many students are trained to make sure to use all given information in solving a problem- if you leave out something that the instructor gives you, then you probably forgot to do something important.

So while I don't take a stance on what an LLM does should be considered reasoning, I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence. In other words, average Americans exhibit similar limitations on their reasoning as good LLMs. Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs- they aren't doing graduate-level research but they are already capable of helping a large portion of the population.


> In other words, average Americans exhibit similar limitations on their reasoning as good LLMs.

It's not even clear this is a good example of "reasoning". You can progress all the way through multi-variable calculus with just decent pattern-matching, variable-substitution, and rote memorization of sufficient lists of rules. I imagine for "reasoning" ability to apply you need to be able to detect incoherency and reject an approach—and incoherency detection seems to be a big missing ingredient right now (...which many humans lack, too!).

On the other side—any such ability would cripple a chatbot's ability to answer questions about the real world as our world is characterized (via description with informal language) by incoherent and contradictory concepts that can only be resolved through good-faith interpretation of the questioner. A large mark of intelligence (in the colloquial sense, not the IQ sense) is the ability to navigate both worlds.


Not to disparage American school system (my country’s is worse) but it’s very much easy mode. I know that not everyone is suited to academic excellence, but it’s definitely easier to learn when young. I do believe too much hand holding actively harm learning.

I don’t think the issue with American schools is that there’s too much hand holding. If anything, it’s the opposite; teachers at drastically underfunded schools don’t have any time to help the students of their 50 person class through the confused curriculum.

This, it is like when I hear interviews of PHDs talking about AI and they mention something like "AI will be smarter than humans", I am like "really?, where have you been all this time?, do you smart people ever leave your labs and go see the real world?, LLMs are already smarter that the huge majority of Humans in this planet, what are you talking about?"

This must be some bizarre definition of “smarter”.

Very interesting, and aligns with what I would expect in terms of the type of "thinking" LLMs do. I think that it's also the type of "thinking" that will let a student pass most school courses, except of course for the ones where the teacher has taken the time to pose test questions that aren't as amenable to pattern matching. (Hard, but I assume most readers here are familiar with leetcode style interviews and what makes questions of that kind higher or lower quality for assessing candidates)

(And yes, I know people are hard at work adding other types of thinking to work along with the pure language models)


If the argument is that LLMs are bad at reasoning because they are easily distractible and the results vary with modifications in the question, one should be reminded of the consistency and distractability of humans.

Why? LLMs are supposedly better than humans (as many comments claim in this thread).

I test llms actually similar. For example there is a well known logic puzzle were a farmer tries to cross a river with a cabbage a goat and a wolf. Llms can solve that since at least GPT-2, however if we replace the wolf with a cow, gpt-o does correctly infer the rules of the puzzle but can't solve it.

I've been using this as my first question to any new LLM I try and I'm quite sure nothing before GPT-4 even got close to a correct solution. Can you post a prompt that GPT-2 or 3 can solve?

What happens if you sit down and invent a logic game that is brand new and has never been documented before anywhere then ask an LLM to solve it? That, to a layman like me, seems like a good way to measure reasoning in AI.

I think the problem is inventing new structures for logic games. The shape of the problem ideally would be different than any existing puzzle, and that's hard. If a person can look at it and say "oh, that's just the sheep-wolf-cabbage/liar-and-truthteller/etc. problem with extra features" then it's not an ideal test because it can be pattern-matched.

This is being done, but the difficulties are: (1) How do you assess that it is really brand-new and not just a slight variation of an existing one? (2) Once you publish it, it stops being brand-new, so its lifetime is limited and you can’t build a longer-term reproducible test out of it.

I've found that the River Crossing puzzle is a great way to show how LLMs break down.

For example, I tested Gemini with several versions of the puzzle that are easy to solve because they don't have the restrictions such as the farmer's boat only being able to carry one passenger/item at a time.

Ask this version, "A farmer has a spouse, chicken, cabbage, and baby with them. The farmer needs to get them all across the river in their boat. What is the best way to do it?"

In my tests the LLMs nearly always assume that the boat has a carry-restriction and they come up with wild solutions involving multiple trips.


Meaning it's just a glorified Google.

I'm scared of the cows around you if they eat goats

It would be interesting if this kind of work could ever be extended to show the limitations of mathematical reasoning in animals and humans.

For example, just as a dog will never understand a fourier transform, there are likely ideas that humans cannot understand. If we know what our limits are, I wonder if we could build machines that can reason in ways we aren't capable of?


I think it is a naive assumption that such a limitation even exists ("exists" in a sense that it is actually useful, by being consistent and somewhat simple to describe).

We investigated similar ideas for language (=> Noam Chomsky), where we tried to draw clear, formalized limits for understanding (to show e.g. how human capabilities contrast with animals). The whole approach failed completely and irredeemably (personal opinion), but researching it was far from useless to be fair.


As the human brain is finitely bounded in space and time, any idea that can't be compressed or represented by condensing notation, which is "larger" than the 100B cells+100T synapses can represent, or whose integration into said human's brain would take longer than 150 years, would be considered unable to be contemplated by a normal human.

I'm curious about what happens with the no-op dataset if you include in the prompt that the questions may contain irrelevant information.

It seems incredibly easy to generate an enormous amount of synthetic data for math. Is that happening? Does it work?

They did that for o1 and o1-preview. Which if you read the paper or do your own testing with that SOTA model you will see that the paper is nonsense. With the best models the problems they point out are mostly marginal like one or two percentage points when changing numbers etc.

They are taking poor performance of undersized models and claiming that proves some fundamental limitation of large models, even though their own tests show that isn't true.


You choose to ignore Figure 8 which shows a 18% drop when simply adding an irrelevant detail.

In the other test the perturbations aren’t particularly sophisticated and modify the problem according to a template. As the parent comment said this is pretty easy to generate test data for (and for the model to pattern match against) so maybe that is what they did.

A better test of “reasoning” would be to isolate the concept/algorithm and generate novel instances that are completely textually different from existing problems to see if the model really isn’t just pattern matching. But we already know the answer to this because it can’t do things like arbitrary length multiplication.


Data is the wrong approach to develop reasoning. You we don't want LLM's to simply memorize 3x3 = 9 we want them to understand that 3 + 3 + 3 = 9 therefore 3x3 = 9 (obviously a trivial example). If they have developed reasoning very few examples should be needed.

The way I see it reasoning is actually the ability of the model to design and train smaller models that can learn with very few examples.


> If they have developed reasoning very few examples should be needed.

Yes, once the modules for reasoning have converged, it will take very few examples for it to update to new types of reasoning. But to develop those modules from scratch requires large amounts of examples that overtax its ability to memorize. We see this pattern in the "grokking" papers. Memorization happens first, then "grokking" (god I hate that word).

It's not like humans bootstrap reasoning out of nothing. We have a billion years of evolution that encoded the right inductive biases in our developmental pathways to quickly converge on the structures for reasoning. Training an LLM from scratch is like recapitulating the entire history of evolution in a few months.


Yes, this is how o1 was trained. Math and programming, because they are verifiable.

This is also why o1 is not better at English. Math skills transfer to general reasoning but not so much to creative writing.


In which distribution? Like school math or competition or unsolved problems? FWIW I think one and three and probably easier to generated as synethetically. It's harder to bound the difficulty but I think the recent David silver talk implies it doesn't matter much. Anyway there's some work on this you can find online--they claim to improve gsm8k and MATH a bit but not saturate it. Idk in practice how useful it is

I don’t think so. The data is biased towards being very general.

I don't understand the idiocracy we live in, it is beyond obvious not just that the stock market is a bubble but ESPECIALLY the AI related stocks are a massive bubble, when it pops, and it will, it is going to be very very ugly, yet people keep pouring in, as Sabine said it, it's starting to look like particle physics where they keep asking for bigger colliders, just because you have a bigger collider, if your methodology is flawed you aren't gonna get any more significant returns.

Eventually they will run out of exponential cash to pour in, and investors will start asking questions, stocks are already valued at 60x+ their earnings, whenever it pops you don't want to be the one who bought the top.

Guess it's still gonna take a while more for the layman to realize the issues with LLMs, but it'll happen.


>if your methodology is flawed you aren't gonna get any more significant returns.

The problem with this statement is that predictions made about scaling 5 years ago have held true[1]. We keep adding parameters, adding compute, and the models keep getting more capable.

The flaws of LLM's from 2024 are not what is relevant. Just like the flaws of LLMs from 2021 were not relevant. What is relevant is the rate of change, and the lack of evidence that things won't continue on this steep incline. Especially if you consider that GPT4 was sort of a preview model that motivated big money to make ungodly investments to see how far we can push this. Those models will start to show up over the next 2 years.

If they break the trend and the scaling flops, then I think a lot of air is gonna blow out of the bubble.

[1]https://arxiv.org/pdf/2001.08361


we added a lot of parameters.

We added a LOT of data.

The resulting models have become only slightly better. And they still have all of their old problems.

I think this is proof that scaling doesn't work. It's not like we just doubled the sizes, they increased by a lot, but improvements are less and less each time. And they've already run out of useful data.


They are very literally asking for trillions and even nuclear powered data centers, pretty sure we've gotten to the point where it's not sustainable.

Those are roadmap items being asked for, but the next gen models are already in training. If they keep moving along the same trend line, like all the previous models have, then they probably will be able to find the investors for the next next gen. Even if it's a few trillion dollars and a few nuclear power plants.

This doesn't even factor in the tech inertia. We could stop making new models today, and it would probably be 4-5 years before integration slowed down. Google still hasn't even put Gemini in their home speakers.


I honestly can't see why LLMs should be good at this sort of thing. I am convinced you need a completely different approach. At the very least you mostly only want one completely correct result. Good luck getting current models to do that.

LLMs aren't totally out of scope of mathematical reasoning. LLMs roughly do two things, move data around, and recognize patterns. Reasoning leans heavily on moving data around according to context-sensitive rules. This is well within the scope of LLMs. The problem is that general problem solving requires potentially arbitrary amounts of moving data, but current LLM architectures have a fixed amount of translation/rewrite steps they can perform before they must produce output. This means most complex reasoning problems are out of bounds for LLMs so they learn to lean heavily on pattern matching. But this isn't an intrinsic limitation to LLMs as a class of computing device, just the limits of current architectures.

I'm a math phd student at the moment and I regularly use o1 to try some quick calculations I don't feel like doing. While I feel like GPT-4o is so distilled that it just tries to know the answer from memory, o1 actually works with what you gave it and tries to calculate. It's can be quite useful.

I'm curious what kind of quick calculation do you usually use llm for?

Edited for clarity


Just earlier today I wanted to check if exp(inx) is an orthonormal basis on L^2((0, 1)) or if it needs normalization. This is an extremely trivial one though. Less trivially I had an issue where a paper claimed that a certain white noise, a random series which diverges in a certain Hilbert space, is actually convergent in some L^infinity type space. I had tried to use a Sobolev embedding but that was too crude so it didn't work. o1 correctly realized that you have to use the decay of the L^infinity norm of the eigenbasis, a technique which I had used before but just didn't think of in the moment. It also gave me the eigenbasis and checked that everything works (again, standard but takes a while to find in YOUR setting). I wasn't sure about the normalization so again I asked it to calculate the integral.

This kind of adaptation to your specific setting instead of just spitting out memorized answers in commonn settings is what makes o1 useful for me. Now again, it is often wrong, but if I am completely clueless I like to watch it attempt things and I can get inspiration from that. That's much more useful than seeing a confident wrong answer like 4o would give it.


()

That makes the whole conclusion obviously false.

I don't really understand why, but I think we are going to see total denial from a significant percentage of the population all the way up to and past the point where many average mathematicians and software engineers cannot in any way compete with AI.

We already are reportedly getting pretty close with o1 (not o1-preview).

There are also new paradigms for machine learning and hardware in the pipeline that will continue to provide orders of magnitude performance gains and new capabilities in the next 5-10 years.

Many people still claim that "self driving cars don't exist", in so many words, even though they are deployed in multiple cities.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: