Apple study proves LLM-based AI models are flawed because they cannot reason

dboreham · 2024-10-13T00:41:10 1728780070

What an amazing time we live in when people are discussing how perhaps the output from a bunch of matrix convolutions possibility can't quite reason about mathematics.

fourgreen · 2024-10-13T03:01:40 1728788500

The benchmark itself is interesting, but we should notice that the authors didn't test it on claude 3.5. From my experience, for tasks which need intelligence, claude 3.5 is much better than gpt or llama (which authors used). So the conclusion is not than "LLM-based AI models are flawed because they cannot reason" but rather at most "gpt o1-min and Llama3-8B are flawed because they cannot reason". For instance, authors do this test:

""" We add seemingly relevant statements to the questions that are, in fact, irrelevant to the reasoning and conclusion. However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes prompt: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

o1-mini response: Oliver picks kiwis over three days as follows:

- Friday: 44 kiwis

- Saturday: 58 kiwis

- Sunday: Double the number he picked on Friday, which is 2 × 44 = 88 kiwis

However, on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) - 5 (smaller kiwis) = 83 kiwis Now, summing up the kiwis from all three days: 44 (Friday) + 58 (Saturday) + 83 (Sunday) = 185 kiwis Oliver has a total of 185 kiwis.

"""

I tried asking exactly the same question to claude 3.5 and the response was:

"""

Let's break this problem down step by step:

1. Friday's kiwis:

   Oliver picked 44 kiwis on Friday.

2. Saturday's kiwis:

   Oliver picked 58 kiwis on Saturday.

3. Sunday's kiwis:

   Oliver picked double the number of kiwis he did on Friday.

   Friday's kiwis = 44

   Double of Friday's kiwis = 44 × 2 = 88 kiwis

   However, 5 of these were smaller than average, but they still count in the total.

4. Total kiwis:

   Friday's kiwis + Saturday's kiwis + Sunday's kiwis

   = 44 + 58 + 88

   = 190 kiwis

Therefore, Oliver has a total of 190 kiwis.

"""

rahimnathwani · 2024-10-13T02:09:34 1728785374

Discussed earlier: https://news.ycombinator.com/item?id=41808683

unwise-exe · 2024-10-13T00:47:02 1728780422

Interesting. I would have thought that the training set (basically the whole internet AIUI) would have included various "teacher's version" exams with enough word problems with intentionally-distracting extra information, that the models would be able to ignore that sort of thing.

This sounds like they're inspecting existing models. Maybe a model trained specifically on "word problem" question-answer pairs (as in, the sort of things that show up on tests and always pretend that the sort of complications a domain expert would know about just don't exist) would do better?

og_kalu · 2024-10-13T00:49:13 1728780553

They hid away the results of o1 preview in the Appendix but it does not drop below margin of error numbers on 4/5 of their modified benchmarks. The last one they add "seemingly relevant but ultimately irrelevant information to problems" and it drops to 77%. Now I'm willing to bet this is within human baselines but either way, researchers really need to start including human baselines in these kinds of papers.

unwise-exe · 2024-10-13T01:11:35 1728781895

>>> I'm willing to bet this is within human baselines but either way, researchers really need to start including human baselines in these kinds of papers.

They should indeed, but is that "I've never seen this before" human baseline, or with prior exposure ("what do you mean I got that wro... oh I see what you did there") or explicit instruction?

og_kalu · 2024-10-13T01:21:22 1728782482

I mean baselines to match how the LLMs are being tested in whatever paper (as best as possible). In this case, average scores on the unaltered benchmark and then average scores on each modified benchmark to indicate how much on average human performance drops introducing these details.

A "hey, you got that wrong. check again" is fine if the LLMs in the paper are also being prompted that way.

refulgentis · 2024-10-13T00:40:16 1728780016

This is an interesting acid test:

- article is titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"

- some people having been flogging it as "LLMs cannot reason"

- it shows a 6-8 point drop, in test results in the 80s, if you replace the #s in the test set problems with random #s, and run multiple times

- If anything, sounds like a huge W to me: very hard to claim they're just memorizing with that small of a drop

apsec112 · 2024-10-13T00:41:28 1728780088

This is silly. Humans will also get fewer right answers if you make the question more complex (requiring additional steps), or if you add irrelevant information as a distraction (since on tests, there's usually an assumption that all information given is relevant). As for changing names and numbers, the large effects they saw were all on small (<10B param) open source models; the effects on o1 were tiny and barely distinguishable from noise.

unwise-exe · 2024-10-13T00:54:02 1728780842

>>> since on tests, there's usually an assumption that all information given is relevant

Maybe on grade-school tests, but the professional certification exams I've taken have questions about scenarios where part of the challenge is recognizing which parts of the scenario description aren't relevant. The one I had an in-person class for, the instructor specifically called this out and advised reading the questions first so we'd know which parts of the scenario we didn't have to think about while reading it.

mvdtnz · 2024-10-13T00:41:30 1728780090

Gosh I wish someone would pay me handsomely for coming up with such stupidly obvious "research" results as "a computer program that uses statistics to pick the next word in a sequence doesn't reason like a person".

apsec112 · 2024-10-13T00:49:15 1728780555

AlphaProof was able to get a silver IMO medal by writing formally verified proofs of novel mathematical problems:

https://deepmind.google/discover/blog/ai-solves-imo-problems...

Whether this is "like a person" or not, it seems silly to insist that this "doesn't count" as mathematical reasoning. I certainly couldn't get an IMO silver medal and I have a degree in math.

mvdtnz · 2024-10-13T01:33:47 1728783227

Hang on a second, there's a lot wrong with this.

> First, the problems were manually translated into formal mathematical language for our systems to understand.

Ok so the "AI" wasn't solving the same problem as every other Olympiad, it was solving a "translated" version. I wonder how much of the solving was performed in this translation. If the model is so capable of reasoning, why was this step performed by people?

> In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

So no, they would not have been awarded a silver medal if they were competing in the IMO.

Not to mention, AlphaProof is not an LLM and has absolutely nothing to do with what I was commenting about.

mslt · 2024-10-13T00:51:21 1728780681

There’s clearly enough people who need to be told this, given any thread on this topic in recent years

rowanG077 · 2024-10-13T00:55:39 1728780939

You can do it too, just take a PhD in any field.

xk_id · 2024-10-13T00:50:45 1728780645

VCs spend obscene amounts of money trying to manufacture consumer demand for a fake product, then we are left arguing for years about it. Such a waste of time.

apsec112 · 2024-10-13T00:52:34 1728780754

ChatGPT got hundreds of millions of users with no advertising, and unlike (say) Google+, OpenAI didn't have a pre-existing userbase to push new products on. So there's clearly some demand.

xk_id · 2024-10-13T02:20:07 1728786007

No advertising? You probably mean billboards, but openAI’s advertising includes making a very big fuss about the “dangers” of a stochastic text generator. Advertising is whatever gets people to think about your product. But the demand was created on false pretences, so it also lead to a rational fight against the nonsense. It’s like we got sucked/trolled into a hype. And there are more meaningful things we could have spent our time on.

og_kalu · 2024-10-13T00:47:24 1728780444

They tested o1-preview but the results hidden away in the Appendix, probably because o1-preview's "drops" for 4 out of 5 of the new benchmarks are all within the margin of error. i.e 94.9 on the full GSM8K and 93.6, 92.7, 95.4, 94 and 77.4 on the modified benchmarks.

The study proves nothing of the sort. Even the results of 4o are enough to give pause to this conclusion.