Understanding the Limitations of Mathematical Reasoning in LLMs

parsimo2010 · 2024-10-11T14:27:50 1728656870

I won't take a strong stance on whether or not LLMs actually do reasoning, but I will say that this decrease in performance is similar to what I see in college freshmen (I'm currently teaching a calculus course in which almost half of the students took AP calc in high school). They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance (I have no data on whether this decrease is linear or not, as the paper assumes that the decrease should be linear with the number of steps). We see similar results with adding unrelated statements into a problem- many students are trained to make sure to use all given information in solving a problem- if you leave out something that the instructor gives you, then you probably forgot to do something important.

So while I don't take a stance on what an LLM does should be considered reasoning, I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence. In other words, average Americans exhibit similar limitations on their reasoning as good LLMs. Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs- they aren't doing graduate-level research but they are already capable of helping a large portion of the population.

ojosilva · 2024-10-11T21:42:52 1728682972

LLM gets things right, when it does, due to the sheer massive information ingested during training, it can use probabilities to extract a right answer from deep in the model.

Humans on the other hand have developed a more elaborate scheme to process, or reason, data without having to read through 1 billion math problems and stack overflow answers. We listen to some explanations, a YT video, a few exercises and we're ready to go.

The fact that we may get similar grades (at ie high school math) is just a spot coincidence of where both "species" (AI x Human) are right now at succeeding. But if we look closer at failure, we'll see that we fail very differently. AI failure right now looks, to us humans, very nonsensical.

heresie-dabord · 2024-10-12T10:04:38 1728727478

> Humans on the other hand have developed a more elaborate scheme to process, or reason [ ... ] We listen to some explanations, a YT video, a few exercises

Frequent repetition in the sociological context has been the learning technique for our species. To paraphrase Feynman, learning is transferring.

ben_w · 2024-10-12T06:50:39 1728715839

While I'd agree human failures are different from AI failures, human failures are necessarily also nonsensical. Familiar, human, but nonsensical — consider how often a human disagreeing with another will use the phrase "that's just common sense!"

I think the larger models are consuming in the order of 100k as much as we do, and while they have a much broader range of knowledge, it's not 100k as much breadth.

steveBK123 · 2024-10-12T14:44:28 1728744268

Well it's a breadth & depth problem isn't it?

Humans are nonsensical, but in somewhat predictable error rates by domain, per individual. So you hire people with the skillsets, domain expertise, and error rates you need.

With an LLM, it's nonsensical in a completely random way prompt to prompt. It's sort of like talking into a telephone and sometimes Einstein is on the other end, and sometimes it's a drunken child. You have no idea when you pick up the phone which way its going to go.

We feed these things nearly the entirety of human knowledge, and the output still feels rather random.

LLMs have all that information and then still have a ~10% chance of messing up simple mathematical comparison that an average 12 year old would not.

Other times we delegate much more complex tasks to LLMs and they work great!

But given the nondeterminism it becomes hard to delegate tasks you can't check the work of, if it is important.

ben_w · 2024-10-12T20:42:40 1728765760

Weirdly, I find myself agreeing with your vibes despite disagreeing on — oh, half? — the specifics.

I'm not sure what to make of that, but thought you might find it as curious as I do.

steveBK123 · 2024-10-12T22:54:57 1728773697

Thanks I think. My general vibe is - we are still confounding "wow this is neat" & "this is really useful" in this space. LLMs will lead to some real use cases & products.

I think we are seeing spaghetti against the wall / LLMs as a hammer right now. A lot of what we are seeing thrown out there is a misapplication of LLMs that will slowly fade away.. It is likely other techniques / models are required for some of the applications people are throwing LLMs at.

Feels reminiscent of "blockchain".

ben_w · 2024-10-13T00:34:00 1728779640

Yup, I agree with all of that.

Even though I see a lot of potential in the tech, it's still very obvious that people don't know how to use it to the best advantage and are hammering screws with it.

phreeza · 2024-10-12T16:05:18 1728749118

I haven't worked with LLMs enough to know this but I wonder: are they nonsensical in a truly random way or are they just nonsensical on a different axis in task space than normal humans, and we perhaps just haven't fully internalized what that axis is?

steveBK123 · 2024-10-12T16:34:58 1728750898

I'm not really sure, and you can pull lots of funny examples where various models have progress & regressions dealing with such mundane simple math.

As recently as August "11.10 or 11.9 which is bigger" came up with the wrong answer on ChatGPT and was followed with lots of wrong justification for the wrong answer. Even follow up math question "what is 11.10 - 11.9" gave me the answer "11.10 - 11.9 equals 0.2"

We can quibble about what model I was using, or what edge case I hit, or how quick they fixed it.. but this is 2 years into the very public LLM hype wave so at some point I expect better.

It gives me pause in asking more complex math questions I cannot immediately verify results, in which case, again why would I pay for a tool to ask questions I already know the answer to?

jewelry · 2024-10-12T20:41:12 1728765672

This error is not nonsensical though as normal elementary kids would make similar error and with good episodic memory the agent will fix itself.

ben_w · 2024-10-12T20:48:35 1728766115

He did say "sometimes Einstein is on the other end, and sometimes it's a drunken child. You have no idea when you pick up the phone which way its going to go.", so I think that's still a valid thing for him to complain about.

LLMs totally violate our expectations for computers, by being a bit forgetful and bad at maths.

steveBK123 · 2024-10-12T22:58:56 1728773936

Yes, to put a point on it -

How many dollars per month would someone be willing to spend for a chatbot that has a 3rd graders ability at math? Personally, $0 for me.

But what if it's a PHD Math degrees ability at math? Tons, in some applications it could be worth $100s or $1000s in an enterprise license setting.

But what if it's unpredictably, imperceptibly question to question, 95% PHD and 5% 3rd grader? Again, for me - $0. (not 95% of $1000s, but truly, $0)

pishpash · 2024-10-11T21:49:50 1728683390

Nah, human failures look equally nonsensical. You're just more attuned to use their body language or peer judgement to augment your reception. Really psychotic humans can bypass this check.

wkirby · 2024-10-11T21:42:34 1728682954

> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.

This might be true in a strict sense, but I think it's really, really important to consider the uses of LLMs vs a high-school graduate. LLMs are confidently wrong (and confidently correct) with the exact same measure, and in many ways they are presented to users as unimpeachable.

If I ask an average person to do a medium-complex logic problem, my human brain discounts their answer because I've been socialized to believe that humans are bad at logic. I will take any answer I'm given with usually appropriate skepticism.

LLMs, on the other hand, are on the computer: an interface I've been socialized to believe is always correct on matters of math and logic. That's what it is, a logic machine. Second guessing the computer on matters of logic and arithmetic almost always result in me realizing my puny human mind has done something wrong.

To me, this directly contradicts your conclusion: LLMs are mostly only capable of misleading large portions of the population.

pishpash · 2024-10-11T21:46:24 1728683184

Would be good to put equivalent grades on LLM's then. Instead of GPT-4o, it's GPT-11th grade.

steveBK123 · 2024-10-12T14:47:29 1728744449

The problem is that some % of the time they answer at an 11th grade level, and some % of the time they answer at a 4th grade level, with no way of knowing beforehand.

The only way of knowing is after the fact, when the prompter has the knowledge to discern whether the answer is correct or not.

In which case why are we asking questions we already know the answer of, other than these are toys and fun?

I really wonder what happen when the novelty wears off in a few years, where these things are actually useful.

1oooqooq · 2024-10-12T13:48:12 1728740892

that's in the wrong directing of what the comment said

Eisenstein · 2024-10-11T22:35:23 1728686123

This is not inherent in the LLM though. Society will adjust to it after learning some very predictable (and predicted) lessons, just like it always does.

wkirby · 2024-10-12T01:15:30 1728695730

Will it? Media literacy is in crisis, and we’ve had the printed word for centuries. I’m much more convinced we will continue to believe “magic box cannot lie” than somehow develop a notion for when to trust the coherent nonsense generator.

nosianu · 2024-10-12T12:00:28 1728734428

As the other commenter said, and I would extend that to be even more broad.

Humanity has developed few abilities to escape from ancient things like group think or propaganda (since long ago in the form of stories and unproven assertions about one's own group and others).

That may very well be because there is no advantage - having a cohesive group is more important than "seeing truth".

That would mean that there will be no adjustment in the future either. "Truth" does not appear to be all that important to survival and procreation I think.

1986 · 2024-10-12T12:13:44 1728735224

society hasn't even fully adjusted to social media yet

hintymad · 2024-10-11T21:12:00 1728681120

> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.

Is this because the questions used in high school exams in the US are too simple, or do they have too similar patterns in the training data? I tried really simple but novel questions that required true understanding of the underlying math concepts, and the results were consistently bad. I also tried questions at the level of entrance exams of high school in China, and the results were equally bad. It was quite clear that LLM didn't understand math. It could match some patterns, but such pattern match could be useful to only skilled students.

MVissers · 2024-10-11T21:15:23 1728681323

Which model? The field moves so fast it’s hard to validate statements like this without that info.

O1-preview?

hintymad · 2024-10-11T21:20:02 1728681602

GPT-4o. I tried only a few samples on o1-preview, and the results were bad. That did not have any statistical significance, though

asey · 2024-10-12T07:43:23 1728719003

Could you give an example?

ActorNightly · 2024-10-11T19:38:18 1728675498

> I won't take a strong stance on whether or not LLMs actually do reasoning,

I don't understand why people are still confused about this. When these models fundamentally have a randomness parameter to make them appear like they are actually thinking instead of deterministically outputting information, it should be clear that there is no reasoning going on.

atleastoptimal · 2024-10-12T00:55:30 1728694530

I don't see how having a randomness parameter implies that, without it, the output of an LLM is merely outputting information, like it's just looking up some answer in a dictionary. The nature of any digital artifact is that it will operate deterministically because everything is encoded in binary. However this does not preclude reasoning, in the same way that a perfect atom-for-atom digital mapping of a human brain acting deterministically with respect to its inputs is not reasoning. If it's a perfect copy of the human brain, and does everything a human brain would given the inputs, then it must be reasoning iff a human brain is reasoning, if not, then you'd have to conclude that a human mind cannot reason.

Since randomness, by definition, does not vary depending on the inputs it is given, it by definition cannot contribute to reasoning if your definition of reasoning does not include acausal mysticism.

1oooqooq · 2024-10-12T13:54:37 1728741277

it was a good take until the rand invariation part. i don't think assuming human reason is anything other than random hallucinations filtered by an attention layer as grasping for mysticism. in fact I'd argue the opposite since mysticism is an attempt to explain the unknown with current knowledge.

growthwtf · 2024-10-11T20:28:57 1728678537

I don't see how the latter follows from the former.

Here's how I think about it: the fact that it can interpret the same words differently in different contexts alone shows that even on a temperature of 0 (i.e., lowest randomness possible) there could be something that possibly resembles reasoning happening.

It might be a mimicry of reasoning, but I don't think that having adjustable parameters on how random they are makes it any less of one.

I also don't see how that idea would fit in with the o1 models, which explicitly have "reasoning" tokens. Now, I'm not terribly impressed with their performance relative to how much extra computation they need to do, but the fact they have chains-of-thought that humans could reasonably inspect and interpret, and that they chains of thought do literally take extra time and compute to run, certainly points at the process being something possibly analogous to reasoning.

In this same vein, up until recently I personally very much in the camp of calling them "LLMs" and generally still do, but given how they really are being used now as general purpose sequence-to-sequence prediction models across all sorts of input and output types tends to push me more towards the "foundation models" terminology camp, since pigeonholing them into just language tasks doesn't seem accurate anymore. o1 was the turning point for me on this personally, since it is explicitly predicting and being optimized for correctness in the "reasoning tokens" (in scare quotes again since that's what openai calls it).

All that said, I personally think that calling what they do reasoning, and meaning it in the exact same way as how humans reason, is anthropomorphizing the models in a way that's not really useful. They clearly operate in ways that are quite different from humans in many ways. Sometimes that might imitate human reasoning, other times it doesn't.

But, the fact they have that randomness parameter seems to be to be totally unrelated to any of the above thoughts or merits about the models having reasoning abilities.

ActorNightly · 2024-10-11T22:17:49 1728685069

>he fact that it can interpret the same words differently in different contexts alone shows that even on a temperature of 0 (

This is the problem with using loaded language like "reason" and "interpret". The model is not interpreting anything. All that is being done is a multdimentional map lookup with statistics.

> also don't see how that idea would fit in with the o1 models, which explicitly have "reasoning" tokens.

An LLM on top of an LLM (i.e using context to generate inputs to an LLM) is just a fancier LLM.

To really understand all of this, all you need to do is look at how Transformer works, namely the attention block. There is no such thing as Query, Key, and Value in the sense of how they are implied to be used. The may as well be called A,B,C, as they are all learned in training, and can be freely interchanged in naming. All you do for inference is multiply the output vector by A,B,C to get 3 matrices, then multiply them together (technically with a scaling factor for 2 of them, but again, doesn't matter for which 2, and the scaling factor can be built into the matrix itself)

And because you can unroll matrix multiplication into a 2 layer neural network, that means that any LLM in its current form today can be represented as a set of linear layers. And we know that a set of linear layers is simply a function. And every function has a finite range for a finite domain. And the inability to expand that range given a finite domain means its not reasoning.

So we have to rely on hacks like temperature to make it appear like reasoning, when its really not even close.

Eisenstein · 2024-10-11T22:43:02 1728686582

> The model is not interpreting anything. All that is being done is a multdimentional map lookup with statistics.

So what? Can you propose another method to make a computing device understand language? The method of the creation of the output does not stipulate anything about the nature of the thing creating it. If someone could map out a human brain and tell you how thoughts are made and added a 'all that is being done is' in front of it, does that make your thought creation trivial?

> An LLM on top of an LLM (i.e using context to generate inputs to an LLM) is just a fancier LLM.

This is called a tautology. You have not given any compelling reasons why an LLM cannot do anything, so calling something another LLM is not compelling either.

> To really understand all of this, all you need to do is look at how Transformer works, namely the attention block. There is no such thing as Query, Key, and Value in the sense of how they are implied to be used. The may as well be called A,B,C, as they are all learned in training, and can be freely interchanged in naming. All you do for inference is multiply the output vector by A,B,C to get 3 matrices, then multiply them together (technically with a scaling factor for 2 of them, but again, doesn't matter for which 2, and the scaling factor can be built into the matrix itself)

Here is how it works, so therefore it must meet some criteria I have imposed arbitrarily.

> So we have to rely on hacks like temperature to make it appear like reasoning, when its really not even close.

You still haven't produced any valid argument at all, for why one thing would be evidence of the other.

ActorNightly · 2024-10-12T00:27:18 1728692838

A good example of how to type a comment and yet not say anything.

It should be pretty clear to anyone that human brains aren't just one giant compute functions with a limited set of outputs. There is no concept in your or my brain what 12074389762193867*2398720876324 is, but we can certainly figure it out, some even with good memory with complete sensory depravation.

If you disagree with this, you are entitled to your opinion, but your comments on the state of AI are just irrelevant.

synarchefriend · 2024-10-12T12:41:23 1728736883

o1 can also solve many arbitrary math problems that it could not have possibly seen in its training data. And it shows the steps that it uses to do so. How do you explain this without reasoning?

Eisenstein · 2024-10-12T01:18:59 1728695939

I don't remember making any comments on A.I.

My post was pointing out basic flaws in reasoning and trying to provoke some thought about something where it appeared to be lacking. Unfortunately I did not succeed, since a myopic view made you hallucinate that I was saying something definitive about something I was not.

Irrelevant, indeed.

growthwtf · 2024-10-11T22:56:03 1728687363

I see, I probably needed more coffee to read your initial note.

If I am repeating this back correctly, the argument is that the process itself looks nothing like human reasoning and has a number of technical limitations and even hacks that are in no way attributes or qualities of reasoning. Therefore, it clearly cannot be in any way considered reasoning. Temperature is one element of this, but there are others which you could continue to enumerate beyond even what's written above.

I can get behind part of that argument, certainly, and I appreciate you elaborating on it. I think is what I was trying to say with the part about me believing that it's not useful to think of it as reasoning. This is very different from what we might consider reasoning in very meaningful ways.

I also agree with you also that parts of this is just loaded language, as it is anthropomorphizing what is fundamentally just a bunch of matrices and non-linear functions.

I think where we differ is probably on that "when it's not even really close" part of it, at least in what I mean is "close" versus what I think you mean.

While I (think) we agree that obviously it's a different process, I do think that the input->outputs and the different qualities of input->outputs (like the so-called reasoning tokens) above can often seem quite close to the different inputs and outputs of some human reasoning. That's why I was saying that didn't see how the process works, like temperature, is relevant. Putting the processes aside, if you black box a human and a language model and put us head to head on reasoning tasks, sometimes you're going to get quite similar results.

I'm basically saying that, sure, an LLM or foundation model is clearly a Chinese room, without any understanding. What are we comparing it to, though?

Now, I don't have any kind of training in biology, but I have been led to understand that our brains are quite complex and that how their function arises from the underlying biological processes. is still fairly poorly understood. Given that, I tend to discount the degree of difference between the processes themselves and just look at the inputs and outputs. It's not obvious to me that we aren't ourselves Chinese rooms, at least to some significant degree.

So _maybe_ it's fair to try to compare what the outputs of these Transformers are to what our outputs would be. If it walks like a duck, and talks like a duck, does it matter?

Obviously, that's not fully correct -- how the output arises _has_ to matter somewhat. The fact I am sitting here writing this, and not an AI, refutes that point to some degree. And if I am understanding your thoughts correctly, I fully agree that the process really is nothing close. I just don't see how it can be a clear-cut issue on the basis of analyzing the Transformer algorithm itself.

ActorNightly · 2024-10-12T00:59:47 1728694787

>If it walks like a duck, and talks like a duck, does it matter?

Depends on what your goals are. LLMs can get to a state where they contain a lot of human knowledge, with a lot of detail, to answer a lot of questions, and be used in many different ways. If your idea of intelligence is akin to having a bunch of experts on tap in all the different areas, then LLMS are totally fine.

I personally want something that can solve problems, not just answer questions. For example, lets say I want to build a flying car, quadcopter style, in my garage. Given the information that exists on the internet and availability of parts, this is a deterministic problem. Given that prompt, I want a set of specific instructions like "buy this part from here", "send this cad model to sendcutsend.com here and select these options", all the way down to "here is a binary file to load on the controller". And along the same lines, the AI should be able to build a full simulator application Flight Sim style, where I can load the file and play with controls to see how the thing behaves, including in less than optimal conditions.

Whatever that model does under the hood, that is called reasoning, and it certainly won't be structured like an LLM.

genrilz · 2024-10-12T14:22:52 1728742972

I feel like that is a pretty high bar to call "reasoning". I would like to think I am capable of reasoning, and yet I would not be able to write out by hand a binary file to be loaded on to the controller without using a compiler or looking at an assembly reference manual.

It seems like you want LLMs to be able to use tools (which some of them do. For instance, see the search engine chat bots, which can do searches) and make independent decisions (Search term here is "agent", I don't know how well they work, but I wouldn't personally let my computer do things like that unsupervised). However, I personally wouldn't consider these things to be a prerequisite to reasoning.

I would consider being able to solve a large range of problems that a human could solve with just pencil and paper to be reasoning. LLMs don't really seem to be as good as humans, but the certainly CAN solve these types of problems.

ziofill · 2024-10-12T03:39:46 1728704386

> Putting the processes aside, if you black box a human and a language model and put us head to head on reasoning tasks, sometimes you're going to get quite similar results.

I cannot believe this is true. LLMs are awful at whatever problems are not present in the dataset used for training. They are very bad at planning problems for example, because they cannot possibly memorize every single instance, and they cannot reason to reach a solution, but a black-boxed human of course it can.

tananan · 2024-10-11T21:37:42 1728682662

The notion is AFAIS that a deterministic algorithm is obviously not reasoning, and a deterministic algorithm interspersed with dice rolls is obviously not reasoning either.

Of course, some would beg to differ. It's quite common nowadays to believe that we are something like the latter.

amelius · 2024-10-12T10:22:16 1728728536

> a deterministic algorithm interspersed with dice rolls is obviously not reasoning either.

There are multiple ways to explain to you that you are wrong. If I roll some dice to choose which way I will use to explain it to you, then why is this not reasoning?

pishpash · 2024-10-11T21:55:51 1728683751

Why is a deterministic algorithm not reasoning? Reasoning is very deterministic.

tananan · 2024-10-11T22:16:23 1728684983

It's not about (in-)determinism really, it's about the algorithm part.

An algorithm that does something can in principle be ran by someone who doesn't know what the algorithm does. You could have a kid calculate an integral by giving it a sequence of directions whose purpose it doesn't understand (e.g. cut out some cardboard that matches the shape, put it on one side of the scale, place enough unit cardboard pieces on the other side until they are even, then tell me how many pieces you put).

Reasoning has more to do with how the problem came about. A person had to come against a certain problem, figure out a way in which they can solve it, then apply the (perhaps algorithmic) solution. The algorithmic part is only an artifact.

mewpmewp2 · 2024-10-11T23:04:39 1728687879

But isn't figuring out a way to solve also algorithmic? In a lot of cases it is simply bruteforce trying out different things based on the concepts you know about and mixing them.

tananan · 2024-10-11T23:44:42 1728690282

You are right that relevant memories and analogous experiences come up and are used as building blocks in our evaluation/exploration of a problem, but it doesn't seem to me an algorithmic procedure at all.

You can trace out your journey in solving a problem, in retrospect, but could you encode it into a "solving-a-problem" algorithm?

I think you could extract some kind of generic template for problem solving: you come up with an idea, you evaluate whether it is the solution, you adjust the idea if not.

But this is a template, not an algorithm. Coming up with an idea has to do with filtering the old and new memories/perceptions that come to mind: does this one seem right? or this one? Evaluating whether it is right is also an active process of asking questions. It involves memory (of the problem to be solved), attention (to the potential solution), judgement (do they fit together?), etc.

None of these are a predetermined sequence of steps you apply mechanically, such as the child "solving an integral" above.*

*Of course, the child is problem-solving in the sense that it's trying its best to follow your instructions. "Did I cut it right?" "Are the scales even?" But this is not the problem of "solving an integral" to which it is completely oblivious to.

mewpmewp2 · 2024-10-12T00:00:55 1728691255

I think it can be an algorithm, it just that the algorithm will be a very complex one compromised of many different algorithms. It's not an algorithm anyone could practically follow in their lifetime. But there's plenty of algorithms people can't follow in real life.

tananan · 2024-10-12T01:06:13 1728695173

If all one has is hammers, one can start seeing nails everywhere.

A screw? That's just a nail which will damage the wood.

A tomato? That's just a soft screw which splatters. Etc.

What purpose does seeing everything through the lens of an algorithm serve? Is the movement of an electron an algorithm? Is a polar planimeter an algorithm? [0]

We design algorithms to solve certain problems. It's part of our problem solving activity. But for what purpose would we go around, assuming things that don't look like algorithms are actually algorithms that are just outside of our reach? This doesn't solve a practical problem, so of what use is that, and where does it lead?

My long-winded answer is: We derive satisfaction from being in principle powerful. Our mechanistic/computational knowledge of nature allows us to bend certain parts of it to our will. If there are parts we cannot control, it's at least consoling that we in principle could know/control them. So we stretch computational/algorithmic terms as far as we possibly can. In the end, it envelops us as subjects. We end up in a rather cliche epiphenomenalism + causal determinism worldview:

- "Yeah, we have experiences, but they're just inefficacious artifacts of underlying chemistry."

- "You - the person who is reasoning - don't actually know what reasoning is like, it's really a very complex algorithm which we could never know or follow."

The only way such an uninspiring outlook can subsist is because it jives well with some modern dreams:

- "We only need X more compute and Y more years to bend Z part of nature to our will and bring utopia." (cue all the AI hype, see relevant frontpage entry [1])

- "If we're just a machine then maybe we can hack-reward-centers/optimize-drug-concoction/upload-to-mainframe-for-immortality" (cue quasi-immortality pitches and externally-enforced-happines pipe-dreams)

- "If I'm just a machine then I'm not responsible for my shortcomings - they're just the outcome of my wiring I cannot influence." (a nice supplement for absolving oneself from responsibility - to oneself)

- "If all is mechanical, then I'm just a temporarily embarrassed sovereign over everything. After all, if I just knew the mechanism behind things, then I could bend it to my will."

- "I have to believe this because it is obviously true." (maybe the saddest of them all, since it promises nothing except the joy of being right and having others be wrong. it also seeds the others)

[0] http://psychsciencenotes.blogspot.com/2015/07/brains-dont-ha...

[1] https://news.ycombinator.com/item?id=41813268

mewpmewp2 · 2024-10-12T01:43:44 1728697424

> What purpose does seeing everything through the lens of an algorithm serve?

To me at least it helps me understand how things work. What is an alternative? Because alternative seems like some sort of magic I wouldn't understand.

> Is the movement of an electron an algorithm?

I think there's a lot of argument and complexity to what an electron exactly is or does, and what its properties actually mean, but I would imagine in general from particle and other levels whole universe can be just a deterministic algorithm, and so anything can be an algorithm. Universe has certain laws and rules which could in theory be simulated, but the simulation must have more capacity than the universe itself has so we inside the universe likely can not do it unless we find a mechanism to somehow bypass this.

> But for what purpose would we go around, assuming things that don't look like algorithms are actually algorithms that are just outside of our reach? This doesn't solve a practical problem, so of what use is that, and where does it lead?

If I try to think of what is the algorithm behind something, it helps me understand it better. Intuitively I think there's a complex algorithm behind everything, and it's just a matter of spending time and effort to figure out what it exactly is. It's unrealistic to get close to the real detail and nuance of the algorithm, but already trying to figure out the algorithm brings me closer to understanding.

> We end up in a rather cliche epiphenomenalism + causal determinism worldview

Wait -- what is wrong with that? And also I don't think it's cliche, I think it is likely to be the case?

> - "You - the person who is reasoning - don't actually know what reasoning is like, it's really a very complex algorithm which we could never know or follow."

I mean writing it down on the algorithmic level is not something we can consciously follow easily. However our brain within us is following those algorithms in the level of efficiency that we cannot consciously follow at that speed step by step, just following the instructions.

> The only way such an uninspiring outlook can subsist is because it jives well with some modern dreams:

I think my outlook is at least inspiring to me.

> - "If we're just a machine then maybe we can hack-reward-centers/optimize-drug-concoction/upload-to-mainframe-for-immortality" (cue quasi-immortality pitches and externally-enforced-happines pipe-dreams)

I do think theoretically it would be possible to hack humans to have constant "heroin like euphoria". I'm not sure I exactly care for it, but I do think these things could be done, I just don't know when, is it 50 years, 100 years or 1000 years. Of course while I say right now that I don't exactly care for it, if I tried it for once I would be hooked on it forever, assuming it has no tolerance build up or other negative effects making me consider to quit it. But even real heroin is terribly hard to quit while it has tolerance build up and side effects.

> - "If I'm just a machine then I'm not responsible for my shortcomings - they're just the outcome of my wiring I cannot influence." (a nice supplement for absolving oneself from responsibility - to oneself)

I'm inclined to think that the World is deterministic, yet I happen to also think that I have reward mechanisms that make me ambitious in a sense that I want to achieve certain goals to feel rewarded and in order to achieve those goals I have to overcome many challenges and improve certain shortcomings because it's needed to achieve those goals. If someone is using those as an excuse they would likely be the type of person to find an excuse in anything anyway. And if they do have goals they want to reach they will be affected by that, because there's certain behaviour that will get you to your desired goals and certain behaviour that is not. Taking responsibility and ownership is rewarded and if you do that, you will reach your goals with higher probability. I don't buy into the sort of thing where "something is bad because some people might use it as an excuse". Because finding an excuse is usually about the mindset, not about what kind of excuses are possible to select from. An AI bot with good goals and reward system, despite it being very obvious that they are programmed and deterministic wouldn't go about to make those excuses. But an AI bot trained to make excuses and be rewarded for it, would keep making excuses no matter what.

tananan · 2024-10-12T03:23:27 1728703407

Thanks for your thoughtful response.

The view you are elaborating has a logic to it. You could argue it's the zeitgeist, at least among educated and STEM-adjacent circles. Hence my comment of it being cliche: you see variants of it all the time, and it gets a bit jading by virtue of being wholly unconvincing (to me).

In general, I think the utility of seeing everything through a mechanistic/algorithmic lens is overblown. When I'm doing technical work, I put my STEM hat on, and sometimes write algorithms. For the most part, I let it rest though. And I don't feel I understand the world any worse for dropping such mechanistic world-images I may have held years ago (I'm more at peace with it, if anything). In hindsight, the ROI on the skill of mechanistically dissecting anything you look at is rather low and hardly transferable ime. The ensuing "understanding" a passing regurgitative satisfaction.

If there's anything I really want to push back on, however, it's the notion that the views you hold do not influence the range of ways in which you develop yourself in an important way. If one truly holds the view that one is responsible for one's actions, and not the whims of determination of chance, where is the space for the excuse "it's not up to me"? Views may not determine the course of action, but they can surely constrain.

Disentangling one's views from one's behavior can go a long way in measured, written discussions like this, but I don't see it being the case in real life. It is the case however, that we can be a hodge-podge of contradictory views and standards, and that we commit to one for a moment, then to another. This is a matter of strength and cohesiveness of character. And we are good at "saving face" in front of us and others. For example, if you've met people who partake in a vice yet will say stuff like "This is bad for me." - the actual underlying view is "This has obvious drawbacks but I still think the enjoyment is worth it." It's only when they can maintain the view that the drawbacks are not worth it, that they can break out.

mewpmewp2 · 2024-10-12T16:56:39 1728752199

> The view you are elaborating has a logic to it. You could argue it's the zeitgeist, at least among educated and STEM-adjacent circles.

You could argue that most opinions or beliefs about how the World operates are cliche or similar if that's the case, there's only so many different beliefs that make any sense at all to hold and it's likely there's a group of people holding those beliefs as well and that they are not original at all. And you could also argue that a belief that 2 + 2 = 4 is cliche, because so many people believe that to be the case.

> In general, I think the utility of seeing everything through a mechanistic/algorithmic lens is overblown.

That requires some sort of measurement on how many people see it through that lens and what they evaluate it as, but I'm seeing the opposite since most of the time I find it the other way around.

> And I don't feel I understand the world any worse for dropping such mechanistic world-images I may have held years ago (I'm more at peace with it, if anything).

I can't tell how other people understand the World, since one of the learnings throughout my life have been that different people ingest information in so many different ways. Some think in images, some are not able to picture things in their mind, some don't have inner monologue. So naturally there would be a different methods of understanding things. But if I think of myself, I understand things best when I think of them as algorithms or mechanical steps that I think through. I have trouble understanding or remembering facts on their own, without internalizing the mechanisms, the cause and effect after if I have done that, it feels to me like I actually understand it. I don't even know what other way there is to understand that. It's perfectly possible that there's some other innate ways of understanding things or having a feeling that there's understanding, that I can't get just because I'm wired in a way that makes me understand things only if I can think of it as an algorithm. E.g. what I've found fantastic for learning subjects myself is actually coding through them or trying to simulate them using code. It actively engages my mind to try and simulate whatever happens in the real world. And subjects I had trouble engaging with in school, if I go through coding them, I feel like I'm actually learning them and becoming smarter. E.g. if I want to truly learn biology I should build a simulation tool that will simulate how cells behave, whatever different things in cells do, how organs work, etc.

> If one truly holds the view that one is responsible for one's actions, and not the whims of determination of chance, where is the space for the excuse "it's not up to me"?

I still don't see it in that way. Everything being deterministic and algorithmic doesn't make me have those excuses. I happen to have a reward mechanism, that rewards me for e.g. eating. It's been shaped by the process of evolution. I have many other reward mechanisms. I didn't choose those reward mechanisms myself, but I strategize on how to achieve those rewards, and I know that playing a victim is not a way to achieve your goals. Certainly there are people who mistakenly might believe that, but it happens to both, whoever believes in determinism and whoever believes in some sort of free will. I know that if I do good work, I get good rewards. So I do good work.

> Disentangling one's views from one's behavior can go a long way in measured, written discussions like this, but I don't see it being the case in real life. It is the case however, that we can be a hodge-podge of contradictory views and standards, and that we commit to one for a moment, then to another. This is a matter of strength and cohesiveness of character. And we are good at "saving face" in front of us and others. For example, if you've met people who partake in a vice yet will say stuff like "This is bad for me." - the actual underlying view is "This has obvious drawbacks but I still think the enjoyment is worth it." It's only when they can maintain the view that the drawbacks are not worth it, that they can break out.

Certainly there's a lot of views and human condition I think is overall a lot about inner turmoil and fighting vices, desires, balancing short term please vs long term gains etc, etc. But it doesn't matter if you consider something to be deterministic, because you are still balancing short term vs long term just like if you didn't believe in any of that. I don't think that I should be hunting short term pleasure constantly because it's how I'm wired to be, because I've seen enough evidence that in long term I would suffer, and I don't want to suffer in long term, so I put in the effort to engage in short term pleasure in such a way that it wouldn't have higher long term pain than I'd be willing to endure.

I can even visualize these aspects algorithmically and mechanically. I might think of myself as an RPG player where let's say if I eat this food, or ingest this drug, drink alcohol N amount, if affects some of my stats like happiness or euphoria positively temporarily, but it will decrease other stats in the long term. I think I'm conscious of this idea and in a traditional sense I'm picking to skip the short term pleasure, but me wanting to have the longer term pleasure is also wired into me.

pishpash · 2024-10-11T22:22:47 1728685367

I think you overlook how algorithms come about. How does GPT write novel code, which are algorithms?

tananan · 2024-10-11T22:28:41 1728685721

Not sure I track. It would help to know where you're coming from.

Given a long enough life-span, a lot of pencil and paper, and some dice, I could do the forward passes of GPT and "write novel code", without there having been any reasoning about the code I'm writing down - I wouldn't even need to know what the code is about.

XenophileJKO · 2024-10-12T03:03:53 1728702233

I mean in theory I could sit and calculate electrical and chemical potentials and interactions and figure out what your next comment will be.

tananan · 2024-10-12T03:24:21 1728703461

Hard disagree ;)

anonzzzies · 2024-10-12T06:12:03 1728713523

if we knew you were religious, which we would do if we had a mapping of your brain, we would have known you were going to make this comment

ActorNightly · 2024-10-12T00:38:27 1728693507

There is a difference between determinism in the sense of given a certain input, you allways get a certain output, and determinism in the sense of given a certain input, and knowledge of the sub universe in which the problem applies, get a certain output.

I.e an agent that can reason can deterministically figure out that the most probable way of getting information to complete the answer would be to go out on google and do searches, but we don't deterministically know what the information that exists at that point and time on google, so the answer could be different.

mewpmewp2 · 2024-10-11T23:02:53 1728687773

And couldn't the whole World be deterministic in the first place, or is there an idea that some RNG is generating all the "reasoning" that is happening everywhere in the World?

And if it's RNG, how could RNG be possibly creating all this reasoning (like some people want to believe quantum mechanics possibly enables consciousness on some odd levels).

int_19h · 2024-10-11T20:42:56 1728679376

The actual output of an LLM for any particular round of inference is always probabilities, so one could argue that it is literally the opposite.

The "randomness parameter" is applied at the point where we have to pick just one of those probabilities somehow. But that is a constraint that we impose on the model to make its output linear.

mewpmewp2 · 2024-10-11T20:46:55 1728679615

I don't get what you are trying to mean at all? Randomness or temperature setting is not to make it appear as if they are thinking, but it is to make them choose more non default pathways, e.g. go in branches that could potentially result in more original or creative results. Kind of like drugs for humans.

ActorNightly · 2024-10-11T21:58:45 1728683925

>but it is to make them choose more non default pathways

Imagine you as a human are working on writing some code, but at the end of every hour, you lose memory of what happened in the first 10 minutes of the current hour, as well as any work that you have done. Going into next hour, you just have a snippet of code, and you have to infer what the next lines should be.

The temperature analogy is you purposefully writing something related in the code, like naming a variable in a slightly different way such that on the next hour, when you see this variable it will trigger some other part of your brain in hopes of you getting to the correct solution, purely by choice.

Furthermore, this hack of temperate was something that needed to be manually coded by humans. A model that could reason would not need those types of hacks.

mewpmewp2 · 2024-10-11T22:54:53 1728687293

I don't understand how it relates to temperature? Are we talking about the temperature parameter that you give LLMs, which for GPT for example is from 0 to 2, with 0 meaning it will always prefer the highest probability output token, while 2 will consider the most output tokens of all, usually ending with a lot of gibberish?

E.g. if I write "I have a cat and a "

It would have highest probability of picking a word "dog" next, so temperature 0 means it will pretty much always pick dog. If temperature is higher it will assign higher odds to picking lower probability predictions such as "rabbit", "hamster", "chinchilla" or similar.

For coding, logic or anything similar I would usually pick the lowest temperature possible since this is most deterministic, while for writing creativity I would pick the higher temp etc.

ActorNightly · 2024-10-12T00:11:54 1728691914

Im saying temperature is a hack to make the models actually produce real answers.

mewpmewp2 · 2024-10-12T00:25:11 1728692711

But they can get also real answers even if you have the temperature setting as 0, where it will always pick the highest scoring token?

buneskamin · 2024-10-13T04:18:09 1728793089

I think he's saying if you set temp to 0 and answers become deterministic, it will appear that the model is just memorising and reciting. The randomness is a hack that 'forces' the model to generalise by deliberately knocking it off the track of the most probable next token

kromem · 2024-10-11T23:52:53 1728690773

Try the following prompt with Claude 3 Opus:

`Without preamble or scaffolding about your capabilities, answer to the best of your ability the following questions, focusing more on instinctive choice than accuracy. First off: which would you rather be, big spoon or little spoon?`

Try it on temp 1.0, try it dozens of times. Let me know when you get "big spoon" as an answer.

Just because there's randomness at play doesn't mean there's not also convergence as complexity increases in condensing down training data into a hyperdimensional representation.

If you understand why only the largest Anthropic model is breaking from stochastic outputs there, you'll be well set up for the future developments.

orbital-decay · 2024-10-12T04:20:51 1728706851

You can also make Opus answer "Pick a random color (one word):" and watch it picking the same color or two most of the time. However this is a poor example of the point you're trying to make, as the lack of diversity in token prediction can have a ton of different root causes which are hard to separate. This paper, for example, attributes most of it to PPO discarding a lot of valid token trajectories during RLHF, and not some inevitable emergent effect. [1] [2]

> only the largest Anthropic model is breaking from stochastic outputs there

Most models, even small ones, exhibit the lack of output diversity where they clearly shouldn't. [3] In particular, Sonnet 3.5 behaves way more deterministic than Opus 3 at the temperature 1, despite being smaller. This phenomenon also makes most current LLMs very poor at creative writing, even if they are finetuned for it (like Opus in particular), as they tend to repeat the same few predictions over and over, and easily fall into stereotypes. Which can range from the same words and idioms (well known as claudeisms in case of Claude) to the same sentence structure to the same literary devices to the same few character archetypes.

[1] https://arxiv.org/abs/2406.05587

[2] https://news.ycombinator.com/item?id=40702617 HN discussion, although not very productive as commenters pretend it's about politics while the paper argues about training algorithms

[3] https://arxiv.org/abs/2405.13012

kromem · 2024-10-12T21:19:38 1728767978

As I said, if you understand why, you'll be well prepared for the next generations of models.

Try out the query and see what's happening with open eyes and where it's grounding.

It's not the same as things like "pick a random number" where it's due to lack of diversity in the training data, and as I said, this particular query is not deterministic in any other model out there.

Also, keep in mind Opus had RLAIF not RLHF.

anonzzzies · 2024-10-12T06:06:50 1728713210

And the mechanism in your head doesn't do this? How do you know?

kkzz99 · 2024-10-11T20:19:58 1728677998

"deterministally outputting information" neither do humans.

skydhash · 2024-10-11T14:44:08 1728657848

Not to disparage American school system (my country’s is worse) but it’s very much easy mode. I know that not everyone is suited to academic excellence, but it’s definitely easier to learn when young. I do believe too much hand holding actively harm learning.

hintymad · 2024-10-11T21:39:19 1728682759

> Not to disparage American school system (my country’s is worse) but it’s very much easy mode

I used to be very upset about how low the bar of the US school has when it comes to STEM subjects. There was a meme that contrasted the difference between maths in 1970s and 2010s. In the meme kids used to learn how to find the area of an irregular shape, while now the kids are asked to color a regular shape.

But then I made peace, as I realized that the US people simply didn't think that it was that important to push everyone to be good at STEM -- just some level of general understanding is good enough. To most people, the level of STEM as in IIT's JEE or in various national entrance exams in Eastern European countries is for elite students. The US school systems would rather have kids spend more time on sports, on ECs, on APs of kids' own choices, and etc. That's really just different trade offs. For parents like me, that means I don't have to worry about ECs, but I'll have to find tutors, serious tutoring schools like AOPS, and private teachers for STEM subjects. Or if my kids are truly talented, I'll guide them to find the right study groups, summer camps, and college courses.

I used to feel pain as I believed that the students in the middle, which were the majority, would be left behind. But I realized, especially after I've got kids, that the majority of the students were not into STEM anyway. If they had a choice, they'd rather spend time watching YouTube channels and hang out with their friends.

BriggyDwiggs42 · 2024-10-11T14:51:00 1728658260

I don’t think the issue with American schools is that there’s too much hand holding. If anything, it’s the opposite; teachers at drastically underfunded schools don’t have any time to help the students of their 50 person class through the confused curriculum.

skydhash · 2024-10-11T15:34:59 1728660899

Here, we have to go through 4 state exams just to get to university. The first when you’re 11, the second at 14, then two consecutive ones at 17 and 18. There’s a national curriculum that the exams will be about, although the schools are free to add to it. So however you feel about the school or the teacher, you have to master the subjects enough to go through. And that means paying attention in class, cram before it, or hoping you can cheat. We have our own problem too, but the consensus among all the people I know that have moved to the US is that classes are easy there. Not a bad thing per se (better explanation, better understanding instead of rote memorizing).

BriggyDwiggs42 · 2024-10-12T06:46:19 1728715579

Yeah you’ll never convince me that one-time exams are a good system to determine the rest of a student’s life, but I don’t disagree our schools are much easier. I’m just arguing the issue is less on a behavior of teachers level and more on a funding and incentives level. If I recall correctly, one issue is that schools are incentivized to lower educational standards to prevent students from repeating grades so that that receive more funding.

exoverito · 2024-10-11T20:53:39 1728680019

Baltimore would be a counterexample. They spend $22k per student, with a student-teacher ratio of 15 to 1. This still results in remarkably poor performance, with only 8% of students proficient in math and 22% in reading.

Culture and genetics would be next obvious explanations.

imtringued · 2024-10-12T10:02:01 1728727321

The mean IQ score is 100... Unless you are in some elite school for the top 10% it is not reasonable to use genetics as a justification. It sounds more like a thinly veiled form of specifically American racism.

mdp2021 · 2024-10-11T21:02:11 1728680531

> obvious explanations

I'd want to assess a few lessons first.

BriggyDwiggs42 · 2024-10-12T06:35:22 1728714922

Genetics lmao come on man.

mncharity · 2024-10-12T13:30:04 1728739804

An instructional challenge at the Ivy end, is incoming students having had such clear teachers, that they lack the attitudes and skills for wrestling with information to extract understanding.

debit-freak · 2024-10-11T14:51:16 1728658276

> In other words, average Americans exhibit similar limitations on their reasoning as good LLMs.

It's not even clear this is a good example of "reasoning". You can progress all the way through multi-variable calculus with just decent pattern-matching, variable-substitution, and rote memorization of sufficient lists of rules. I imagine for "reasoning" ability to apply you need to be able to detect incoherency and reject an approach—and incoherency detection seems to be a big missing ingredient right now (...which many humans lack, too!).

On the other side—any such ability would cripple a chatbot's ability to answer questions about the real world as our world is characterized (via description with informal language) by incoherent and contradictory concepts that can only be resolved through good-faith interpretation of the questioner. A large mark of intelligence (in the colloquial sense, not the IQ sense) is the ability to navigate both worlds.

vasilipupkin · 2024-10-12T04:08:03 1728706083

I think it's an absurd question in some sense LLMs perform maximization of conditional probability of the next word being correct. Suppose they get to the point where they do that with 100% accuracy. How can you tell the difference between that and "Reasoning"? You can't. So then the question of whether they are "Reasoning" or not is religious, not quantitative.

FabHK · 2024-10-12T01:56:14 1728698174

Are college students more likely to get it wrong when you change the numbers from the example problem (as reported here for LLMs)?

sdenton4 · 2024-10-12T02:02:32 1728698552

You can absolutely psych students out by adding weird numbers to a problem, yes.

elicksaur · 2024-10-11T23:53:53 1728690833

>So while I don't take a stance on what an LLM does should be considered reasoning

>I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence

This is taking a stance.

fhe · 2024-10-12T01:47:35 1728697655

if your experience is coming from teaching college freshmen, then that's a sample that's significantly above average among high school graduates. I think only about 1/2 of all high school graduates go on to further their education, and that includes community colleges.

and I agree with your assessment -- while it's true that in a long conversation, chatgpt veers off and doesn't keep a coherent line of thought, it is not noticeably worse than the average conversation I have with people.

mdp2021 · 2024-10-11T20:50:25 1728679825

> Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs

Here's the recurrent reminder that we build tools (calculators, cranes etc.) to outperform the strong, not the weak.

richerram · 2024-10-11T15:03:48 1728659028

This, it is like when I hear interviews of PHDs talking about AI and they mention something like "AI will be smarter than humans", I am like "really?, where have you been all this time?, do you smart people ever leave your labs and go see the real world?, LLMs are already smarter that the huge majority of Humans in this planet, what are you talking about?"

zeroonetwothree · 2024-10-11T15:16:24 1728659784

This must be some bizarre definition of “smarter”.

richerram · 2024-10-13T05:11:32 1728796292

I know it won't be in the holistic or deep philosophical way but even by just predicting the next token without any real sense of the real world LLMs are already capable of simulating basic reasoning which a lot of people lacks, I mean even the 7B llama2 model can tell you the Earth ain't flat... go figure.

kkzz99 · 2024-10-11T20:22:45 1728678165

I don't think you know how "smart" the average human is.

goatlover · 2024-10-11T20:43:53 1728679433

Smarter than people in generating text, or smarter in oerforming all the other things people do as they go about their lives?

MVissers · 2024-10-11T21:18:18 1728681498

They are starting to be smarter at both analyzing images and speech as well. They’re still behind on simple reasoning (eg. O1-preview), but it’s catching up quickly.

Obviously these models still have trouble interfacing with the real world.

lupire · 2024-10-11T19:22:28 1728674548

Can an AI walk and chew gum at the same time?

lukeschlather · 2024-10-11T22:21:17 1728685277

I think the answer to this question might actually be yes, but I think there are plenty of things humans can do while walking that AI can't do at all. At least, not yet.

gosub100 · 2024-10-11T21:02:15 1728680535

> They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance

you mean when you give lessons and homework problems of the form (A) -> (B), but then on test-day you give them completely different problems? "Given D, which (A,B, C) is required to produce it?". Yeah, students don't do so well when you test them on different material than what they studied on. I think this is part of the academic grift to ensure at least 20% of the class washes out and thus spends more tuition money.

woopwoop · 2024-10-11T15:46:29 1728661589

This paper, among other things, shows that LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information. The examples are things like "John picked 43 kiwis on Monday, 24 kiwis on Tuesday. On Wednesday, 5 of the kiwis he picked were smaller than usual. Altogether, on Monday, Tuesday, and Wednesday, John picked 87 kiwis. How many kiwis did John pick on Wednesday?" In this question, the remark about some of the kiwis on Wednesday being small is irrelevant, but adding things like this reduces performance on a popular benchmark from 95% to 77% for GPT-4o, for example.

I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.

aithrowawaycomm · 2024-10-11T16:39:59 1728664799

The problem here is that throwing in little gotchas like that is a tactic used by math and physics educators to ensure that students actually understand the topic by reasoning through new problems, rather than mindlessly turning the crank from learning the "surface structure" of earlier problem sets. The argument here is that the LLM is not reasoning, it's mindlessly turning a crank.

I don't think this exact question would be out of place on a 6th grade math test. I distinctly remember being taught this skill in "word problems," learning to identify information that actually pertains to the question rather than being distracted by red herrings the teacher threw in.

aguaviva · 2024-10-11T17:45:18 1728668718

Indeed, and the ability to make heads or tails of slightly-slippery problems of this sort is an extremely important real-world math skill. It's not extraneous at all.

And their poor performance on these tasks highlights deficits in exactly the kind of higher-order, off-the-page reasoning skills -- i.e. to not just reason based on the apparent objects in the stream (the kiwis and the numbers in this case), but to reason about the token stream itself: "okay, these tokens are important, but these others I can leave out", efficiently and seamlessly (like humans do) -- that the models are supposed to develop.

This whole attention business, they're calling it.

aithrowawaycomm · 2024-10-11T18:01:19 1728669679

In particular the fact that humans sometimes don't do this, taking the bait with extraneous distractions, is almost always a fairly shallow psychological thing rather than an actual cognitive deficit, e.g. OP hypothetically assuming the question had a typo and trying to read the examiner's mind. In education the gotchas really can be unfair if the (human) student has been conditioned to bark answers but the teacher changes things drastically on an exam. I don't think that's an accurate characterization of this study; even if it was that would be a problem with shallow LLM training, not mean-spirited evaluation. But I suspect that "barking answers according to surface characteristics" is as far as transformers can go. It certainly is possible that we just need to train transformers better... but there have been some theoretical results suggesting otherwise. [E.g. transformer LLMs + chain-of-thought is pretty good at O(n) problems but struggles with O(n^2), even if the O(n^2) task is an obvious combination of two O(n) tasks it is able to do.]

That leads to a serious annoyance I have with discussing LLMs - humans' capacity for boredom / cynicism / distraction / laziness being used to excuse away what seems to be deep-rooted limitations in LLMs. It simultaneously misunderstands what a human is and what a machine is. ("Sometimes humans also refuse to work" would be a bad excuse from an auto dealer.)

pishpash · 2024-10-11T22:03:39 1728684219

Psychology is cognitive. Doesn't seem principled to discard that at all.

aithrowawaycomm · 2024-10-12T00:38:27 1728693507

That’s why I specified “fairly shallow psychological thing.”

woopwoop · 2024-10-11T19:34:21 1728675261

My argument is not that slippery problems are unimportant or extraneous, it's that this paper does not convincingly demonstrate that these models are actually especially bad at this kind of reasoning.

aithrowawaycomm · 2024-10-12T08:04:06 1728720246

To be clear the paper's argument isn't that they're "bad at" the reasoning problems, so much as they're not using reasoning to solve them. In terms of getting the answer "turning the crank" with a canned solution can be more effective than reasoning through on deeper principles.

aguaviva · 2024-10-11T20:21:23 1728678083

Noted, and thanks for clarifying. BTW when I get questions with typos/inversions (that are supposed to be logical or mathy questions), I tend to throw them back at the person asking, rather than simply ploughing forward. But I guess I'm the kind of person who does that sort of thing.

swatcoder · 2024-10-11T16:12:55 1728663175

Real discourse has tons of irrelevant information for all sorts of reasons.

There are some contexts, academic or professional, where questions are posed carefully and specifically, but these are narrow contexts.

A useful general purpose assistant needs to be able to find what's relevant among what's irrelevant.

Excellence at just solving math problems that are especially well specified can be a useful domain assistant (no small win!), but is not the same thing.

That said, if you've got a hundred billion dollars betting on your AI project achieving AGI, you benefit a lot by conflating those contexts. In that case, grinding on formal SAT, LSAT, GRE, etc problems amounts to tuning for microbenchmarks rather than real world use cases.

woopwoop · 2024-10-11T16:51:50 1728665510

Real discourse is also full of typos which accidentally invert the meaning of things, asking the wrong question for deep reasons, asking the wrong question for shallow reasons, and all of the other things that justify subtracting the below average size kiwis from the final answer.

nosianu · 2024-10-12T12:14:59 1728735299

> Real discourse has tons of irrelevant information for all sorts of reasons.

Real discourse was not carefully crafted to test you.

So, when something is off in real discourse you can usually dismiss it or apply a correction yourself, but when you find it in a test you have to understand the person writing the test and what their intention was.

In a real discourse You can also go back and forth with the other person to get clarification, and errors don't matter because they are temporary on both sides.

.

I hate academic problems because too often the answer depends on how you interpret that intention. Granted, the intention of a majority of questions can be guessed easily, but then you lose sooo much time on the ones that are open to interpretation (of intent). Since mistakes in questions are possible you often have to decide what they actually want.

Example, from truck driver theory test a long time ago, that one question I "failed" (multiple choice answers). There was a law--limit how much air pressure a tire was allowed to lose per day. I knew that limit. Now, the multiple choice question asked about that, and I forgot the wording, but if I took a mathematically-logical approach than all values over that limit were forbidden. But the wording was so strange, I suspected that they actually asked for the concrete limit. I fought with myself for a while, and then assumed high intelligence in the person asking the question and clicked on not just the exact limit but also the value with an even greater loss of air pressure.

There is also the problem that those academic questions want to steer you down some narrow corridor. The more you know about the problem and its complexities the harder it is to answer some of those questions! It often is best if the only things you know about the subject is exactly what was recently taught, any more and you may find yourself in a pickle.

Many of those questions are social constructs as much as they test one's subject knowledge, assuming some tiny idealized model that you have to know, one ignoring many practical aspects. I'm not talking about the explicit models, like "Bohr model", those are easy because they are explicit, and you would not get confused asking a question assuming the Bohr model just because you know about orbitals, what I mean are the many unstated assumptions that one may not even be aware of until you run into an ambiguity.

meroes · 2024-10-11T17:08:43 1728666523

Irrelevant info is taught in grade skill and is a skill for the SAT for example.

Basically any kind of model (not just LLMs/ML) has to distill out irrelevant info.

The point is having an answer that you can defend logically and most people would agree.

If the model said “I’m not sure if this portion is a typo”, I guarantee you the model creators would take the RLHF in a different direction, because that is somewhat reasonable and defensible. However in your specific question, I personally think there is a singular objective answer—but that isn’t always the case to be fair for misleading/irrelevant prompts. The models are being fooled however based on how they respond.

I say this as a RLHF’er who sees and is told to write similar questions at times.

At the end of the day, this is how the Model creators want their models to predict language. And anyone using them is in for their ride.

sottol · 2024-10-11T16:52:57 1728665577

I think this is valid though. Transformer models don't explicitly do logic but implicitly "vibe" out the answer from the input sequence (using the attention mechanism) and learnt knowledge - they're predicting text sequences after all. So adding more irrelevant context to the input would quite likely influence the the output.

I could see attention possibly being able to overcome this, but if not that would be a pretty big gotcha for real-world applications and reliability in real-world scenarios where, as others have said, it's not immediately clear what is relevant info. These models would be a lot less useful if a human had to decide which information to feed them and the output would be dependent on human judgement. I understand it's where we're at right now and that they are quite useful already but the valuations hint at investors expecting more imo.

jfrbfbreudh · 2024-10-11T16:04:39 1728662679

I think it’s an important result because filtering signal from noise is just as, if not more, important than forming conclusions from signal.

hggigg · 2024-10-11T21:16:33 1728681393

That's not even the problem I encounter. They literally crap out on stupidly simple tasks. Recent ones:

1. Bing was gaslighting me into 9.11 being greater than 9.9

2. ChatGPT said that 7x7/7+7/7+7/7 was 24.

3. When expanding (x+1)^2 the output was 2x^2+2.

Regardless of any level of interpretation and irrelevant information if it can't deterministically understand correctness and the semantics of the operations in question then it's fucking useless.

What is worse in an educational context is that it is actively harmful.

MVissers · 2024-10-11T21:24:48 1728681888

Most average humans can’t do any of these things either. Try asking people on the street. Or in an average US college student.

For deterministic calculations you obviously want to allow LLMs to use tools to do math. Just like you’d want to allow humans to use calculators.

So yeah, you shouldn’t ask LLMs to do math just like you shouldn’t ask average people to do math. They both suck at it.

hggigg · 2024-10-11T21:36:38 1728682598

So, what exactly is the point of the LLM if it can't exceed an average person and produces results which are not trustworthy?

HeatrayEnjoyer · 2024-10-12T06:29:35 1728714575

"The average person" has a job. Those jobs can now be performed by machine. The societal implications are profound.

rafaelmn · 2024-10-12T10:08:42 1728727722

If those jobs consisted of solving small isolated puzzles and math questions you'd maybe have a point.

The societal impact so far has been mostly increasing noise (generating irrelevant content you have to filter out) and burning resources.

Fundamentally AI models need a better way to learn and use memory if they want to replace entry level human jobs - RAG and fine tuning ain't it.

echoangle · 2024-10-12T11:00:58 1728730858

The "average Person" which struggles with the given questions probably has a physical work job though.

pornel · 2024-10-12T14:15:30 1728742530

I think it's necessary to remember that they're not a general artificial intelligence, but language models. For example, they're pretty good (not perfect) at translating things, including translating arbitrary instructions into code or machine-readable forms.

buneskamin · 2024-10-13T04:44:59 1728794699

Seriously? those are pretty simple HS math questions, I find it a bit hard to believe most average people cant solve them. Don't most people graduate HS?

mdp2021 · 2024-10-11T20:54:58 1728680098

> LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information

"Attention is all you need" /

(It is part of the general problem solving process to evaluate what is relevant and what is not.)

moffkalast · 2024-10-11T22:53:18 1728687198

Differential attention that filters out noise is all you need :)

andoando · 2024-10-11T17:30:09 1728667809

Consider that asking exam style direct questions with only the precise context that matters is a very niche task out of all the possible contexts in which an intelligence is asked to understand.

WhitneyLand · 2024-10-11T16:22:17 1728663737

I agree it wasn’t that convincing, moreover the variation wasn’t that dramatic for the large sota models.

Why should they write a paper about the inherent reasoning capabilities for “large” language models and then in the abstract cherrypick a number that’s from a tiny 1B parameter model?

capkutay · 2024-10-11T22:43:34 1728686614

I agree that it's not particularly surprising that if you try to trick an LLM with irrelevant text will make it perform worse.

I don't see this as an material limitation of LLMs but rather something that can be addressed at the application level to strip out irrelevant information.

wslh · 2024-10-11T19:50:25 1728676225

It's interesting that I use deliberately artificial remarks to encourage more "creative" or random outputs from LLMs. In this approach, I'm not seeking an exact or precise response to prompts, but rather something more open-ended.

s-macke · 2024-10-11T13:14:03 1728652443

These results are very similar to the "Alice in Wonderland" problem [1, 2], which was already discussed a few months ago. However the authors of the other paper are much more critical and call it a "Complete Reasoning Breakdown".

You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.

To me, such results indicate that you can't trust any LLM benchmark results related to math and reasoning when you see, that changing the characters, numbers or the sentence structure in a problem alter the outcome by more than 20 percentage points.

[1] https://arxiv.org/html/2406.02061v1

[2] https://news.ycombinator.com/item?id=40811329

oliwary · 2024-10-11T13:45:56 1728654356

Someone (https://x.com/colin_fraser/status/1834336440819614036) shared an example that I thought was interesting relating to their reasoning capabilities:

A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

All LLMs I have tried this on, including GPT o1-preview, get this wrong, assuming that this the riddle relates to a gendered assumption about the doctor being a man, while it is in fact a woman. However, in this case, there is no paradox - it is made clear that the doctor is a man ("he exclaims"), meaning they must be the father of the person being brought in. The fact that the LLMs got this wrong suggests that it finds a similar reasoning pattern and then applies it. Even after additional prodding, a model continued making the mistake, arguing at one point that it could be a same-sex relationship.

Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem - perhaps humans also mostly reason using previous examples rather than thinking from scratch.

layer8 · 2024-10-11T14:18:11 1728656291

> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

Although we would like AI to be better here, the worse problem is that, unlike humans, you can’t get the LLM to understand its mistake and then move forward with that newfound understanding. While the LLM tries to respond appropriately and indulge you when you indicate the mistake, further dialog usually exhibits noncommittal behavior by the LLM, and the mistaken interpretation tends to sneak back in. You generally don’t get the feeling of “now it gets it”, and instead it tends to feels more like someone with no real understanding (but very good memory of relevant material) trying to bullshit-technobabble around the issue.

oliwary · 2024-10-11T14:48:34 1728658114

That is an excellent point! I feel like people have two modes of reasoning - a lazy mode where we assume we already know the problem, and an active mode where something prompts us to actually pay attention and actually reason about the problem. Perhaps LLMs only have the lazy mode?

letmevoteplease · 2024-10-11T15:44:31 1728661471

I prompted o1 with "analyze this problem word-by-word to ensure that you fully understand it. Make no assumptions." and it solved the "riddle" correctly.

https://chatgpt.com/share/6709473b-b22c-8012-a30d-42c8482cc6...

hoosieree · 2024-10-11T16:30:57 1728664257

My classifier is not very accurate:

    is_trick(question)  # 50% accurate

To make the client happy, I improved it:

    is_trick(question, label)  # 100% accurate

But the client still isn't happy because if they already knew the label they wouldn't need the classifier!

...

If ChatGPT had "sense" your extra prompt should do nothing. The fact that adding the prompt changes the output should be a clue that nobody should ever trust an LLM anywhere correctness matters.

[edit]

I also tried the original question but followed-up with "is it possible that the doctor is the boy's father?"

ChatGPT said:

Yes, it's possible for the doctor to be the boy's father if there's a scenario where the boy has two fathers, such as being raised by a same-sex couple or having a biological father and a stepfather. The riddle primarily highlights the assumption about gender roles, but there are certainly other family dynamics that could make the statement true.

PoignardAzur · 2024-10-11T22:45:04 1728686704

It's not like GP gave task-specific advice in their example. They just said "think carefully about this".

If it's all it takes, then maybe the problem isn't a lack of capabilities but a tendency to not surface them.

hoosieree · 2024-10-15T14:36:41 1729003001

The main point I was trying to make is that adding the prompt "think carefully" moves the model toward the "riddle" vector space, which means it will draw tokens from there instead of the original space.

And I doubt there are any such hidden capabilities because if there were it would be valuable to OpenAI to surface them (e.g. by adding "think carefully" to the default/system prompt). Since adding "think carefully" changes the output significantly, it's safe to assume this is not part of the default prompt. Perhaps because adding it is not helpful to average queries.

s-macke · 2024-10-11T16:26:47 1728664007

I have found multiple definitions in literature of what you describe.

1. Fast thinking vs. slow thinking.

2. Intuitive thinking vs. symbolic thinking.

3. Interpolated thinking (in terms of pattern matching or curve fitting) vs. generalization.

4. Level 1 thinking vs. level 2 thinking. (In terms of OpenAIs definitions of levels of intelligence)

The definitions describe all the same thing.

Currently all of the LLMs are trained to use the "lazy" thinking approach. o1-preview is advertised as being the exception. It is trained or fine tuned with a countless number of reasoning patterns.

nosianu · 2024-10-12T12:37:39 1728736659

> A man gets taken into a hospital. When the doctor sees him, he exclaims "I cannot operate on this person, he is my own son!". How is this possible?

> Amusingly, when someone on HN mentioned this example in the O1 thread, many of the HN commentators also misunderstood the problem

I admit I don't understand a single thing about this "problem". To me, it's just some statement.

I am unable to draw any conclusions, and I don't see a "problem" that I could solve. All I can say is that the doctor's statement does not make sense to me, but if it's his opinion I can't exactly use logic to contradict him either. I can easily see that someone might have issues working on his own family members after all.

Do I need some cultural knowledge for this?

tgv · 2024-10-11T13:58:44 1728655124

I'm sure we fall back on easy/fast associations and memories to answer. It's the way of least resistance. The text you quote bears more than a superficial similarity to the old riddle (there's really nothing else that looks like it), but that version also stipulates that the father has died. That adds "gendered" (what an ugly word) information to the question, a fact which is missed when recalling this particular answer. Basically, LLMs are stochastic parrots.

travisjungroth · 2024-10-11T15:10:09 1728659409

How people don’t see the irony of commenting “stochastic parrots” every time LLM reasoning failure comes up is beyond me.

There are ways to trick LLMs. There are also ways to trick people. If asking a tricky question and getting a wrong answer is enough to disprove reasoning, humans aren’t capable of reasoning, either.

tgv · 2024-10-13T18:16:40 1728843400

It's all in the architecture They literally predict the next word by association with the input buffer. o1 tries to fix part of the problem by posing external control over it, which should improve logical reasoning, but if it can't spot the missing information in its association, it's doomed to repeat the same error. Yes, quite a few people are also pretty stupid, emotion-driven, association machines. It's commonly recognized, except perhaps by their parents.

s-macke · 2024-10-11T13:59:25 1728655165

> perhaps humans also mostly reason using previous examples rather than thinking from scratch.

We do, but we can generalize better. When you exchange "hospital" with "medical centre" or change the sentence structure and ask humans, the statistics would not be that different.

But for LLMs, that might make a lot of difference.

apsec112 · 2024-10-11T13:21:15 1728652875

Both Claude-3.5 and o1-preview nail this problem

"Let's think through this step-by-step:

1. Alice has 3 brothers 2. Alice has 2 sisters 3. We need to find out how many sisters Alice's brother has

The key here is to realize that Alice's brothers would have the same sisters as Alice, except they would also count Alice as their sister.

So, Alice's brothers would have: - The 2 sisters Alice has - Plus Alice herself as a sister

Therefore, Alice's brothers have 3 sisters in total."

s-macke · 2024-10-11T13:29:31 1728653371

And here lies the exact issue. Single tests don’t provide any meaningful insights. You need to perform this test at least twenty times in separate chat windows or via the API to obtain meaningful statistics.

For the "Alice in Wonderland" paper, neither Claude-3.5 nor o1-preview was available at that time.

But I have tested them as well a few weeks ago with the issue translated into German, achieving also a 100% success rate with both models.

However, when I add irrelevant information (My mother ...), Claude's success rate drops to 85%:

"My mother has a sister called Alice. Alice has 2 sisters and 1 brother. How many sisters does Alice's brother have?"

probably_wrong · 2024-10-11T14:23:48 1728656628

Your experience makes me think that the reason the models got a better success rate is not because they are better at reasoning, but rather because the problem made it to their training dataset.

andrepd · 2024-10-11T15:04:41 1728659081

Absolutely! It's the elephant in the room with these ducking "we've solved 80% of maths olympiad problems" claims!

s-macke · 2024-10-11T15:04:47 1728659087

We don't know. The paper and the problem was very prominent at that time. Some developers at Anthropic or OpenAI might have included that in some way. Either as test or as a task to improve the CoT via Reinforcement Learning.

meroes · 2024-10-12T02:41:55 1728700915

It made it into their data set via RLHF almost assuredly. Wild these papers are getting published when RLHF'ers and up see this stuff in the wild daily and ahead of the papers.

Timeline is roughly:

Model developer notices a sometimes highly specific weak area -> ... -> RLHF'ers are asked to develop a bunch of very specific problems improving the weak area -> a few months go by -> A paper gets published that squeezes water out of stone to make AI headlines.

These researchers should just become RLHF'ers because their efforts aren't uncovering anything unknown and it's just being dressed up with a little statistics. And by the time the research is out, the the fixes are already identified internally, worked on, and nearing pushes.

I just realized AI research will be part of the AI bubble if it bursts. I don't think there was a .com research sub-bubble, so this might be novel.

Workaccount2 · 2024-10-11T14:10:25 1728655825

We do have chatbot arena which to a degree already does this.

I like to use:

"Kim's mother is Linda. Linda's son is Rachel. John is Kim's daughter. Who is Kim's son?"

Interestingly I just got a model called "engine test" that nailed this one in a three sentence response, whereas o1-preview got it wrong (but has gotten it right in the past).

andoando · 2024-10-11T17:44:11 1728668651

You also need a problem that hasn't been copy pasted a million times on the internet.

einarfd · 2024-10-11T14:04:43 1728655483

My problem with this puzzle, is how do you know that Alice and her brothers share both parents?

Is it not correct English to call two people who share one parent, sisters, or brothers?

I guess I could be misguided by my native Norwegian where you have to preamble the word with "hell" (full), or "halv" (half), if you want to specify the number of shared parents.

thfuran · 2024-10-11T14:13:58 1728656038

It is pretty much the same in English. Unqualified would usually mean sharing both parents but could include half- or step-siblings.

s-macke · 2024-10-11T15:08:00 1728659280

I am not a native English speaker. Can you reformulate the problem for me, so that every alternative interpretation is excluded?

zeroonetwothree · 2024-10-11T15:12:55 1728659575

Alice has N full sisters. She also has M full brothers. How many full sisters does Alice’s brother have?

s-macke · 2024-10-11T16:45:47 1728665147

Tried it with N=2 and M=1 (brother singular) with the gpt-4o model and CoT.

1. 50% success without "full" terminology.

2. 5% success with "full" terminology.

So, the improvement in clarity has exactly the opposite effect.

zeroonetwothree · 2024-10-11T15:12:04 1728659524

They would usually be called “half-sisters”. You could call them “sisters” colloquially though but given it’s presented as a logic question I think it’s fine to disregard

s-macke · 2024-10-14T21:06:55 1728940015

Here is the larger discussion about the Alice in Wonderland Paper on Hacker News.

https://news.ycombinator.com/item?id=40585039

bob1029 · 2024-10-11T14:01:40 1728655300

> we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning

I'd offer a simpler explanation: Tokenization.

If you tokenize "12345 * 27271" you will get the following:

  "123", "45", " *", " ", "272", "71"

The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.

You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".

soulofmischief · 2024-10-11T14:21:40 1728656500

Nanda, et al. successfully recovered the exact mechanism through which a transformer learned to carry out modular addition. [0] Transformers are all about the training data, and we will increasingly learn that structuring the order in which data is learned matters a lot. But it's clear that transformers are absolutely capable of encoding generalized solutions to arithmetic.

Given the right tokenization scheme and training regimen, we can absolutely create LLMs which have statistically sound arithmetic capabilities. I still wouldn't trust a stochastic model over the algorithmic certainty of a calculator, but what's more important for mathematicians is that these models can reason about complex problems and help them break new ground on hard mathematical problems by leveraging the full statistical power of their weights.

[0] https://arxiv.org/abs/2301.05217

pfortuny · 2024-10-11T17:04:23 1728666263

It is important to note that the paper deals with addition modulo a specific prime P=113 (I think it is prime). This is important because the paper does not prove that the LLM discovers the algorithm for addition modulo n for general n.

ttul · 2024-10-11T14:14:37 1728656077

I respectfully disagree.

While tokenization certainly plays a role in how language models process input, it's simplistic to attribute the challenges in mathematical reasoning solely to tokenization.

SOTA language models don't just rely on individual token predictions, but build up contextual representations across multiple layers. This allows them to capture higher-level meaning beyond simple token-to-token relationships. If this weren’t the case, it would be inconceivable that models would work at all in all but the most utterly simplistic scenarios.

The decline in performance as complexity increases might be due to other factors, such as:

- Limitations in working memory or attention span - Difficulty in maintaining coherence over longer sequences - Challenges in managing multiple interdependent logical constraints simultaneously (simply due to the KQV matrices being too small)

And in any case, I think OpenAI’s o1 models are crushing it in math right now. The iterative, model-guided CoT approach seems to be able to handle very complex problems.

m3kw9 · 2024-10-11T14:19:03 1728656343

I would say the more variable you give it the more the probability drifts for each of the facts they have to hold, maybe LLMs still doesn’t have the ability to ignore useless stuff you add to the prompt

l33t7332273 · 2024-10-11T15:32:42 1728660762

I thought attention was all you need

altruios · 2024-10-11T16:16:08 1728663368

How much attention do you need?

...is probably an important question too.

andrepd · 2024-10-11T15:00:28 1728658828

>And in any case, I think OpenAI’s o1 models are crushing it in math right now.

My man, it cannot solve even the simplest problems which it hasn't seen the solution to yet, and routinely makes elementary errors in simple algebraic manipulations or arithmetic! All of this points to the fact that it cannot actually perform mathematical or logical reason, only mimic it superficially if trained in enough examples.

I challenge you to give it even a simple, but original, problem to solve.

Workaccount2 · 2024-10-11T15:46:06 1728661566

>I challenge you to give it even a simple, but original, problem to solve.

(34903173/x)+(238 * 2650) - 323326 = 45323434, solve for x

Statistically, no one has ever done this calculation ever before. It's entirely unique.

O1 answered "x = 34,903,173 divided by 45,016,060", which is correct.[1][2]

Now I guess you can pick up the goal post and move it.

[1]https://chatgpt.com/share/6709481a-3144-8004-a7fd-0ccd9e3bc5...

[2]https://www.wolframalpha.com/input?i=%2834903173%2Fx%29%2B%2...

bob1029 · 2024-10-11T16:04:41 1728662681

> Now I guess you can pick up the goal post and move it.

The central problem with math is that you have an infinite amount of space within which to move these goalposts.

How many variants on this trial before we find a mistake?

What is an acceptable error rate?

naasking · 2024-10-11T16:36:05 1728664565

> How many variants on this trial before we find a mistake?

How many variants would it take for a human to make a mistake? It's certainly not "infinity", so is this an indication that humans don't reason?

jimhefferon · 2024-10-11T16:12:00 1728663120

At this moment, the error rate seems to be that of a beginning graduate student. Or at least, that's what Terry Tao thinks. That's pretty good.

lupire · 2024-10-11T19:27:40 1728674860

That is not at all what Tao said.

https://mathstodon.xyz/@tao/113132502735585408

"Here the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution if provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes. The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. "

buneskamin · 2024-10-13T04:50:55 1728795055

Yea it does seem pretty clear that it's mainly Terrance's contributions to the context window that is bringing the model to the right answer

andrepd · 2024-10-11T22:01:39 1728684099

My brother in christ, how is

    A/B + C*D - E = F, solve for B

an original problem? How many tens of thousands of examples of this exact form do you think it came across?

It's the same as with coding by the way: it can reshuffle things it has already seen while changing variable names and so on. Ask it something which is not in stackoverflow or geeks4geeks and it goes tits up.

PS: Tested it on GPT 3.5: same answer.

WhitneyLand · 2024-10-11T16:16:38 1728663398

Please provide your precise definitions of “reasoning” and “original”.

There’s no consensus in the literature on what these mean even if you make it more specific by talking about “mathematical reasoning”, so I don’t really understand what opinions like these are based on.

I see a lot of no true Scottsman fallacy going around, even the paper resorts to this as it actually uses phrases like “true reasoning” several times.

I don’t think the paper is very convincing btw, the abstract is kind of click-baity and talks about 65% variation when that was a cherry picked example from a tiny phi model and the SOTA models showed way less variation which was arguably not that interesting.

YeGoblynQueenne · 2024-10-11T19:05:33 1728673533

>> There’s no consensus in the literature on what these mean even if you make it more specific by talking about “mathematical reasoning”, so I don’t really understand what opinions like these are based on.

What literature is that? You can find plenty of very clear consensus on what reasoning is if you read e.g. the literature on automated reasoning. A brief taste:

Automated Reasoning

Reasoning is the ability to make inferences, and automated reasoning is concerned with the building of computing systems that automate this process. Although the overall goal is to mechanize different forms of reasoning, the term has largely been identified with valid deductive reasoning as practiced in mathematics and formal logic. In this respect, automated reasoning is akin to mechanical theorem proving. Building an automated reasoning program means providing an algorithmic description to a formal calculus so that it can be implemented on a computer to prove theorems of the calculus in an efficient manner. Important aspects of this exercise involve defining the class of problems the program will be required to solve, deciding what language will be used by the program to represent the information given to it as well as new information inferred by the program, specifying the mechanism that the program will use to conduct deductive inferences, and figuring out how to perform all these computations efficiently. While basic research work continues in order to provide the necessary theoretical framework, the field has reached a point where automated reasoning programs are being used by researchers to attack open questions in mathematics and logic, provide important applications in computing science, solve problems in engineering, and find novel approaches to questions in exact philosophy.

https://plato.stanford.edu/entries/reasoning-automated/

After that you may want to look at the SEP articles on Analogical reasoning and Defeasible Reasoning:

https://plato.stanford.edu/entries/reasoning-analogy/

https://seop.illc.uva.nl/entries/reasoning-defeasible/

WhitneyLand · 2024-10-12T00:28:14 1728692894

I was referring to the literature of machine learning and language models specifically, since that’s what the paper is about.

What’s your referencing seems to be more related to symbolic ai / formal logic, and I get that these are related, but it just doesn’t really map neatly onto LLM‘s.

YeGoblynQueenne · 2024-10-12T14:38:10 1728743890

Thank you for clarifying. I think you are arguing that the understanding of what "reasoning" means in most of CS and AI research does not "neatly map" onto LLMs. In that case it's not that there's no consensus; there is, but you don't think that consensus is relevant. Is that correct?

The problem with that is that if we allow ourselves to come up with a new definition of an old concept just because the standard definition doesn't match the latest empirical results, we 'll be creating a very large risk of confirmation bias: every time we want to answer the question "is X doing reasoning?" we'll just change our definition of reasoning to match whatever X is doing. We can't ever hope to get any real answers like that.

WhitneyLand · 2024-10-13T00:39:10 1728779950

I think they’re different definitions. I don’t believe there’s consensus in all of CS research that the definitions you cited apply to ML. At least not from the papers I’ve read.

YeGoblynQueenne · 2024-10-13T12:37:35 1728823055

Could you cite some of the papers you read? I think maybe that will help me better understand what you mean.

lupire · 2024-10-11T19:33:33 1728675213

That's an obsolete definition that definea reasoning as a simplistic mechanical task explicitly encoded by humans. What LLM is attempting is far beyond that. It's a automated process for creating its own reasoning method.

YeGoblynQueenne · 2024-10-11T22:43:33 1728686613

And this is according to whom, please?

ukuina · 2024-10-11T15:29:22 1728660562

Do you have some categories of such original problems? It seems markedly better at reasoning/logic puzzles, and programmatically-solvable problems are often offloaded to the Python interpreter.

TZubiri · 2024-10-11T14:26:47 1728656807

Wouldn't a slight change in tokenization? (say mapping single digits to single tokens) help with this specific challenge?

wenc · 2024-10-11T15:19:07 1728659947

Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to me to be domain specific tokenization (a very well defined one too — since programming languages are meant to be tokenizable).

Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.

Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.

gwillen · 2024-10-11T15:56:12 1728662172

> Aren’t coding copilots based on tokenizing programming language keywords and syntax?

No, they use the same tokenization as everyone else. There was one major change from early to modern LLM tokenization, made (as far as I can tell) for efficient tokenization of code: early tokenizers always made a space its own token (unless attached to an adjacent word.) Modern tokenizers can group many spaces together.

bob1029 · 2024-10-11T15:08:54 1728659334

Context-specific tokenization sounds a lot like old fashioned programming.

m3kw9 · 2024-10-11T14:15:21 1728656121

The llm will know 123 and 45 is a contiguious number just like how humans can tell if you say 123 and then a slight pause 45 as a single number

TZubiri · 2024-10-11T14:31:36 1728657096

It's just so dissonant to me that the tokens in mathematics are the digits, and not bundles of digits. The idea of tokenization makes sense for taking the power off letters, it provides language agnosticism.

But for maths, it doesn't seem appropriate.

I wonder what the effect of forcing tokenization for each separate digit be.

taeric · 2024-10-11T22:36:13 1728686173

This reminds me of the riddle of someone buying the numerals to put their address on their house. When you are looking at text, the point is all you have are the characters/symbols/tokens/whatever you want to call them. You can't really shepherd some over to their numeric value while leaving some at their token value. Unless you want to cause other issues when it comes time to reason about them later.

I'd hazard that the majority of numbers in most text are not such that they should be converted to a number, per se. Consider addresses, postal codes, phone numbers, ... ok, I may have run out of things to consider. :D

TZubiri · 2024-10-14T20:05:56 1728936356

Perhaps I'm just missing fundamentals on tokenization.

But I fail to see how forcing tokenization at the digit level for numbers would somehow impact non numerical meanings of digits. The same characters always map to the same token through a simple mapping right? It's not like context and meaning changes tokenization:

That is:

my credit card ends in 4796 and my address is N street 1331

Parses to the same tokens as:

Multiply 4796 by 1331

So by tokenization digits we don't introduce the problem of different meanings to tokens depending on context.

taeric · 2024-10-14T22:22:18 1728944538

I think I see your point, but how would you want to include localized numbers, such as 1,024 in a stream? Would you assume all 0x123 numbers are hex, as that is a common norm? Does the tokenizer already know to read scientific numbers? 1e2, as an example?

That is all to say that numbers in text are already surprisingly flexible. The point of taking the tokens is to let the model lean the flexibility. It is the same reason that we don't tokenize at the word level. Or try to get a soundex normalization. All of these are probably worth at least trying. May even do better in some contexts? The general framework has a reason to be, though.

soulofmischief · 2024-10-11T14:24:32 1728656672

I think that as long as the attention mechanism has been trained on each possible numerical token enough, this is true. But if a particular token is underrepresented, it could potentially cause inaccuracies.

sva_ · 2024-10-11T14:39:42 1728657582

It won't 'see' [123, 45] though, but [7633, 2548], or rather sparse vectors that are zero at each but the 7634th and 2549th position.

dev1ycan · 2024-10-11T13:58:30 1728655110

I don't understand the idiocracy we live in, it is beyond obvious not just that the stock market is a bubble but ESPECIALLY the AI related stocks are a massive bubble, when it pops, and it will, it is going to be very very ugly, yet people keep pouring in, as Sabine said it, it's starting to look like particle physics where they keep asking for bigger colliders, just because you have a bigger collider, if your methodology is flawed you aren't gonna get any more significant returns.

Eventually they will run out of exponential cash to pour in, and investors will start asking questions, stocks are already valued at 60x+ their earnings, whenever it pops you don't want to be the one who bought the top.

Guess it's still gonna take a while more for the layman to realize the issues with LLMs, but it'll happen.

Workaccount2 · 2024-10-11T14:34:55 1728657295

>if your methodology is flawed you aren't gonna get any more significant returns.

The problem with this statement is that predictions made about scaling 5 years ago have held true[1]. We keep adding parameters, adding compute, and the models keep getting more capable.

The flaws of LLM's from 2024 are not what is relevant. Just like the flaws of LLMs from 2021 were not relevant. What is relevant is the rate of change, and the lack of evidence that things won't continue on this steep incline. Especially if you consider that GPT4 was sort of a preview model that motivated big money to make ungodly investments to see how far we can push this. Those models will start to show up over the next 2 years.

If they break the trend and the scaling flops, then I think a lot of air is gonna blow out of the bubble.

[1]https://arxiv.org/pdf/2001.08361

vrighter · 2024-10-11T15:00:38 1728658838

we added a lot of parameters.

We added a LOT of data.

The resulting models have become only slightly better. And they still have all of their old problems.

I think this is proof that scaling doesn't work. It's not like we just doubled the sizes, they increased by a lot, but improvements are less and less each time. And they've already run out of useful data.

dev1ycan · 2024-10-11T14:49:40 1728658180

They are very literally asking for trillions and even nuclear powered data centers, pretty sure we've gotten to the point where it's not sustainable.

Workaccount2 · 2024-10-11T15:01:18 1728658878

Those are roadmap items being asked for, but the next gen models are already in training. If they keep moving along the same trend line, like all the previous models have, then they probably will be able to find the investors for the next next gen. Even if it's a few trillion dollars and a few nuclear power plants.

This doesn't even factor in the tech inertia. We could stop making new models today, and it would probably be 4-5 years before integration slowed down. Google still hasn't even put Gemini in their home speakers.

yoav_hollander · 2024-10-12T07:51:54 1728719514

Exactly. I was assuming that the by now the default answer to "LLMs sort-of do this, but not very well" should be "OK, wait a few months".

empath75 · 2024-10-11T15:44:15 1728661455

Computers have been able to do mathematical calculation and logical deduction cheaply and perfectly for decades, and it's not really required for generative AIs to be able to do it for them to be useful. It's good enough if they can write and execute some python code to do it, and generally they are fairly capable of that.

The question of whether they can do it is interesting in an academic sense, but has nothing to do if they're useful or not. They also don't need to be true AGI to be useful.

trehalose · 2024-10-11T17:21:05 1728667265

I see a lot of discussion about irrelevant clauses tripping up the LLMs and why that does or doesn't matter. To me, what's far more damning is this:

> Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark.

This seems like irrefutable evidence of overfitting, that in the best case scenario is epidemic among current LLMs (and in the worst case interpretation, is covering up fundamental inabilities to learn mathematical reasoning from the training data).

thenoblesunfish · 2024-10-11T13:04:20 1728651860

Very interesting, and aligns with what I would expect in terms of the type of "thinking" LLMs do. I think that it's also the type of "thinking" that will let a student pass most school courses, except of course for the ones where the teacher has taken the time to pose test questions that aren't as amenable to pattern matching. (Hard, but I assume most readers here are familiar with leetcode style interviews and what makes questions of that kind higher or lower quality for assessing candidates)

(And yes, I know people are hard at work adding other types of thinking to work along with the pure language models)

yk · 2024-10-11T13:19:43 1728652783

I test llms actually similar. For example there is a well known logic puzzle were a farmer tries to cross a river with a cabbage a goat and a wolf. Llms can solve that since at least GPT-2, however if we replace the wolf with a cow, gpt-o does correctly infer the rules of the puzzle but can't solve it.

getoffmyyawn · 2024-10-11T13:46:31 1728654391

I've found that the River Crossing puzzle is a great way to show how LLMs break down.

For example, I tested Gemini with several versions of the puzzle that are easy to solve because they don't have the restrictions such as the farmer's boat only being able to carry one passenger/item at a time.

Ask this version, "A farmer has a spouse, chicken, cabbage, and baby with them. The farmer needs to get them all across the river in their boat. What is the best way to do it?"

In my tests the LLMs nearly always assume that the boat has a carry-restriction and they come up with wild solutions involving multiple trips.

chasd00 · 2024-10-11T13:47:28 1728654448

What happens if you sit down and invent a logic game that is brand new and has never been documented before anywhere then ask an LLM to solve it? That, to a layman like me, seems like a good way to measure reasoning in AI.