Hacker News new | past | comments | ask | show | jobs | submit login
Understanding Reasoning LLMs (sebastianraschka.com)
135 points by sebg 4 hours ago | hide | past | favorite | 55 comments





I like Raschka's writing, even if he is considerably more optimistic about this tech than I am. But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878

They are certainly capable of doing is a wide variety of computations that simulate reasoning, and maybe that's good enough for your use case. But it is unpredictably brittle unless you spend a lot on o1-pro (and even then...). Raschka has a line about "whether and how an LLM actually 'thinks' is a separate discussion" but this isn't about semantics. R1 clearly sucks at deductive reasoning and you will not understand "reasoning" LLMs if you take DeepSeek's claims at face value.

It seems especially incurious for him to copy-paste the "a-ha moment" from Deepseek's technical report without critically investigating it. DeepSeek's claims are unscientific, without real evidence, and seem focused on hype and investment:

  This moment is not only an "aha moment" for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. 

  The "aha moment" serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
Perhaps it was able to solve that tricky Olympiad problem, but there are an infinite variety of 1st grade math problems it is not able to solve. I doubt it's even reliably able to solve simple variations of that root problem. Maybe it is! But it's frustrating how little skepticism there is about CoT, reasoning traces, etc.

> they are incapable of even the simplest "out-of-distribution" deductive reasoning

But the link demonstrates the opposite- these models absolutely are able to reason out of distribution, just not with perfect fidelity. The fact that they can do better than random is itself really impressive. And o1-preview does impressively well, only vary rarely getting the wrong answer on variants of that Alice in Wonderland problem.

If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.

Overall, poor reasoning that is better than random but frequently gives the wrong answer is fundamentally, categorically entirely different from being incapable of reasoning.


anyone saying an LLM is a stochastic parrot doesn't understand them... they are just parroting what they heard.

There is definitely a mini cult of people that want to be very right about how everyone else is very wrong about AI.

Firstly this is meta ad hom. You're ignoring the argument to target the speaker(s)

Secondly, you're ignoring the fact that the community of voices with experience in data sciences, computer science and artificial intelligence themselves are split on the qualities or lack of them in current AI. GPT and LLM are very interesting but say little or nothing to me of new theory of mind, or display inductive logic and reasoning, or even meet the bar for a philosophers cave solution to problems. We've been here before so many, many times. "Just a bit more power captain" was very strong in connectionist theories of mind. fMRI brains activity analytics, you name it.

So yes. There are a lot of "us" who are pushing back on the hype, and no we're not a mini cult.


There are a couple Twitter personalities that definitely fit this description.

There is also a much bigger group of people that haven't really tried anything beyond GPT-3.5, which was the best you could get without paying a monthly subscription for a long time. One of the biggest reasons for r1 hype, besides the geopolitical angle, was people could actually try a reasoning model for free for the first time.


ie, the people that AI is dumb? Or you are saying I'm in a cult for being pro it - I'm definitely part of that cult - the "we already have agi and you have to contort yourself into a pretzel to believe otherwise" cult. Not sure if there is a leader though.

I didn't realize my post can be interpreted either way. I'll leave it ambiguous, hah. Place your bets I guess.

You think we have AGI? What makes you think that?

By knowing what each of the letters stand for

A good literary production. I would have been proud of it had I thought of it, but it's a path to observe a strong "whataboutery" element that if we use "stochastic parrot" as shorthand and you dislike the term, now you understand why we dislike the constant use of "infer", "reason" and "hallucinate"

Parrots are self aware, complex reasoning brains which can solve problems in geometry, tell lies, and act socially or asocially. They also have complex vocal chords and can perform mimicry. Very few aspects of a parrots behaviour are stochastic but that also underplays how complex stochastic systems can be in their production. If we label LLM products as Stochastic Parrots it does not mean they like cuttlefish bones or are demonstrably modelled by Markov chains like Mark V Shaney.


> If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.

You don't seem to understand how they work, they recurse their solution meaning if they have remembered components it parrots back sub solutions. Its a bit like a natural language computer, that way you can get them to do math etc, although the instruction set isn't of a turing language.

They can't recurse sub sub parts they haven't seen, but problems that has similar sub parts can of course be solved, anyone understands that.


> You don't seem to understand how they work

I don't think anyone understands how they work- these type of explanations aren't very complete or accurate. Such explanations/models allow one to reason out what types of things they should be capable of vs incapable of in principle regardless of scale or algorithm tweaks, and those predictions and arguments never match reality and require constant goal post shifting as the models are scaled up.

We understand how we brought them about via setting up an optimization problem in a specific way, that isn't the same at all as knowing how they work.

I tend to think in the totally abstract philosophical sense, independent of the type of model, at the limit of an increasingly capable function approximator trained on an increasingly large and diverse set of real world cause/effect time series data, you eventually develop and increasingly accurate and general predictive model of reality organically within the model. Some model types do have fundamental limits in their ability to scale like this, but we haven't yet found one with these models.

It is more appropriate to objectively test what they can and cannot do, and avoid trying to infer what we expect from how we think they work.


Well we do know pretty much exactly what they do, don't we?

What surprises us is the behaviors coming out of that process.

But surprise isn't magic, magic shouldn't even be on the list of explanations to consider.


Magic wasn’t mentioned here. We don’t understand the emerging behavior, in the sense that we can’t reason well about it and make good predictions about it (which would allow us to better control and develop it).

This is similar to how understanding chemistry doesn’t imply understanding biology, or understanding how a brain works.


Exactly, we don't understand, but we want to believe it's reasoning, which would be magic.

> I don't think anyone understands how they work

Yes we do, we literally built them.

> We understand how we brought them about via setting up an optimization problem in a specific way, that isn't the same at all as knowing how they work.

You're mistaking "knowing how they work" with "understanding all of the emergent behaviors of them"

If I build a physics simulation, then I know how it works. But that's a separate question from whether I can mentally model and explain the precise way that a ball will bounce given a set of initial conditions within the physics simulation which is what you seem to be talking about.


>But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning:

That's not actually what your link says. The tweet says that it solves the simple problem (that they originally designed to foil base LLMs) so they had to invent harder problems until they found one it could not reliably solve.


Did you see how similar the more complicated problem is? It's nearly the exact same problem.

"researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation" - Rich Sutton

> But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878

Your link says that R1, not all models like R1, fails at generalization.

Of particular note:

> We expose DeepSeek R1 to the variations of AIW Friends problem and compare model behavior to o1-preview, o1-mini and Claude 3.5 Sonnet. o1-preview handles the problem robustly, DeepSeek R1 shows strong fluctuations across variations with distribution very similar to o1-mini.


I'd expect that OpenAI's stronger reasoning models also don't generalize too far outside of the areas they are trained for. At the end of the day these are still just LLMs, trying to predict continuations, and how well they do is going to depend on how well the problem at hand matches their training data.

Perhaps the type of RL used to train them also has an effect on generalization, but choice of training data has to play a large part.


The way the authors talk about LLMs really rubs me the wrong way. They spend more of the paper talking up the 'claims' about LLMs that they are going to debunk than actually doing any interesting study.

They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.


What the hype crowd doesn't get is that for most people, "a tool that randomly breaks" is not useful.

>They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.

And lo and behold, they still found a glaring failure. You can't fault them for not buying into the hype.


But it is still dishonest to declare reasoning LLMs a scam simply because you searched for a failure mode.

If given a few hundred tries, I bet I could find an example where you reason poorly too. Wikipedia has a whole list of common failure modes of human reasoning: https://en.wikipedia.org/wiki/List_of_fallacies


Well, given the success rate is no more than 90% in the best cases. You could probably find a failure in about 10 tries. The only exception is o1-preview. And this is just a simple substitution of parameters.

The other day I fed a complicated engineering doc for an architectural proposal at work into R1. I incorporated a few great suggestions into my work. Then my work got reviewed very positively by a large team of senior/staff+ engineers (most with experience at FAANG; ie credibly solid engineers). R1 was really useful! Sorry you don’t like it but I think it’s unfair to say it sucks at reasoning.

Your argument is exactly the kind which makes me think people who claim LLMs are intelligent are trolling.

You are equating things which are not related and do not follow from each other. For example:

- A tool being useful (for particular people and particular tasks) does not mean it is reasoning. A static type checker is pretty fucking useful but is neither intelligent nor reasoning.

- The OP did not say he doesn't like R1, he said he disagrees with the opinion it can reason and with how the company advertises the model.

The fake "sorry" is a form of insult and manipulation.

There are probably more issues with your comment but I am unwilling to invest any more time into arguing with someone unwilling to use reasoning to understand text.


This is basically a misrepresentation of that tweet.

Nice article.

>Whether and how an LLM actually "thinks" is a separate discussion.

The "whether" is hardly a discussion at all. Or, at least one that was settled long ago.

"The question of whether a computer can think is no more interesting than the question of whether a submarine can swim."

--Edsger Dijkstra


The document that quote comes from is hardly a definitive discussion of the topic.

“[…] it tends to divert the research effort into directions in which science can not—and hence should not try to—contribute.” is a pretty myopic take.

--http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD898.PDF


It's interesting if you're asking the computer to think, which we are.

It's not interesting if you're asking it to count to a billion.


Is there any work being done in training LLMs on more restricted formal languages? Something like a constraint solver or automated theorem prover, but much lower level. Specifically something that isn't natural language. That's the only path I could see towards reasoning models being truly effective

I know there is work being done with e.g. Lean integration with ChatGPT, but that's not what I mean exactly -- there's still this shakey natural-language-trained-LLM glue in the driver's seat

Like I'm envisioning something that has the creativity to try different things, but then JIT compile their chain of thought, and avoid bad paths


How would that be different from something like ChatGPT executing Lean? That's exactly what humans do, we have messy reasoning that we then write down in formal logic and compile to see if it holds.

In my mind, the pure reinforcement learning approach of DeepSeek is the most practical way to do this. Essentially it needs to continually refine and find more sound(?) subspaces of the latent (embedding) space. Now this could be the subspace which is just Python code (or some other human-invented subspace), but I don't think that would be optimal for the overall architecture.

The reason why it seems the most reasonable path is because when you create restrictions like this you hamper search viability (and in a high multi-dimensional subspace, that's a massive loss because you can arrive at a result from many directions). It's like regular genetic programming vs typed-genetic programming. When you discard all your useful results, you can't go anywhere near as fast. There will be a threshold where constructivist, generative schemes (e.g. reasoning with automata and all kinds of fun we've neglected) will be the way forward, but I don't think we've hit that point yet. It seems to me that such a point does exist because if you have fast heuristics on when types unify, you no longer hamper the search speed but gain many benefits in soundness.

One of the greatest human achievements of all time is probably this latent embedding space -- one that we can actually interface with. It's a new lingua franca.

These are just my cloudy current thoughts.


DeepSeek's approach with R1 wasn't pure RL - they used RL only to develop R0 from their V3 base model, but then went though two iterations of using current model to generate synthetic reasoning data, SFT on that, then RL fine-tuning, and repeat.

fwiw, most people don't really grok the power of latent space wrt language models. Like, you say it, I believe it, but most people don't really grasp it.

I think something like structured generation might work in this context

Are there any websites that show the results of popular models on different benchmarks, which are explained in plain language? As an end user, I'd love a quick way to compare different models suitability for different tasks.

Nice explainer. The R1 paper is a relatively easy read. Very approachable, almost conversational.

I say this because I am constantly annoyed by poor, opaque writing in other instances. In this case, DS doesn’t need to try to sound smart. The results speak for themselves.

I recommend anyone who is interested in the topic to read the R1 paper, their V3 paper, and DeepSeekMath paper. They’re all worth it.


Great post, but every time I read something like this I feel like I am living in a prequel to the Culture.

Is that bad? The Culture is pretty cool I think. I doubt the real thing would be so similar to us but who knows.

Oh no, I’d live on an Orbital in a heartbeat. No, it’s just that all of these kinds of posts make me feel like we’re about to live through “The Bad Old Days”.

This article has a superb diagram of the DeepSeek training pipeline.

How important is it that the reasoning takes place in another thread versus just chain-of-thought in the same thread? I feel like it makes a difference, but I have no evidence.

There are no LLMs that reason, its an entirely different statistical process as compared to human reasoning.

"There are no LLMS that reason" is a claim about language, namely that the word 'reason' can only ever be applied to humans.

Not at all, we are building conceptual reasoning machines, but it is an entirely different technology than GPT/LLM dl/ml etc. [1]

[1] https://graphmetrix.com/trinpod-server


doesn't it seem like these models are getting to the point where even conceiving their training and development is less and less possible for the general public?

I mean, we already knew only a handful of companies with capital could train them, but at least the principles, algorithms, etc. were accessible to individuals who wanted to create their own - much simpler - models.

it seems that era is quickly ending, and we are entering the era of truly "magic" AI models that no one knows how they work because companies keep their secret sauces...


Recent developments like V3, R1 and S1 are actually clarifying and pointing towards more understandable, efficient and therefore more accessible models.

We have been in the 'magic scaling' era for a while now. While the basic architecture of language models is reasonably simple and well understood, the emergent effects of making models bigger are largely magic even to the researchers, only to be studied emperically after the fact.

I don't think it's realistic to expect to have access to the same training data as the big labs that are paying people to generate it for them, but hopefully there will be open source ones that are still decent.

At the end of the day current O1-like reasoning models are still just fine-tuned LLMs, and don't even need RL if you have access to (or can generate) a suitable training set. The DeepSeek R1 paper outlined their bootstrapping process, and HuggingFace (and no doubt others) are trying to duplicate it.


In recent weeks what's happening is exactly the contrary.

Amazing accomplishments by brightest minds only to be used to write history by the stupidest people.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: