Hacker News new | past | comments | ask | show | jobs | submit login
DeepThought-8B: A small, capable reasoning model (ruliad.co)
130 points by AnhTho_FR 1 day ago | hide | past | favorite | 97 comments





There's been a rush of releases of reasoning models in the past couple of weeks. This one looks interesting, too.

I found the following video from Sam Witteveen to be a useful introduction to a few of those models:

https://youtu.be/vN8jBxEKkVo


In what way did they "release" this? I can't find it in hugging face or ollama, and they only seem to have a "try online" link in the article. "Self-sovereign intelligence", indeed.

They released it in the same sense OpenAI released GPT4. There is an online demo you can chat with, and a form to get in touch with sales to get API access

they didn't

Legally, you cannot name the llama3 based models like that, YOu have to use, llama in the name

https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blo...


Too bad :)

Facebook trained the model on an Internet's worth of copyrighted material without any regard for licenses whatsoever - even if model weights are copyrightable, which is an open question, you're doing the exact same thing they did. Probably not a bulletproof legal defense though.


Can't wait until Meta sue them so we can have a judgment on whether or not models weights are subject to copyright.

Model weights are (abstractly speaking) a very intensive, concentrated form of website scraping, yes ?

What does the (USA) law say about scraping ? Does "fair use" play a role ?


Yes, and there have already been court cases that ruled AI training of copyrighted data to be fair use, because it's technically no different than any other form of art, everything is based off of seeing other ideas elsewhere, there are no new ideas anymore.

"what is the population of manhattan below central park"

ChatGPT-o1-preview: 647,000 (based on 2023 data, breaking it down by community board area): https://chatgpt.com/share/674b3f5b-29c4-8007-b1b6-5e0a4aeaf0... (this appears to be the most correct, judging from census data)

DeepThought-8B: 200,000 (based on 2020 census data) Claude: 300-350,000 Gemini: 2.7M during peak times (strange definition of population !)

I followed up with DeepThought-8B: "what is the population of all of manhattan, and how does that square with only having 200,000 below CP" and it cut off its answer, but in the reasoning box it updated its guess to 400,000 by estimating as a fraction of land area.


Am I wrong to think that "reasoning model" is a misleading marketing term?

Isn't it a LLM with an algo wrapper?


Whether you bake the behaviour in or wrap it in an external loop, you need to train/tune the expected behaviour. Generic models can do chain of thought if asked for, but will be worse than the specialised one.

They're not baking anything in. Reasoning, as it is defined by AI marketing departments, is just beam search.

Could you educate me on what is beam search ? Or link a good ressource

EDIT: https://www.width.ai/post/what-is-beam-search

So the wider the beam, the better the outcome?

Yep, no reasoning, just a marketing term to say "more accurate probabilities"


Beam search just traverses different paths and assigns each path a probability of being correct. The paths with the higher probabilities are kept and the ones with lower probabilities are pruned until the search terminates with an "answer". The marketing department calls it "reasoning" and "test-time compute" because the average consumer does not care whether it's beam search or something else.

Your link seems to do a good job of explaining beam search but it's a classic algorithm in state space exploration so most books on search algorithms and discrete optimization will have a section about it.¹

1: https://books.google.com/books?id=QzGuHnDhvZIC&q=%22beam%20s...


An algorithm that searches for the highest probability answer is not "reasoning"? "Search" has been a fundamental building block of GOFAI since the beginning. How do you define reasoning? Can you justify it being different from the last 70 years of thought on the topic?

Since you asked, I define reasoning as cambridge does:

Reasoning "the process of thinking about something in order to make a decision"

Thinking: "the activity of using your mind to consider something"

Mind: "the part of a person that makes it possible for him or her to think, feel emotions, and understand things"

I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"

I also think that the definition of Mind of Cambridge is incomplete and lacks the creativity part along with cognition and emotions. But it's a vastly different topic.


>I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"

I recall reading about a theory of neurology where many thoughts (neural circuits) fired simultaneously and then one would "win" and the others got suppressed. The closest thing I can find right now is Global Workspace Theory.


That seems a bit of an odd definition.

A: Doing it with B.

B: Using C to do it.

C: The part that does B

Without defining, "Your", "Consider", "person", "thinking", "feel", and "understand" it could be anything.

There's more than enough leeway in those undefineds to subjectively choose whatever you want.


I would rather call it circular reasoning,a thing humans are very capable of.

Enter the Chinese Room..

What do you mean please?

The Chinese Room is a theoretical room that contains a “Chinese speaker” but actually when given a text to ‘understand’ actually just looks up the text in a huge number of words inside until it finds a way to find a response and then just outputs the response as its reply

The same way they now call "open-source" a completely closed-source binary blob full of copyright infringement.

AI marketing departments are fond of anthropomorphic language but it's actually just regular beam search.

"reasoning model" means nothing so I don't think it's misleading.

Reasoning means "inference" or "deduction" to me, or at least some process related to first order logic.

The known upper bound for transformers on the fly computation abilities is a complexity class called DLOGTIME-uniform TC^0.

There is a lot to unpack there but if you take FO as being closed under conjunction (∧), negation (¬) and universal quantification (∀); you will find that DLOGTIME-uniform TC^0 is equal to FO+Majority Gates.

So be careful about that distinction.

To help break the above down:

DLOGTIME = Constructible by a RAM or TM in logarithmic time. uniform = Only one circuit for all input sizes, when circuits families are the default convention TC^0: Constant-Depth Threshold Circuits

Even NP == SO-E, the second-order queries where the second-order quantifiers are only existantials.

DLOGTIME-uniform TC^0 is a WAY smaller group than most people realize, but anything that is an algorithm or a program basically is logic, with P being FO + transitive closure or a half a dozen other known mappings.

Transformers can figure out syntax, but if you dig into that dlogtime part, you will see that semantic correctness isn't really an option...thus the need to leverage the pattern matching and finding of pre-training as much as possible.


Thanks. If I'm reading this right, the limiting factor on the intelligence of current LLMs is not the network size, nor training data (size/quality) but rather the architecture? Do we know of a better one for complex computations / "reasoning"?

Given the name they gave it, someone with access should ask it for the “Answer to the Ultimate Question of Life, The Universe, and Everything”

If the answer is anything other than a simple “42”, I will be thoroughly disappointed. (The answer has to be just “42”, not a bunch of text about the Hitchhikers Guide to the Galaxy and all that.)


Deep Thought didn't answer right away either.

“Right away”. lol.

I asked it "Describe how a device for transportation of living beings would be able to fly while looking like a sphere" and it just never returned an output

I asked it to just count letters in a long word and it never returned an output (been waiting for 30 minutes now)

It isn’t pleased you ask it such questions

Blaine is a pain

The reasoning steps look reasonable and the interface is simple and beautiful, though Deepthought-8b fails to disambiguate the term "the ruliad" as the technical concept from Wolfram physics, from this company's name Ruliad. Maybe that isn't in the training data, because it misunderstood the problem when asked "what is the simplest rule of the ruliad?" and went on to reason about the company's core principles. Cool release, waiting for the next update.

Xd, Gotta love how your first question to a test a model is about a “ruliad”. It’s not even in my ios dictionary

I am very impressed. I asked chat.ruliad.co

  Beginning from the values for fundamental physical constants, is it possible to derive the laws of entropy ?
and then based on its response to that I asked it

  Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?

That will be very common in the corpus, thus pattern matching and retrieval more than 'reasoning'.

Not that isn't useful, just expected behavior for an LLM these days.

I would be interested in the output, because it is almost certainly the more common by fundamentally incorrect intro class half truth, that is useful for the typical curriculum, but not near the frontier of current understanding.


For those of us who can't try it out ourselves, could you post the results somewhere and share a link?

It's not giving me a link, and its responses are multi-step and wordy. I can post some pullquotes...

Yes, it is possible to derive the laws of entropy from the fundamental physical constants. Entropy is a measure of the number of possible microscopic configurations of a system, and it is related to the fundamental physical constants through the Boltzmann constant (k_B) and the laws of thermodynamics.

The second law of thermodynamics states that the total entropy of an isolated system can only increase over time, or remain constant if the system is in equilibrium. This law can be derived from the fundamental physical constants and the microscopic behavior of particles in a system.

One way to derive the laws of entropy is by considering the statistical mechanics of a system. In statistical mechanics, entropy is related to the number of possible microscopic configurations (microstates) of a system, which can be described by the Boltzmann constant (k_B). The entropy (S) of a system is given by the Boltzmann formula:

S = k_B * ln(Ω)

where Ω is the number of possible microstates.

The fundamental physical constants, such as the speed of light (c), the gravitational constant (G), and the Planck constant (h), are related to the other constants through various equations. For example, the Boltzmann constant (k_B) can be expressed in terms of the Planck constant (h), the speed of light (c), and the gravitational constant (G):

k_B = (h * P / 2π) * (c^2 / G)

where P is the pressure.

By using these relationships and considering the microscopic behavior of particles in a system, it is possible to derive the laws of entropy from the fundamental physical constants. However, it is important to note that entropy is a macroscopic property, and its laws are not directly derived from the fundamental physical constants but rather from the microscopic behavior of particles and the laws of thermodynamics.

[Q: Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?]

The dividing line between the region where microscopic/atomic processes are reversible and the region where macroscopic processes are irreversible lies in the scale at which the processes occur and the interactions between particles.

At the microscopic level (a), processes are generally considered reversible because the interactions between individual particles can be easily reversed. For example, when two particles collide, they can exchange momentum and energy, and the process can be considered reversible. In this region, the system is in thermodynamic equilibrium, and the second law of thermodynamics does not apply.

However, as we move to the macroscopic level (b), the number of particles involved in the processes increases, and the interactions between them become more complex. In this region, the processes are considered irreversible because the energy and momentum exchanges between particles lead to an increase in entropy. This increase in entropy is a result of the second law of thermodynamics, which states that the total entropy of an isolated system can only increase over time or remain constant if the system is in equilibrium.

The dividing line between reversible and irreversible processes can be identified by examining the scale at which the system operates and the interactions between particles. At the microscopic level, processes are generally reversible, while at the macroscopic level, processes are irreversible due to the increase in entropy resulting from the interactions between a large number of particles.

-end-


Is the source code available for this? And who is behind the company?

Please don't call model weights source code. Code is something you can (usually) read and understand, if anything weights are closer to a very obfuscated compiled binary - although even that can be analyzed by a good enough reverse engineer.

"Open Source" models are the latest in a long series of attempts to take advantage of FOSS's reputation without actually releasing something that adheres to the principles.


To run a language model you need both the model weights and the source code of the implementation!

I would think truly open source means also including the initial training data. i.e. Everything needed to build[/tweak/customize] the thing from the ground up.

llama.cpp is open source!

Neat! But if we're sticking to the source code analogy, the matrix multiplier probably maps closer to a CPU or an interpreter - I wouldn't say every Java program is open source because you have an open-source JVM.

does that make every image closed source because you need a viewer to view them?

Yes, which is why if you're serious you will exclusively use the terminal to protect your computer from even the passing chance of being exposed to a so called "image file".

In all seriousness, compilation and model training are lossy processes and erase a lot of the context needed to understand the output (and with model training we don't fully understand it even with access to the training data). Images aren't necessarily derivative of anything, so the analogy breaks down here.


often images are not only lossily compressed, but they are a composite of many images/layers/image transformations. these are lost when the image is flattened and then distributed.

I asked it 'find two primes whose sum is 123' .. it is in deep thought from 5 minutes just looping and looping over seemingly repeated hallucinations of right path. (btw, chatgpt immediately answers 61 and 62 lol.. so much for intelligence)

Here is what I got from o1:

To find two prime numbers that sum to 123, let’s consider the properties of prime numbers and the nature of their sums: 1. Prime Numbers Greater Than 2 Are Odd: Except for 2, all prime numbers are odd. 2. Sum of Two Odd Numbers Is Even: Adding two odd numbers results in an even number. 3. Sum of an Even and an Odd Number Is Odd: Adding an even number (like 2) and an odd number results in an odd sum.

Given that 123 is an odd number, the only way two primes can sum to 123 is if one of them is 2 (the only even prime) and the other is an odd prime. Let’s test this possibility: • 2 + x = 123 ⇒ x = 121

However, 121 is not a prime number (since ).

Next, consider the possibility of adding two odd primes: • Odd Prime + Odd Prime = Even Number

Since adding two odd primes yields an even number, it’s impossible for two odd primes to sum to 123 (an odd number).

Conclusion: There are no two prime numbers that sum to 123.

Answer: There are no two prime numbers whose sum is 123; such primes do not exist.


qwq went through a two screenfuls of tokens to try all primes under 123 twice, but eventually gave the right answer.

Well explained!

Qwen coder 32b with a JavaScript interpreter

Impressive answer for a model that can run on your own computer

https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai...


It looks nice, but my chrome browser on android has all sorts of trouble with rendering the animated bits, so it ends up skipping frames throughout my navigating and clicks. Add to that; the model doesn't respond at all in my multiple attempts, it's a waste of time until it's remedied.

Is it possible to try it without logging in?

Can you log in with anything other than a Google account?

I was excited by the tagline "Self-Sovereign", but it appears this is not.


"Model A 13B", "Model B 20B" etc are pretty vapid claims. Which actual models? There are plenty of terrible high-param-count models from a year or two ago. The benchmark seems meaningless without saying what models are actually being compared against... And "13B" in particular is pretty sketchy: are they comparing it against Llama 2 13B? Even an untuned Llama 3.1 8B would destroy that in any benchmark.

Smells a little grifty to me...


It’s just a web page. How to try the model?

It's a bit buried, but there's a link further down that takes you to an interface similar to ChatGPT. However, I have not been able to get any output from it; it seems to just think forever. I must be giving it the wrong prompts...

User

what's up

---

DeepThought

Thought for 3 seconds

1. problem understanding

The user has provided a simple greeting 'what's up' and is expecting a response. This is a casual conversation starter, and I need to provide an appropriate reply.

2. analysis

The phrase 'what's up' is an informal greeting asking about someone's well-being or current state. It's commonly used in casual conversations with friends, family, or acquaintances.

3. contextualization

As an AI language model, I don't have personal experiences or emotions, so I need to provide a response that acknowledges the greeting without implying any personal state.

4. solution proposal

An appropriate response would be to acknowledge the greeting and offer a brief, friendly reply that doesn't require reciprocating any personal information.

5. creative thinking

I'll craft a response that is friendly, concise, and universally applicable: 'Not much, just hanging out and processing information. How about you?'

6. evaluation

This response acknowledges the greeting, mentions my 'state' (processing information) which is true for an AI, and turns the question back to the user, encouraging further conversation.

7. verification

The proposed response meets all the criteria: it's friendly, appropriate for the casual tone of 'what's up', and doesn't require any personal information from me as an AI.

8. conclusion drawing

The final response 'Not much, just hanging out and processing information. How about you?' is an appropriate and effective way to respond to the user's 'what's up'.

---

Not much, just hanging out and processing information. How about you?


DeepMasking

Oh fuck me, it’s a fellow autistic. You’d think after all these years on the planet my brain would be able to immediately respond with “Fine. You?” but no, this is what my brain does before eventually replying.

A bit off-topic, but that comparison graph is a great example why you should buy your designer a cheap secondary screen. I was viewing it on my second monitor and had to lean in to make out the off-white bar for Model D on the light-grey background. Moved the window over to my main screen and it's clear as day, five nice shades of coffee on a light-gray background.

That's a pretty egregious mistake for a designer to make -- and that's not even mentioning the lack of accessibility. WebAIM's contrast checker says it's a 1:1 contrast ratio!

If someone is releasing a model that claims to have a level of reasoning, one would hope that their training dataset was scrutinized and monitored for unintended bias (as any statistical dataset is susceptible to: see overfitting). But if the graph on the announcement page is literally unreadable to seemingly anyone but the creator... that's damning proof that there is little empathy in the process, no?


> that's damning proof that there is little empathy in the process, no?

No.


I wouldn’t say it’s implied, but there’s a reason people put on nice clothes for an interview.

I’m looking at the graphs on my phone and I’m pretty sure that there are 5 graphs and 3 labels. And their 8B model doesn’t seem to be very good, looks like a 20B model beats it in every single benchmark.


The body text is also quite hard to read because the font has a tall x-height and line spacing is very tight.

This makes paragraphs look very dense, almost like it was set in uppercase only, because the lowercase letters don’t create a varying flow between lines for the eye to follow.

The model may be good, but the web design doesn’t win any prizes.


Even on a high quality screen, it's a bit much.

Also, is it standard practice to obfuscate which models you're benchmarking against? They're just labeled Model A-D, with sizes but no additional information.

Given the context, it appears they are not benchmarking against other models but comparing differently sized versions of the same model. The 8B one is just the one they decided to give a catchy name. The other ones are probably also just fine tuned Llama models. But without information on the total compute budget (i.e. nr. of trained tokens), this kind of plot is pretty useless anyways.

That doesn't make any sense, because their 8B is listed as benchmarking above the 13B "model A".

I have an iPhone 15 Pro Max and it took me five glances to see the last bar. Use black outlines or something idk

i sadly don't feel this is a mistake, the transparent once are the two that beat the model in one category or more, its feels more like scam than error, if not please fix it

Not bad, asked it to count Rs in "strawberry" and Ns in "international", it answered correctly, and it was fast.

I gave a trick question. It went on right path and then gone bonkers!

how many r's are there in the word strawrrberrry

Thought for 173 seconds

1 problem understanding

The task is to count the number of 'R's in the word 'strawrrberrry'. This requires careful examination of each letter in the given word.

2 data gathering

Let's break down the word into individual letters: s, t, r, a, w, r, r, b, e, r, r, r, y

3 analysis

We need to count the occurrences of the letter 'R' in this sequence. Let's go through each letter and keep a count.

4 calculation

Counting 'R's: 1. First 'R' (t=1), 2. Second 'R' (t=2), 3. Third 'R' (t=3), 4. Fourth 'R' (t=4), 5. Fifth 'R' (t=5). Total count: 5

5 verification

Let's double-check our count by going through the word again: strawrrberrry. Confirmed, there are indeed 5 'R's.

6 error checking

Ensuring we haven't missed any 'R's or counted any extra. The count remains at 5, which appears to be correct.

7 conclusion drawing

Based on our careful count and verification, we can confidently state that there are 5 'R's in the word 'strawrrberrry'.

8 implementation

The final response should be a simple statement: 'There are 5 R's in the word strawrrberrry.'

There are 5 R's in the word strawrrberrry.


Mine was wrong

1 Problem Understanding The task is to count the number of 'r's in the word 'strawberry'. This requires identifying each occurrence of the letter 'r' in the given word and keeping a count. 2 Data Gathering The word to analyze is 'strawberry'. I need to examine each character in this word. 3 Hypothesis Formation Based on a quick glance, I hypothesize that there are 2 'r's in 'strawberry', but I need to verify this systematically. 4 Analysis I will go through each letter of 'strawberry' one by one: s-t-r-a-w-b-e-r-r-y 5 Calculation Counting the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 9. Total count: 2 6 Verification Double-checking: s(0)-t(0)-r(1)-a(1)-w(1)-b(1)-e(1)-r(2)-r(2)-y(2). Confirmed, there are 2 'r's. 7 Conclusion Drawing The final answer is that there are 2 'r's in the word 'strawberry'. 8 Implementation To present the answer, I will state: 'There are 2 r's in the word "strawberry".' This directly answers the question in a clear and concise manner.


It's still not quite perfect at letter counting, but it handles questions like "which is heavier? 2kg of feathers or 1kg of lead" really well. The reasoning steps seem to prevent it from jumping to pattern-matched answers

I don't get why this question is relevant to evaluate the reasoning capacity. Gpt4o (no reasoning in an anthropomorphic sense) answers correctly

--- The reasoning lies in the concept of mass and weight. The weight of an object is determined by its mass, not its material.

1. Mass comparison:

2kg of feathers has a mass of 2 kilograms.

1kg of lead has a mass of 1 kilogram.

Since 2 kilograms is greater than 1 kilogram, the feathers are heavier.

2. Irrelevance of material:

The type of material (feathers or lead) does not change the mass measurement.

Lead is denser than feathers, so 1kg of lead takes up much less space than 2kg of feathers, but the weight is still based on the total mass.

Thus, 2kg of any substance, even something as light as feathers, is heavier than 1kg of a dense material like lead.


Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.

Yes makes sense, I didn't took in account the model size and now you mention it makes a lot of sense.

This is a bad test for reasoning. The model is either trained on the spelling of its tokens or specific words, or not. It's not a thing reasoning would help with in any way. (Ok, maybe accidentally if it tried different splits it's trained to spell, but that's not the reasoning part either)

I know that this is a typical test of an LLM's ability to reason, but I wonder how much time could be saved by teaching an LLM how to recognise the type of problem that it's inherently bad at, bundling a python interpreter and asking it to to write a short function to solve the problem? Are we not pushing a boulder uphill?

> teaching an LLM how to recognise the type of problem that it's inherently bad at

Solving this is the actual hard part and is either adjacent to or even equivalent to solving the problem of LLM's hallucinating. ChatGPT already includes a Python interpreter tool which can be used if the context indicates its appropriate.


I suppose my question is pointing to another, which is: can one make an LLM that doesn't hallucinate? Isn't that problem inherent to the way that LLMs work? Obviously we can try to clean the data so there isn't any nonsense fed into it, but that'll only get you so far with a probabilistic, stochastic system. As an LLM once told me "Some experts argue that hallucination is an innate limitation of LLMs, akin to confabulation in humans, where false memories are created without the intention to deceive". I'm not sure if I believe that though.

For LLM's no, but the explanation is wrong also, it has nothing to do with 'false memories' and has everything to do with how LLMs work.

Here is the paper.

https://arxiv.org/abs/2401.11817

RAG and fine tuning improve domain specificity and may reduce the problem to a level where you don't care, but it will always be there.

Clean data would help reduce the incidents, possibly to a level that is more usable, but also don't remove the problem.

Considering next token prediction as serial runs on multi tape TMs, with the previous output as the input can help.

Especially if you consider Microsoft's 1.53bitnet that requires full precision for training but can reduce weights to just the sign components for inference.

Unfortunately all paths to explain this I have require graduate level complexity theory and/or diff geometry. Or you relive the Brouwer–Hilbert controversy by trying the logic path.


Does it still count correctly if you ask for variations like "strawerberry"?

I found the following transcript of the thought process to that question equally enlightening and funny:

Thought for 4 seconds

1 problem understanding

The task is to count the number of instances of the letter 'r' in the word 'strawerberry'. This requires identifying each occurrence of 'r' in the given word and keeping a count.

2 data gathering

The word to analyze is 'strawerberry'. I need to examine each character in this word.

3 hypothesis formation

Based on a quick glance, I hypothesize that there are 2 instances of 'r' in 'strawerberry'.

4 analysis

Let's break down the word into individual characters: s, t, r, a, w, e, r, b, e, r, r, y

5 calculation

Now, let's count the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 7, 3) Third 'r' at position 10. Total count: 3

6 verification

Double-checking the count: s(0), t(0), r(1), a(0), w(0), e(0), r(2), b(0), e(0), r(3), r(3), y(0). Confirmed: 3 instances of 'r'.

7 error checking

The initial hypothesis was incorrect. The actual count (3) differs from the initial guess (2). This highlights the importance of careful analysis rather than quick assumptions.

8 conclusion drawing

The letter 'r' appears 3 times in the word 'strawerberry'.

9 implementation

The final response should be the number 3, as this is the correct count of 'r' instances in 'strawerberry'.


I get:

>Alright, I need to figure out how many 'r's are in the word 'strawerberry'. Let's look at it carefully. I'll write it down: s-t-r-a-w-e-r-b-e-r-r-y. Now, I'll go through each letter one by one.

>Starting from the beginning: s, t, r – that's one 'r'. Then a, w, e, r – that's the second 'r'. Next, b, e, r – that's the third 'r'. And finally, another r before the y, making it the fourth 'r'. So, I count four 'r's in total.

>The word "strawberry" contains four 'r's.


Wow, so close. I wonder what went wrong here?

Tokenization, still. LLMs don’t see individual letters. It’s like using Google translate to ask a German speaker the same question and concluding that Germans can’t count letters.

Does this mean there's no actual reasoning happening in these "reasoning" steps?

Well, the model is based on llama-8b, which is quite bad at reasoning. Reasoning (or things that look and quack like reasoning) is more the domain of 70B+ models, and some of the newer 7B models.

The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model


Of course there’s no actual reasoning.

Is there something more than prediction going on? I'd like to understand why there would be studies like this: https://arxiv.org/pdf/2307.02477 if it's all just prediction. Are there some algorithms that are called "reasoning", that are somewhere between prediction and full-blown reasoning?

My guess is the framing in that paper was chosen to be provocative. The authors surely understand deep neural networks neither reason nor recite as we understand it in humans.

In any case, the conclusion was decidedly against reasoning.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: