Hacker News new | past | comments | ask | show | jobs | submit | earth2mars's comments login

Apple website


Tried exactly the same model. And unfortunately the reasoning is just useless. Built it is still not able to tell how many r's in strawberry.


That's a tokenizer issue though?


Not 100% so for chain of thought models, they should recognize to spell the word letter by letter in some separated form and then count the tokens in that form. The Qwen distill seems to do exactly this really well:

> Step-by-step explanation:

> 1. Break down each word: "not", "really", "a", "tokenizer", "issue".

> 2. Count 'e's in each word:

> - "not": 0

> - "really": 1

> - "a": 0

> - "tokenizer": 2

> - "issue": 1

> 3. Sum the counts: 0 + 1 + 0 + 2 + 1 = 4.

>

> Answer: There are 4 E's in the phrase.

In the thought portion it broke the words up every which way you could think to check then validated the total by listing the letters in a number list by index and counting that compared to the sums of when it did each word.


But the only way to do this is if it is trained on how to map the word token to character tokens ie

Hello -> h e l l o 66547 -> 12 66 88 88 3

Or, maybe it memorized that hello has a single e.

Either way, This seems to be a edge case that may or may not exist in the training data, but seems orthogonal to 'reasoning'

A better test case would be how it performs if you give the spelling mappings for each word the context?


"Be trained how to map" implies someone is feeding in a list of every token and what the letters for that token are as training data and then training that. More realistically, this just happens automatically during training as the model figures out what splits work with which tokens because that answer was right when it came across a spelling example or question. The "reasoning" portion comes into play by its ability to judge whether what it's doing is working rather than go with the first guess. E.g. feeding "zygomaticomaxillary" and asking for the count of 'a's gives a CoT

> <comes to an initial guess> > Wait, is that correct? Let me double-check because sometimes I might miscount or miss letters. > Maybe I should just go through each letter one by one. Let's write the word out in order: > <writes one letter per line with the conclusion for each > *Answer:* There are 3 "a"s in "zygomaticomaxillary."

It's not the only example of how to judge a model but there are more ways to accurately answering this problem than "hardcode the tokenizer data in the training" and heavily trained CoT models should be expected to hit on at least several of these other ways or it is suspect they miss similar types of things elsewhere.


The important thing is, when we're all replaced by robots, deep down we will know we are superior because we can count letters in strawberry.


Wait, no AI is used ? Amazon claims big about their code conversion from Java 8 to 17 using Q developer (GitHub Copilot equivalent). Why not use Llama3 models here? Can't they help doing such?


Why use something that works 90 times out 100, but blows up ten times, versus an automated process that can be provably correct all the time?


This is an application where LLMs should be the obvious choice, it is Machine translation the thing they are supposed to excel at. Why not use them? Likely a lack of data. You would need lots of Kotlin data (I’m sure lots of Java data exists), and the data would need to overlap so the LLM could understand the mapping.


Why did they skip o2?



Maybe spend more compute time to let it think about optimizing the compute time.


That kind of proves that point that no matter how smart it can get, it may still have several disabilities that are crucial and very naive for humans. Is it generalizing on any task or specific set of tasks.


I gave a trick question. It went on right path and then gone bonkers!

how many r's are there in the word strawrrberrry

Thought for 173 seconds

1 problem understanding

The task is to count the number of 'R's in the word 'strawrrberrry'. This requires careful examination of each letter in the given word.

2 data gathering

Let's break down the word into individual letters: s, t, r, a, w, r, r, b, e, r, r, r, y

3 analysis

We need to count the occurrences of the letter 'R' in this sequence. Let's go through each letter and keep a count.

4 calculation

Counting 'R's: 1. First 'R' (t=1), 2. Second 'R' (t=2), 3. Third 'R' (t=3), 4. Fourth 'R' (t=4), 5. Fifth 'R' (t=5). Total count: 5

5 verification

Let's double-check our count by going through the word again: strawrrberrry. Confirmed, there are indeed 5 'R's.

6 error checking

Ensuring we haven't missed any 'R's or counted any extra. The count remains at 5, which appears to be correct.

7 conclusion drawing

Based on our careful count and verification, we can confidently state that there are 5 'R's in the word 'strawrrberrry'.

8 implementation

The final response should be a simple statement: 'There are 5 R's in the word strawrrberrry.'

There are 5 R's in the word strawrrberrry.


What is stopping you not doing it now? I know Q is not good (hallucinates, slow, requires sign in) But it's wise to explain what your gripe is about than saying which you can always do.


My gripe was with the Explainer modal that covers the entire article upon visiting the site.


What license model is this? Clicking on HF link takes to 404


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: