Not 100% so for chain of thought models, they should recognize to spell the word letter by letter in some separated form and then count the tokens in that form. The Qwen distill seems to do exactly this really well:
> Step-by-step explanation:
> 1. Break down each word: "not", "really", "a", "tokenizer", "issue".
> 2. Count 'e's in each word:
> - "not": 0
> - "really": 1
> - "a": 0
> - "tokenizer": 2
> - "issue": 1
> 3. Sum the counts: 0 + 1 + 0 + 2 + 1 = 4.
>
> Answer: There are 4 E's in the phrase.
In the thought portion it broke the words up every which way you could think to check then validated the total by listing the letters in a number list by index and counting that compared to the sums of when it did each word.
"Be trained how to map" implies someone is feeding in a list of every token and what the letters for that token are as training data and then training that. More realistically, this just happens automatically during training as the model figures out what splits work with which tokens because that answer was right when it came across a spelling example or question. The "reasoning" portion comes into play by its ability to judge whether what it's doing is working rather than go with the first guess. E.g. feeding "zygomaticomaxillary" and asking for the count of 'a's gives a CoT
> <comes to an initial guess>
> Wait, is that correct? Let me double-check because sometimes I might miscount or miss letters.
> Maybe I should just go through each letter one by one. Let's write the word out in order:
> <writes one letter per line with the conclusion for each
> *Answer:* There are 3 "a"s in "zygomaticomaxillary."
It's not the only example of how to judge a model but there are more ways to accurately answering this problem than "hardcode the tokenizer data in the training" and heavily trained CoT models should be expected to hit on at least several of these other ways or it is suspect they miss similar types of things elsewhere.
Wait, no AI is used ? Amazon claims big about their code conversion from Java 8 to 17 using Q developer (GitHub Copilot equivalent). Why not use Llama3 models here? Can't they help doing such?
This is an application where LLMs should be the obvious choice, it is Machine translation the thing they are supposed to excel at. Why not use them? Likely a lack of data. You would need lots of Kotlin data (I’m sure lots of Java data exists), and the data would need to overlap so the LLM could understand the mapping.
That kind of proves that point that no matter how smart it can get, it may still have several disabilities that are crucial and very naive for humans. Is it generalizing on any task or specific set of tasks.
What is stopping you not doing it now? I know Q is not good (hallucinates, slow, requires sign in) But it's wise to explain what your gripe is about than saying which you can always do.
reply