Hacker News new | past | comments | ask | show | jobs | submit login

The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.

For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:

    Welcome Ġto ĠHacker ĠNews , ĠI Ġhope Ġyou Ġlike Ġstrawberries .
(Ġ means that the token was preceded by a space.)

Each of these pieces is looked up and encoded as a tensor with their indices. Adding a special token for the beginning and end of the text, giving:

    [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499, 1093, 76203, 13]
So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.

Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.

Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:

    Wel kom Ġop ĠHacker ĠNews , Ġik Ġhoop Ġdat Ġje Ġvan Ġa ard be ien Ġh oud t .
And the word voor strawberries (aardbeien) is split, though still not in letters.



The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on.


How the tokenizer works explains why a model can’t answer the question, what the name of the cat is doesn’t explain anything.

This is Hacker News, we are usually interested in how things work.


Indeed, I appreciate the explanation, it is certainly both interesting and informative to me, but to somewhat echo the person you are replying to - if I wanted a boat, and you offer me a boat, and it doesn’t float - the reasons for failure are perhaps full of interesting details, but perhaps the most important thing to focus on first - is to make the boat float, or stop offering it to people who are in need of a boat.

To paraphrase how this thread started - it was someone testing different boats to see whether they can simply float - and they couldn’t. And the reply was questioning the validity of testing boats whether they can simply float.

At least this is how it sounds to me when I am told that our AI overlords can’t figure out how many Rs are in the word “strawberry”.


At some point you need to just accept the details and limitations of things. We do this all the time. Why is your calculator giving only approximate result? Why can't your car go backwards as fast as forwards? Etc. It sucks that everyone gets exposed to the relatively low level implementation with LLM (almost the raw model), but that's the reality today.


People do get similarly hung up on surprising floating point results: why can't you just make it work properly? And a full answer is a whole book on how floating point math works.


The test problem is emblematic of a type of synthetic query that could fail but of limited import in actual usage.

For instance you could ask it for a JavaScript function to count any letter in any word and pass it r and strawberry and it would be far more useful.

Having edge cases doesn't mean its not useful it is neither a free assastant nor a coder who doesn't expect a paycheck. At this stage it's a tool that you can build on.

To engage with the analogy. A propeller is very useful but it doesn't replace the boat or the Captain.


Does not seem work universally. Just tested a few with this prompt

"create a javascript function to count any letter in any word. Run this function for the letter "r" and the word "strawberry" and print the count"

ChatGPT-4o => Output is 3. Passed

Claude3.5 => Output is 2. Failed. Told it the count is wrong. It apologised and then fixed the issue in the code. Output is now 3. Useless if the human does not spot the error.

llama3.1-70b(local) => Output is 2. Failed.

llama3.1-70b(Groq) => Output is 2. Failed.

Gemma2-9b-lt(local) => Output is 2. Failed.

Curiously all the ones that failed had this code (or some near identical version of it)

```javascript

function countLetter(letter, word) {

  // Convert both letter and word to lowercase to make the search case-insensitive

  const lowerCaseWord = word.toLowerCase();

  const lowerCaseLetter = letter.toLowerCase();


  // Use the split() method with the letter as the separator to get an array of substrings separated by the letter

  const substrings = lowerCaseWord.split(lowerCaseLetter);

  // The count of the letter is the number of splits minus one (because there are n-1 spaces between n items)

  return substrings.length - 1;
}

// Test the function with "r" and "strawberry"

console.log(countLetter("r", "strawberry")); // Output: 2 ```


It's not the job of the LLM to run the code... if you ask it to run the code, it will just do its best approximation at giving you a result similar to what the code seems to be doing. It's not actually running it.

Just like Dall-E is not layering coats of pain to make a watercolor... it just makes something that looks like one.

Your LLM (or you) should run the code in a code interpretor. Which ChatGPT did because it has access to tools. Your local ones don't.


Your function returns 3, and I don't see how it can return 2.


I did not run the code myself. The code block and console log I have pasted is verbatim copy from Claude3.5


Claude isn't actually running console.log() it produced correct code.

This prompt "please write a javascript function that takes a string and a letter and iterates over the characters in a string and counts the occurrences of the letter"

Produced a correct function given both chatGPT4o and claude3.5 for me.


It is however a highly relevant thing to be aware of when evaluating a LLM for 'intelligence', which was the context this was brought up in.

Without looking at the word 'strawberry', or spelling it one letter at a time, can you rattle off how many letters are in the word off the top of your head? No? That is what we are asking the LLM to do.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: