All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet.
Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.
for strawberry, it see it as [496, 675, 15717], which is str aw berry.
If you insert characters to breaks the tokens down, it find the correct result:
how many r's are in "s"t"r"a"w"b"e"r"r"y" ?
>If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?
The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is very similar.
Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.
Psychologists use this trick as well to figure out internal representations, for example the rorschach test.
And probably, if you asked random people in the street how many p's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say Spanish, also the country name uses an F.
Yeah, but for most people, it would be because the don't know how to spell "Philippines" at all. Confoundingly, LLMs know exactly how to spell Strawberry and still get this wrong.
> I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?
I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.
A useless benchmark is useless.
In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.
Count the number of occurrences of the letter e in the word "enterprise".
Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.
It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.
Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.
> Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
We are also not exactly looking letter by letter at everything we read.
Just because we normally aren't reading letter by letter, it doesn't mean we can't. We can recognize common words on sight, ignoring minor variations, because we've seen the words thousands or millions of times, but that doesn't somehow disable the much less frequently used ability to approach a brand new word.
I think that humans indeed identify words as a whole and do not read letter by letter.
However, this implies you need to know the word to begin with.
I can write "asdf" and you might be oblivious to what I mean. I can mention "adsf" to a JavaScript developer and he will immediately think of the tool versioning tool. Because context and familiarity is important.
I believe it's a bit more nuanced than that. Short ubiquitous words like "and" or "the" we instantly recognize at a glance, but long unfamiliar or rarer words we read from the beginning, one syllable or letter at a time, until pattern recognition from memory kicks in. All unconsciously, unless the word is so odd, out of place, mispelled, or unknown that it comes to conscious awareness and interrupts our reading.
It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.
A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.
>I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
You don't ask a human being how many r's there are in strawberry at all. The only reason you or anyone else asks that question is because it's an interesting quirk of how LLMs work that they struggle to answer it in that format. It's like an alien repeatedly showing humans an optical illusion that relies on the existence of our (literal) blind spot and using it as evidence of our supposed lack of intelligence.
> This is only an issue if you send commands to a LLM as you were communicating to a human.
Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.
Far from the entire value proposition. Chatbots are just one use of LLMs, and not the most useful one at that. But sure, the one "the public" is most aware of. As opposed to "the hackers" that are supposed to frequent this forum. LOL
Tokens are the first form of information being encoded into the model. They're statistically guided, more or less a compression dictionary comparable to a Lempel Ziv setup.
Combinations of tokens get encoded, so if the feature isn't part of the information being carried forward into the network as it models the information in the corpus, the feature isn't modeled well, or at all. The consequence of having many character tokens is that the relevance of individual characters is lost, and you have to explicitly elicit the information. Models know that words have individual characters, but "strawberry" isn't encoded as a sequence of letters, it's encoded as an individual feature of the tokenizer embedding.
Other forms of tokenizing have other tradeoffs. The trend lately is to increase tokenizer dictionary scope, up to 128k in Llama3 from 50k in gpt-3. The more tokens, the more nuanced individual embedding features in that layer can be before downstream modeling.
Tokens inherently constrain how the notion of individual letters are modeled in the context of everything an LLM learns. In a vast majority of cases, the letters don't matter, so the features don't get mapped and carried downstream of the tokenizer.
What you're saying sounds plausible, but I don't see how we can conclude that definitively without at least some empirical tests, say a set words that predictably give an error along token boundaries.
The thing is, there are many ways a model can get around to answering the same question, it doesn't just depend on the architecture but also on how the training data is structured.
For example, if it turned out tokenization was the cause of this glitch, conceivably it could be fixed by adding enough documents with data relating to letter counts, providing another path to get the right output.
There isn't a lot of place that teach the AI which letters there is in each tokens.
It's a made up concept, and the AI doesn't have enough information in the dataset about this concept, it can difficulty generalize this.
There is a lot of problems like that, that can be reformulated.
For example if you ask it what is the biggest between 9.11 and 9.9, it will respond 9.9. If you look at how it's tokenized, you can see it restate an easy problem as something not straightforward even for a human.
if you restart the problem by writing the number in full letters, it will correctly respnod.
Its like showing someone a color and asking how many letters it has. 4... 3?
blau, blue, azul, blu
The color holds the meaning and the words all map back.
In the model the individual letters hold little meaning. Words are composed of letters but simply because we need some sort of organized structure for communication that helps represents meaning and intent. Just like our color blue/blau/azul/blu.
Not faulting them for asking the question but I agree that the results do not undermine the capability of the technology. In fact it just helps highlight the constraints and need for education.
How is a layman supposed to even know that it's testing on that? All they know is it's a large language model. It's not unreasonable they should expect it to be good at things having to do with language, like how many letters are in a word.
Seems to me like a legit question for a young child to answer or even ask.
I'm saying a layman or say a child wouldn't even think this is a "test". They are just asking a language model a seemingly simple language related question from their point of view.
It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).
If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.
Very large language models also “know” how to spell the word associated with the strawberry token, which you can test by asking them to spell the word one letter at a time. If you ask the model to spell the word and count the R’s while it goes, it can do the task. So the failure to do it when asked directly (how many r’s are in strawberry) is pointing to a real weakness in reasoning, where one forward pass of the transformer is not sufficient to retrieve the spelling and also count the R’s.
Sure, that's a different issue. If you prompt in a way to invoke chain of thought (e.g. what humans would do internally before answering) all of the models I just tested got it right.
> If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.
LOL. I would fail your test, because "fraise" only has one R, and you're expecting me to reply "3".
The thinking part of a model doesn’t know about tokens either. Like a regular human few thousand years ago didn’t think of neural impulses or air pressure distribution when talking. It might “know” about tokens and letters like you know about neurons and sound, but not access them on the technical level, which is completely isolated from it. The fact that it’s a chat of tokens of letters, which are a form of information passing between humans, is accidental.
Illiterate humans can come up with new words like that too without being able to spell, LLMs are modeling language without precisely modeling spelling.
The tokenizer system supports virtually any input text that you want, so it follows that it also allows virtually any output text. It isn’t limited to a dictionary of the 1000 most common words or something.
There are tokens for individual letters, but the model is not trained on text written with individual tokens per letter, it is trained on text that has been converted into as few tokens as possible. Just like you would get very confused if someone started spelling out entire sentences as they spoke to you, expecting you to reconstruct the words from the individual spoken letters, these LLMs also would perform terribly if you tried to send them individual tokens per letter of input (instead of the current tokenizer scheme that they were trained on).
Even though you might write a message to an LLM, it is better to think of that as speaking to the LLM. The LLM is effectively hearing words, not reading letters.
I hear this a lot but there are vast sums of money thrown at where a model fails the strawberry cases.
Think about math and logic. If a single symbol is off, it’s no good.
Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.
But we don’t restrict it to math or logical syntax. Any prompt across essentially all domains. The same model is expected to handle any kind of logical reasoning that can be brought into text. We don’t mark it incorrect if it spells an unimportant word wrong, however keep in mind the spelling of a word can be important for many questions, for example—off the top of my head: please concatenate “d”, “e”, “a”, “r” into a common English word without rearranging the order. The types of examples are endless. And any type of example it gets wrong, we want to correct it. I’m not saying most models will fail this specific example, but it’s to show the breadth of expectations.
Of course. Because these models have no intelligence.
Everyone who believes they do seem to believe intelligence derives from being able to use language, however, and not being able to tell how many times the letter r is in the word strawberry is a very low bar to not pass.
If you give LLMs the letters one a time they often count them just fine, though Claude at least seems to need to keep a running count to get it right:
"How many R letters are in the following? Keep a running count.
s
t
r
a
w
b
e
r
r
y"
They are terrible at counting letters in words because they rarely see them spelled out. An LLM trained one byte at a time would always see every character of every word and would have a much easier time of it. An LLM is essentially learning a new language without a dictionary, of course it's pretty bad at spelling. The tokenization obfuscates the spelling not entirely unlike how verbal language doesn't always illuminate spelling.
May the effect you see, when you spell it out, be not a result of “seeing” tokens, but a result of the fact that a model learned – at a higher level – how lists in text can be summarized, summed up, filtered and counted?
Iow, what makes you think that it’s exactly letter-tokens that help it and not the high-level concept of spelling things out itself?
It's more that it's liable to struggle to guess how to spell tokens [10295, 947] (or whatever it is) since there's no a priori reason that it will learn to associate them with the exact right tokens for the individual letters in the right order. If it's trained on bytes though, it doesn't need to infer that. It's like asking a smart, semi-literate person a spelling question- they might have a rough sense of it but they will not be very good at it.
Once it is just counting lists then it's probably drawing on a higher level capability, yeah.
Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.
Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.
I don’t know anything about LLMs beyond using ChatGPT and Copilot… but unless because of this lack of knowledge I am misinterpreting your reply - it sounds as if you are excusing the model giving a completely wrong answer to a question that anyone intelligent enough to learn alphabet can answer correctly.
The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.
For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:
So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.
Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.
Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:
Wel kom Ġop ĠHacker ĠNews , Ġik Ġhoop Ġdat Ġje Ġvan Ġa ard be ien Ġh oud t .
And the word voor strawberries (aardbeien) is split, though still not in letters.
The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on.
Indeed, I appreciate the explanation, it is certainly both interesting and informative to me, but to somewhat echo the person you are replying to - if I wanted a boat, and you offer me a boat, and it doesn’t float - the reasons for failure are perhaps full of interesting details, but perhaps the most important thing to focus on first - is to make the boat float, or stop offering it to people who are in need of a boat.
To paraphrase how this thread started - it was someone testing different boats to see whether they can simply float - and they couldn’t. And the reply was questioning the validity of testing boats whether they can simply float.
At least this is how it sounds to me when I am told that our AI overlords can’t figure out how many Rs are in the word “strawberry”.
At some point you need to just accept the details and limitations of things. We do this all the time. Why is your calculator giving only approximate result? Why can't your car go backwards as fast as forwards? Etc. It sucks that everyone gets exposed to the relatively low level implementation with LLM (almost the raw model), but that's the reality today.
People do get similarly hung up on surprising floating point results: why can't you just make it work properly? And a full answer is a whole book on how floating point math works.
The test problem is emblematic of a type of synthetic query that could fail but of limited import in actual usage.
For instance you could ask it for a JavaScript function to count any letter in any word and pass it r and strawberry and it would be far more useful.
Having edge cases doesn't mean its not useful it is neither a free assastant nor a coder who doesn't expect a paycheck. At this stage it's a tool that you can build on.
To engage with the analogy. A propeller is very useful but it doesn't replace the boat or the Captain.
Does not seem work universally. Just tested a few with this prompt
"create a javascript function to count any letter in any word. Run this function for the letter "r" and the word "strawberry" and print the count"
ChatGPT-4o => Output is 3. Passed
Claude3.5 => Output is 2. Failed. Told it the count is wrong. It apologised and then fixed the issue in the code. Output is now 3. Useless if the human does not spot the error.
llama3.1-70b(local) => Output is 2. Failed.
llama3.1-70b(Groq) => Output is 2. Failed.
Gemma2-9b-lt(local) => Output is 2. Failed.
Curiously all the ones that failed had this code (or some near identical version of it)
```javascript
function countLetter(letter, word) {
// Convert both letter and word to lowercase to make the search case-insensitive
const lowerCaseWord = word.toLowerCase();
const lowerCaseLetter = letter.toLowerCase();
// Use the split() method with the letter as the separator to get an array of substrings separated by the letter
const substrings = lowerCaseWord.split(lowerCaseLetter);
// The count of the letter is the number of splits minus one (because there are n-1 spaces between n items)
return substrings.length - 1;
It's not the job of the LLM to run the code... if you ask it to run the code, it will just do its best approximation at giving you a result similar to what the code seems to be doing. It's not actually running it.
Just like Dall-E is not layering coats of pain to make a watercolor... it just makes something that looks like one.
Your LLM (or you) should run the code in a code interpretor. Which ChatGPT did because it has access to tools. Your local ones don't.
Claude isn't actually running console.log() it produced correct code.
This prompt "please write a javascript function that takes a string and a letter and iterates over the characters in a string and counts the occurrences of the letter"
Produced a correct function given both chatGPT4o and claude3.5 for me.
It is however a highly relevant thing to be aware of when evaluating a LLM for 'intelligence', which was the context this was brought up in.
Without looking at the word 'strawberry', or spelling it one letter at a time, can you rattle off how many letters are in the word off the top of your head? No? That is what we are asking the LLM to do.
How can I know whether any particular question will test a model on its tokenization? If a model makes a boneheaded error, how can I know whether it was due to lack of intelligence or due to tokenization? I think finding places where models are surprisingly dumb is often more informative than finding particular instances where they seem clever.
It's also funny, since this strawberry question is one where a model that's seriously good at predicting the next character/token/whatever quanta of information would get it right. It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.
> How can I know whether any particular question will test a model on its tokenization?
Does something deal with separate symbols rather than just meaning of words? Then yes.
This affects spelling, math (value calculation), logic puzzles based on symbols. (You'll have more success with a puzzle about "A B A" rather than "ABA")
> It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.
This thread contains contradictions. Every other announcement of an llm contains a comment with a contradicting text when people post the wrong responses.
I suppose what models should have are some instructions of things they aren’t good at and will need to break out into python code or what have you. Humans have an intuition for this - I have basic knowledge of when I need to write something down or use a calculator. LLMs don’t have intuition (yet - though I suppose one could use a smaller model for that), so explicit instructions would work for now.
It's not very interesting when they fail at it, but it will be interesting if they get good at it.
Also there are some cases where regular people will stumble into it being awful at this without any understanding why (like asking it to help them
with their wordle game.)
Call me when models understand when to convert the token into actual letters and count them. Can’t claim they’re more than word calculators before that.
When you read and comprehend text, you don't read it letter by letter, unless you have a severe reading disability. Your ability to comprehend text works more like an LLM.
Essentially, you can compare the human brain to a multi-model or modular system. There are layers or modules involved in most complex tasks. When reading, you recognize multiple letters at a time[], and those letters are essentially assembled into tokens that a different part of your brain can deal with.
Breaking down words into letters is essentially a separate "algorithm". Just like your brain, it's likely to never make sense for a text comprehension and generation model to operate at the level of letters - it's inefficient.
A multi-modal model with a dedicated model for handling individual letters could easily convert tokens into letters and operate on them when needed. It's just not a high priority for most use cases currently.
I agree completely, that wasn’t the point though: the point was that my 6 yo knows when to spell the word when asked and the blob of quantized floats doesn’t, or at least not reliably.
So the blob wasn’t trained to do that (yeah low utility I get that) but it also doesn’t know it doesn’t know, which is an another much bigger and still unsolved problem.
I would argue that most sota models do know that they don't know this, as evidenced by the fact that when you give them a code interpreter as a tool they choose to use it to write a script that counts the number of letters rather than try to come up with an answer on their own.
The model communicates in a language, but our letters are not necessary for such and in fact not part of the english language. You could write english using per word pictographs and it would still be the same english&the same information/message. It's like asking you if there is a '5' in 256 but you read binary.
Is anyone in the know, aside from mainstream media (god forgive me for using this term unironically) and civillians on social media claiming LLMs are anything but word calculators?
I think that's a perfect description by the way, I'm going to steal it.
I think it's a very poor intuition pump. These 'word calculators' have lots of capabilities not suggested by that term, such as a theory of mind and an understanding of social norms. If they are a "merely" a "word calculator", then a "word calculator" is a very odd and counterintuitively powerful algorithm that captures big chunks of genuine cognition.
There is no philosophical difference. It's like asking if Usain Bolt is really a fast runner, or if he just seems like it because he has long legs and powerful muscles.
I think that's a poor a comparison, but I understand your point. I just disagree about there being no philosophical difference. I'd argue the difference is philosophical, rather than factual.
You also indirectly answered my initial question -- so thanks!
I'm not sure I'm educated (or rested) enough to answer that in a coherent manner, certainly not in a comment thread typing on mobile. So I won't waste your time babbling.
I don't disagree they produce astonishing responses but the nuance of why it's producing that output matters to me.
For example, with regard to social mores, I think a good way to summarize my hang up is that my understanding is LLMs just pattern match their way to approximations.
That to me is different from actually possessing an understanding, even though the outcome may be the same.
I can't help but draw comparisons to my autistic masking.
They’re trained on the available corpus of human knowledge and writings. I would think that the word calculators have failed if they were unable to predict the next word or sentiment given the trillions of pieces of data they’ve been fed. Their training environment is literally people talking to each other and social norms. Doesn’t make them anything more than p-zombies though.
As an aside, I wish we would call all of this stuff pseudo intelligence rather than artificial intelligence
I side with Dennett (and Turing for that matter) that a "p-zombie" is a logically incoherent thing. Demonstrating understanding is the same as having understanding because there is no test that can distinguish the two.
Are LLMs human? No. Can they do everything humans do? No. But they can do a large enough subset of things that until now nothing but a human could do that we have no choice but to call it "thinking". As Hofstadter says - if a system is isomorphic to another one, then its symbols have "meaning", and this is indeed the definition of "meaning".
> If you believe this, you don't understand how LLMs work.
Nor do they understand how intelligence works.
Humans don't read text a letter at a time. We're capable of deconstructing words into individual letters, but based on the evidence that's essentially a separate "algorithm".
Multi-model systems could certainly be designed to do that, but just like the human brain, it's unlikely to ever make sense for a text comprehension and generation model to work at the level of individual letters.
I would counterargue with "that's the model's problem, not mine".
Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.
A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".
Of course they lack building blocks for full intelligence. They are good at certain tasks, and counting letters is emphatically not one of them. They should be tested and compared on the kind of tasks they're fit for, and so the kind of tasks they will be used in solving, not tasks for which they would be misemployed to begin with.
I agree with you, but that's not what the post claims. From the article:
"A significant effort was also devoted to enhancing the model’s reasoning capabilities. (...) the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer."
Words like "reasoning capabilities" and "acknowledge when it does not have enough information" have meanings. If Mistral doesn't add footnotes to those assertions then, IMO, they don't get to backtrack when simple examples show the opposite.
Its not like an LLM is released with a hit list of "these are the tasks I really suck at." Right now users have to figure it out on the fly or have a deep understanding of how tokenizers work.
That doesn't even take into account what OpenAI has typically done to intercept queries and cover the shortcomings of LLMs. It would be useful if each model did indeed come out with a chart covering what it cannot do and what it has been tailored to do above and beyond the average LLM.
Sure, if you want to go with wildly theoretical approaches, we can't even be sure if the rock on the ground doesn't have some form of intelligence.
Meanwhile, for practical purposes, there's little arrogance needed to say that some things are preconditions for any form of intelligence that's even remotely recognizable.
1) Learning needs to happen continuously. That's a no-go for now, maybe solvable.
2) Learning needs to require much less data. Very dubious without major breakthroughs, likely on the architectural level. (At which point it's not really an LLM any more, not in the current sense)
3) They need to adapt to novel situations, which requires 1&2 as preconditions.
3
4) There's a good chance intelligence requires embodiment. It's not proven, but it's likely. For one, without observing outcomes, they have little capability to self-improve their reasoning.
5) They lack long-term planning capacity. Again, reliant on memory, but also executive planning.
There's a whole bunch more. Yes, LLMs are absolutely amazing achievements. They are useful, they imply a lot of interesting things about the nature of language, but they aren't intelligent. And without modifying them to the extent that they aren't recognizably what we currently call LLMs, there won't be intelligence. Sure, we can have the ship of Theseus debate, but for practical purposes, nope, LLMs aren't intelligent.
4) 'Embodiment' is another term we don't really know how to define. At what point does an entity have a 'body' of the sort that supports 'intelligence'? If you want to stick with vague definitions, 'awareness' seems sufficient. Otherwise you will end up arguing about paralyzed people, Helen Keller, that rock opera by the Who about the pinball player, and so on.
5) OK, so the technology that dragged Lee Sedol up and down the goban lacks long-term planning capacity. Got it.
None of these criteria are up to the task of supporting or refuting something as vague as 'intelligence.' I almost think there has to be an element of competition involved. If you said that the development of true intelligence requires a self-directed purpose aimed at outcompeting other entities for resources, that would probably be harder to dismiss. Could also argue that an element of cooperation is needed, again serving the ultimate purpose of improving competitive fitness.
LLMs are not a tool to model intelligence. It's not a function of the dataset, they are, as is, not sufficient. One of the largest shortcomings being the lack of continuous learning, memory, and (likely) forgetting
Those who develop AI that know anything don't actually describe current technology as human like intelligence rather it is capable of many tasks which previously required human intelligence.
When using a prompt that involves thinking first, all three get it correct.
"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."
This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"
EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.
Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.
There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.
The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.
*) that is, except sometimes by making adjustments to the system prompt
I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.
I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.
The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.
> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"
To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.
The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.
GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.
It “work” but the LLM having to use the calculator mean the LLM doesn’t understand arithmetic enough and doesn’t know how to use an follow a set of step (algorithm ) natively to find the answer for bug numbers.
I believe this could be fixed and is worth fixing.
Because it’s the only way LLM will be able to help math and physic researcher write proof and make real scientific progress
It generates the code to run for the answer. Surely that means it actually knows to build the appropriate algorithm - it just struggles to perform the actual calculation.
Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.
For example, the latter model answered with:
To count the number of Rs in the word "strawberry", I'll break it down step by step:
Start with the individual letters: S-T-R-A-W-B-E-R-R-Y
Identify the letters that are "R": R (first one), R (second one), and R (third one)
Count the total number of Rs: 1 + 1 + 1 = 3
Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.
And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.
Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.
Maybe the various chat interfaces already do this behind the scenes?
Lots of replies mention tokens as the root cause and I’m not well versed in this stuff at the low level but to me the answer is simple:
When this question is asked (from what the models trained on) the question is NOT “count the number of times r appears in the word strawberry” but instead (effectively) “I’ve written ‘strawbe’, now how many r’s are in strawberry again? Is it 1 or 2?”.
I think most humans would probably answer “there are 2” if we saw someone was writing and they asked that question, even without seeing what they have written down. Especially if someone said “does strawberry have 1 or 2 r’s in it?”. You could be a jerk and say “it actually has 3” or answer the question they are actually asking.
It’s an answer that is _technically_ incorrect but the answer people want in reality.
LLMs think in tokens, not letters. It's like asking someone who is dyslexic about spelling. Not their strong suit. In practice, it doesn't matter much, does it?
Language models are best treated like consciousness. Our consciousness does a lot less than people like to attribute to it. It is mostly a function of introspection and making connections, rather than being the part of the brain where higher level reasoning and the functions of the brain that tell your body how to stay alive (like beating your heart).
By allowing a language model to do function calling, you are essentially allowing it to do specialized "subconscious" thought. The language model becomes a natural language interface to the capabilities of its "subconsciousness".
A specific human analogy could be: I tell you to pick up a pen off of the table, and then you do it. Most of your mental activity would be subconscious, orienting your arm and hand properly to pick up the pen, actually grabbing the pen, and picking it up. The linguistic representation of the action would exist in your concious mind (pick up the pen), but not much else.
A language model could very easily call out to a text processing function to correctly do things like count the number of r's in the word strawberry. That is a job that your concious mind can dispatch to your subconciousness.
This doesn't make any sense. The LLM knows the token to letter associations if you ask it. Once it has split the token into letters, it can answer the question anyway.
This tells us that the model is unaware of its own limitations, otherwise it would have figured that this tasks needs it to split the letters first.
Well splitting into letters doesn't help with Mistral Large 2. I tried this with Mistral 2:
[Question: How many r's in strawberry? First split the word into letters, one letter on each line then answer the question.]
[Answer: Let's split the word "strawberry" into letters:
s
t
r
a
w
b
e
r
r
y
Now, let's count the number of 'r's:
There are 2 'r's in the word "strawberry".]
[Question: Did you count the number of r's?]
[answer: Yes, I did. Here's the count:
r
r
So, there are 2 'r's in the word "strawberry".]
I suspect the issue is these models have to be trained to accomplish tasks, and this is not a task it makes sense to train a model to perform. You might train it to do this with function calling one day, calling a python library to count the r's in a string, but actually manually counting r's just doesn't play to an LLM's strengths.
Does it? It knows how a word in some context represented by tokens split into letters. It may know or not know the table. I wouldn’t trust what it tells about word/token correspondence more than in general.
I really wanted to believe this could be simply fixed by more prompt engineering and chain of thought, so I tried it with Claude 3.5 and … well unfortunately still no luck. the reply is amusing - I’ll include it verbatim below.
Of note, this happened because earlier I had asked it without CoT and it incorrectly stated two “r”s. Once I started a new conversation (so fresh context window), it gave the right answer (3) and no longer claimed the “still the second r” as shown below.
Let's think through this step-by-step:
1. First, let's spell out "strawberry" letter by letter:
s-t-r-a-w-b-e-r-r-y
2. Now, let's count the "r"s as we go through each letter:
s (0)
t (0)
r (1) - First "r" found
a (1)
w (1)
b (1)
e (1)
r (2) - Second "r" found
r (2) - This is still the second "r"
y (2)
3. After going through all the letters, we've counted 2 "r"s.
Me: How many Rs are in the word strawberry? Think step by step
Claude: Let's approach this step-by-step:
1. First, let's write out the word: strawberry
2. Now, let's go through each letter one by one:
s - not an R
t - not an R
r - this is an R (count: 1)
a - not an R
w - not an R
b - not an R
e - not an R
r - this is another R (count: 2)
r - this is a third R (count: 3)
y - not an R
3. After going through all the letters, we've counted 3 Rs.
Therefore, there are 3 Rs in the word "strawberry".
Yeah, I figured out that it didn’t work at first because I still had the previous attempt (where it got it wrong) in my conversation history. Starting with a fresh conversation gave me the correct answer. It was still funny to see it “rationalize” the wrong answer tho.
For these classes of problems that LLMs struggle with, a more reliable way to go about them seems to be to ask them to solve them using tools, e.g. writing and executing a Python script to count the "R"s.
Interestingly enough much simpler models can write an accurate function to give you the answer.
I think it will be a while before we get there. An LLM can lookup knowledge but can't actually perform calculations itself, without some external processor.
Why do we have to "get there?" Humans use calculators all the time, so why not have every LLM hooked up to a calculator or code interpreter as a tool to use in these exact situations?
How much do threads like this provide the training data to convince future generations that—despite all appearances to the contrary—strawberry is in fact spelled with only two R's?
I just researched "how many r's are in strawberry?" in a search engine, and based solely on the results it found, I would have to conclude there is substantial disagreement on whether the correct answer is two or three.
Does anyone have input on the feasibility of running an LLM locally and providing an interface to some language runtime and storage space, possibly via a virtual machine or container?
No idea if there's any sense to this, but an LLM could be instructed to formulate and continually test mathematical assumptions by writing / running code and fine-tuning accordingly.
Yes, we are doing this at Riza[0] (via WASM). I'd love to have folks try our downloadable CLI which wraps isolated Python/JS runtimes (also Ruby/PHP but LLMs don't seem to write those very well). Shoot me an email[1] or say hi in Discord[1].
I'm not surprised, because it's an issue with the fundamental design of the "pick words that tend to flow after the other words" machine.
Training data will only "fix" it in the shallow sense that it will have seen a comment like yours before. (As opposed to the deeper sense of "learning to count.")
Your question sounds like you want to know how the word is spelled, and no one would put two r’s at straw, so the model could be assuming that you’re asking whether it’s strawbery or strawberry.
What happens if you ask the total number of occurrences of the letter r in the word? Does it still not get it right?
Plug in a code interpreter as a tool and the model will write Python or JavaScript to solve this and get it right 100% of the time. (Full disclosure: I work on a product called Riza that you can use as a code interpreter tool for LLMs)
I kind of tried to replicate your experiment (in German where "Erdbeere" has 4 E) that went the same way. The interesting thing was that after I pointed out the error I couldn't get it to doubt the result again. It stuck to the correct answer that seemed kind of "reinforced".
It was also interesting to observe how GPT (4o) even tried to prove/illustrate the result typographically by placing the same word four times and putting the respective letter in bold font (without being prompted to do that).
I haven't played with this model, but rarely do I find working w/ Claude or GPT-4 for that to be the case. If you say it's incorrect, it will give you another answer instead of insisting on correctness.
If you find a case where forceful pushback is sticky, it's either because the primary answer is overwhelmingly present in the training set compared to the next best option or because there are conversations in the training that followed similar stickiness, esp. if the structure of the pushback itself is similar to what is found in those conversations.
I'll put it another way - behavior like this is extremely rare in my experience. I'm just trying to explain if one encounters it why it's likely happening.
I wrote and published a paper at COLING 2022 on why LLMs in general won't solve this without either 1. radically increasing vocab size, 2. rethinking how tokenizers are done, or 3. forcing it with constraints:
I mean, yes? Letter-based gymnastics are difficult in a token-based environment, and they're not a particularly valuable use case (outside of online dunking), so it's not really a surprise nobody cares much about those in terms of model training.