Most human math phd's have all kinds of shortcomings. The idea that finding some "gotchas" shows that they are miles off the mark with the hype is absurd.
> Most human math phd's have all kinds of shortcomings.
I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.
Humans tasked to count how many vowels are in "Pneumonoultramicroscopicsilicovolcanoconiosis" (a real word), without seeing the word visually, just from language, would struggle. Working memory limits. We're not that different, we fail too.
I'll bet said phds can't answer the equivalent question in a language they don't understand. LLMs don't speak character level english. LLMs are, in some stretched meaning of the word, illiterate.
If LLMs used character level tokenization it would work just fine. But we don't do that and accept the trade off. It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.
I’ll bet said PhDs will tell you they don’t know instead of confidently stating the wrong answer in this case. Getting LLMs to express an appropriate level of confidence in their output remains a major problem.
> It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.
I think it is meaningful in that it highlights how we need to approach things a bit differently. For example, instead of asking "How many r's in strawberry?", we say "How many r's in strawberry? Show each character in an ordered list before counting. When counting, list the position in the ordered list." If we do this, every model that I asked got it right.
There are quirks we need to better understand and I would say the strawberry is one of them.
Edit: I should add that getting LLMs to count things might not be the best way to go about it. Having it generate code to count things would probably make more sense.
I was impressed with Claude Sonnet the other day - gave it a photo of my credit card bill (3 photos actually - long bill) and asked it to break it down by recurring categories, counting anything non-recurring as "other". It realized without being asked that a program was needed, and wrote/ran it to give me what I asked for.
In general it's "tool use" where the model's system prompt tells it to use certain tools for certain tasks, and having been trained to follow instructions, it does so!
> It's not that hard of a problem to solve at the application level.
I think it will be easy if you are focused on one or two models from the same family, but I think the complexity comes when you try to get a lot models to act in the same way.
I don't think that (sub-word) tokenization is the main difficulty. Not sure which models still fail the "strawberry" test, but I'd bet they can at least spell strawberry if you ask, indicating that breaking the word into letters is not the problem.
The real issue is that you're asking a prediction engine (with no working memory or internal iteration) to solve an algorithmic task. Of course you can prompt it to "think step by step" to get around these limitations, and if necessary suggest an approach (or ask it to think of one?) to help it keep track of it's letter by letter progress through the task.
No ... try claude.ai or meta.ai (both behave the same) by asking them how many r's in the (made up) word ferrybridge. They'll both get it wrong and say 2.
Now ask them to spell ferrybridge. They both get it right.
gemini.google.com still fails on "strawberry" (the other two seem to have trained on that, which is why i used a made up word instead), but can correctly break it into a letter sequence if asked.
Yep, if by chance you hit a model that has seen the training data that happens to shove those tokens together in a way that it can guess, lucky you.
The point is, it would be trivial for an LLM to get it right all the time with character level tokenization. The reason LLMs using the current tokenization best tradeoff find this activity difficult is that the tokens that make up tree don't include the token for e.
No - you can give the LLM a list of letters and it STILL won't be able to count them reliably, so you are guessing wrong about where the difficult lies.
Try asking Claude: how many 'r's are in this list (just give me a number as your response, nothing else) : s t r a w b e r r y
How many examples like that do you think it's seen? You can't given an example of something that is in effect a trick to get character level tokenization and then expect it to do well when it's seen practically zero of such data in it's training set.
Nobody who suggests methods like character or byte level 'tokenization' suggests a model trained on current tokenization schemes should be able to do what you are suggesting. They are suggesting actually train it on characters or bytes.
1) You must have tested and realized that these models can spell just fine - break a word into a letter sequence, regardless of how you believe they are doing it
2) As shown above, even when presented with a word already broken into a sequence of letters, the model STILL fails to always correctly count the number of a given letter. You can argue about WHY they fail (different discussion), but regardless they do (if only allowed to output a number).
Now, "how many r's in strawberry", unless memorized, is accomplished by breaking it into a sequence of letters (which it can do fine), then counting the letters in the sequence (which it fails at).
So, you're still sticking to your belief that creating the letter sequence (which it can do fine) is the problem ?!!
You say that very confidently - but why shouldn't an LLM have learned a character-level understanding of tokens?
LLMs would perform very badly on tasks like checking documents for spelling errors, processing OCRed documents, pluralising, changing tenses and handling typos in messages from users if they didn't have a character-level understanding.
It's only folks who have absolutely no idea how LLMs work that would think this task presents any difficulty whatsoever for a PhD-level superintelligence :)
It can literally spell out words, one letter per line.
Seems pretty clear to me the training data contained sufficient information for the LLM to figure out which tokens correspond to which letters.
And it's no surprise the training data would contain such content - it'd be pretty easy to synthetically generate misspellings, and being able to deal with typos and OCR mistakes gracefully would be useful in many applications.
Two answers:
1 - ChatGPT isn't an LLM, its an application using one/many LLMs and other tools (likely routing that to a split function).
2 - even for a single model 'call':
It can be explained with the following training samples:
"tree is spelled t r e e" and
"tree has 2 e's in it"
The problem is, the LLM has seen something like:
8062, 382, 136824, 260, 428, 319, 319
and
19816, 853, 220, 17, 319, 885, 306, 480
For a lot of words, it will have seen data that results in it saying something sensible. But it's fragile. If LLMs used character level tokenization, you'd see the first example repeat the token for e in tree rather than tree having it's own token.
There are all manner of tradeoffs made in a tokenization scheme. One example is that openai made a change in space tokenization so that it would produce better python code.
LLMs are taught to predict. Once they've seen enough training samples of words being spelled, they'll have learnt that in a spelling context the tokens comprising the word predict the tokens comprising the spelling.
Once they've learnt the letters predicted by each token, they'll be able to do this for any word (i.e. token sequence).
Of course, you could just try it for yourself - ask an LLM to break a non-dictionary nonsense word like "asdpotyg" into a letter sequence.
It does away with sub-word tokenization but is still more or less a transformer (no working memory or internal iteration). Mostly, the (performance) gains seem modest (not unanimous, some benchmarks it's a bit worse) ....until you hit anything to do with character level manipulation and it just stomps. 1.1% to 99% on CUTE - Spelling as a particularly egregious example.
I'm not sure what the problem is exactly but clearly something about sub-word tokenization is giving these models a particularly hard time on these sort of tasks.
The CUTE benchmark is interesting, but doesn't have enough examples of the actual prompts used and model outputs to be able to evaluate the results. Obviously transformers internally manipulate their input at token level granularity, so to be successful at character level manipulation they first need to generate the character level token sequence, THEN do the manipulation. Prompting them to directly output a result without allowing them to first generate the character sequence would therefore guarantee bad performance, so it'd be important to see the details.
> Once they've learnt the letters predicted by each token, they'll be able to do this for any word (i.e. token sequence).
They often fail at things like this, hence the strawberry example. Because they can't break down a token or have any concept of it. There is a sort of sweat spot where it's really hard (like strawberry). The example you give above is so far from a real word that it gets tokenized into lots of tokens, ie it's almost character level tokenization. You also have the fact that none of the mainstream chat apps are blindly shoving things into a model. They are almost certainly routing that to a split function.
Why would an LLM need to "break down" tokens into letters to do spelling?! That is just not how they work - they work by PREDICTION. If you ask an LLM to break a word into a sequence of letters, it is NOT trying to break it into a sequence of letters - it is trying to do the only thing it was trained to do, which is to predict what tokens (based on the training samples) most likely follow such a request, something that it can easily learn given a few examples in the training set.
The LLM can't, thats what makes it relatively difficult. The tokenizer can.
Run it through your head with character level tokenization. Imagine the attention calculations. See how easy it would be? See how few samples would be required? It's a trivial thing when the tokenizer breaks everything down to characters.
Consider the amount and specificity of training data required to learn spelling 'games' using current tokenization schemes. Vocabularies of 100,000 plus tokens, many of which are close together in high dimensional space but spelled very differently. Then consider the various data sets which give phonetic information as a method to spell. They'd be tokenized in ways which confuse a model.
Look, maybe go build one. Your head will spin once you start dealing with the various types of training data and how different tokenization changes things. It screws spelling, math, code, technical biology material, financial material. I specifically build models for financial markets and it's an issue.
> I specifically build models for financial markets and it's an issue.
Well, as you can verify for yourself, LLMs can spell just fine, even if you choose to believe that they are doing so by black magic or tool use rather than learnt prediction.
So, whatever problems you are having with your financial models isn't because they can't spell.
You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.
Of all the incredible things that LLMs can do, why do you imagine that something so basic is challenging to them?
In a trillion token training set, how few examples of spelling are you thinking there are?
Given all the specialized data that is deliberately added to training sets to boost performance in specific areas, are you assuming that it might not occur to them to add coverage of token spellings if it was needed ?!
Why are you relying on what you believe to be true, rather than just firing up a bunch of models and trying it for yourself ?
> You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.
Yes, it is significantly easier to train a model to do the first than the second across any real vocabulary. If you don't understand why, maybe go back to basics.
No, because it still has to learn what to predict when "spelling" is called for. There's no magic just because the predicted token sequence is the same as the predicting one (+/- any quotes, commas, etc).
And ...
1) If the training data isn't there, it still won't learn it
2) Having to learn that the predictive signal is a multi-token pattern (s t) vs a single token one (st) isn't making things any simpler for the model.
Clearly you've decided to go based on personal belief rather that actually testing for yourself, so the conversation is rather pointless.
You are going to find for 1) with character level tokenization you don't need to have data for every token for it to learn. For current tokenization schemes you do, and it still goes haywire from time to time when tokens which are close in space are spelled very differently.
I don't doubt that training an LLM, and curating a training set, is a black art. Conventional wisdom was that up until a few years ago there were only a few dozen people in the world who knew all the tricks.
However, that is not what we were discussing.
You keep flip flopping on how you think these successfully trained frontier models are working and managing to predict the character level sequences represented by multi-character tokens ... one minute you say it's due to having learnt from an onerous amount of data, and the next you say they must be using a split function (if that's the silver bullet, then why are you not using one yourself, I wonder).
Near the top of this thread you opined that failure to count r's in strawberry is "Because they can't break down a token or have any concept of it". It's a bit like saying that birds can't fly because they don't know how to apply Bernoulli's principle. Wrong conclusion, irrelevant logic. At least now you seem to have progressed to (on occasion) admitting that they may learn to predict token -> character sequences given enough data.
If I happen into a few million dollars of spare cash, maybe I will try to train a frontier model, but frankly it seems a bit of an expensive way to verify that if done correctly it'd be able to spell "strawberry", even if using a penny-pinching tokenization scheme.
Nope, the right analogy is: "it's like saying a model will find it difficult to tell you what's inside a box because it can't see inside it". Shaking it, weighing it, measuring if it produces some magnetic field or whatever is what LLMs are currently doing, and often well.
The discussion was around the difficulty of doing it with current tokenization schemes v character level. No one said it was impossible. It's possible to train an LLM to do arithmetic with decent sized numbers - it's difficult to do it well.
You don't need to spend more than a few hundred dollars to train a model to figure something like this out. In fact, you don't need to spend any money at all. If you are willing to step through small model layer by layer, it obvious.
At the end of the day you're just wrong. You said models fail to count r's in strawberry because they can't "break" the tokens into letters (i.e. predict letters from tokens, given some examples to learn from), and seem entirely unfazed by the fact that they in fact can do this.
Maybe you should tell Altman to put his $500B datacenter plans on hold, because you've been looking at your toy model and figured AGI can't spell.
Maybe go back and read what I said rather than make up nonsense. 'often fail' isn't 'always fail'. And many models fail the strawberry example, that's why it's famous. I even lay out some training samples that are of the type that enable current models to succeed at spelling 'games' in a fragile way.
Problematic and fragile at spelling games compared to using character or byte level 'tokenization' isn't a giant deal. These are largely "gotchas" that don't reduce the value of the product materially. Everyone in the field is aware. Hyperbole isn't required.
Someone linked you to one of the relevant papers above... and you still contort yourself into a pretzel. If you can't intuitively get the difficulty posed by current tokenization, and how character/byte level 'tokenization' would make those things trivial (albeit with a tradeoff that doesn't make it worth it) maybe you don't have the horsepower required for the field.
"""
While current LLMs with BPE vocabularies lack
direct access to a token’s characters, they perform
well on some tasks requiring this information, but
perform poorly on others. The models seem to
understand the composition of their tokens in direct probing, but mostly fail to understand the concept of orthographic similarity. Their performance
on text manipulation tasks at the character level
lags far behind their performance at the word level.
LLM developers currently apply no methods which
specifically address these issues (to our knowledge), and so we recommend more research to
better master orthography. Character-level models
are a promising direction. With instruction tuning, they might provide a solution to many of the
shortcomings exposed by our CUTE benchmark
"""
> LLMs are, in some stretched meaning of the word, illiterate.
You raise an interesting point here. How would LLMs need to change for you to call them literate? As a thought experiment, I can take a photograph of a newspaper article, then ask a LLM to summarise it for me. (Here, I assume that LLMs can do OCR.) Does that count?
It's a bit of a stretch to call them illiterate, but if you squint, it's right.
The change is easy - get rid of tokenization and feed in characters or bytes.
The problem is, that causes all kinds of other problems with respect to required model size, required training, and so on. It's a researchy thing, I doubt we end up there any time soon.
I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.
So can the current models.
It's frustrating that so many people think this line of reasoning actually pays off in the long run, when talking about what AI models can and can't do. Got any other points that were right last month but wrong this month?
There are always going to be doubters on this. It's like the self driving doubters. Until you get absolute perfection, they'll point out shortcomings. Never mind that humans have more holes than swiss cheese.