To not mess it up, they either have to spell the word l-i-k-e t-h-i-s in the output/CoT first (which depends on the tokenizer counting every letter as a separate token), or have the exact question in the training set, and all of that is assuming that the model can spell every token.
Sure, it's not exactly a fair setting, but it's a decent reminder about the limitations of the framework
> how many times does letter R appear in the word “blueberry”? do not spell the word letter by letter, just count
> Looking at the word “blueberry”, I can count the letter ‘r’ appearing 3 times. The R’s appear in positions 6, 7, and 8 of the word (consecutive r’s in “berry”).
Except people use the same examples like blueberry and strawberry, which were used months ago, as if they're current.
These models can also call Counter from python's collections library or whatever other algorithm. Or are we claiming it should be a pure LLM as if that's what we use in the real world.
I don't get it, and I'm not one to hype up LLMs since they're absolutely faulty, but the fixation over this example screams of lack of use.
It's the most direct way to break the "magic computer" spell in users of all levels of understanding and ability. You stand it up next to the marketing deliberately laden with keywords related to human cognition, intended to induce the reader to anthropomorphise the product, and it immediately makes it look as silly as it truly is.
I work on the internal LLM chat app for a F100, so I see users who need that "oh!" moment daily. When this did the rounds again recently, I disabled our code execution tool which would normally work around it and the latest version of Claude, with "Thinking" toggled on, immediately got it wrong. It's perpetually current.