This line of reasoning that LLMs "only predict" the next token is akin to saying humans can only think or speak one word at a time. Yes, we use one token/word at a time, but it is the aggregate thought that matters, regardless of what underlies it.
I think the mistake people make is assuming that "probability" is a simple concept.
If there are 50K possible tokens and I don't have any other information, I could make a naive estimate that every token has equal probability and start generating text that is just gibberish. With the simple single-token Markov-chain example I would estimate probabilities based the previous token, and that probability estimate would be much better. If you use it for generating text it will look like something that is almost, but not quite, entirely unlike human speech. [1]
The difference lies entirely in how accurately you model the world and what information you have available when estimating probabilities. Models like GPT4 happen to be very good at it because they encode a huge amount of knowledge about the world and take a lot of context into account when estimating the probability. That's not something to be taken lightly.
I am skeptical anyone saying this is making a mistake: it only ever really comes up when someone has specific priors they're wanting to litigate - best summarized by the timeless: you cannot make a man understand something when his paycheque depends on his not understanding it.
This line of reasoning that LLMs "only predict" the next token is akin to saying humans can only think or speak one word at a time. Yes, we use one token/word at a time, but it is the aggregate thought that matters, regardless of what underlies it.