Are you sure? Because LLMs definitely have to respond to user queries in time to avoid being perceived as slow. Therefore, thinking internally for too long isn’t an option either.
LLMs spend a fixed amount of effort on each token they output, and in a feedforward manner. There's no recursion in the network other than through predicting predicated on the token that it just output. So it's not really time pressure in the same way that you might experience it, but it makes sense that sometimes the available compute is not enough for the next token (and sometimes it's excessive). Thinking modes try to improve this by essentially allowing the LLM to 'talk to itself' before sending anything to the user.
There’s no "thinking internally" in LLMs. They literally "think" by outputting tokens. The "thinking modes" supported by online services are just the LLM talking to itself.
That's not what I meant. "Thinking internally" referred to the user experience only, where the user is waiting for a reply from the model. And they are definitely optimised to limit that time.
There’s no waiting for reply, there’s only the wait between tokens output, which is fixed and mostly depends on hardware and model size. Inference is slower on larger models, but so is training, which is more of a bottleneck than user experience.
The model cannot think before it starts emitting tokens, the only way for it to "think" privately is by the interface hiding some of its output from the user, which is what happens in "think longer" and "search the web" modes.
If a online LLM doesn’t begin emitting a reply immediately, more likely the service is waiting for available GPU time or something like that, and/or prioritizing paying customers. Lag between tokens is also likely caused by large demand or throttling.
Of course there are many ways to optimize model speed that also make it less smart, and maybe even SOTA models have such optimizations these days. Difficult to know because they’re black boxes.