I'm struggling to understand what these models are actually learning. They can be applied to all sorts of problems, but what fundamental things are beng encoded in the model?
They take a sequence of tokens and they do their best to continue the sequence in a plausible way. That's the basic idea - given the text so far figure out what's likely to follow.
Do they need to learn rules though, or could they memorize enough to learn the probabilities of the most likely continuation? To my mind the blurry-jpeg-metaphor[1] is very much spot-on. While it is speculation, it would seem to me personally that LLMs de-facto doing a (lossy) memorization and neural networks in general being able to do well in "local consistency" seems a way to think about them that is consistent with most of the model behaviour we observe.
The next step of training is Human Feedback Reinforcement Learning. They get rewarded for certain outputs and punished for other outputs. This is how they learn to be agreeable, to attempt to answer people's questions, not write Hitler speeches, etc.
What you're describing is how to turn an LLM into a chat bot like ChatGPT. OP is asking about LLMs which by themselves don't need any reinforcement learning.
Yeah but to do anything useful (say to classify or make sequence labels like "color proper names red") you usually do need to do a second stage of training. The remarkable thing is that the unsupervised training on a large corpus transfers so well to future stages.
To be pedantic, "predict the next token" was what we were trying to do with RNNs 7-8 years ago. People are training transformers on "mask out 15% of the words randomly and guess what they were" which is a big difference because that task is symmetrical in the forward and backwards directions whereas the single-direction nature of RNNs was a major limitation (e.g. when they start out they have no state so if a model was writing fake abstracts for clinical case reports, something I tried, it decides what disease the patient had based on what letters or words it picked early on whereas it really should start out with a "latent state" that includes the characteristics of the patients including the disease the same way the clinical encounter did and the way the author did when they wrote the abstract.)