The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.
During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".
Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.
I think concept here means it is assigned a point or an area in the many-dimensional embedding space. The "concept" has no obvious form, but similar words, synonyms or words from another languages meaning roughly the same are very close together in this space.
>Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.
During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".
Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.