Informed layman warning. The tokenizer covers the entire dataset. It's basically...

astrange · 2025-10-06T00:38:23 1759711103

> The tokenizer covers the entire dataset.

Well, this is only trivially true. You can feed binary data to the LLM and it probably has tokens that only cover single bytes of that.

lelanthran · 2025-10-06T07:50:01 1759737001

Not an expert, but I don't think that this bit:

> At some point in this process it naturally abstracts concepts from languages: "as a child"

Is true. I don't know of any way for the model to represent concepts.

brulard · 2025-10-07T20:16:58 1759868218

I think concept here means it is assigned a point or an area in the many-dimensional embedding space. The "concept" has no obvious form, but similar words, synonyms or words from another languages meaning roughly the same are very close together in this space.

jfyi · 2025-10-06T10:37:49 1759747069

https://www.anthropic.com/research/tracing-thoughts-language...

>Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.