Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.



The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.


SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.


SentencePiece is a tool and library for training and using tokenizers, and supports two algorithms: Byte-Pair Encoding (BPE) and Unigram. You could almost say it is the library for tokenizers, as it has been standard in research for years now.

Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.

What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: