How the BPE tokenization algorithm used by large language models works

weinzierl · on July 6, 2023

I wonder how tokenization works for East Asian languages. There are obviously more characters than the token vocabulary size of typical current models.

So, how do models like GPT answer in Chinese? Are they able to produce any Chinese character? From what I understand, they are not.

My second question would then be, which tokenization algorithms are used for Chinese and other East Asian languages. What does that mean for the models? How do models that can learn proper Chinese (with complete tokenization) differ from the models for languages with less characters?

gwern · on July 8, 2023

> There are obviously more characters than the token vocabulary size of typical current models.

Not really. A contemporary Chinese or Japanese person might know 1-10,000 characters, while even the GPT-2 BPE was ~51,000 tokens, and there is no real barrier to pushing it to hundreds of thousands; IIRC, FB did one with around a million BPEs recently. Plenty of room... There may be X0,000 hanzi total, but most of those are vanishingly rare & specialized (and you have the amusing 'ghost characters' which are in Unicode but never actually existed), and don't matter: you probably don't have the text corpus to learn them in any meaningful fashion nor will they ever show up in your real-world applications. So the usual approach has always been to just assign each character a unique ID, and that's how you tokenize. (If you run into one of the rare ones, you can just replace it with an Unknown token, or encode it as multiple bytes.)

There is in theory 'subword' structure to characters, but there's much less than there is for other writing systems like alphabets, where letter-level tokenization is ideal, and tokenizations trying to exploit that structure remain a niche. So, hanzi-level tokenization it is.

space_fountain · on July 6, 2023

It will happily translate simple phrases into Chinese and automated translation software can read it back. As other commenters are saying the tokenizer works on bytes so a single Chinese character will be made of multiple tokens just as some English words are made of multiple tokens. GPT doesn’t see characters it sees bytes. Probably it would do better with a encoding scheme for the characters that was designed around some sort of logic or consistency, but it can go far with just memorization

cubefox · on July 6, 2023

> There are obviously more characters than the token vocabulary size of typical current models.

I'm pretty sure the set of tokens also contains all 256 Bytes to cover such cases.

weinzierl · on July 6, 2023

Not sure what you mean, but the transformer model can only predict one token at a time. The final output layer needs as many nodes (or neurons, if you will) as there are distinct in the token vocabulary. So a large token vocabulary is expensive and that's why GPT-3 and LLaMA have only about 50000 different tokens and use BPE to find a set of useful tokens. They still can express every possible English text because the token vocabulary contains the whole latin alphabet.

Unicode 15 has nearly 150000 characters and CJK languages have even more than that because of Han unification.

A model like GPT-3 can only output a very primitive version of Chinese. My question is how real Chinese models deal with this and specifically how tokenization works in that case.

space_fountain · on July 6, 2023

Yes, but the tokens are translated into bytes not characters. There are only 256 distinct bytes so GPT models can easily be trained to produce any character. Probably the problem will be how sensible or understandable the binary form of Chinese characters in Unicode are, but that will be a problem for the model, not the tokenizer

weinzierl · on July 6, 2023

Ok, I understand. That helped, thanks.

And also suddenly the B in BPE makes a lot of sense.

cubefox · on July 6, 2023

The way I think about it: A token can be one to many bytes long, so they can be longer or shorter than a single character.

skdotdan · on July 6, 2023

Not an expert in East Asian languages, but GPT tokenizers are generally byte-based. Meaning that the basic unit to do the merges is a single byte, not a character.

weinzierl · on July 6, 2023

GPT uses BPE wit 50k or 100k token vocabulary from what I understand. Given that a lot of space is taken by words and subwords this is not nearly enough for the CJK alphabet.

LoganDark · on July 6, 2023

There are only 256 bytes. CJK characters can be produced by outputting these bytes in a certain order. LLMs are capable of outputting multiple tokens in order because even many words are multiple tokens each.