- Serves as a form of compression. The main benefit of that is supporting longer sequences for any given context window. As a side benefit, it squeezes about the same amount of "information" into each token -- meaning you don't have to add any terms to your model to account for such an imbalance (or even test whether that hyperparameter matters).
- Allows you to insert stuff other than the raw data into your stream of "tokens" to the LLM. For something like a chatbot, that could be as simple as a prefix to whoever's talking next (e.g., system, user, model). You similarly probably want control characters to denote the end of a sequence. If you have multi-modal content (e.g., text + images), you need some way to delimit the transition between those. All of those problems could mostly be solved with an appropriate encoding scheme, but that's basically tokenization by a different name (in that it's a transformation from one set of tokens to another that you have to apply to every input).
You can solve that second problem trivially with just a vocabulary of 256 "byte" tokens plus O(1) control tokens, so that's not a huge deal in practice, just a point worth mentioning if we're talking about actually naively encoding bytes.
The first problem is more interesting. One observation is that if for your particular problem tokenization doesn't offer much compression, the difference won't matter much, or will favor raw bytes over tokenization if the tokenization isn't tailored to your particular data. IIRC there was something about Hebrew text floating around as an example of raw byte models performing better than tokenized models.
Another observation is that if your particular model has any form of compression for redundant state space (not true of any sort of vanilla transformer, mostly not true for any major competitor, technically possible regardless), especially if the cost of processing a token isn't substantially greater than the cost per byte of tokenizing an input, you also don't buy anything from tokenization. You're absolutely able to feed that raw data in and let the model handle the details.
On the flip side, suppose you're handling vanilla English text with a vanilla transformer. You can support something like 50x longer sequences basically for free by adding tokenization. You'd be silly not to.
Image transformers are slightly different in some sense, at least in typical implementations. The tokenization is lossy (not injective), and the de-tokenization must therefore have the opposite property (not a function -- or, since it is a function, it either doesn't reproduce every possible input image patch or has randomness to at least match the right distribution hopefully). They're often called the same thing, but I view that as something different from tokenization. Certain categories of problems (much like the English text example above) are made drastically cheaper by the process. Others (unlike the English text example above) are rendered impossible by the loss of information. A byte vocabulary makes those theoretically possible again, but you suddenly need a way to handle the "entropy per byte" problem which you didn't have to care about before.
Maybe one last idea, fuzzy detokenization (like in image transformers) has a notable advantage in spec adherence. Outputting an image and then letting some other hand-written code convert that to a png is much more likely to produce something usable than outputting a png directly, byte by byte. The whole thing is probabilistic, and the flurry of strategies you've seen along the lines of "decode while greedily adhering to a schema (json being the canonical example everyone wants to use for some reason, if you want to search for it)" produce the wrong output distribution, often drastically so, by virtue of the biased sampling on something only correct because of its conditional probabilities. I'm not sure exactly how big of a model you need (or how tailored of a loss function) to make a model reliably output correct, large png files, but the current SOTA isn't there yet for general-purpose problems.
In practice, people have made some byte-token models. They vary from "meh" to SOTA depending on the problem. On most problems, they're much more expensive than tokenized solutions. Interestingly, when they're SOTA they tend to be among the cheaper solutions.
I've been chipping away at some new model architectures, and something kind of like a byte-token solution is pretty suitable for those, largely because the model itself offers that compression you would otherwise obtain from tokenization. I'll finish and release them one of these years. For transformers though, the byte-token solution is usually only interesting insofar as proving people's suspicions. Results are fine, not amazing, except in special cases.
Tokenization:
- Serves as a form of compression. The main benefit of that is supporting longer sequences for any given context window. As a side benefit, it squeezes about the same amount of "information" into each token -- meaning you don't have to add any terms to your model to account for such an imbalance (or even test whether that hyperparameter matters).
- Allows you to insert stuff other than the raw data into your stream of "tokens" to the LLM. For something like a chatbot, that could be as simple as a prefix to whoever's talking next (e.g., system, user, model). You similarly probably want control characters to denote the end of a sequence. If you have multi-modal content (e.g., text + images), you need some way to delimit the transition between those. All of those problems could mostly be solved with an appropriate encoding scheme, but that's basically tokenization by a different name (in that it's a transformation from one set of tokens to another that you have to apply to every input).
You can solve that second problem trivially with just a vocabulary of 256 "byte" tokens plus O(1) control tokens, so that's not a huge deal in practice, just a point worth mentioning if we're talking about actually naively encoding bytes.
The first problem is more interesting. One observation is that if for your particular problem tokenization doesn't offer much compression, the difference won't matter much, or will favor raw bytes over tokenization if the tokenization isn't tailored to your particular data. IIRC there was something about Hebrew text floating around as an example of raw byte models performing better than tokenized models.
Another observation is that if your particular model has any form of compression for redundant state space (not true of any sort of vanilla transformer, mostly not true for any major competitor, technically possible regardless), especially if the cost of processing a token isn't substantially greater than the cost per byte of tokenizing an input, you also don't buy anything from tokenization. You're absolutely able to feed that raw data in and let the model handle the details.
On the flip side, suppose you're handling vanilla English text with a vanilla transformer. You can support something like 50x longer sequences basically for free by adding tokenization. You'd be silly not to.
Image transformers are slightly different in some sense, at least in typical implementations. The tokenization is lossy (not injective), and the de-tokenization must therefore have the opposite property (not a function -- or, since it is a function, it either doesn't reproduce every possible input image patch or has randomness to at least match the right distribution hopefully). They're often called the same thing, but I view that as something different from tokenization. Certain categories of problems (much like the English text example above) are made drastically cheaper by the process. Others (unlike the English text example above) are rendered impossible by the loss of information. A byte vocabulary makes those theoretically possible again, but you suddenly need a way to handle the "entropy per byte" problem which you didn't have to care about before.
Maybe one last idea, fuzzy detokenization (like in image transformers) has a notable advantage in spec adherence. Outputting an image and then letting some other hand-written code convert that to a png is much more likely to produce something usable than outputting a png directly, byte by byte. The whole thing is probabilistic, and the flurry of strategies you've seen along the lines of "decode while greedily adhering to a schema (json being the canonical example everyone wants to use for some reason, if you want to search for it)" produce the wrong output distribution, often drastically so, by virtue of the biased sampling on something only correct because of its conditional probabilities. I'm not sure exactly how big of a model you need (or how tailored of a loss function) to make a model reliably output correct, large png files, but the current SOTA isn't there yet for general-purpose problems.
In practice, people have made some byte-token models. They vary from "meh" to SOTA depending on the problem. On most problems, they're much more expensive than tokenized solutions. Interestingly, when they're SOTA they tend to be among the cheaper solutions.
I've been chipping away at some new model architectures, and something kind of like a byte-token solution is pretty suitable for those, largely because the model itself offers that compression you would otherwise obtain from tokenization. I'll finish and release them one of these years. For transformers though, the byte-token solution is usually only interesting insofar as proving people's suspicions. Results are fine, not amazing, except in special cases.