This thread looks spicy, so I won't address most of it. On the "2x vocab == 1/2 the tokens" idea though:
It usually doesn't pan out that way. Tokens aren't uniformly likely in normal text. They tend to follow some kind of a power law (like pdf(x) ~ x^0.4), and those 100k extra tokens, even assuming they're all available for use in purely textual inputs/outputs, will only move you from something like 11.4 bits of entropy per token to 12.1 (a 6% improvement).
With that base idea in mind, how do we square that with your observations of large errors? It's a bit hard to know for certain since you didn't tell us which thing you're encoding, but:
1. Using an estimator, even if fairly precise and unbiased across all text, will have high variance for small inputs. If you actually need to estimate costs accurately for _each_ query (not just have roughly accurate costs summed across many queries or an accurate cost for a large query), this project is, as you pointed out, definitely not going to work.
2. Assuming your query distribution matches the tokenizer's training data in some sense, you would expect those errors to balance out over many queries (comparing total predicted costs to total actual costs) or over a single large query. That's still useful for a lot of people (e.g., to estimate the cost of running models across a large internal corpus).
3. Out-of-distribution queries are another interesting use-case where this project falls flat. IIRC somebody here on HN comments frequently about using LLMs for Hebrew text (specifically noting good performance with no tokenization, which is another fun avenue of research), and if any of the models included Hebrew-specific tokenization (I think code-specific tokenization is probably more likely in the big models, not that the specific example matters), you'll likely find that the model in question is much cheaper than the rest for those kinds of queries. There's no free lunch, and you'll necessarily also find pockets of other query types where that model's tokenizer is more expensive than the other tokenizers. This project doesn't have the ability to divine that sort of discrepancy.
4. Even being within 2x on costs (especially when we're talking about 3-4 orders of magnitude of discrepancy in the costs for the different models) is useful. It lets you accomplish things like figuring out roughly the best model you can afford for a kind of task.
Separately:
> This is unnecessarily harsh, I'm not "upset with the implementation", ...
I think the problem people were noting was your tone. Saying you're uncomfortable with the methodology, that you won't use it, highlighting the error bars, pointing out pathological inputs, ..., are all potentially "interesting" to someone. Telling a person they should feel ashamed and are clueless is maybe appropriate somewhere sometimes, but it's strongly frowned upon at HN, usually shouldn't be done publicly in any setting, and is also a bit extreme for a project which is useful even in its current state. Cluelessness is an especially hard claim to justify when the project actually addresses your concern (just via a warning instead of a hard failure or something, which isn't your preferred behavior).
On the other hand, this situation is bad and the idea we should ignore it is misguided.
Below, we show that given only 1 tokenizer is used, any outputs collapse to a function that is constant, the per token cost. This is why it's shameful: I get that people can't believe its just C100K, but it is, and the writers know that, and they know at that point there is no function, just a constant.
> (especially when we're talking about 3-4 orders of magnitude of discrepancy in the costs for the different models)
Between a large and small model from the same provider, but not inter-provider.
The OOM mention hides the ball, yet shows something very important: no one uses a library to get a rough cost estimate when there's a 3 OOMs difference. You would use it if you were comparing models which were closer in cost...except you can't...because tokens is a constant, because they only use 1 tokenizer.
> It lets you accomplish things like figuring out roughly the best model you can afford for a kind of task.
The library calls Tiktoken to get the C100K count of tokens. Cost = cost per token * tokens. If it is only using C100K, tokens is constant, and the only relevant thing is cost per token, another constant. Now we're outside the realm of even needing a function.
> you're uncomfortable with the methodology, that you won't use it, highlighting the error bars, pointing out pathological inputs, ..., are all potentially "interesting" to someone.
Tangential critiques are preferable? Is the issue pathological inputs? Or is the issue that its stated, documented, and explicit purpose is cost calculation based on token calculation for 400 LLMs, and it only supports 1 tokenizer and isn't even trying to make accurate cost estimates? It's passing your string to Tiktoken for a C100K count, it's not doing the bare minimum of every other tokenizer library I've seen that builds on Tiktoken.
Note there are no error bars that will satisfy because A) it's input dependent and B) it's impossible to get definitive error bars because the tokenizers they claim to support don't have any public documentation, anywhere.
It's shameful to ship code that claims to calculate financial costs, is off by a minimum of 30%+, and doesn't document any of that, anywhere. This is commonly described as fraud. Shameful is a weak claim.
Not at all. Reasoned arguments and adding information (the examples I gave were what I thought your main points were) to the discussion are preferable to (what seemed to be) character attacks. Your comment here, as an example, was mostly great. It provides the same level of usefulness to anyone reading it (highlighting that the computation is just C100K and that people will be misled if they try to use it the wrong way), and you also added reasoned counter-arguments to my OOM idea and several other interesting pieces of information. To the extent that you kept the character attacks against the author, you at least softened the language.
Respectfully attacking ideas instead of people is especially important in online discourse like this. Even if you're right, attacking people tends to spiral a conversation out of control and convince no one (often persuading them of the opposite of whatever you were trying to say).
> just C100K
It's not just C100K though. It is for a few models [0], but even then the author does warn the caller (mind you, I prefer mechanisms like an `allow_approximate_token_count=False` parameter or whatever, but that's not fraud on the author's part; that's a dangerous API design).
Going back to the "tone" thing, calling out those sorts of deficiencies is a great way to warn other people, let them decide if that sort of thing matters for their use case, point out potential flaws in your own reasoning (e.g., it's not totally clear to me if you think the code always uses C100K or always uses it for a subset of models, but if it's the former then you'd probably be interested in knowing that the tokenizer is actually correct for most models) and discuss better API designs. It makes everyone better off for having read your comment and invites more discussions which will hopefully also make everyone better off.
> outside the realm of even needing a function
Maybe! I'd argue that it's useful to have all those prices (especially since not all tokens are created equally) in one place somewhere, but arguing that this is left-pad for LLM pricing is also a reasonable thing to talk about.
> it's impossible to get definitive error bars
That's also true, but that doesn't matter for every application. E.g., suppose you want to run some process on your entire corporate knowledge-base and want a ballpark estimate of costs. The tokenizer error is on average much smaller than the 30%+ you saw for some specific (currently unknown to us here at HN) very small input. Just run your data through this tool, tally up the costs, and you ought to be within 10%. Nobody cares if it's a $900 project or a $1300 project (since nobody is allocating expensive, notoriously unpredictable developers to a project with only 10-30% margins). You just tell the stakeholders it'll cost $2k and a dev-week, and if it takes less then everyone is happily surprised. If they say no at that estimate, they probably wouldn't have been ecstatic with the result if it actually cost $900 and a dev-day anyway.
I really appreciate your engagement here and think it has great value on a personal level, but the length and claims tend to hide two very obvious, straightforward things that are hilariously bad to the point its unbelievable:
1. They only support GPT3.5 and GPT4.0. Note here: [1], and that gpt-4o would get swallowed into gpt-4-0613.
2. This will lead to massive, significant, embarrassingly large error in calculations. Tokenizers are not mostly the same, within 10% error.
# Explicating #1, Responsive to ex. "It's not just C100K though. It is for a few models [0]".
The link is to Tiktoken, OpenAI's tokenization library. There are literally more than GPT3.5 and GPT4.0 there, but they're just OpenAI's models, no one else's, none of the others in the long list in their documentation, and certainly not 400.
Most damning? There's only 2 other tokenizers, long deprecated, used only for deprecated models not served anymore, thus you're not calculating costs with them. The only live ones are c100k and o200k. As described above, and shown in [1], their own code kneecaps the o200k and will use c100k
# Explicating #2
Let me know what you'd want to see if you're curious about the 30%+ error thing. I don't want to guess at a test suite that would make you confident you need to revise a prior that there's only +/- 10% difference between arbitrary tokenizers.
For context, I run about 20 unit tests, for each of the big 5 providers, with the same prompts, to capture their input and output token counts to make sure I'm billing accurately.
# Conclusion
Just to save you time, I think the best way I can provide some value is token count experiments demonstrating error. You won't be able to talk me down to "eh, lets just say its +/- 10%, thats good enough for most people!" --- It matters, if it didn't, they'd explicate at least some of this. Instead, its "tokenization for 400 LLMs!"
Oh I see (tiktoken). That's my mistake. I naively assumed the only good reason to pull in a 3rd party lib like that is if it actually did a reasonable amount of work.
> curious about the 30%+ error thing
I'm mildly curious. I have no doubt that small strings will often have high relative errors. I'd be surprised though if sum(estimated)/sum(actual) were very large if you copied in either a large piece of text or many small pieces of text, outside of specialized domains out of the normal scope of that tokenizer (e.g., throwing latex code into something trained just on wikipedia).
That's more for entropic reasons than anything else. The only way that's true is if (1) some of these tokenizers are much less naive than the normal LLM literature and actually approach entropic bounds, or (2) the baseline implementations are especially bad so that there's a lot of headroom for improvements.
What happens when you throw in something like a medium-sized plain-text wikipedia article (say, the first half as input and the second as output)?
> messaging -- tokenization for 400 LLMs
Alright, I'm sold. I'm still partial to Hanlon's razor for these sort of things, but that ought to be patched.
It usually doesn't pan out that way. Tokens aren't uniformly likely in normal text. They tend to follow some kind of a power law (like pdf(x) ~ x^0.4), and those 100k extra tokens, even assuming they're all available for use in purely textual inputs/outputs, will only move you from something like 11.4 bits of entropy per token to 12.1 (a 6% improvement).
With that base idea in mind, how do we square that with your observations of large errors? It's a bit hard to know for certain since you didn't tell us which thing you're encoding, but:
1. Using an estimator, even if fairly precise and unbiased across all text, will have high variance for small inputs. If you actually need to estimate costs accurately for _each_ query (not just have roughly accurate costs summed across many queries or an accurate cost for a large query), this project is, as you pointed out, definitely not going to work.
2. Assuming your query distribution matches the tokenizer's training data in some sense, you would expect those errors to balance out over many queries (comparing total predicted costs to total actual costs) or over a single large query. That's still useful for a lot of people (e.g., to estimate the cost of running models across a large internal corpus).
3. Out-of-distribution queries are another interesting use-case where this project falls flat. IIRC somebody here on HN comments frequently about using LLMs for Hebrew text (specifically noting good performance with no tokenization, which is another fun avenue of research), and if any of the models included Hebrew-specific tokenization (I think code-specific tokenization is probably more likely in the big models, not that the specific example matters), you'll likely find that the model in question is much cheaper than the rest for those kinds of queries. There's no free lunch, and you'll necessarily also find pockets of other query types where that model's tokenizer is more expensive than the other tokenizers. This project doesn't have the ability to divine that sort of discrepancy.
4. Even being within 2x on costs (especially when we're talking about 3-4 orders of magnitude of discrepancy in the costs for the different models) is useful. It lets you accomplish things like figuring out roughly the best model you can afford for a kind of task.
Separately:
> This is unnecessarily harsh, I'm not "upset with the implementation", ...
I think the problem people were noting was your tone. Saying you're uncomfortable with the methodology, that you won't use it, highlighting the error bars, pointing out pathological inputs, ..., are all potentially "interesting" to someone. Telling a person they should feel ashamed and are clueless is maybe appropriate somewhere sometimes, but it's strongly frowned upon at HN, usually shouldn't be done publicly in any setting, and is also a bit extreme for a project which is useful even in its current state. Cluelessness is an especially hard claim to justify when the project actually addresses your concern (just via a warning instead of a hard failure or something, which isn't your preferred behavior).