Hacker News new | past | comments | ask | show | jobs | submit login
The New OpenAI tokenizer for the Turbo model is much better than the GPT-2/3 one
19 points by Tiberium 11 months ago | hide | past | favorite | 5 comments
The new gpt-3.5-turbo model by OpenAI uses the new cl100k_base tokenizer. I used the tiktoken Python library to get the list (code - https://rentry.org/qghpf), and some of the results are very interesting. There are a lot of tokens that are used for indentation (multiple spaces), which is much better for code than GPT-2 which always had spaces as separate tokens - in longer code examples the token savings can be 2x compared to GPT-2 tokenizer.

The longest token out of them all is 58040, which is a string of 128 spaces. There are tokens for strings with 1 to 81 spaces, also for 83, 87, 91, 95 and as mentioned 128 spaces.

The longest non-space token 87644 is "//----------------------------------------------------------------------------------------------------------------" with 114 characters.

The longest non-symbol token is 63570 - ".translatesAutoresizingMaskIntoConstraints" (from Apple's UIView). There are really a lot of identifier names from all the different frameworks out there - ".onChange", "DetailsService". Also some interesting ones - " "../../../../" (the quote inside is from the token itself)

If you want to see all tokens sorted by their byte length - see https://gist.github.com/Yardanico/623b3092d0b707119f8c7d90a3596afe (warning: 1.6MB of text, 100k+ lines)

Thank you for the summary! Trying to figure out how to squeeze as much content as possible into the context, and this bit of wisdom will be very useful :)

Wonder how much ASCII art you could get in there if they trained on IRC chat logs XD

So, basically what your saying "All you need is a damn good tokenizer"

I see you paid attention. :)

its wild how everyday i learn something new about LLMs just reading here on HN haha

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact