Should the cost really be 15x? Or even 5x? In this case, it's not even a questio...

viscanti · on April 10, 2023

Because there's so much more English language for them to train on relative to most other languages, they're able to do some optimizations for English that they can't elsewhere. Should they not be able to implement optimizations for cases where they have the data volume to do so?

famouswaffles · on April 10, 2023

Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

viscanti · on April 10, 2023

> GPT-4's tokenizer is already far more efficient though still weighted to English.

Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?

kg · on April 10, 2023

"there's so much more English language for them to train on relative to most other languages" is an interesting assertion. There are billions of people on earth speaking languages other than English and they have access to the internet. Are you sure it's not just the case that we didn't scrape that data?

Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data. But it becomes an intentional choice with consequences, like the 15.77x seen here.

MichaelZuo · on April 10, 2023

> Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data.

Isn't that exactly how OpenAI managed to 10x GPT 3.5 with GPT 4.0?

JCharante · on April 10, 2023

but training against the entire Internet would still be biased towards English because English is the dominant language used on the Internet.

user_named · on April 10, 2023

What makes you think there's a "should"?

cool_dude85 · on April 10, 2023

There's always a should. Society gets a say in what people and corporations can and can't do in (at the very least) the form of laws. There's your should right there.

pixl97 · on April 10, 2023

Which society?

I mean OpanAI is a US company is unsurprisingly going to mostly communicate in English.

Are we counting all societies, if so should our software cater to all their demands, language and/or culturally demanded?

famouswaffles · on April 10, 2023

Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.