Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Should the cost really be 15x? Or even 5x? In this case, it's not even a question of whether the network is better at English, it's that the cost to communicate with it at all in other languages is higher. Once you pay that cost you now have to deal with the network potentially generating lower quality results for prompts in non-English languages too, which raises the actual cost of doing something with GPT beyond 15x since you probably will need more attempts.


Because there's so much more English language for them to train on relative to most other languages, they're able to do some optimizations for English that they can't elsewhere. Should they not be able to implement optimizations for cases where they have the data volume to do so?


Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.


> GPT-4's tokenizer is already far more efficient though still weighted to English.

Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?


"there's so much more English language for them to train on relative to most other languages" is an interesting assertion. There are billions of people on earth speaking languages other than English and they have access to the internet. Are you sure it's not just the case that we didn't scrape that data?

Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data. But it becomes an intentional choice with consequences, like the 15.77x seen here.


> Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data.

Isn't that exactly how OpenAI managed to 10x GPT 3.5 with GPT 4.0?


but training against the entire Internet would still be biased towards English because English is the dominant language used on the Internet.


What makes you think there's a "should"?


There's always a should. Society gets a say in what people and corporations can and can't do in (at the very least) the form of laws. There's your should right there.


Which society?

I mean OpanAI is a US company is unsurprisingly going to mostly communicate in English.

Are we counting all societies, if so should our software cater to all their demands, language and/or culturally demanded?


Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: