Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Earlier Yi paper indicates it's trained with less than 25% Chinese dataset, contrast to GPT-3 which was 93% English[1][2]. Is that a bug or could there be something inherent to current LLM architecture - like dataset must be 90%+ English to not fall apart?

1: https://arxiv.org/html/2403.04652v1

2: https://github.com/openai/gpt-3/blob/master/dataset_statisti...



The pretraining might not matter here so much as the instruct fine-tuning.

The small GLM models were like 50-50 English-Chinese in pretraining but much more Chinese in instruct training. Had the same issue until they balanced that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: