Earlier Yi paper indicates it's trained with less than 25% Chinese dataset, contrast to GPT-3 which was 93% English[1][2]. Is that a bug or could there be something inherent to current LLM architecture - like dataset must be 90%+ English to not fall apart?
The pretraining might not matter here so much as the instruct fine-tuning.
The small GLM models were like 50-50 English-Chinese in pretraining but much more Chinese in instruct training. Had the same issue until they balanced that.
1: https://arxiv.org/html/2403.04652v1
2: https://github.com/openai/gpt-3/blob/master/dataset_statisti...