Earlier Yi paper indicates it's trained with less than 25% Chinese dataset, cont... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		numpad0 on May 13, 2024 \| parent \| context \| favorite \| on: Yi 1.5 Earlier Yi paper indicates it's trained with less than 25% Chinese dataset, contrast to GPT-3 which was 93% English[1][2]. Is that a bug or could there be something inherent to current LLM architecture - like dataset must be 90%+ English to not fall apart? 1: https://arxiv.org/html/2403.04652v1 2: https://github.com/openai/gpt-3/blob/master/dataset_statisti...

og_kalu on May 13, 2024 [–]

The pretraining might not matter here so much as the instruct fine-tuning.

The small GLM models were like 50-50 English-Chinese in pretraining but much more Chinese in instruct training. Had the same issue until they balanced that.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact