Poro-34B: Open-Source Multilingual (English, Finnish) & Code LLM

jruohonen · on Nov 13, 2023

"Poro’s advanced capabilities with European languages like Finnish descend from how it addresses the core challenge for low-resource languages: training LLMs requires enormous amounts of data, but for low-resource languages like Finnish, sufficient data is simply not available. In general, Poro addresses this by cross-training low-resource languages with high-resource languages. This takes advantage of a cross-lingual signal that allows the model to achieve higher performance for the low-resource language than training a monolingual model, and has the further advantage of teaching the model basic translation capability."

I wonder how this cross-training affects the "overall quality" of the LLM for the low-resource languages? Any scientific papers to pinpoint?