I have done some training with the Mistral family of models, and that’s probably what I’d think to try first on a French corpus.
Feel free to open an issue and I’ll work on it as I find time.
FYI huggingface hosts datasets too. And wikipedia has a nice portal for datasets : https://en.m.wikipedia.org/wiki/List_of_datasets_for_machine...