You could train it from scratch on The Pile dataset[1] with a few hundred thousand bucks worth of GPU quota. It's not rocket science - the architecture is, and that's open source by your definition.
The graph of layers and ops isn't open source by my definition. It can be extracted from the model, but so can control graphs out of any binary. That's how higher end disassemblers work like IDA and ghidra.
Once again, this pickle file is not what's sitting in Mistral's engineer's editors as they go about their day.
Well the checkpoint __is__ the computational graph. The graph is also all the code. But if you want it in python... that's here[0].
Please be clear, we keep asking. What are you asking for? Datasets? Training algo? What?
Comparing it to software artifacts isn't a good comparison when any program with open source code (visible or free to use) is equivalent to what's being given. You have everything you need to use, edit, and fuck around with. You don't have the exact scheme, but I'll put it this way, if you gave me the hardware I could produce a LLM of high quality from scratch using their architecture.
That doesn't conflict with anything I've said. Yes, the checkpoint is code. It's not source code.
It's not what Mistral's engineers edit to create this release. Just like an ELF file is necessarily contains the code flow graph, in a way extractable by experts, but isn't open source because... it's not source.
There's been all sorts of closed source libraries that you can freely integrate for whatever reason. They're not open source either.