I was just messing around with LLMs all day, so had a few test cases open. Asked it to change a few things in a ~6KB C# snippet in a somewhat ambiguous, but reasonable way.
GPT-4 did this job perfectly. Qwen:72b did half of the job, completely missed the other one, and renamed 1 variable that had nothing to do with the question. Llama3.1:70b behaved very similar to Qwen, which is interesting.
OpenCoder:8b started reasonably well, then randomly replaced "Split('\n')" with "Split(n)" in unrelated code, and then went completely berserk, hallucinating non-existent StackOverflow pages and answers.
My best guess is that you shouldn't train it on mostly code. Natural language conversations used to train other models let them "figure out" human-like reasoning. If your training set is mostly code, it can produce output that looks like code, but it will have little value to humans.
Edit: to be fair, llama3.2:3b also botched the code. But it did not hallucinate complete nonsense at least.
Here is quite comprehensive llm for coding leaderboard: https://aider.chat/docs/leaderboards/
And they update it quite quickly with new models releases.
GPT-4 did this job perfectly. Qwen:72b did half of the job, completely missed the other one, and renamed 1 variable that had nothing to do with the question. Llama3.1:70b behaved very similar to Qwen, which is interesting.
OpenCoder:8b started reasonably well, then randomly replaced "Split('\n')" with "Split(n)" in unrelated code, and then went completely berserk, hallucinating non-existent StackOverflow pages and answers.
For posterity, I saved it here: https://pastebin.com/VRXYFpzr
My best guess is that you shouldn't train it on mostly code. Natural language conversations used to train other models let them "figure out" human-like reasoning. If your training set is mostly code, it can produce output that looks like code, but it will have little value to humans.
Edit: to be fair, llama3.2:3b also botched the code. But it did not hallucinate complete nonsense at least.