Hacker News new | past | comments | ask | show | jobs | submit login

I was just messing around with LLMs all day, so had a few test cases open. Asked it to change a few things in a ~6KB C# snippet in a somewhat ambiguous, but reasonable way.

GPT-4 did this job perfectly. Qwen:72b did half of the job, completely missed the other one, and renamed 1 variable that had nothing to do with the question. Llama3.1:70b behaved very similar to Qwen, which is interesting.

OpenCoder:8b started reasonably well, then randomly replaced "Split('\n')" with "Split(n)" in unrelated code, and then went completely berserk, hallucinating non-existent StackOverflow pages and answers.

For posterity, I saved it here: https://pastebin.com/VRXYFpzr

My best guess is that you shouldn't train it on mostly code. Natural language conversations used to train other models let them "figure out" human-like reasoning. If your training set is mostly code, it can produce output that looks like code, but it will have little value to humans.

Edit: to be fair, llama3.2:3b also botched the code. But it did not hallucinate complete nonsense at least.






> and renamed 1 variable that had nothing to do with

irrefutable proof we have AGI. it's here. they are as sentient as any human in my code reviews


Have you also tried Qwen-2.5-coder and deepseek-coder-v2 on the problem? I‘d be very curious whether they do any better.

Are you able to test on https://chat.mistral.ai/chat as well? With large2 and coedestral?

I'm interested !


How did Claude Sonnet 3.5 fair?

It would be interesting to also compare with Claude 3.5 and Deepseek 2.5

Here is quite comprehensive llm for coding leaderboard: https://aider.chat/docs/leaderboards/ And they update it quite quickly with new models releases.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: