Qwen2.5 Coder 32B is great for an OSS model, but in my testing (ollama) Sonnet 3.5 yields noticeably better results, a lot more than what the provided benchmarks suggest.
Best thing about it is that it's an OSS model that can be hosted by anyone, resulting in an open competitive market bringing hosting costs down, currently sitting at $0.18/$0.18 M tok/s [1] making it 50x cheaper than Sonnet 3.5 and ~17x cheaper than Haiku 3.5.
Claude Sonnet 3.5s are bars too high to clear. No other model comes close, with the occasional exception of o1-preview. But o1-preview is always a gamble, your rolls are limited and it will either be the best answer possible from an LLM or it returns after a wild goose chase, having talked itself into a tangled mess of confusion.
I'd personally rank the Qwen2.5 32B model only a little behind GPT4o at worst, and preferable to gemini 1.5 pro 002 (at code only, Gemini is a model that's surprisingly bad at code considering its top class STEM reasoning).
This makes Qwen2.5-coder-32B astounding all considered. It's really quite capable and is finally an accessible model that's useful for real work. I tested it on some linear algebra, discussed pros and cons of a belief propagation based approach to SAT solving, had it implement a fast simple approximate nearest neighbor based on the near orthogonality of random vectors in high dimensions (in OCaml, not perfect with but close enough to useful/easily correctable), simulate execution of a very simple recursive program (also Ocaml) and write a basic post processing shader for Unity. It did really well on each of those tasks.
Not really tried the Claude 3.5, later tried o1-preview on github models and recently Qwen2.5 32B for a prompt to generate a litestar[0] app to manage a wysiwyg content using grapesjs[1] and use pelican[2] to generate static site. It generated very bad code and invented many libraries in import which didn't exist. Cluade was one of the worst code generator, later tried sieve of atkin to generate primes to N and then use miller-rabin test to test each generated prime both using all the cpu core available. Claude completely failed and could never get a correct code without some or the other errors especially using multiprocess, o1-preview got it right in first attempt, Qwen 2.5 32B got it right in 3'rd error fix. In general for some very simple code Claude is correct but when using something new it completely fails, o1-preview performs much better. Give a try to generate some manim community edition visualization using Claude, it generates something not working correct or with errors, o1-preview does much better job.
In most of my test o1-preview performed way better than Claude and Qwen was not that bad either.
Isn't it a bit unreasonable to compare something free that you can run today on totally standard consumer hardware with *the* state of the art LLM that is probably >= 1T parameters?
This question keeps popping up but I don't get it. Everyone and their dog has an OpenAI-compatible API. Why not just serve a local LLM and put api.openai.com 127.0.0.1 in your hosts file?
There is a difference between chat and code completion. While with chat, you can use localhost with llama.cpp, but code completion you cannot do that: https://github.com/zed-industries/zed/issues/12519.
Yeah, benchmarks are one thing but when you actually interact with the model it becomes clear very fast how "intelligent" the model actually is, by doing or noting small things that other models won't. 3.5 Sonnet v1 was great, v2 is already incredible.
...but you can't run it locally. Not unless you're sitting on some monster metal. It's tiresome when people compare enormous cloud models to tiny little things. They're completely different.
I believe that parent comment was referring to the point that Sonnet 3.5 cannot be run locally which it obviously cannot but not because of locally available compute but because it's not OSS.
You really don't need models to fit into GPU's anymore for inference, e.g. llama.cpp is pretty good at partial GPU offload and I've gotten pretty fast results with only being able to fit ~30% of a model in VRAM and the rest in DDR5.
Ollama's default quantization is q4_0 which is quite bad, but you can go to ollama's model page to see all the quantizations they have available, you can do e.g. "ollama run qwen2.5-coder:32b-instruct-q8_0" which will need ~35G + space for context
The issue with some recent models is that they're basically overfitting on public evals, and it's not clear who's the biggest offender—OpenAI, or the Chinese? And regardless, "Mandelbrot in plaintext" is a bad choice for evaluation's sake. The public datasets are full of stuff like that. You really want to be testing stuff that isn't overfit to death, beginning with tasks that notoriously don't generalise all too well, all the while being most indicative of capability: like, translating a program that is unlikely to have been included in the training set verbatim, from a lesser-known language—to a well-known language, and back.
I'd be shocked if this model held up in the comprehensive private evals.
That's why I threw in "same size as your terminal window" for the Mandelbrot demo - I thought that was just enough of a tweak to avoid exact regurgitation of some previously trained program.
I have not performed comprehensive evals of my own here - clearly - but I did just enough to confirm that the buzz I was seeing around this model appeared to hold up. That's enough for me to want to write about it.
Hey, Simon! Have you ever considered to host private evals? I think, with the weight of the community behind you, you could easily accumulate a bunch of really high-quality, "curated" data, if you will. That is to say, people would happily send it to you. More people should self-host stuff like https://github.com/lm-sys/FastChat without revealing their dataset, I think, and we would probably trust it more than the public stuff, considering they already trust _you_ to some extent! So far the private eval scene is just a handful of guys on twitter reporting their findings in unsystematic manner, but a real grassroots approach backed up by a respectable influencer would go a long way to change that.
Honestly I don't think I have the right temperament for being a reliable source for evals. I've played around with a few ideas - like "Pelicans on a bicycle" https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/ - but running evals well on an ongoing basis requires a focus and attention to detail that isn't a great fit for how I work.
> The issue with some recent models is that they're basically overfitting on public evals… You really want to be testing stuff that isn't overfit to death… I'd be shocked if this model held up in the comprehensive private evals.
From the announcement:
> we selected the latest 4 months of LiveCodeBench (2024.07 - 2024.11) questions as the evaluation, which are the latest published questions that could not have leaked into the training set, reflecting the model’s OOD capabilities.
They say a lot of things, like that their base models weren't instruction-tuned, however people have confirmed that it's impossible to find instruction that it wouldn't follow, and the output would indicate that exactly. The labs absolutely love incorporating public evals in their training; of course, they're not going to admit that.
All the big guys are hiring domain experts - serious brains, phd level, in some cases - to build bespoke train and test data for their models.
As long as Jensen Huang keeps shitting out nvidia cards, progress is just a function of cash to burn on paying humans to dump their knowledge into train data... and hoping this silly transformer architecture keeps holding up
> All the big guys are hiring domain experts - serious brains, phd level, in some cases
I don't know where this myth had originated, and perhaps it was true at least at some point, but you just have to consider that all the recent major advances in datasets had to do with _unsupervised_ reward models, synthetic, generational datasets, and new advanced alignment methods. The big labs _are_ hiring serious PhD level researchers, and most of these are physicists, Bayesians of many kind and breed, not "domain experts." However, perception matters a lot these days; some labs, I won't point, but OpenAI is probably the biggest offender, simply cannot control themselves. The fact of the matter is they LOVE including the public evals in their finetuning, as it makes them appear stronger in the "benchmarks."
> like, translating a program that is unlikely to have been included in the training set verbatim, from a lesser-known language—to a well-known language, and back.
I would exactly want to see that, or "make a little interpreter for a basic subset of C, or Scheme or <X>".
So far, non-English inputs have been most telling: I deal with Ukrainian datasets mostly, and what we see is OpenAI models, the Chinese models, of course, and Llama's, to admittedly, lesser extent—all degrading disproportionately compared to the other models. You know what model degrades the least comparatively? Gemma 27b. The arena numbers would suggest it's not so strong, but they'd actually managed to make something useful for the whole world (I can only judge re: Ukrainian, of course, but I suspect it's probably equally good in the other languages, too.) However, nothing can compete currently with Sonnet 3.5 in reasoning. I predict a surge in the private eval scene when people inevitably grow wary of leaderboard-propaganda.
I heard conflicting things about it. Some claim it was trained so it can do well on benchmarks and in real world scenarios it's lacking. Can somebody deny/confirm ?
If your benchmark covers all possible programming tasks then you dont need an llm, you need search over your benchmark.
Hypothetically let's say the benchmark contains "test divisibility of this integer by n" for all n of the form 3x+1. An extremely overfit llm won't be able to code divisibility for all n not of the form 3x+1, and your benchmark will never tell.
No, because solving a well defined problem with well defined right or wrong is generally not what people use llm for. Most of the times my query to llm is underspecified, and lot of time I figure out the problem when chatting with LLM.
And benchmark by definition only measures just right/wrong answer.
This is called Goodhart's law, who said: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."
But in modern usage it is often rephrased to: "When a measure becomes a target, it ceases to be a good measure"
It's small... And for that size, it does very well. Been using it a few days and it's quite good for it's size and the fact you can run it locally. So not sure if it's true what you say; for us it works really well.
I tried the Qwen2.5 32B a couple weeks ago. It was amazing for a model that can run on my laptop but far from Claude/GPT-4o. I am downloading the coder tuned version now.
I like the idea of offline LLMs but in practice there's no way I'm wasting battery life on running a Language Model.
On a desktop too, I wonder if it's worth the additional stress and heat on my GPU as opposed to one somewhere in a datacenter which will cost me a few dollars per month, or a few cents per hour if I spin up the infra myself on demand.
Super useful for confidential / secret work though
In my experience, a lot of companies still have rules against using tools like Copilot due to security and copyright concerns, even though many software engineers just to ignore them.
This could be a way to satisfy both sides, although it only solves the issue of sending internal data to companies like OpenAI, it doesn't solve the "we might accidentally end up with somebody else's copyrighted code in our code base" issue.
Best thing about it is that it's an OSS model that can be hosted by anyone, resulting in an open competitive market bringing hosting costs down, currently sitting at $0.18/$0.18 M tok/s [1] making it 50x cheaper than Sonnet 3.5 and ~17x cheaper than Haiku 3.5.
[1] https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct