Phi 4. Its fast and reasonable enough, but with local models you have to know what you want to do. If you want a chat bot you use something with Hermes tunes, if you want code you want a coder - a lot of people like the deepseek distill qwen instruct for coding.
There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow without datacenter cards with enough RAM to hold them.
I just asked your very question a day or two ago because I put back together a machine with a 3060 12GB and wondered what sota was on that amount of RAM.
If you use lmstudio it will auto pick which of the quantized models to get, but you can pick a larger model quant if you want. You pick a model and a parameter size and it will choose the "best" quantization for your hardware. Generally.
I have a machine with 3108 TI's that I do a batch with, sending the question first to a LLM and then an LRM, returning to review the faster results and killing the job if they are acceptable. Ollama or just llama.cpp on podman makes this trivial.
But knowing* what model will be better will be impossible, only broad heuristics that may or may not be correct for any individual prompt could be used.
While there are better options if you were buying them today, an old out of date system with out of date GPUs works well in this batch model.
gemma-3-27b-it-Q6_K_L works fine with these, and that mixed with an additional submit to DeepSeek-R1-Distill-Qwen-32B is absolutely fine on that system that would just be shut down otherwise.
I have a very bright line about inter-customer leakage risk prevention that may be irrational but with that mixture I find that I am better looking at scholarly papers than trying the commercial models.
My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
You can add 41080ti at 200w limit with common PSU's and get the memory, but performance is limited by the pci bus at 31080ti.
As they seem to sell for the same price, I would probably buy the Titan V today, but the point being is that if you are fine with the even smaller models, you can run them queries in parallel or even cross verify, which dramatically helps with planning tasks even with the foundational models.
But series/parallel runs do a lot, and if you are using them for code, running a linter etc... on the structured output saves a lot of time evaluating the multiple response.
No connection to them at all, but bartowski on hugging face puts a massive amount of time and effort into re-quantizing models.
If you don't a restriction like my FP64 need, you can get 70b models running on two 24Gb gpus without much 'cost' to accuracy.
> My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.
interesting. Very interesting. Why fp64 as opposed to BF16? different sort of model? i don't even know where to find fp64 models (not that i've looked).
also Bartowski may be on huggingface but they're also part of the LM Studio group, and frequently chat on that discord. actually, at least 3 of the main model converter / quant people are on that discord.
I haven't got two 24GB cards, yet, but maybe soon, with the way people are hogging the 5000 series.
edit: i realize that they're increasing the marketing FLOPS by halving the resolution, the current gen stuff is all "fast" at FP16 (or BF16 - brainfloat 16 bit). So when nvidia finishes and releases a card with double the FLOPS at 8 bit, will that card be 8 times slower at fp64?
while researching this i discovered another fast fp64 card is the R9 280x by amd/ati. although the memory is weak, only 3GB! But i suppose if you need the numerical accuracy, there's always that, and those cards are like $40 (in the us, on ebay, sold listings), compared to $400 for the titan. if you need 4x the ram though i guess you're stuck paying 10x the price!
I mostly like to evaluate them whenever I ask a remote model (Calude 3.7, ChatGPT 4.5), to see how far they have progressed. From my tests qwen 2.5 coder 32b is still the best local model for coding tasks. I've also tried Phi 4, nemotron, mistral-small, and qwq 32b. I'm using a MacBook Pro M4 46GB RAM.
Also, what tasks are you using them for?