Any specific model recommendations for running locally? Also, what tasks are you...

genewitch · on April 13, 2025

Phi 4. Its fast and reasonable enough, but with local models you have to know what you want to do. If you want a chat bot you use something with Hermes tunes, if you want code you want a coder - a lot of people like the deepseek distill qwen instruct for coding.

There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow without datacenter cards with enough RAM to hold them.

I just asked your very question a day or two ago because I put back together a machine with a 3060 12GB and wondered what sota was on that amount of RAM.

If you use lmstudio it will auto pick which of the quantized models to get, but you can pick a larger model quant if you want. You pick a model and a parameter size and it will choose the "best" quantization for your hardware. Generally.

nico · on April 13, 2025

Thank you for the insightful reply

> There's no equivalent to "does everything kinda well" like chatgpt or Gemini on local, except maybe the 70B and larger, but those are slow

Is there something like a “prompt router”, that can automatically decide what model to use based on the type of prompt/task?

tough · on April 13, 2025

there's RouteLLM: https://github.com/lm-sys/RouteLLM

nvidia has LLMRouter https://build.nvidia.com/nvidia/llm-router

llama-index also supports routing https://docs.llamaindex.ai/en/stable/examples/low_level/rout...

semantic router seems interesting https://github.com/aurelio-labs/semantic-router/

you could also just use langchain to route https://jimmy-wang-gen-ai.medium.com/llm-router-in-langchain...

interesting paper PickLLM: Context-Aware RL-Assisted Large Language Model Routing

https://arxiv.org/abs/2412.12170

nyrikki · on April 13, 2025

I have a machine with 3108 TI's that I do a batch with, sending the question first to a LLM and then an LRM, returning to review the faster results and killing the job if they are acceptable. Ollama or just llama.cpp on podman makes this trivial.

But knowing* what model will be better will be impossible, only broad heuristics that may or may not be correct for any individual prompt could be used.

While there are better options if you were buying them today, an old out of date system with out of date GPUs works well in this batch model.

gemma-3-27b-it-Q6_K_L works fine with these, and that mixed with an additional submit to DeepSeek-R1-Distill-Qwen-32B is absolutely fine on that system that would just be shut down otherwise.

I have a very bright line about inter-customer leakage risk prevention that may be irrational but with that mixture I find that I am better looking at scholarly papers than trying the commercial models.

My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.

You can add 41080ti at 200w limit with common PSU's and get the memory, but performance is limited by the pci bus at 31080ti.

As they seem to sell for the same price, I would probably buy the Titan V today, but the point being is that if you are fine with the even smaller models, you can run them queries in parallel or even cross verify, which dramatically helps with planning tasks even with the foundational models.

But series/parallel runs do a lot, and if you are using them for code, running a linter etc... on the structured output saves a lot of time evaluating the multiple response.

No connection to them at all, but bartowski on hugging face puts a massive amount of time and effort into re-quantizing models.

If you don't a restriction like my FP64 need, you can get 70b models running on two 24Gb gpus without much 'cost' to accuracy.

That would be preferable to a router IMHO.

genewitch · on April 13, 2025

> My primary task is FP64 throughput limited, and thus I am stuck on Titan V as it is ~6 times faster than the 4090 and 5 times faster than the 5090 is the only reason I don't have newer GPUS.

interesting. Very interesting. Why fp64 as opposed to BF16? different sort of model? i don't even know where to find fp64 models (not that i've looked).

also Bartowski may be on huggingface but they're also part of the LM Studio group, and frequently chat on that discord. actually, at least 3 of the main model converter / quant people are on that discord.

I haven't got two 24GB cards, yet, but maybe soon, with the way people are hogging the 5000 series.

edit: i realize that they're increasing the marketing FLOPS by halving the resolution, the current gen stuff is all "fast" at FP16 (or BF16 - brainfloat 16 bit). So when nvidia finishes and releases a card with double the FLOPS at 8 bit, will that card be 8 times slower at fp64?

nyrikki · on April 13, 2025

My primary task isn't ML, and 64bit is needed for numerical stability.

For the Titan V, the F64 was 1/2 of F32, it was the only and last consumer generation to have that.

For Titan RTX and newer NVIDIA cards, the ratio from F32 to F64 is typically 1/64th of FP32.

So the Titan RTX, with 16 FP32 TFlop/s drops to 0.5 FP64 TFlops/s

While the Titan V, starting at 15 FP32 TFlops/s still has 7.5 FP64 TFlops/s

The 5090 TI has 104.9 FP16/32 TFlops/s, but only 1.64 FP64 TFlops/s.

Basically Nvidia decided most people didn't need FP64, and chose to improve quantized performance instead.

If you can run on a GPU, that Titan V has more 64bit Flops than even an AMD Ryzen Threadripper PRO 7995WX.

genewitch · on April 13, 2025

while researching this i discovered another fast fp64 card is the R9 280x by amd/ati. although the memory is weak, only 3GB! But i suppose if you need the numerical accuracy, there's always that, and those cards are like $40 (in the us, on ebay, sold listings), compared to $400 for the titan. if you need 4x the ram though i guess you're stuck paying 10x the price!

rubymamis · on April 13, 2025

I mostly like to evaluate them whenever I ask a remote model (Calude 3.7, ChatGPT 4.5), to see how far they have progressed. From my tests qwen 2.5 coder 32b is still the best local model for coding tasks. I've also tried Phi 4, nemotron, mistral-small, and qwq 32b. I'm using a MacBook Pro M4 46GB RAM.