If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.
Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).
Sure, I think their quant format is pretty basic, something similar to bnb q4 - my plan will to be scripting a framework for testing, so should do that as well since the omniquant implementation is in mlc-llm anyways.
If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.
Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).
r/LocalLlama probably is the best place to search for if you're looking for people's experiences w/ quants. I know some people have been testing, like: https://www.reddit.com/r/LocalLLaMA/comments/17klaa5/tested_...