Hacker News new | past | comments | ask | show | jobs | submit login

It's been on my list to do a proper shootout of all the various new quant formats floating around (my list here: https://llm-tracker.info/books/llms/page/quantization-overvi...) but a lot of them don't have very good production code yet (eg, a few months ago, when I tried OmniQuant, some of the important bits of code wasn't even included and had to be gotten directly from the authors: https://llm-tracker.info/books/llms/page/omniquant).

If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.

Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).

r/LocalLlama probably is the best place to search for if you're looking for people's experiences w/ quants. I know some people have been testing, like: https://www.reddit.com/r/LocalLLaMA/comments/17klaa5/tested_...




> https://llm-tracker.info/books/llms/page/quantization-overvi...

This is a very cool resource, thanks!

Gems like this, even in areas I follow pretty closely, are why I keep coming back to HN.


i humbly request you to add mlc-llm to your quant test when/if you get around to doing it


Sure, I think their quant format is pretty basic, something similar to bnb q4 - my plan will to be scripting a framework for testing, so should do that as well since the omniquant implementation is in mlc-llm anyways.


i was trying to get this to work with mlc-llm. i'd appreciate any pointers


more specifically on a non-cuda gpu - mali on orangepi via opencl




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: