The dynamic quantization[1] looks really interesting. Now, I've just been dabbling, but did I understand correctly that this dynamic quantization is compatible with GUFF? If so, how do you convert it? Just the standard way or?
I was really curious to try the dynamic 4-bit version of the Llama-3.2 11B Vision model as I found the Q8 variant much better than the standard Q4_K_M variant in certain cases, but it doesn't fully fit my GPU so is significantly slower.
Oh the dynamic 4bit quants sadly are not GGUF compatible yet - it currently works through Hugging Face transformers, Unsloth and other trainers.
My goal was to make a dynamic quant for GGUF as well - it's just a tad bit complicated to select which layers to quantize and not with GGUF - I might have to manually edit the llama.cpp quantize C file
Also I'm unsure yet if llama.cpp as of 11th Jan 2025 supports Llama Vision (or maybe it's new?) I do remember Qwen / Llava type models are working
> Oh the dynamic 4bit quants sadly are not GGUF compatible yet
Ah bummer. Is this a GGUF file-format issue or mostly "just" a code-doesn't-exist issue?
> Also I'm unsure yet if llama.cpp as of 11th Jan 2025 supports Llama Vision (or maybe it's new?)
Ah, I totally forgot Ollama did that on their own and didn't merge upstream.
I'm using Ollama because was so easy to get running on my main Windows rig so I can take advantage of my GPU there, I still do a bit of gaming, while all the stuff which uses Ollama for inference I run on my server.
> Ah bummer. Is this a GGUF file-format issue or mostly "just" a code-doesn't-exist issue?
Just code! Technically I was working on dynamic quants for DeepSeek V3 (200GB in size), which will increase accuracy by a lot for a 2bit model (if you leave attention in 4bit), and just use 20GB more. -> But I'm still working on it!
> Ah, I totally forgot Ollama did that on their own and didn't merge upstream.
Yep they have Llama vision support! Llama.cpp has Qwen, Llava support - I think Llama V support is coming, but it'll take much more time - the arch is vastly different than normal transformers due to cross attention
The dynamic quantization[1] looks really interesting. Now, I've just been dabbling, but did I understand correctly that this dynamic quantization is compatible with GUFF? If so, how do you convert it? Just the standard way or?
I was really curious to try the dynamic 4-bit version of the Llama-3.2 11B Vision model as I found the Q8 variant much better than the standard Q4_K_M variant in certain cases, but it doesn't fully fit my GPU so is significantly slower.
[1]: https://unsloth.ai/blog/dynamic-4bit