> I uploaded GGUFs, 4bit quants, dynamic quants The dynamic quantization[1] look...

danielhanchen · 2025-01-12T02:13:30 1736648010

Oh the dynamic 4bit quants sadly are not GGUF compatible yet - it currently works through Hugging Face transformers, Unsloth and other trainers.

My goal was to make a dynamic quant for GGUF as well - it's just a tad bit complicated to select which layers to quantize and not with GGUF - I might have to manually edit the llama.cpp quantize C file

Also I'm unsure yet if llama.cpp as of 11th Jan 2025 supports Llama Vision (or maybe it's new?) I do remember Qwen / Llava type models are working

magicalhippo · 2025-01-12T03:10:56 1736651456

> Oh the dynamic 4bit quants sadly are not GGUF compatible yet

Ah bummer. Is this a GGUF file-format issue or mostly "just" a code-doesn't-exist issue?

> Also I'm unsure yet if llama.cpp as of 11th Jan 2025 supports Llama Vision (or maybe it's new?)

Ah, I totally forgot Ollama did that on their own and didn't merge upstream.

I'm using Ollama because was so easy to get running on my main Windows rig so I can take advantage of my GPU there, I still do a bit of gaming, while all the stuff which uses Ollama for inference I run on my server.

Anyway thanks for the response.

danielhanchen · 2025-01-12T04:40:48 1736656848

> Ah bummer. Is this a GGUF file-format issue or mostly "just" a code-doesn't-exist issue?

Just code! Technically I was working on dynamic quants for DeepSeek V3 (200GB in size), which will increase accuracy by a lot for a 2bit model (if you leave attention in 4bit), and just use 20GB more. -> But I'm still working on it!

> Ah, I totally forgot Ollama did that on their own and didn't merge upstream.

Yep they have Llama vision support! Llama.cpp has Qwen, Llava support - I think Llama V support is coming, but it'll take much more time - the arch is vastly different than normal transformers due to cross attention