I tried that Unsloth R1 quantization on my dual Xeon Gold 5218 with 384 GB DDR4-2666 (about half of memory channels used, so not most optimal).
Type IQ2_XXS / 183GB, 16k context:
CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.
CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.
I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.
Type IQ2_XXS / 183GB, 16k context:
CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.
CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.
I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.