Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

Have you quantized it?




The model I have is q4_0 I think that's 4 bit quantized

I'm running in Windows using koboldcpp, maybe it's faster in Linux?


I am running linux with cublast offload, and I am using the new 3 bit quant that was just pulled in a day or two ago.


Thanks! I'll have to try the 3bit to see if that helps


cuBLAS or CLBlast? There is no such thing as cublast


> The model I have is q4_0 I think that's 4 bit quantized

That's correct, yeah. Q4_0 should be the smallest and fastest quantized model.

> I'm running in Windows using koboldcpp, maybe it's faster in Linux?

Possibly. You could try using WSL to test—I think both WSL1 and WSL2 are faster than Windows (but WSL1 should be faster than WSL2).


I didn't know what WSL was, but now I do, thanks for the tip!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: