Hacker News new | past | comments | ask | show | jobs | submit login

>Does this mean it may be possible to self-host a ChatGPT clone assuming you have a 70B model?

Not only possible but quite easy. Inference for 70B can be done with llama.cpp using CPU only, on any commodity hardware with >64GB of RAM




I have 64gb on my 5 year old thinkpad. What kind of performance (tokens per sec) I could expect on that nowadays for a 70B model?


Llama cpp speed is dramatically improved by avx instructions. If your CPU has those it would be much faster than not.

And if it doesn't you need to do some workarounds with compiling and it gets a bit harder to run.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: