Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Just ran Llama-2 (without a GPU) and it gave me coherent responses in 3 minutes (which is extremely fast for no GPU). How does this work?

It should be much faster with llama.cpp. My old-ish laptop CPU (AMD 4900HS) can ingest a big prompt reasonably quickly and then stream text fast enough to (slowly) read.

If you have any kind of dGPU, even a small laptop one, prompt ingestion is dramatically faster.

Try the latest Kobold release: https://github.com/LostRuins/koboldcpp

But to answer your question, the GGML CPU implementation is very good, and actually generating the response is somewhat serial, and more RAM speed bound than compute bound.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: