> Just ran Llama-2 (without a GPU) and it gave me coherent responses in 3 minutes (which is extremely fast for no GPU). How does this work?
It should be much faster with llama.cpp. My old-ish laptop CPU (AMD 4900HS) can ingest a big prompt reasonably quickly and then stream text fast enough to (slowly) read.
If you have any kind of dGPU, even a small laptop one, prompt ingestion is dramatically faster.
But to answer your question, the GGML CPU implementation is very good, and actually generating the response is somewhat serial, and more RAM speed bound than compute bound.
It should be much faster with llama.cpp. My old-ish laptop CPU (AMD 4900HS) can ingest a big prompt reasonably quickly and then stream text fast enough to (slowly) read.
If you have any kind of dGPU, even a small laptop one, prompt ingestion is dramatically faster.
Try the latest Kobold release: https://github.com/LostRuins/koboldcpp
But to answer your question, the GGML CPU implementation is very good, and actually generating the response is somewhat serial, and more RAM speed bound than compute bound.