Hacker News new | past | comments | ask | show | jobs | submit login

I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.
Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: