I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://git... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

vikp on July 23, 2023 | parent | context | favorite | on: Llama2.c: Inference llama 2 in one file of pure C

I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.

Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact