Running Llama.cpp on AWS Instances

mikeravkine · 2023-11-28T14:11:41

If anyone is looking for a more reasonably cost effective solution, Hetzner has 16 vCPU/32GB RAM ARM VMs for $24E/mo that will run 34b Q4 GGUF at around 4 tok/sec. It's not very fast, but it is very cheap.

joelthelion · 2023-11-28T11:56:10

Something that would be extremely helpful is a good benchmark of various hardware for llm inference. It's really hard to tell how well a GPU will perform or whether it will be supported at all.

ionwake · 2023-11-28T10:24:02

SO roughyl how much does this instance cost a day? Like $30? Im kind of confused why it wasnt mentioned, but hey maybe poeple arent as cheap as me. Cool project tho.

michaelt · 2023-11-28T10:29:58

The pricing is shown in the first image on the page: $0.526 per hour.

patrakov · 2023-11-28T10:58:10

This is if we follow the recommendation to use a GPU. But if you don't need 200+ tokens per second and are happy with something like 10-15, a CPU-only solution is acceptable and would cost significantly less.

alekseiprokopev · 2023-11-28T09:25:51

One of the tasks that can be accomplished by running LLMs on a CPU is to execute long background tasks that do not require real-time response. llama.cpp seems like a suitable platform for this. It would be interesting to explore how to leverage the various acceleration techniques available on AWS.

ilaksh · 2023-11-28T09:06:17

I am more interested on running llama.cpp on CPU-only VPSs/EC2. Although it is probably too slow.

rini17 · 2023-11-28T10:10:02

13b versions of models are running on 8-core CPU fast enough to have a fluid conversation.

borissk · 2023-11-28T16:35:06

What can I run on nVidia RTX 4060 Ti with 16GB RAM?

rini17 · 2023-11-28T22:03:35

Best to try yourself, llama.cpp is refreshingly easy to build.