Hacker News new | past | comments | ask | show | jobs | submit login
Running Llama.cpp on AWS Instances (github.com/ggerganov)
96 points by schappim 6 months ago | hide | past | favorite | 10 comments



If anyone is looking for a more reasonably cost effective solution, Hetzner has 16 vCPU/32GB RAM ARM VMs for $24E/mo that will run 34b Q4 GGUF at around 4 tok/sec. It's not very fast, but it is very cheap.


Something that would be extremely helpful is a good benchmark of various hardware for llm inference. It's really hard to tell how well a GPU will perform or whether it will be supported at all.


SO roughyl how much does this instance cost a day? Like $30? Im kind of confused why it wasnt mentioned, but hey maybe poeple arent as cheap as me. Cool project tho.


The pricing is shown in the first image on the page: $0.526 per hour.


This is if we follow the recommendation to use a GPU. But if you don't need 200+ tokens per second and are happy with something like 10-15, a CPU-only solution is acceptable and would cost significantly less.


One of the tasks that can be accomplished by running LLMs on a CPU is to execute long background tasks that do not require real-time response. llama.cpp seems like a suitable platform for this. It would be interesting to explore how to leverage the various acceleration techniques available on AWS.


I am more interested on running llama.cpp on CPU-only VPSs/EC2. Although it is probably too slow.


13b versions of models are running on 8-core CPU fast enough to have a fluid conversation.


What can I run on nVidia RTX 4060 Ti with 16GB RAM?


Best to try yourself, llama.cpp is refreshingly easy to build.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: