If anyone is looking for a more reasonably cost effective solution, Hetzner has 16 vCPU/32GB RAM ARM VMs for $24E/mo that will run 34b Q4 GGUF at around 4 tok/sec. It's not very fast, but it is very cheap.
Something that would be extremely helpful is a good benchmark of various hardware for llm inference. It's really hard to tell how well a GPU will perform or whether it will be supported at all.
SO roughyl how much does this instance cost a day? Like $30?
Im kind of confused why it wasnt mentioned, but hey maybe poeple arent as cheap as me. Cool project tho.
This is if we follow the recommendation to use a GPU. But if you don't need 200+ tokens per second and are happy with something like 10-15, a CPU-only solution is acceptable and would cost significantly less.
One of the tasks that can be accomplished by running LLMs on a CPU is to execute long background tasks that do not require real-time response. llama.cpp seems like a suitable platform for this. It would be interesting to explore how to leverage the various acceleration techniques available on AWS.