Ask HN: Affordable hardware for running local large language models?

ysleepy · 2024-05-05T18:17:06 1714933026

I simply bought 4x32GB ddr4 memory (~200 bucks) for a normal desktop mainboard and a high-thread-count cpu.

You can experiment with a lot of models, it's just going to be slow.

With ddr5 you can even go higher with 48GB modules.

Otherwise I got a 3060 12G, which can be had for 200€ used.

Its a very affordable setup.

instagib · 2024-05-05T12:10:27 1714911027

The energy/compute cost per performance is not a good ratio due to current optimization. Old hardware makes it worse.

Consider making friends with people who have a good desktop or laptop computer to see if you can use it for a little while when visiting and making them a meal or coffee.

If you give up on local, it reduces the cost by using servers.

Give up on performance and allow hallucination for an introduction to llm’s is my only option for budget and local. A very specific spellcheck or similar based llm would be possible on limited hardware.

Iirc, there is a publication on 1.3bit or 1.4bit quantization that someone implemented on GitHub.

instagib · 2024-05-06T05:53:40 1714974820

1.58 bits https://old.reddit.com/r/LocalLLaMA/comments/1bpa6ol/unoffic...

angoragoats · 2024-05-05T17:06:16 1714928776

This might technically be outside your budget, but if you happen to have a PC, I highly recommend the RTX 4060 Ti 16GB ($450, or less if on sale). It can easily handle 13B models and is quite fast. You don’t need a fancy PC to put it into; anything with a spare PCIe slot and a reasonable sized power supply will work.

These cards can easily be found at MSRP because they’re not a great improvement over the 3060/4060 8GB for gaming, but the added memory makes them excellent for AI.

roosgit · 2024-05-05T18:50:27 1714935027

About a year ago I bought some parts to build a Linux PC for testing LLMs with llama.cpp. I paid less than $200 for: a B550MH motherboard, AMD Ryzen 3 4100, 16GB DDR4, 256GB NVMe SSD. I already had an old PC case with a 350W PSU and a 256MB video card because the PC wouldn’t boot without one.

I looked today on Newegg and similar PC components would cost $220-230.

From a performance perspective, I get about 9 tokens/s from mistral-7b-instruct-v0.2.Q4_K_M.gguf with a 1024 context size. This is with overclocked RAM which added 15-20% more speed.

The Mac Mini is probably faster than this. However the custom built PC route gives you the option to add more RAM later on to try bigger models. It also lets you add a decent GPU. Something like a used 3060, as one of comments says.

duffyjp · 2024-05-05T23:33:28 1714952008

FYI for the Mac Mini idea, I have an M1 Macbook Pro with 32gb. There's some sort of limitation on how much ram can be allocated to the GPU. Trying to run even a 22gb ram model will fail. The best I've gotten is Code Llama 34B 3-bit at 18.8gb. There can be tons of RAM still empty but the LLM will just infinite loop dropping a chunk of RAM and reloading from disk.

s1gsegv · 2024-05-06T00:25:24 1714955124

Yes, Metal seems to allow a maximum of 1/2 of the RAM for one process, and 3/4 of the RAM allocated to the GPU overall. There’s a kernel hack to fix it, but that comes with the usual system integrity caveats. https://github.com/ggerganov/llama.cpp/discussions/2182

pquki4 · 2024-05-05T11:55:13 1714910113

I think you need to be very clear about what your goal is -- just playing with different "real" hardware, or running some small models as experiments, or trying to do semi-serious work with LLMs? How much do you want to spend on hardware and electricity in the long run, and how much are you willing to "lose"? e.g. if a setup turns out to be not very useful and hard to repurpose it because you already have too many computers, and you need to either sell it or throw it away, what's your limit?

Depending on your answer, I suspect you might want to use Google Colab Pro/Paperspace/AWS/vast.ai instead of building your own hardware.

lemonlime0x3C33 · 2024-05-05T10:33:15 1714905195

I have used a raspberry pi for running image classification CNN’s, it really depends on the model you are using. Edge IoT AI is making a lot of progress targeting running AI on resource constrained devices.

If you have access to the dataset you want you could train one yourself to fit on your target hardware. You could also look at FPGA solutions if you are comfortable working with those. Training locally might take some time but you could use google Codelab to train it.

wokwokwok · 2024-05-05T12:08:15 1714910895

You can use a raspberry pi with 8 GB of ram to run a quantised 7B model (eg. https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF), or any cheap stick pc.

For larger models, or GPU accelerated inference, there is no “cheap” solution.

Why do you think everyone is so in love with the 7B models?

It’s not because they’re good. They’re just ok, and it’s expensive to run larger models.

ilaksh · 2024-05-05T12:30:08 1714912208

What kind of computer do you currently have? Try phi3. It's amazing and under 4GB. If you want something affordable then the one rule is to stay away from Apple products. Maybe a Ryzen. https://www.geekompc.com/geekom-a7-mini-pc-ryzen-7000/

AMD Ryzen 7 5800H

nubinetwork · 2024-05-05T09:16:29 1714900589

I'm happy with a 2950x and a Radeon VII... but that costs more than your example Mac mini.

cjbprime · 2024-05-06T15:40:35 1715010035

There are some ARM SBCs with e.g. 32GB RAM and an NPU for under $300, such as the Orange Pi 5 Plus, but I'm guessing refurbished Apple Silicon hardware is the best answer for the price.

p1esk · 2024-05-05T08:04:41 1714896281

what is considered affordable hardware for running large language models locally today?

I’d say “under $20k” is considered affordable. In comparison, a single H100 server is $250k. At least if you want to run decent models (>70B) at bearable speeds (>1t/s).

Your optimal choice today is Mac Studio with 192GB of unified memory (~$7k). But it will be too slow to run something like llama 400B.

angoragoats · 2024-05-05T17:02:42 1714928562

What criteria are you using to define “optimal”? And what is your use case (e.g. how large of a model would you like to run)? If you want to maximize the amount of high-bandwidth memory available, Macs are decent. But they are not the best value IMHO.

If you just want to play with 13B parameter models or smaller, an RTX 4060 Ti 16GB is a great option at $450 or less.

If you want the ability to use larger models, RTX 3090s are a pretty good value. They can be had on the secondary market for $700ish, and are quite fast and have 24GB each. For 70B models, you’ll want to use 4-5 bit quantization and have two 3090s. You could probably run larger models on 4 or 6 of them.

Both of these options require a PC to install them into, but are nowhere near the cost of a Mac mini with 192GB of RAM. Yes, the Mac will give you more memory total, but won’t be as fast at inference and costs several multiples of a dual 3090 setup.

p1esk · 2024-05-05T17:26:37 1714929997

I want to play with the best model I can get my hands on. In the nearest future, this will probably be llama 400B. Even that model will probably be dumber than GPT4, and GPT4 feels pretty dumb sometimes.

espinielli · 2024-05-05T12:21:58 1714911718

maybe LLaMA helps https://justine.lol/oneliners/