What kind of machine do I need to run 405B local?

monkmartinez · 2024-07-23T15:31:21 1721748681

You can't. Sorry.

Unless...

You have a couple hundred $k sitting around collecting dust... then all you need is a DGX or HGX level of vRAM, the power to run it, the power to keep it cool, and place for it to sit.

angoragoats · 2024-07-23T16:15:11 1721751311

You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:

* You'll be running a Q5(ish) quantized model, not the full model

* You're OK with buying used hardware

* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.

The build would look something like (approximate secondary market prices in parentheses):

* Asrock ROMED8-2T motherboard ($700)

* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)

* 256GB of DDR4, 8x 32GB modules ($550)

* nvme boot drive ($100)

* Ten RTX 3090 cards ($700 each, $7000 total)

* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)

* An open frame case, the kind made for crypto miners ($100?)

* PCIe splitters, cables, screws, fans, other misc parts ($500)

Total is about $10k, give or take. You'll be limiting the GPUs (using `nvidia-smi` or similar) to run at 200-225W each, which drastically reduces their top-end power draw for a minimal drop in performance. Plug each power supply into a different AC circuit, or use a dual 120V adapter with a 240V outlet to effectively accomplish the same thing.

When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.

It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.

I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.

Edit: the other cool thing is that if you use fast DDR4 and populate all 8 RAM slots as I recommend above, the memory bandwidth of this system is competitive with that of Apple silicon -- 204.8GB/sec, with DDR4-3200. Combined with a 32+ core Epyc, you could experiment with running many models completely on the CPU, though Lllama 405b will probably still be excruciatingly slow.

bick_nyers · 2024-07-23T19:08:56 1721761736

Would be interesting to see the performance on a dual-socket EPYC system with DDR5 running at maximum speed.

Assuming NUMA doesn't give you headaches (which it will) you would be looking at nearly 1 TB/s

tpm · 2024-07-23T20:20:55 1721766055

But you need cpus with the highest number of chiplets because the memory controller to chiplet interconnect is the (memory bandwidth) limiting factor there. And those are of course the most expensive ones. And then it's still much slower than gpus for llm inference, but at least you have enough memory.

causal · 2024-07-23T16:04:57 1721750697

You could run a 4bit quant for about $10k I'm guessing. 10x3090s would do.

pbmonster · 2024-07-24T06:59:20 1721804360

You can get 3 Mac Studios for less than "a couple hundred $k". Chain them with Exo, done. And they fit under your desk and keep themselves cool just on their own...

93po · 2024-07-23T15:30:00 1721748600

according to another comment, ~10x 4090 video cards.

Teknomancer · 2024-07-23T15:41:14 1721749274

That was the punchline of a joke.

93po · 2024-07-24T19:40:15 1721850015

lol thanks, i know nothing about the hardware side of things for this stuff

daft_pink · 2024-07-23T15:40:02 1721749202

thanks. hoping the Nvidia 50 series offers some more VRAM.