Hacker News new | past | comments | ask | show | jobs | submit login

What kind of machine do I need to run 405B local?



You can't. Sorry.

Unless...

You have a couple hundred $k sitting around collecting dust... then all you need is a DGX or HGX level of vRAM, the power to run it, the power to keep it cool, and place for it to sit.


You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:

* You'll be running a Q5(ish) quantized model, not the full model

* You're OK with buying used hardware

* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.

The build would look something like (approximate secondary market prices in parentheses):

* Asrock ROMED8-2T motherboard ($700)

* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)

* 256GB of DDR4, 8x 32GB modules ($550)

* nvme boot drive ($100)

* Ten RTX 3090 cards ($700 each, $7000 total)

* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)

* An open frame case, the kind made for crypto miners ($100?)

* PCIe splitters, cables, screws, fans, other misc parts ($500)

Total is about $10k, give or take. You'll be limiting the GPUs (using `nvidia-smi` or similar) to run at 200-225W each, which drastically reduces their top-end power draw for a minimal drop in performance. Plug each power supply into a different AC circuit, or use a dual 120V adapter with a 240V outlet to effectively accomplish the same thing.

When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.

It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.

I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.

Edit: the other cool thing is that if you use fast DDR4 and populate all 8 RAM slots as I recommend above, the memory bandwidth of this system is competitive with that of Apple silicon -- 204.8GB/sec, with DDR4-3200. Combined with a 32+ core Epyc, you could experiment with running many models completely on the CPU, though Lllama 405b will probably still be excruciatingly slow.


Would be interesting to see the performance on a dual-socket EPYC system with DDR5 running at maximum speed.

Assuming NUMA doesn't give you headaches (which it will) you would be looking at nearly 1 TB/s


But you need cpus with the highest number of chiplets because the memory controller to chiplet interconnect is the (memory bandwidth) limiting factor there. And those are of course the most expensive ones. And then it's still much slower than gpus for llm inference, but at least you have enough memory.


You could run a 4bit quant for about $10k I'm guessing. 10x3090s would do.


You can get 3 Mac Studios for less than "a couple hundred $k". Chain them with Exo, done. And they fit under your desk and keep themselves cool just on their own...


according to another comment, ~10x 4090 video cards.


That was the punchline of a joke.


lol thanks, i know nothing about the hardware side of things for this stuff


thanks. hoping the Nvidia 50 series offers some more VRAM.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: