Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.
The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
I'm paying about $0.19/hr and using half that power just for a large spinning RAID, running some VMs and security cameras. And I'm reconsidering my digital extravagance because of the electric bill. You probably make way more money than I do.
Here's a DeepSeek-V4-Flash benchmark on 2X RTX Pro 6000:
- Prefill: ~10K tok/s
- Decode: 190 | 375 | 980 tok/s (for 1 | 4 | 16 concurrent requests)
- GPU power draw during benchmark: Average: 585W | Max: 849W | Limit: 1200W with undervolt. Idle PC is 125W.
I've asked it to calculate the following considering a realistic blend of cached prompts and decode for agentic dev scenario.
Electricity-only (@ USD $0.08/kWh)
Usage | IN price | OUT price | Monthly cost
Concurrency=1 | $0.040/M | $0.080/M | $8.65 to $38.88 (5% to 100% active)
Concurrency=4 | $0.024/M | $0.044/M | up to $48.67 (cheaper per token but higher power draw)
Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?
A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):
IN price OUT price
Self-hosted $0.121/M $0.363/M
OpenRouter (budget) $0.098/M $0.196/M
OpenRouter (DeepSeek) $0.140/M $0.280/M
B) Breakeven subscription (users active ~1.5h/day):
Interestingly if we assume 16 concurrent users, prefill drops to 600 t/s and generation to 61 t/s, and this starts to be dangerously near to M5 Max 35 t/s generation and 400 t/s prefill you get with DwarfStar in your own laptop (that you use for many other things) that costs ~6500 usd/eur.
DwarfStar and other end-user inference engines should also support batched/concurrent inference IMHO. Not so much for the overly naïve "serving multiple users" case (the local hardware cannot really compete with ordinary datacenter gear, much less with the big proprietary suppliers; the compute headroom is too small to begin with once the model is in RAM) but rather to improve SSD streamed decode in the unattended inference scenario, where the goal is to meaningfully raise aggregate tok/s whilst facing an overly tight constraint on disk bandwidth, and CPU/GPU compute have a lot of slack.
Of course this requires wide enough batches to have at least some reuse of fetched experts across a batch, but that seems feasible in the "unattended" case, where firing off multiple inferences to be processed together seems quite natural. (We may also have some benefit from better use of the resident experts cache and/or of SSD transfer bandwidth.)
https://github.com/antirez/ds4/issues/275 seems to provide intriguing rough results while https://github.com/antirez/ds4/issues/314 is a valuable contrast where one commonly suggested solution ("just run multiple instances of the engine in parallel") ran into real issues. Neither of these discuss the combined use of batching and SSD streaming yet, so there's room for experimentation.
I am using the `voipmonitor/vllm:lucifer` docker from the RTX6K discord community discussed at the same link the other commenter posted. It is based around this PR https://github.com/vllm-project/vllm/pull/43477