Have you measured your electricity consumption for this rig? I have to wonder ho...

ux266478 · 2026-06-15T18:08:01 1781546881

Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.

The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.

leptons · 2026-06-16T00:00:47 1781568047

I'm paying about $0.19/hr and using half that power just for a large spinning RAID, running some VMs and security cameras. And I'm reconsidering my digital extravagance because of the electric bill. You probably make way more money than I do.

mtone · 2026-06-15T21:38:31 1781559511

Here's a DeepSeek-V4-Flash benchmark on 2X RTX Pro 6000:

  - Prefill: ~10K tok/s
  - Decode: 190 | 375 | 980 tok/s (for 1 | 4 | 16 concurrent requests)
  - GPU power draw during benchmark: Average: 585W | Max: 849W | Limit: 1200W with undervolt. Idle PC is 125W.

I've asked it to calculate the following considering a realistic blend of cached prompts and decode for agentic dev scenario.

Electricity-only (@ USD $0.08/kWh)

  Usage          | IN price  | OUT price | Monthly cost
  Concurrency=1  | $0.040/M  | $0.080/M  | $8.65 to $38.88 (5% to 100% active)
  Concurrency=4  | $0.024/M  | $0.044/M  | up to $48.67 (cheaper per token but higher power draw)

Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?

A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):

                        IN price    OUT price
  Self-hosted           $0.121/M    $0.363/M
  OpenRouter (budget)   $0.098/M    $0.196/M
  OpenRouter (DeepSeek) $0.140/M    $0.280/M

B) Breakeven subscription (users active ~1.5h/day):

    1 user: $563/mo (oh, hai)
    25 users: $23/mo
    100 users: $6/mo

antirez · 2026-06-16T12:04:35 1781611475

Interestingly if we assume 16 concurrent users, prefill drops to 600 t/s and generation to 61 t/s, and this starts to be dangerously near to M5 Max 35 t/s generation and 400 t/s prefill you get with DwarfStar in your own laptop (that you use for many other things) that costs ~6500 usd/eur.

zozbot234 · 2026-06-16T13:21:08 1781616068

DwarfStar and other end-user inference engines should also support batched/concurrent inference IMHO. Not so much for the overly naïve "serving multiple users" case (the local hardware cannot really compete with ordinary datacenter gear, much less with the big proprietary suppliers; the compute headroom is too small to begin with once the model is in RAM) but rather to improve SSD streamed decode in the unattended inference scenario, where the goal is to meaningfully raise aggregate tok/s whilst facing an overly tight constraint on disk bandwidth, and CPU/GPU compute have a lot of slack.

Of course this requires wide enough batches to have at least some reuse of fetched experts across a batch, but that seems feasible in the "unattended" case, where firing off multiple inferences to be processed together seems quite natural. (We may also have some benefit from better use of the resident experts cache and/or of SSD transfer bandwidth.)

https://github.com/antirez/ds4/issues/275 seems to provide intriguing rough results while https://github.com/antirez/ds4/issues/314 is a valuable contrast where one commonly suggested solution ("just run multiple instances of the engine in parallel") ran into real issues. Neither of these discuss the combined use of batching and SSD streaming yet, so there's room for experimentation.

arjie · 2026-06-16T01:21:44 1781572904

Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.

mtone · 2026-06-16T04:23:25 1781583805

I am using the `voipmonitor/vllm:lucifer` docker from the RTX6K discord community discussed at the same link the other commenter posted. It is based around this PR https://github.com/vllm-project/vllm/pull/43477

arjie · 2026-06-16T09:01:56 1781600516

Ah I’m on the same PR just behind. Thank you.

CamperBob2 · 2026-06-16T04:04:22 1781582662

Not OP, but I am seeing up to 260 tokens/second output at c=1 with the recipe at https://github.com/local-inference-lab/rtx6kpro/blob/master/... using 4x 6k cards. Average is more like 200.

There may be a way to get the 2-bit quantized version running even faster on a pair of them.

arjie · 2026-06-16T09:03:39 1781600619

Thank you. Useful to know. Clipped on top by reduce, I assume.

CamperBob2 · 2026-06-16T16:27:04 1781627224

I think so. The machine I'm using runs at Gen4 x8, while the cards can take advantage of Gen5 x16.