6 tokens per second? Can you put up with that? As seems very slow. I aim for 40t...

segmondy · 2026-06-23T04:39:19 1782189559

I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.

manmal · 2026-06-23T04:44:16 1782189856

But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

all2 · 2026-06-23T04:54:09 1782190449

Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.

AussieWog93 · 2026-06-23T10:17:04 1782209824

Not a Local LLM user, but I regularly kick off meaty jobs in Claude Code then check on them 1-2hrs later.

wqaatwt · 2026-06-23T10:38:37 1782211117

In this case it would be 20-40 hours to accomplish the same amount in f work when running locally

Mashimo · 2026-06-23T06:13:40 1782195220

Run one task, while you do another? Or while you sleep / eat / rave?

manmal · 2026-06-23T12:38:03 1782218283

While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?

nijave · 2026-06-23T13:07:49 1782220069

I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.

nozzlegear · 2026-06-23T16:57:12 1782233832

Do you work at Facebook and happen to find yourself in a token burning competition with your colleagues?

Mashimo · 2026-06-23T14:12:25 1782223945

Why would you use this when your company has access to actual SOTA? I don't get it.

manmal · 2026-06-24T12:41:08 1782304868

Why would I ever use a local model by that logic? A usable model means my computer was very expensive so I‘d have the funds for a Pro plan as well.

Mashimo · 2026-06-25T08:20:23 1782375623

Well we are on HACKER news ;-) To mess around and learn something would be one reason. Maybe you already have the hardware. Why selfhost anything if cloud does the thing?

An other reason would be because your company does not allow any source code leaks and thus every developer either has local models or none.

6 T/second local model while your colleagues have 200 EUR/month claude does not make much sense. At least I can't see a use case.

segmondy · 2026-06-23T14:59:54 1782226794

Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?

Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.

manmal · 2026-06-24T12:39:58 1782304798

I would just, not wait for things to finish because it’d be instant? No need to create slop just because something is faster.

froh · 2026-06-23T04:43:05 1782189785

do you use caveman or similar?

walrus01 · 2026-06-23T08:13:32 1782202412

I get a lot done with something that's also approximately 6 tokens/second, if you're willing to give it a well defined set of prompts and projects to work on, leave it for an hour or two, then come back and check what it's done. And often to remember to give it something of more consequence to do for at least 3-4 hours of wall clock runtime before heading to bed.