I'm constantly tempted by the idealism of this experience, but when you factor i...

Aurornis · 2025-08-08T20:06:33 1754683593

I think the local LLM scene is very fun and I enjoy following what people do.

However every time I run local models on my MacBook Pro with a ton of RAM, I’m reminded of the gap between local hosted models and the frontier models that I can get for $20/month or nominal price per token from different providers. The difference in speed and quality is massive.

The current local models are very impressive, but they’re still a big step behind the SaaS frontier models. I feel like the benchmark charts don’t capture this gap well, presumably because the models are trained to perform well on those benchmarks.

I already find the frontier models from OpenAI and Anthropic to be slow and frequently error prone, so dropping speed and quality even further isn’t attractive.

I agree that it’s fun as a hobby or for people who can’t or won’t take any privacy risks. For me, I’d rather wait and see what an M5 or M6 MacBook Pro with 128GB of RAM can do before I start trying to put together another dedicated purchase for LLMs.

jauntywundrkind · 2025-08-08T21:14:53 1754687693

I agree and disagree. Many of the best models are open source, just too big to run for most people.

And there are plenty of ways to fit these models! A Mac Studio M3 Ultra with 512 GB unified memory though has huge capacity, and a decent chunk of bandwidth (800GB/s. Compare vs a 5090's ~1800GB/s). $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive. Performance is even less, but a single AMD Turin chip with it's 12-channels DDR5-6000 can get you to almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+ in ram costs, plus $4800 for for example a 48 core Turin to go with it. (But if you go to older generations, affordability goes way up! Special part, but the 48-core 7R13 is <$1000).

Still, those costs come to $5000 at the low end. And come with much less token/s. The "grid compute" "utility compute" "cloud compute" model of getting work done on a hot gpu with a model already on it by someone else is very very direct & clear. And are very big investments. It's just not likely any of us will have anything but burst demands for GPUs, so structurally it makes sense. But it really feels like there's only small things getting in the way of running big models at home!

Strix Halo is kind of close. 96GB usable memory isn't quite enough to really do the thing though (and only 256GB/s). Even if/when they put the new 64GB DDR5 onto the platform (for 256GB, lets say 224 usable), one still has to sacrifice quality some to fit 400B+ models. Next gen Medusa Halo is not coming for a while, but goes from 4->6 channels, so 384GB total: not bad.

(It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-direction. Compared to the need here, it's no-where near enough to have a big memory host and smaller memory gpu)

Aurornis · 2025-08-08T22:23:42 1754691822

> Many of the best models are open source, just too big to run for most people.

You can find all of the open models hosted across different providers. You can pay per token to try them out.

I just don't see the open models as being at the same quality level as the best from Anthropic and OpenAI. They're good but in my experience they're not as good as the benchmarks would suggest.

> $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive.

This is why I only appreciate the local LLM scene from a distance.

It’s really cool that this can be done, but $10K to run lower quality models at slower speeds is a hard sell. I can rent a lot of hours on an on-demand cloud server for a lot less than that price or I can pay $20-$200/month and get great performance and good quality from Anthropic.

I think the local LLM scene is fun where it intersects with hardware I would buy anyway (MacBook Pro with a lot of RAM) but spending $10K to run open models locally is a very expensive hobby.

jstummbillig · 2025-08-08T21:35:53 1754688953

> Many of the best models are open source, just too big to run for most people

I don't think that's a likely future, when you consider all the big players doing enormous infrastructure projects and the money that this increasingly demands. Powerful LLMs are simply not a great open source candidate. The models are not a by-product of the bigger thing you do. They are the bigger thing. Open sourcing a LLM means you are essentially investing money to just give it away. That simply does not make a lot of sense from a business perspective. You can do that in a limited fashion for a limited time, for example when you are scaling or it's not really your core business and you just write it off as expenses, while you try to figure yet another thing out (looking at you Meta).

But with the current paradigm, one thing seems to be very clear: Building and running ever bigger LLMs is a money burning machine the likes of which we have rarely or ever seen, and operating that machine at a loss will make you run out of any amount of money really, really fast.

Rohansi · 2025-08-09T07:09:11 1754723351

You'll want to look at benchmarks rather than the theoretical maximum bandwidth available to the system. Apple has been using bandwidth as a marketing point but you're not always able to use that bandwidth amount depending on your workload. For example, the M1 Max has 400GB/s advertised bandwidth but the CPU and GPU combined cannot utilize all of it [1]. This means Strix Halo could actually be better for LLM inference than Apple Silicon if it achieves better bandwidth utilization.

[1] https://web.archive.org/web/20250516041637/https://www.anand...

esseph · 2025-08-08T21:56:08 1754690168

https://pcisig.com/pci-sig-announces-pcie-80-specification-t...

From 2003-2016, 13 years, we had PCIE 1,2,3.

2017 - PCIE 4.0

2019 - PCIE 5.0

2022 - PCIE 6.0

2025 - PCIE 7.0

2028 - PCIE 8.0

Manufacturing and vendors are having a hard time keeping up. And the PCIE 5.0 memory is.. not always the most stable.

dcrazy · 2025-08-08T22:26:06 1754691966

Are you conflating GDDR5x with PCIe 5.0?

esseph · 2025-08-08T22:45:05 1754693105

No.

I'm saying we're due for faster memory but seem to be having trouble scaling bus speeds as well (in production) and reliable memory. And the network is changing a lot, too.

It's a neverending cycle I guess.

dcrazy · 2025-08-09T00:17:40 1754698660

One advantage of Apple Silicon is the unified memory architecture. You put memory on the fabric instead of on PCIe.

jauntywundrkind · 2025-08-09T00:14:39 1754698479

Thanks for the numbers. Valuable contribution for sure!!

There's been a huge lag for PCIe adoption, and imo so so much has boiled down "do people need it"?

In the past 10 years I feel like my eyes have been opened that every high tech company's greatest highest most compelling desire is to slow walk the release out. To move as slow as the market will bear, to do as little as possible, to roll on and on with minor incremental changes.

There are canonball moments where the market is disrupted. Thank the fucking stars Intel got sick of all this shit and worked hard (with many others) to standardized NVMe, to make a post SATA world with higher speeds & better protocol. AMD64 architecture changed the game. Ryzen again. But so much of the industry is about retaining your cost advantage, is about retaining strong market segmentations, by never shipping too many PCIe lane platforms, by limiting consumer vs workstation vs server video card ram and vgpu (and mxgpu) and display out capabilities often entirely artificially.

But there is a fucking fire right now and everyone knows it. Nvlink is massively more bandwidth and massively more efficient and is essential to system performance. The need to get better fast is so on. Seems like for now SSD will keep slow walking their 2x's. But PCIe is facing a real crisis of being replaced, and everyone wants better. And hates hates hates the insane cost. PCIe 8.0 is going to be insane data to push over a differential, insane speed. But we have to.

Alas PCIe is also hampered by relatively generous broader system design. The trace distances are going to shrink, signal requirements increase a lot. But this needing a intercompatible compliance program for any peripheral to work is a significant disadvantage, versus, just make this point to point link work between these two cards.

There's so many energies happening right now in interconnect. I hope we see some actual uptake, some day. We've had so long for Gen-Z (Ethernet phy, gone now), CXL (3.x being switched, still un-arriced), now UltraEthernet and UltraLink. Man I hope we can see some step improvements. Everyone knows we are in deep shit if NV alone can connect systems. Ironically AMD's HyperTransport was open, was a path towards this, but now Infinity Fabric is an internal only thing and as branding & an idea vanishing from the world kind of, feels insufficient.

esseph · 2025-08-09T01:45:45 1754703945

All of these extremely high end technologies are so far away from hitting the consumer market.

Is there any desire for most people? What's the TAM?

jauntywundrkind · 2025-08-09T04:29:34 1754713774

Classic economics thinking: totally fucked "faster horses" thinking.

The addressable market depends on the advantage. Which right now: we don't know. It's all a guess that someone is going to find it valuable, and no one knows.

But if we find that we didn't actually need $700 NIC's to get shitty bandwidth, if we could have just been putting cables from PCIe shaped slot to PCIe slot (or oculink port!) and getting >>10x performance with >>10x less latency? Yeah bro uhh I think there might be a desire for using the same fucking chip we already use but getting 10x + 10x better out of it.

Faster lower latency cheaper storage? RAM expandability? Lower latency GPU access? There's so much that could make a huge difference for computing, broadly.

justincormack · 2025-08-09T07:45:10 1754725510

Thunderbolt tunnels pcie and you can use it as a nic in effect with one cable between devices. Its slower than oculink but more convenient.

esseph · 2025-08-10T06:06:27 1754805987

I am very ready for optical bus lfg

nemomarx · 2025-08-09T02:48:00 1754707680

Probably small consumer market of enthusiasts (notice Nvidia barely caters to gaming hardware lately) but if you can get better memory throughput on servers isn't that a large industry market?

vFunct · 2025-08-09T01:31:16 1754703076

The game changer technology that'll enable full 1TB+ LLM models for cheap is Sandisk's High Bandwidth Flash. Expect devices with that in about 3-4 years, maybe even on cellphones.

jauntywundrkind · 2025-08-09T04:26:40 1754713600

I'm crazy excited for High Bandwidth Flash, really hope they pull it off. There is a huge caveat: only having a couple hundred or thousand r/w cycles before your multi $k accelerator stops working!! A pretty big constraint!

But as long as you are happy to keep running the same model, the wins here for large capacity & high bandwidth are sick ! And the affordability could be exceptional! (If you can afford to make flash with a hundred or so channels at a decent price!)

Uehreka · 2025-08-08T20:49:29 1754686169

I was talking about this in another comment, and I think the big issue at the moment is that a lot of the local models seem to really struggle with tool calling. Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”

So that’s a real brick wall for a lot of people. It doesn’t matter how smart a local model is if it can’t put that smartness to work because it can’t touch anything. The difference between manually copy/pasting code from LM Studio and having an assistant that can read and respond to errors in log files is light years. So until this situation changes, this asterisk needs to be mentioned every time someone says “You can run coding models on a MacBook!”

com2kid · 2025-08-08T22:52:04 1754693524

> Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”

I'm working on solving this problem in two steps. The first is a library prefilled-json, that lets small models properly fill out JSON objects. The second is a unpublished library called Ultra Small Tool Call that presents tools in a way that small models can understand, and basically walks the model through filling out the tool call with the help of prefilled-json. It'll combine a number of techniques, including tool call RAG (pulls in tool definitions using RAG) and, honestly, just not throwing entire JSON schemas at the model but instead using context engineering to keep the model focused.

IMHO the better solution for local on device workflows would be if someone trained a custom small parameter model that just determined if a tool call was needed and if so which tool.

jauntywundrkind · 2025-08-08T21:16:57 1754687817

Agreed that this is a huge limit. There's a lot of examples actually of "tool calling" but it's all bespoke code-it-yourself: very few of these systems have MCP integration.

I have a ton of respect for SGLang as a runtime. I'm hoping something can be done there. https://github.com/sgl-project/sglang/discussions/4461 . As noted in that thread, it is really great that Qwen3-Coder has a tool-parser built-in: hopefully can be some kind useful reference/start. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/b...

wizee · 2025-08-09T01:13:59 1754702039

Qwen 3 Coder 30B-A3B has been pretty good for me with tool calling.

mxmlnkn · 2025-08-08T22:49:53 1754693393

This resonates. I have finally started looking into local inference a bit more recently.

I have tried Cursor a bit, and whatever it used worked somewhat alright to generate a starting point for a feature and for a large refactor and break through writer's blocks. It was fun to see it behave similarly to my workflow by creating step-by-step plans before doing work, then searching for functions to look for locations and change stuff. I feel like one could learn structured thinking approaches from looking at these agentic AI logs. There were lots of issues with both of these tasks, though, e.g., many missed locations for the refactor and spuriously deleted or indented code, but it was a starting point and somewhat workable with git. The refactoring usage caused me to reach free token limits in only two days. Based on the usage, it used millions of tokens in minutes, only rarely less than 100K tokens per request, and therefore probably needs a similarly large context length for best performance.

I wanted to replicate this with VSCodium and Cline or Continue because I want to use it without exfiltrating all my data to megacorps as payment and use it to work on non-open-source projects, and maybe even use it offline. Having Cursor start indexing everything, including possibly private data, in the project folder as soon as it starts, left a bad taste, as useful as it is. But, I quickly ran into context length problems with Cline, and Continue does not seem to work very well. Some models did not work at all, DeepSeek was thinking for hours in loops (default temperature too high, should supposedly be <0.5). And even after getting tool use to work somewhat with qwen qwq 32B Q4, it feels like it does not have a full view of the codebase, even though it has been indexed. For one refactor request mentioning names from the project, it started by doing useless web searches. It might also be a context length issue. But larger contexts really eat up memory.

I am also contemplating a new system for local AI, but it is really hard to decide. You have the choice between fast GPU inference, e.g., RTX 5090 if you have money, or 1-2 used RTX 3090, or slow, but qualitatively better CPU / unified memory integrated GPU inference with systems such as the DGX Spark, the Framework Desktop AMD Ryzen AI Max, or the Mac Pro systems. Neither is ideal (and cheap). Although my problems with context length and low-performing agentic models seem to indicate that going for the slower but more helpful models on a large unified memory seems to be better for my use case. My use case would mostly be agentic coding. Code completion does not seem to fit me because I find it distracting, and I don't require much boilerplating.

It also feels like the GPU is wasted, and local inference might be a red herring altogether. Looking at how a batch size of 1 is one of the worst cases for GPU computation and how it would only be used in bursts, any cloud solution will be easily an order of magnitude or two more efficient because of these, if I understand this correctly. Maybe local inference will therefore never fully take off, barring even more specialized hardware or hard requirements on privacy, e.g., for companies. To solve that, it would take something like computing on encrypted data, which seems impossible.

Then again, if the batch size of 1 is indeed so bad as I think it to be, then maybe simply generate a batch of results in parallel and choose the best of the answers? Maybe this is not a thing because it would increase memory usage even more.

justincormack · 2025-08-09T08:50:02 1754729402

You might end up using batching to run multiple queries or branches for yourself in parallel. But yes as you say it is very unclear right now.

wizee · 2025-08-09T01:03:17 1754701397

While cloud models are of course faster and smarter, I've been pretty happy running Qwen 3 Coder 30B-A3B on my M4 Max MacBook Pro. It has been a pretty good coding assistant for me with Aider, and it's also great for throwing code at and asking questions. For coding specifically, it feels roughly on par with SOTA models from mid-late 2024.

At small contexts with llama.cpp on my M4 Max, I get 90+ tokens/sec generation and 800+ tokens/sec prompt processing. Even at large contexts like 50k tokens, I still get fairly usable speeds (22 tok/s generation).

1oooqooq · 2025-08-08T21:03:34 1754687014

more interesting is the extent apple convinced people a laptop can replace a desktop or server. mind blowing reality distortion field (as will be proven by some twenty comments telling I'm wrong 3... 2... 1).

davidmurdoch · 2025-08-09T01:12:25 1754701945

I dropped $4k on an (Intel) laptop a few years ago. I thought it would blow my old 2012 core i7 out of the water. Editing photos in Lightroom and Photoshop often requires heavy sustained CPU work. Thermals in laptops is just not a solved problem. People who say laptops are fine replacements for desktops probably don't realize how much and how quickly thermals limit heavy multi-core CPU workloads.

jki275 · 2025-08-09T01:39:35 1754703575

That was true until Apple released the M series laptops.

1oooqooq · 2025-08-12T12:19:57 1755001197

as predicted. lol.

bionsystem · 2025-08-08T21:23:54 1754688234

I'm a desktop guy, considering the switch to a laptop-only setup, what would I miss ?

kelipso · 2025-08-08T21:53:52 1754690032

For $10k, you too can get the power of a $2k desktop, and enjoy burning your lap everyday, or something like that. If I were to do local compute and wanted to use my laptop, I would only consider a setup where I ssh in to my desktop. So I guess only difference from saas llm would be privacy and the cool factor. And rate limits, and paying more if you go over, etc.

com2kid · 2025-08-08T22:55:09 1754693709

$2k laptops now days come with 16 cores. They are thermally limited, but they are going to get you 60-80% the perf of their desktop counterparts.

The real limit is on the Nvidia cards. They are cut down a fair bit, often with less VRAM until you really go up in price point.

They also come with NPUs but the docs are bad and none of the local LLM inference engines seem to use the NPU, even though they could in theory be happy running smaller models.

EagnaIonat · 2025-08-09T06:15:12 1754720112

> For $10k, you too can get the power of a $2k desktop,

Even M1 MBP 32GB performance is pretty impressive for its age and you can get them for well <$1K second hand.

I have one.

I use these models: gpt-oss, llama3.2, deepseek, granite3.3

They all work fine and speed is not an issue. The recent Ollama app means I can have document/image processing with the LLM as well.

moron4hire · 2025-08-09T00:35:29 1754699729

You'll end up with a portable desktop with bad thermals, impacting performance, battery life, and actually-on-the-lap comfort. Bleeding-edge performance laptops can really only manage an hour, max, on battery, making the form factor much more about moving between different pre-planned, desk-oriented work locations.

I take my laptop back and forth from home to work. At work, I ban them from in-person meetings because I want people to actually pay attention to the meeting. In both locations where I use the computer, I have a monitor, keyboard, and mouse I'm plugging in via a dock. That makes the built-in battery and I/O redundant. I think I would rather have a lower-powered, high-battery, ultra portable laptop remoting into the desktop for the few times I bring my computer to in-person meetings for demos.

I wish the memory bandwidth for eGPUs was better.

aldanor · 2025-08-09T00:47:28 1754700448

Huh? Bleeding edge laptops can last a lot more on battery. M3 16'' mbp lasts definitely enough for a full office day of coding. Twice that if just browsing and not doing cpu intensive stuff.

moron4hire · 2025-08-09T01:12:58 1754701978

Even the M4 Max is not "bleeding edge". Apple is doing impressive stuff with energy efficient compute, but you can't get top of the line raw compute for any amount of financial of energy budget from them.

aldanor · 2025-08-09T09:56:57 1754733417

I'm genuinely interested in what kind of work are you doing if bringing m4 max is not enough? And what kind of bleeding edge laptops are we even talking about (link?) and for what purpose?

baobun · 2025-08-08T22:02:52 1754690572

Upgradability, repairability, thermals (translating into widely different performance for the same specs), I/O, connectivity.

jazzypants · 2025-08-08T23:18:34 1754695114

I think this would be more interesting if you were to try to prove yourself correct first.

There are extremely few things that I cannot do on my laptop, and I have very little interest in those things. Why should I get a computer that doesn't have a screen? You do realize that, at this point of technological progress, the computer being attached to a keyboard and a screen is the only true distinguishing factor of a laptop, right?

1oooqooq · 2025-08-12T12:18:58 1755001138

cool. you can browse the web. that's cool. just stay out of conversation you're not an authority.

motorest · 2025-08-08T19:51:40 1754682700

> As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.

Can you explain your rationale? It seems that the worst case scenario is that your setup might not be the most performant ever, but it will still work and run models just as it always did.

This sounds like a classical and very basic opex vs capex tradeoff analysis, and these are renowned for showing that on financial terms cloud providers are a preferable option only in a very specific corner case: short-term investment to jump-start infrastructure when you do not know your scaling needs. This is not the case for LLMs.

OP seems to have invested around $600. This is around 3 months worth of an equivalent EC2 instance. Knowing this, can you support your rationale with numbers?

tcdent · 2025-08-08T20:21:11 1754684471

When considering used hardware you have to take quantization into account; gpt-oss-120b for example is running a very new MXFP4 which will use far more than 80GB to fit into the available fp types on older hardware or Apple silicon.

Open models are trained on modern hardware and will continue to take advantage of cutting edge numeric types, and older hardware will continue to suffer worse performance and larger memory requirements.

motorest · 2025-08-08T20:36:11 1754685371

You're using a lot of words to say "I believe yesterday's hardware might not run models as as fast as today's hardware."

That's fine. The point is that yesterday's hardware is quite capable of running yesterday's models, and obviously it will also run tomorrow's models.

So the question is cost. Capex vs opex. The fact is that buying your own hardware is proven to be far more cost-effective than paying cloud providers to rent some cycles.

I brought data to the discussion: for the price tag of OP's home lab, you only afford around 3 months worth of an equivalent EC2 instance. What's your counter argument?

kelnos · 2025-08-08T21:19:18 1754687958

Not the GP, but my take on this:

You're right about the cost question, but I think the added dimension that people are worried about is the current pace of change.

To abuse the idiom a bit, yesterday's hardware should be able to run tomorrow's models, as you say, but it might not be able to run next month's models (acceptably or at all).

Fast-forward some number of years, as the pace slows. Then-yesterday's hardware might still be able to run next-next year's models acceptably, and someone might find that hardware to be a better, safer, longer-term investment.

I think of this similarly to how the pace of mobile phone development has changed over time. In 2010 it was somewhat reasonable to want to upgrade your smartphone every two years or so: every year the newer flagship models were actually significantly faster than the previous year, and you could tell that the new OS versions would run slower on your not-quite-new-anymore phone, and even some apps might not perform as well. But today in 2025? I expect to have my current phone for 6-7 years (as long as Google keeps releasing updates for it) before upgrading. LLM development over time may follow at least a superficially similar curve.

Regarding the equivalent EC2 instance, I'm not comparing it to the cost of a homelab, I'm comparing it to the cost of an Anthropic Pro or Max subscription. I can't justify the cost of a homelab (the capex, plus the opex of electricity, which is expensive where I live), when in a year that hardware might be showing its age, and in two years might not meet my (future) needs. And if I can't justify spending the homelab cost every two years, I certainly can't justify spending that same amount in 3 months for EC2.

motorest · 2025-08-08T21:54:11 1754690051

> Fast-forward some number of years (...)

I repeat: OP's home server costs as much as a few months of a cloud provider's infrastructure.

To put it another way, OP can buy brand new hardware a few times per year and still save money compared with paying a cloud provider for equivalent hardware.

> Regarding the equivalent EC2 instance, I'm not comparing it to the cost of a homelab, I'm comparing it to the cost of an Anthropic Pro or Max subscription.

OP stated quite clearly their goal was to run models locally.

ac29 · 2025-08-08T22:57:55 1754693875

> OP stated quite clearly their goal was to run models locally.

Fair, but at the point you trust Amazon hosting your "local" LLM, its not a huge reach to just use Amazon Bedrock or something

motorest · 2025-08-09T05:29:14 1754717354

> Fair, but at the point you trust Amazon hosting your "local" LLM, its not a huge reach to just use Amazon Bedrock or something

I don't think you even bothered to look at Amazon Bedrock's pricing before doing that suggestion. They charge users per input tokens + output tokens. In Amazon Bedrock, a single chat session involving 100k tokens can cost you $200. That alone is a third of OP's total infrastructure costs.

If you want to discuss options in terms of cost, the very least you should do is look at pricing.

tcdent · 2025-08-08T22:10:19 1754691019

I incorporated the quantization aspect because it's not that simple.

Yes, old hardware will be slower, but you will also need a significant amount more of it to even operate.

RAM is the expensive part. You need lots of it. You need even more of it for older hardware which has less efficient float implementations.

https://developer.nvidia.com/blog/floating-point-8-an-introd...

fredmcawesome · 2025-08-09T01:02:42 1754701362

But surely this is short term? Once you get older hardware with FP4 support this shouldn't be a concern.

kelnos · 2025-08-08T20:59:33 1754686773

> I expect this will change in the future

I'm really hoping for that too. As I've started to adopt Claude Code more and more into my workflow, I don't want to depend on a company for day-to-day coding tasks. I don't want to have to worry about rate limits or API spend, or having to put up $100-$200/mo for this. I don't want everything I do to be potentially monitored or mined by the AI company I use.

To me, this is very similar to why all of the smart-home stuff I've purchased all must have local control, and why I run my own smart-home software, and self-host the bits that let me access it from outside my home. I don't want any of this or that tied to some company that could disappear tomorrow, jack up their pricing, or sell my data to third parties. Or even use my data for their own purposes.

But yeah, I can't see myself trying to set any LLMs up for my own use right now, either on hardware I own, or in a VPS I manage myself. The cost is very high (I'm only paying Anthropic $20/mo right now, and I'm very happy with what I get for that price), and it's just too fiddly and requires too much knowledge to set up and maintain, knowledge that I'm not all that interested in acquiring right now. Some people enjoy doing that, but that's not me. And the current open models and tooling around them just don't seem to be in the same class as what you can get from Anthropic et al.

But yes, I hope and expect this will change!

jeremyjh · 2025-08-08T19:20:13 1754680813

I expect it will never change. In two years if there is a local option as good as GPT-5 there will be a much better cloud option and you'll have the same tradeoffs to make.

c-hendricks · 2025-08-08T19:25:56 1754681156

Why would AI be one of the few areas where locally-hosted options can't reach "good enough"?

ac29 · 2025-08-08T23:06:00 1754694360

Maybe a better question is when will SOTA models be "good enough"?

At the moment there appears to be ~no demand for older models, even models that people praised just a few months ago. I suspect until AGI/ASI is reached or progress plateaus, that will continue be the case.

lexh · 2025-08-09T00:28:52 1754699332

The current SOTA closed model providers are also all rolling out access to their latest models with better pricing (e.g. GPT-5 this week), which seems like a confounding factor unique to this moment in the cycle. An API consumer would need to have a very specific reason to choose GPT-4o over GPT-5, given the latter costs less, benchmarks better and is roughly the same speed.

jeremyjh · 2025-08-08T23:54:18 1754697258

Yes, this is exactly my point. Thank you for stating it better.

hombre_fatal · 2025-08-08T19:30:36 1754681436

For some use-cases, like making big complex changes to big complex important code or doing important research, you're pretty much always going to prefer the best model rather than leave intelligence on the table.

For other use-cases, like translations or basic queries, there's a "good enough".

kelnos · 2025-08-08T21:21:56 1754688116

That depends on what you value, though. If local control is that important to you for whatever reason (owning your own destiny, privacy, whatever), you might find that trade off acceptable.

And I expect that over time the gap will narrow. Sure, it's likely that commercially-built LLMs will be a step ahead of the open models, but -- just to make up numbers -- say today the commercially-built ones are 50% better. I could see that narrowing to 5% or something like that, after some number of years have passed. Maybe 5% is a reasonable trade-off for some people to make, depending on what they care about.

Also consider that OpenAI, Anthropic, et al. are all burning through VC money like nobody's business. That money isn't going to last forever. Maybe at some point Anthropic's Pro plan becomes $100/mo, and Max becomes $500-$1000/mo. Building and maintaining your own hardware, and settling for the not-quite-the-best models might be very much worth it.

m11a · 2025-08-08T22:02:34 1754690554

Agree, for now.

But the foundation models will eventually hit a limit, and the open-source ecosystem, which trails by around a year or two, will catch up.

bbarnett · 2025-08-08T19:28:35 1754681315

I grew up in a time when listening to an mp3 was too computationally expensive and nigh impossible for the average desktop. Now tiny phones can decode high def video realtime due to CPU extensions.

And my phone uses a tiny, tiny amount of power, comparatively, to do so.

CPU extensions and other improvements will make AI a simple, tiny task. Many of the improvements will come from robotics.

oblio · 2025-08-08T21:22:01 1754688121

At a certain point Moore's Law died and that point was about 20 years ago but fortunately for MP3s, it happened after MP3 became easily usable. There's no point in comparing anything before 2005 or so from that perspective.

We have long entered an era where computing is becoming more expensive and power hungry, we're just lucky regular computer usage has largely plateaued at a level where the already obtained performance is good enough.

But major leaps are a lot more costly these days.

victorbjorklund · 2025-08-08T19:36:28 1754681788

Next two years probably. But at some point we will either hit scales where you really dont need anything better (lets say cloud is 10000 token/s and local is 5000 token/s. Makes no difference for most individual users) or we will hit som wall where ai doesnt get smarter but cost of hardware continues to fall

Aurornis · 2025-08-08T20:08:10 1754683690

There will always be something better on big data center hardware.

However, small models are continuing to improve at the same time that large RAM capacity computing hardware is becoming cheaper. These two will eventually intersect at a point where local performance is good enough and fast enough.

kingo55 · 2025-08-08T20:37:30 1754685450

If you've tried gpt-oss:120b and Moonshot AIs Kimi Dev, it feels like this is getting closer to reality. Mac Studios, while expensive are now offering 512gb of usable RAM as well. The tooling available to running local models is also becoming more accessible than even just a year ago.

kasey_junk · 2025-08-08T19:30:56 1754681456

I’d be surprised by that outcome. At one point databases were cutting edge tech with each engine leap frogging each other in capability. Still the proprietary db often have features that aren’t matched elsewhere.

But the open db got good enough that you need to justify not using them with specific reasons why.

That seems at least as likely an outcome for models as they continue to improve infinitely into the stars.

duxup · 2025-08-08T19:30:06 1754681406

Maybe, but my phone has become is a "good enough" computer for most tasks compared to a desktop or my laptop.

Seems plausible the same goes for AI.

zwnow · 2025-08-08T19:49:54 1754682594

You know there's a ceiling to all this with the current LLM approaches right? They won't become that much better, its even more likely they will degrade. There are cases of bad actors attacking LLMs by feeding it false information and propaganda. I dont see this changing in the future.

withinboredom · 2025-08-08T23:23:59 1754695439

I seeded all over the internet that a friend of mine was an elephant with the intention of poisoning the well, so to speak. (with his permission, of course)

That was in 2021. Today if you ask who my friend is, it tells you that he is an elephant, without even doing a web search.

I wouldn’t be surprised if people are doing this with more serious things.

jokethrowaway · 2025-08-09T02:12:40 1754705560

Looks like they patched it (tested on Claude, ChatGPT; I assume it's Rob) but your point is very valid.

kvakerok · 2025-08-08T19:38:26 1754681906

What is even a point of having a self hosted gpt5 equivalent that's not into petabytes of knowledge?

pfannkuchen · 2025-08-08T19:29:57 1754681397

It might change once the companies switch away from lighting VC money on fire mode and switch to profit maximizing mode.

I remember Uber and AirBnB used to seem like unbelievably good deals, for example. That stopped eventually.

oblio · 2025-08-08T21:18:56 1754687936

AirBNB is so good that it's half the size of Booking.com these days.

And Uber is still big but about 30% of the time in places I go to, in Europe, it's just another website/app to call local taxis from (medallion and all). And I'm fairly sure locals generally just use the website/app of the local company, directly, and Uber is just a frontend for foreigners unfamiliar with that.

pfannkuchen · 2025-08-08T21:45:18 1754689518

Right but if you wanted to start a competitor it would be a lot easier today vs back then. And running one for yourself doesn’t really apply to these but spend magnitude difference wise it’s the same idea.

jeremyjh · 2025-08-08T20:35:32 1754685332

This I could see.

bee_rider · 2025-08-09T01:43:51 1754703831

Hardware is slower to design and manufacture than we expect as software people.

What I think we’ll see is: people will realize some things that suck in the current first-generation of laptop NPUs. The next generation of that hardware will get better as a result. The software should generally get better and lighter. We’re currently at step -.5 here, because ~nobody has bought these laptops yet! This will happen in a couple years.

Meanwhile, eventually the cloud LLM hosts will run out of investors money to subsidize our use of their computers. They’ll have to actually start charging enough to make a profit. On top of what local LLM folks have to pay, the cloud folks will have to pay:

* Their investors

* Their security folks

* The disposal costs for all those obsolete NVIDIA cards

Plus the remote LLM companies will have the fundamental disadvantage that your helpful buddy that you use as a psychologist in a pinch is also reporting all your darkest fears to Microsoft or whoever. Or your dev tools might be recycling all the work you thought you were doing for your job, back into their training set. And might be turned off. It just seems wildly unappealing.

ActorNightly · 2025-08-08T22:01:56 1754690516

>but when you factor in the performance of the models you have access to, and the cost of running them on-demand in a cloud, it's really just a fun hobby instead of a viable strategy to benefit your life.

Its because people are thinking too linearly about this, equating model size with usability.

Without going into too much detail because this may be a viable business plan for me, but I have had very good success with Gemma QAT model that runs quite well on a 3090 wrapped up in a very custom agent format that goes beyond simple prompt->response use. It can do things that even the full size large language models fail to do.

bigyabai · 2025-08-08T19:43:01 1754682181

> anything you pick up second-hand will still deprecate at that pace

Not really? The people who do local inference most (from what I've seen) are owners of Apple Silicon and Nvidia hardware. Apple Silicon has ~7 years of decent enough LLM support under it's belt, and Nvidia is only now starting to depreciate 11-year-old GPU hardware in drivers.

If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s. Maybe even faster inference because of MoE architectures or improvements in the backend.

Uehreka · 2025-08-08T20:30:54 1754685054

People on HN do a lot of wishful thinking when it comes to the macOS LLM situation. I feel like most of the people touting the Mac’s ability to run LLMs are either impressed that they run at all, are doing fairly simple tasks, or just have a toy model they like to mess around with and it doesn’t matter if it messes up.

And that’s fine! But then people come into the conversation from Claude Code and think there’s a way to run a coding assistant on Mac, saying “sure it won’t be as good as Claude Sonnet, but if it’s even half as good that’ll be fine!”

And then they realize that the heavvvvily quantized models that you can run on a mac (that isn’t a $6000 beast) can’t invoke tools properly, and try to “bridge the gap” by hallucinating tool outputs, and it becomes clear that the models that are small enough to run locally aren’t “20-50% as good as Claude Sonnet”, they’re like toddlers by comparison.

People need to be more clear about what they mean when they say they’re running models locally. If you want to build an image-captioner, fine, go ahead, grab Gemma 7b or something. If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu.

EagnaIonat · 2025-08-09T06:26:11 1754720771

> I feel like most of the people touting the Mac’s ability to run LLMs are either impressed that they run at all, are doing fairly simple tasks, or just have a toy model they like to mess around with and it doesn’t matter if it messes up.

I feel like you haven't actually used it. Your comment may have been true 5 years ago.

> If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu.

You can use a RAG approach (eg. Milvus) and also LoRA templates to dramatically improve the accuracy of the answer if needed.

Locally you can run multiple models, multiple times without having to worry about costs.

You also have the likes of Open WebUI which builds numerous features on top of an interface if you don't want to do coding.

I have a very old M1 MBP 32GB and I have numerous applications built to do custom work. It does the job the fine and speed is not an issue. Not good enough to do a LoRA build but I have a more recent laptop for that.

I doubt I am the only one.

bigyabai · 2025-08-08T20:43:33 1754685813

I agree completely. My larger point is that Apple and Nvidia's hardware has depreciated less slowly, because they've been shipping highly dense chips for a while now. Apple's software situation is utterly derelict and it cannot be seriously compared to CUDA in the same sentence.

For inference purposes, though, compute shaders have worked fine for all 3 manufacturers. It's really only Nvidia users that benefit from the wealth of finetuning/training programs that are typically CUDA-native.

Aurornis · 2025-08-08T20:13:40 1754684020

> If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s.

I think this is the difference between people who embrace hobby LLMs and people who don’t:

The token/s output speed on affordable local hardware for large models is not great for me. I already wish the cloud hosted solutions were several times faster. Any time I go to a local model it feels like I’m writing e-mails back and forth to an LLM, not working with it.

And also, the first Apple M1 chip was released less than 5 years ago, not 7.

bigyabai · 2025-08-08T21:01:17 1754686877

> Any time I go to a local model it feels like I’m writing e-mails back and forth

Do you have a good accelerator? If you're offloading to a powerful GPU it shouldn't feel like that at all. I've gotten ChatGPT speeds from a 4060 running the OSS 20B and Qwen3 30B models, both of which are competitive with OpenAI's last-gen models.

> the first Apple M1 chip was released less than 5 years ago

Core ML has been running on Apple-designed silicon for 8 years now, if we really want to get pedantic. But sure, actual LLM/transformer use is a more recent phenomenon.

SteveJS · 2025-08-08T23:16:42 1754695002

AFAICT, the RTX 4090 I bought in 2023 has actually appreciated rather than depreciated.

alliao · 2025-08-08T22:07:31 1754690851

really depends on whether local model satisfies your own usage right? if it works locally well enough, just package it up and be content? as long as it's providing value now at least it's local...

ekianjo · 2025-08-09T00:31:26 1754699486

once the models behind API start monetization of their results, their outputs will get much worse. Its just a matter of time.

isaacremuant · 2025-08-08T23:19:48 1754695188

Everything you're saying is FUD. There's immense value in being able to do local or remote as you please and part of it is knowledge.

Also, at the end of the day is about value creates and AI may allow some people to generate more stuff but overall value still tends to align with who is better at the craft pre AI. Not who pays more.

cyanydeez · 2025-08-08T21:51:04 1754689864

Anything you build in the LLM cloud will be. Must be. Rug pulled either via locking success or utter bankruptcy or just a model context prompt change.

Unless you're a billionaire with pull, you're building tools you cant control, cant own and are ephermap wisps.

That's even if you can even trust these large models in consistency.

washadjeffmad · 2025-08-08T23:48:01 1754696881

It's not that bad. If you're an adult making a living wage, and you're literate in some IT principles and AGI operations know-how, it's not a major onetime investment. And you can always learn. I'm sure your argument deterred a lot of your parents' generation from buying computers, too. Where would most of us be if not for that? This is a second transistor moment, right in our lifetime.

Life is about balance. If you Boglehead everything and then die before retirement, did you really live?

braooo · 2025-08-08T19:10:10 1754680210

Running LLMs at home is a repeat of the mess we make with "run a K8s cluster at home" thinking

You're not OpenAI or Google. Just use pytorch, opencv, etc to build the small models you need.

You don't need Docker even! You can share over a simple code based HTTP router app and pre-shared certs with friends.

You're recreating the patterns required to manage a massive data center in 2-3 computers in your closet. That's insane.

frank_nitti · 2025-08-08T19:27:37 1754681257

For me, this is essential. On priciple, I won't pay money to be a software engineer.

I never paid for cloud infrastructure out of pocket, but still became the go-to person and achieved lead architecture roles for cloud systems, because learning the FOSS/local tooling "the hard way" put me in a better position to understand what exactly my corporate employers can leverage with the big cash they pay the CSPs.

The same is shaping up in this space. Learning the nuts and bolts of wiring systems together locally with whatever Gen AI workloads it can support, and tinkering with parts of the process, is the only thing that can actually keep me interested and able to excel on this front relative to my peers who just fork out their own money to the fat cats that own billions worth of compute.

I'll continue to support efforts to keep us on the track of engineers still understanding and able to 'own' their technology from the ground up, if only at local tinkering scale

jtbaker · 2025-08-08T20:11:34 1754683894

Self hosting my own LLM setup in the homelab was what really helped me learn the fundamentals of K8s. If nothing else I'm grateful for that!

Imustaskforhelp · 2025-08-08T19:31:35 1754681495

So I love linux and would wish to learn devops one day in its entirety to be an expert to actually comment on the whole post but

I feel like they actually used docker for just the isolation part or as a sandbox (technically they didn't use docker but something similar to it for mac (apple containers) ) I don't think that it has anything to do with k8s or scalability or pre shared cert or http router :/

meta_ai_x · 2025-08-08T19:27:40 1754681260

This is especially true since AI is a large multiplicative factor to your productivity.

If Cloud LLMs have 10 IQ points > local LLM, within a month, you'll notice you'll be struggling behind the dude who just used Cloud LLM.

LocalLlama is for hobbies or your job depends on running locallama.

This is not one-time upfront setup cost vs payoff later tradeoff. It is a tradeoff you are making every query which compounds pretty quickly.

Edit : I expect nothing better than downvotes from this crowd. How HN has fallen on AI will be a case study for the ages