More

nbardy · 2026-03-04T02:37:14 1772591834

How much of your RAM does that use including kv cache. Is there enough left to run real dev workloads AND the llm?

Also can you run batchwise effectively like vllm on cuda?

Enough to run multiple agents at the same time with throughput?

nbardy · 2026-03-03T04:18:53 1772511533

Why does apple want to make this hardware hard to access?

What actual benefits do they get?

I guess they can have their own models run faster than the competition on their hardware? But they don't even really have anything that consumers use on the ANE as far as I can tell and local LLMs are taking off on macs and could really benefit from this

owlbite · 2026-03-03T04:53:44 1772513624

I suspect main benefits are they have no need to maintain the hardware or software for any longer than it makes sense for their own needs, and don't have to handhold users through a constantly evolving minefield of performance and technical capabilities.

nbardy · 2026-02-14T16:26:25 1771086385

They are far behind. Go check re-swe bench to see the overfitting measured

Or just try to use them. They don’t generalize as well.

They are benchmaxxed.

nbardy · 2026-01-27T10:07:35 1769508455

They should probably fund their military first.

It’s petulant the way the EU is throwing a hissy fit after we’ve had lop-sided trade deals for years and funding the entire NATO alliance ourselves.

They act like we’re going to war with them when we’re asking for parity and for their self reliance to increase.

michaelsshaw · 2026-01-27T11:55:14 1769514914

>They act like we’re going to war with them when we’re asking for parity and for their self reliance to increase.

The US is literally threatening to invade an EU overseas territory.

sambuccid · 2026-01-27T11:59:55 1769515195

That's because not everyone thinks that the trade deals were lop-sided, and it's difficult to objectively determine if they are, given that trade deals are just another lever in the relationship between 2 countries, one lever among millions of levers, one that is constantly calibrated and moved depending on the other ones. In a system like this I think it's pretty difficult to say who's getting more and who's getting less. But Trump doesn't care what is true of false, so for him it's easy to just say what suits him best.

Regarding the war, I can assure you that Trump not excluding to take Greenland my force has been seen by the EU as threat of starting a war, giving that Greenland is part of the EU. Also applying tariffs when European NATO countries sent some troops in Greenland has been perceived as: "Trump wanted to invade Greenland, he felt like EU countries wanted to defend it, so he imposed tariffs because he wanted to invade".

I'm not saying everyone in EU is thinking this, but I think a lot of people did, and this is some context for you to try and understand europe's point of view.

reactormonk · 2026-01-29T09:32:30 1769679150

> They act like we’re going to war with them when we’re asking for parity and for their self reliance to increase.

Threatening to take over Greenland by force isn't considered "going to war" for you?

varispeed · 2026-01-27T11:34:15 1769513655

Comrade, what is the weather in St. Petersburg?

perlgeek · 2026-01-27T11:00:09 1769511609

> They should probably fund their military first.

They should do both. Resilience must be achieved in depth.

> It’s petulant the way the EU is throwing a hissy fit after we’ve had lop-sided trade deals for years and funding the entire NATO alliance ourselves.

Most of the outrage in the EU right now is about Trump's threats against another NATO country (Denmark / Greenland). The funding of the NATO has been slowly shifting for a few years already.

urbandw311er · 2026-01-27T14:03:13 1769522593

If you’re honestly OK with the maths Trump used to calculate the trade deficits then I’m not really sure you’re going to fit in here at HN.

nbardy · 2026-01-19T05:28:37 1768800517

No it’s not. I have written cuda kernels and 8bit optimizers with this.

They’re actually very good at speed optimization and can iterate very quickly taking notes on trials and failures and benchmarks. I’ve had it write 10 different attempts in around an hour and benchmark them all then merge and beat very strong baselines in torch

nbardy · 2026-01-18T03:44:44 1768707884

> Claude Code officially added native support for the Language Server Protocol (LSP) in version 2.0.74, released in December 2025.

I think from training it's still biased towards simple tooling.

But also, there is real power to simple tools, a small set of general purpose tools beats a bunch of narrow specific use case tools. It's easier for humans to use high level tools, but for LLM's they can instantly compose the low level tools for their use case and learn to generalize, it's like writing insane perl one liners is second nature for them compared to us.

If you watch the tool calls you'll see they write a ton of one off small python programs to test, validate explore, etc...

If you think about it any time you use a tool there is probably a 20 line python program that is more fit to your use case, it's just that it would take you too long to write it, but for an LLM that's 0.5 seconds

frumplestlatz · 2026-01-18T10:18:57 1768731537

> but for LLM's they can instantly compose the low level tools for their use case and learn to generalize

Hard disagree; this wastes enormous amounts of tokens, and massively pollutes the context window. In addition to being a waste of resources (compute, money, time), this also significantly decreases their output quality. Manually combining painfully rudimentary tools to achieve simple, obvious things -- over and over and over -- is *not* an effective use of a human mind or an expensive LLM.

Just like humans, LLMs benefit from automating the things they need to do repeatedly so that they can reserve their computational capacity for much more interesting problems.

I've written[1] custom MCP servers to provide narrowly focused API search and code indexing, build system wrappers that filter all spurious noise and present only the material warnings and errors, "edit file" hooks that speculatively trigger builds before the LLM even has to ask for it, and a litany of other similar tools.

Due to LLM's annoying tendency to fall back on inefficient shell scripting, I also had to write a full bash syntax parser and shell script rewriting ruleset engine to allow me to silently and trivially rewrite their shell invocations to more optimal forms that use the other tools I've written, so that they don't have to do expensive, wasteful things like pipe build output through `head`/`tail`/`grep`/etc, which results in them invariably missing important information, and either wandering off into the weeds, or -- if they notice -- consuming a huge number of turns (and time) re-running the commands to get what they need.

Instead, they call build systems directly with arbitrary options, | filters, etc, and magically the command gets rewritten to something that will produce the ideal output they actually need, without eating more context and unnecessary turns.

LLMs benefit from an IDE just like humans do -- even if an "IDE" for them looks very different. The difference is night and day. They produce vastly better code, faster.

[1] And by "I've written", I mean I had an LLM do it.

forty · 2026-01-18T11:26:15 1768735575

Note that the Claude code LSP integration was actually broken for a while after it was released, so make sure you have a very recent version if you want to try it out.

However as parent comment said, it seems to always grep instead, unless explicitly said to use the LSP tool.

cududa · 2026-01-18T05:05:16 1768712716

Correct. If you try to create a coding agent using the raw Codex or Claude code API and you build your own “write tool”, and don’t give the model their “native patch tool”, 70%+ of the time it’s write/ patch fails because it tries to do the operation using the write/ patch tool it was trained on.

htrp · 2026-01-18T18:02:30 1768759350

part of the value add of owning both the model and the tooling

cm2187 · 2026-01-18T09:01:24 1768726884

We are back to RISC vs CISC!

htrp · 2026-01-18T18:03:55 1768759435

history doesn't repeat but it definitely rhymes

nbardy · 2025-12-25T09:33:54 1766655234

Your way off, this reads more like anti capitalist political rhetoric than real reasoning.

Look at Nvidia nemotron series. They hav become a leading open source training lab themselves and they’re releasing the best training data, training tooling, and models at this point.

nbardy · 2025-12-16T06:06:00 1765865160

When are people going to drop the immigration is good at all costs assumption.

We need a well managed set of immigration polices or country WILL take advantage of US. These are our military rivals and we sell our most advanced math, physics and engineering seats to the highest bidder. It’s a self districting disaster and it’s not just on us to treat people better.

Look at the rate of Indian asylum seekers in Canada to see the most extreme case. It happens anywhere you extend naivety and boundless good will.

nbardy · 2025-12-11T21:56:02 1765490162

Those arc agi 2 improvements are insane.

Thats especially encouraging to me because those are all about generalization.

5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.

This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.

Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.

I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.

delifue · 2025-12-12T01:34:28 1765503268

It's also possible that OpenAI use many human-generated similar-to-ARC data to train (semi-cheating). OpenAI has enough incentive to fake high score.

Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".

deaux · 2025-12-12T01:47:39 1765504059

> 5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.

Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.

baq · 2025-12-12T07:00:50 1765522850

Opus 4.5 answers most of my non-question comments with ‘you’re right.’ as the first thing in the output. At least I’m not absolutely right, I’ll take this as an improvement.

deaux · 2025-12-13T10:33:03 1765621983

Hah, maybe 5th gen Claude will change to "you may be right".

The positive thing is that it seems to be more performative than anything. Claude models will say "you're [absolutely] right" and then immediately do something that contradicts it (because you weren't right).

Gemini 3 Pro seems to have struck a decent balance between stubbornness and you're-right-ness, though I still need to test it more.

fellowniusmonk · 2025-12-13T08:58:38 1765616318

5.2 seems worse on overfitting for esoteric logic puzzles in my testing. Tests using precise language where attention has to be paid to use the correct definition among many for a given word. It charges ahead with wrong definitions in a far lower accuracy and worse way now.

mmaunder · 2025-12-11T21:59:49 1765490389

Same. Also got my attention re ARC-AGI-2. That's meaningful. And a HUGE leap.

cbracketdash · 2025-12-12T00:15:36 1765498536

Slight tangent yet I think is quite interesting... you can try out the ARC-AGI 2 tasks by hand at this website [0] (along with other similar problem sets). Really puts into perspective the type of thinking AI is learning!

[0] https://neoneye.github.io/arc/?dataset=ARC-AGI-2

nbardy · 2025-12-03T16:06:13 1764777973

You haven’t actually looked at their fundamentals. They’re profitable serving current models including training costs and are only losing money on future RD training, but if you project future revenue growth on future generations of models you get a clear path to profitability.

They charge higher costs than OpenAI and have faster growing API demand. They have great margins compared to the rest of the industry on inference.

Sure the revenue growth could stop but it hasn’t and there is no reason to think it will.

disgruntledphd2 · 2025-12-03T16:25:43 1764779143

> They’re profitable serving current models including training costs

I hear this a lot, do you have a good source (apart from their CEO saying it in an interview). I might have more faith in him but checks notes, it's late 2025 and AI is not writing all our code yet (amongst other mental things he's said).

TSiege · 2025-12-03T18:28:51 1764786531

The best I kind is this tech crunch article, which appears to be referencing an article from the information that is pay walled.

> The Information reports that Anthropic expects to generate as much as $70 billion in revenue and $17 billion in cash flow in 2028. The growth projections are fueled by rapid adoption of Anthropic’s business products, a person with knowledge of the company’s financials said.

> That said, the company expects its gross profit margin — which measures a company’s profitability after accounting for direct costs associated with producing goods and services — to reach 50% this year and 77% in 2028, up from negative 94% last year, per The Information.

https://techcrunch.com/2025/11/04/anthropic-expects-b2b-dema...

disgruntledphd2 · 2025-12-04T11:43:18 1764848598

So assuming that the gross margin is GAAP (which it probably isn't), then this would suggest that the costs of training are covered by inference sales this year (which is definitely good).

However, I'm still a little sceptical around this as the cost to train new models is going up super-linearly (apparently) which means that the revenue from inference needs to also go up along side this.

Interesting to think about though, thanks for the source!

JohnnyMarcone · 2025-12-03T16:51:14 1764780674

We will all have a great source if they IPO :)