Hacker News new | past | comments | ask | show | jobs | submit | Rover222's comments login

they just made it temporarily free to anyone, FYI

It's bizarre how many people believe that was literally meant as a Nazi salute.

Rule of Goats.

And they mentioned at the end of the presentation that they're already planning their next datacenter, which will require 5x the power. Not sure if that means equivalent to ~1,000,000 of the current GPU's, or more because next-gen Nvidia chips are more efficient.

The B300 8-way SXMs will use around 1.4kW for each GPU. I think the TDP on an H100 is like 700W.

Grok 3 at the top of Chatboat Arena with 1400, and the model will continue to improve as it trains more.

And DeepSeek is just 3% behind. It seems in that benchmark all LLMs perform well and top is formed within some statistical error.

It could also be that they got "inspired" by DeepSeek, hence the very similar results.

So it could be that their success is mostly about taking an open and free thing, and turned it proprietary.


These percentage points don't mean anything. Look up how the Elo system works. They just add 1000 to the result to make it a nicer number.

There are llms below 1000 in the leaderboard

So? Percentage points are only meaningful when the mean of the dataset is 0, which is not the case here.

And Anthropic not even in the top 10 ...

I keep hearing about Claude's impressive coding skills (compared to its benches) yet, not evident for me (I use the web version, not cline). Compared to 4o it's not that great.

My pet theory is that Sonnet was trained really cleverly on a lot code that resembles real world cases.

In our small and humble internal evals it regularly beats any other frontier models on some tasks. The shape of capability is really not intuitive/1 dimensional


I spend four to five hours coding per day and subscribe to every major LLM and Claude is still by far the best for me personally and my co workers.

What are you using it for in general? IME the reason Claude pulls out ahead is that when you use it in a larger existing codebase, it keeps everything "in the style" of that codebase and doesn't veer off into weird territory like all the others.

My experience as well. Working in Scala primarily, it tends to be very good at following the constructs of the project.

Using a specific Monad-transformer regularly? It'll use that pattern, and often very well, handling all the wrapping and unwrapping needed to move data types about (at least well enough that the odd case it misses some wrapping/unwrapping is easy to spot and manage).

A custom GPT or GEM with the same source files, and those models regularly fail to maintain style and context, often suggesting solutions that might be fine in isolation but make little sense in the context of a larger codebase. It's almost like they never reliably refer to the code included in the project/GPT/GEM.

Claude on the other hand is so consistent about referring to existing artifacts that, as you approach the limit of project size (which is admittedly small) you can use up your entire 5-hour block of credits with just a few back-and-forths.


Lol no company is making money using 4o, however thanks to claude sonnet programms like Cursor are usable lol. 4o agents suck, just try it instead of talking

I did try it yet for more than a week still 4o still pretty much better in terms of python coding and architecture/documentation design

That doesn't match my experience at all.

I can honestly tell you from my experience that Sonnet 3.5s coding skills did things no other models did right last year during the summer, this was even though the benchmarks showed that it wasn't the best performing at coding tasks.

I prototyped on the weekend and started out with 4o because i had a subscription running.

After an hour and a half assed working result, i put everything into claude and it made it significant better on the first try and i had not a subscription active with claude.


Really interesting, I used it today still lots of issues. Maybe my python notebook is not approach is too complicated for Sonnet? Couldn't be able to fix a custom complex seaborn plot. 4o failed too. o3-mini-high managed to do it really well on the other hand.

There is honestly no rhyme of reason to all these opinions, someone was telling me the other day that Claude is for sure the best, I'd say multiple people actually.

I find it concerning there is no real accurate benchmarks for this stuff that we can all agree on.


yet with claude still the most useful, lmsys is broken for coding

Any model that censors itself does poorly, despite being able to provide high quality answers.

Anthropic best model is Sonnet 3.5 in my opinion. The reason its good is it is very effective for the price and fast. (I do think Google has caught up a lot in this regard). However, not having COT makes its results worse than similarly cheap COT based models.

Leaderboards don't care about cost. Leaderboards largely rank a combination of accuracy + speed. Anthropic has fell behind Google in accuracy + speed (again missing COT), and frankly behind Google in raw speed.


No idea why was this downvoted, but you are correct.

Seems like the team at xAI caught up very quickly to OpenAI to be at the top of the leaderboard in one of the benchmarks and also caught up with features with Grok 3.

Giving credit where credit is due, even though this is a race to zero.


We got more emotional and opinionated people on HN and they often reacted in an emotional way instead of using logic and being curious.

Yeah, so many people aren't capable of talking about anything Musk-adjacent with clear thoughts. It's insane how quickly xAI went from not existing, to the top of the benchmarks.

I think people here are thinking very clearly about Musk and his various projects.

Not sure about people elsewhere though.


Depends what you mean by "people here". I mean, obviously the majority of HN commentators and even the majority of commentators on this thread seem to be. But there will always be a couple of slightly unhinged folk in a big enough group of readers.

Can't you just take DeepSeek and put it behind an API and get to the top of the benchmarks immediately?

I'm not sure what you mean here? Musk has a history of doing both incredibly useful and cool things, and also incredibly dumb, cruel, and for some people even terrible things. That context should be part of any clear thinking around him. He does not get a clean slate in every new discussion of him.

There are widespread, legitimate concerns about what kind of person Elon Musk is turning out to be. There is a lot of chatter about fears of China's AI rise, but what happens if we get Elon's brand of cruelty and lack of empathy in an authoritarian superintelligent AI ? Is that the AI future we want? Can you imagine an SAI with real power that interacts with people like Elon does on Twitter? I am not sure that is a future I want to live in.


We’re trying to talk about the capabilities of Grok and you can only focus on Musk. That’s what I’m talking about.

Don’t defend a persona over substance. His concerns are valid and relevant to the discussion.

It’s relevant to the subject since he owns it.

There would be no Grok without Musk, any discussion of Grok is going to involve discussion of Musk as well.

You cant see this separate from Musk. Musk isnt a business as usual type.

You know what they say: Fascists are good at keeping the training runs on time.

Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: