Rover222's comments

Rover222 · 2025-02-20T06:19:42 1740032382

they just made it temporarily free to anyone, FYI

Rover222 · 2025-02-18T10:20:29 1739874029

It's bizarre how many people believe that was literally meant as a Nazi salute.

fenomas · 2025-02-18T13:44:28 1739886268

Rule of Goats.

Rover222 · 2025-02-18T05:29:02 1739856542

And they mentioned at the end of the presentation that they're already planning their next datacenter, which will require 5x the power. Not sure if that means equivalent to ~1,000,000 of the current GPU's, or more because next-gen Nvidia chips are more efficient.

grubbs · 2025-02-18T13:18:02 1739884682

The B300 8-way SXMs will use around 1.4kW for each GPU. I think the TDP on an H100 is like 700W.

Rover222 · 2025-02-18T05:03:40 1739855020

Grok 3 at the top of Chatboat Arena with 1400, and the model will continue to improve as it trains more.

riku_iki · 2025-02-18T07:28:47 1739863727

And DeepSeek is just 3% behind. It seems in that benchmark all LLMs perform well and top is formed within some statistical error.

rvnx · 2025-02-18T09:25:15 1739870715

It could also be that they got "inspired" by DeepSeek, hence the very similar results.

So it could be that their success is mostly about taking an open and free thing, and turned it proprietary.

torginus · 2025-02-18T10:38:15 1739875095

These percentage points don't mean anything. Look up how the Elo system works. They just add 1000 to the result to make it a nicer number.

riku_iki · 2025-02-18T15:31:27 1739892687

There are llms below 1000 in the leaderboard

torginus · 2025-02-19T14:30:25 1739975425

So? Percentage points are only meaningful when the mean of the dataset is 0, which is not the case here.

1024core · 2025-02-18T05:36:53 1739857013

And Anthropic not even in the top 10 ...

3abiton · 2025-02-18T07:28:53 1739863733

I keep hearing about Claude's impressive coding skills (compared to its benches) yet, not evident for me (I use the web version, not cline). Compared to 4o it's not that great.

zurfer · 2025-02-18T07:45:02 1739864702

My pet theory is that Sonnet was trained really cleverly on a lot code that resembles real world cases.

In our small and humble internal evals it regularly beats any other frontier models on some tasks. The shape of capability is really not intuitive/1 dimensional

saberience · 2025-02-18T13:31:01 1739885461

I spend four to five hours coding per day and subscribe to every major LLM and Claude is still by far the best for me personally and my co workers.

phillipcarter · 2025-02-18T13:58:00 1739887080

What are you using it for in general? IME the reason Claude pulls out ahead is that when you use it in a larger existing codebase, it keeps everything "in the style" of that codebase and doesn't veer off into weird territory like all the others.

davidee · 2025-02-18T18:47:48 1739904468

My experience as well. Working in Scala primarily, it tends to be very good at following the constructs of the project.

Using a specific Monad-transformer regularly? It'll use that pattern, and often very well, handling all the wrapping and unwrapping needed to move data types about (at least well enough that the odd case it misses some wrapping/unwrapping is easy to spot and manage).

A custom GPT or GEM with the same source files, and those models regularly fail to maintain style and context, often suggesting solutions that might be fine in isolation but make little sense in the context of a larger codebase. It's almost like they never reliably refer to the code included in the project/GPT/GEM.

Claude on the other hand is so consistent about referring to existing artifacts that, as you approach the limit of project size (which is admittedly small) you can use up your entire 5-hour block of credits with just a few back-and-forths.

anti-soyboy · 2025-02-18T10:58:57 1739876337

Lol no company is making money using 4o, however thanks to claude sonnet programms like Cursor are usable lol. 4o agents suck, just try it instead of talking

3abiton · 2025-02-18T13:45:51 1739886351

I did try it yet for more than a week still 4o still pretty much better in terms of python coding and architecture/documentation design

throwaway314155 · 2025-02-18T20:53:48 1739912028

That doesn't match my experience at all.

Alifatisk · 2025-02-18T11:03:28 1739876608

I can honestly tell you from my experience that Sonnet 3.5s coding skills did things no other models did right last year during the summer, this was even though the benchmarks showed that it wasn't the best performing at coding tasks.

Mekoloto · 2025-02-19T11:39:48 1739965188

I prototyped on the weekend and started out with 4o because i had a subscription running.

After an hour and a half assed working result, i put everything into claude and it made it significant better on the first try and i had not a subscription active with claude.

3abiton · 2025-02-20T17:14:33 1740071673

Really interesting, I used it today still lots of issues. Maybe my python notebook is not approach is too complicated for Sonnet? Couldn't be able to fix a custom complex seaborn plot. 4o failed too. o3-mini-high managed to do it really well on the other hand.

bamboozled · 2025-02-18T15:50:19 1739893819

There is honestly no rhyme of reason to all these opinions, someone was telling me the other day that Claude is for sure the best, I'd say multiple people actually.

I find it concerning there is no real accurate benchmarks for this stuff that we can all agree on.

waynenilsen · 2025-02-18T15:54:45 1739894085

yet with claude still the most useful, lmsys is broken for coding

bearjaws · 2025-02-18T14:01:33 1739887293

Any model that censors itself does poorly, despite being able to provide high quality answers.

bangaladore · 2025-02-18T18:32:09 1739903529

Anthropic best model is Sonnet 3.5 in my opinion. The reason its good is it is very effective for the price and fast. (I do think Google has caught up a lot in this regard). However, not having COT makes its results worse than similarly cheap COT based models.

Leaderboards don't care about cost. Leaderboards largely rank a combination of accuracy + speed. Anthropic has fell behind Google in accuracy + speed (again missing COT), and frankly behind Google in raw speed.

rvz · 2025-02-18T05:25:51 1739856351

No idea why was this downvoted, but you are correct.

Seems like the team at xAI caught up very quickly to OpenAI to be at the top of the leaderboard in one of the benchmarks and also caught up with features with Grok 3.

Giving credit where credit is due, even though this is a race to zero.

sinuhe69 · 2025-02-18T05:30:33 1739856633

We got more emotional and opinionated people on HN and they often reacted in an emotional way instead of using logic and being curious.

Rover222 · 2025-02-18T05:27:21 1739856441

Yeah, so many people aren't capable of talking about anything Musk-adjacent with clear thoughts. It's insane how quickly xAI went from not existing, to the top of the benchmarks.

dkjaudyeqooe · 2025-02-18T07:35:35 1739864135

I think people here are thinking very clearly about Musk and his various projects.

Not sure about people elsewhere though.

concordDance · 2025-02-18T09:03:25 1739869405

Depends what you mean by "people here". I mean, obviously the majority of HN commentators and even the majority of commentators on this thread seem to be. But there will always be a couple of slightly unhinged folk in a big enough group of readers.

ks2048 · 2025-02-18T07:38:02 1739864282

Can't you just take DeepSeek and put it behind an API and get to the top of the benchmarks immediately?

nhod · 2025-02-18T05:48:59 1739857739

I'm not sure what you mean here? Musk has a history of doing both incredibly useful and cool things, and also incredibly dumb, cruel, and for some people even terrible things. That context should be part of any clear thinking around him. He does not get a clean slate in every new discussion of him.

There are widespread, legitimate concerns about what kind of person Elon Musk is turning out to be. There is a lot of chatter about fears of China's AI rise, but what happens if we get Elon's brand of cruelty and lack of empathy in an authoritarian superintelligent AI ? Is that the AI future we want? Can you imagine an SAI with real power that interacts with people like Elon does on Twitter? I am not sure that is a future I want to live in.

Rover222 · 2025-02-18T06:10:29 1739859029

We’re trying to talk about the capabilities of Grok and you can only focus on Musk. That’s what I’m talking about.

spiderfarmer · 2025-02-18T07:36:53 1739864213

Don’t defend a persona over substance. His concerns are valid and relevant to the discussion.

notfromhere · 2025-02-18T08:11:33 1739866293

It’s relevant to the subject since he owns it.

nozzlegear · 2025-02-18T14:51:16 1739890276

There would be no Grok without Musk, any discussion of Grok is going to involve discussion of Musk as well.

FranzFerdiNaN · 2025-02-18T08:21:59 1739866919

You cant see this separate from Musk. Musk isnt a business as usual type.

DonHopkins · 2025-02-18T13:13:00 1739884380

You know what they say: Fascists are good at keeping the training runs on time.