Hacker News new | past | comments | ask | show | jobs | submit login
Four new models that are benchmarking near or above GPT-4 (simonwillison.net)
66 points by simonw on March 8, 2024 | hide | past | favorite | 34 comments



I just finished benchmarking both Claude 3 models for coding. Claude 3 Opus outperforms all of OpenAI’s models, making it the best available model for pair programming with AI.

The new claude-3-opus-20240229 model got the highest score ever on this benchmark, completing 68.4% of the tasks. While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.

https://aider.chat/2024/03/08/claude-3.html


It's nothing short of amazing how quickly we went from "nobody can compete with OpenAI" to "they're still cheapest".


The benchmarks are so easy to game it's not meaningful unless you see a major difference and even then it can be questionable. For example, did anyone read the Gemini paper[0]?

The team details the slight modification they made to the metric which is basically a tweak on maj@32 that worked in their favor. They choose to do the majority vote of 32 samples unless the probability of the highest sample was too low and then they chose greedy, they also had to learn what "too low" was. Sound convoluted and a bit contrived? They even show plots that clearly demonstrate they settled on the metric because it's the one where GPT-4 results remain they same, but Gemini wins.

Don't get me wrong, LLM evaluation is essential. But in order to do this you need to be very consistent between comparisons and even then the results can be misleading.

0. https://arxiv.org/pdf/2312.11805.pdf


Yes, since GPT-4 is the one to beat, everyone is basically doing the equivalent of p-hacking their benchmarks. If you roll the dice sufficiently many times then only report on your best roll you can make it look like your model is better than it is.


Is anyone aware of a GitHub Copilot equivalent JetBrains plugin that can switch between models? Everything else I've looked at seems to just embed a chat UI or force you to do some kind of:

    /chat Here is my message
In your code to invoke the LLM. Copilot is seamless and I've grown accustomed to how it works. Ideally I could also put in my own local LLM to test with as well. I've seen Continue.dev but it seems to require some kind of manual invoking (I'd love to be wrong).

I'd like to try swapping out GPT-4 for Claude 3 Opus but I'm not super interested in changing my workflow. I manually use ChatGPT or Copilot Chat (less so) for some cases but I really like the in-line suggestions in my editor.


I’m pretty sure Copilot just came out to some JetBrain’s products. Like literally this week. At least two people in my team already tried it in IDEA. Iirc from the thread the were messaging in it was still behind some nightly build and not on stable, or something. I’m sure you can google it. =)


GitHub Copilot or a different one? I’ve been using GitHub Copilot since it came out in IDEA.


There is Continue, still very much work in progress.

https://plugins.jetbrains.com/plugin/22707-continue


Do you know if it does the auto-suggestion like Copilot?


All those benchmarks are sooo bad that I'm not really convinced one way or another. Having tried out a few of them I roughly prefer gpt-4 I think still (though I haven't worked with any of them long enough to make a clear judgement).


> none of these models are being transparent about their training data.

The rumor, with no supporting evidence, of course, is that they training corpus includes all of LibGen.


> Inflection-2.5, March 7th. This one came out of left field for me: Inflection make Pi, a conversation-focused chat interface that felt a little gimmicky to me when I first tried it. Then just the other day they announced that their brand new 2.5 model benchmarks favorably against GPT-4, and Ethan Mollick—one of my favourite LLM sommeliers—noted that it deserves more attention.

Inflection's offering looks like a scam given that the responses that it gives are very close or the same as Anthropic's Claude 3 or even being an API wrapper around the Claude API [0] and packaged as a product.

Pi is more like a toy with performative use-cases. How the company was able to raise millions of dollars out of that absolutely make no sense whatsoever other than pure VC hype.

[0] https://twitter.com/seshubon/status/1765870717844050221


Inflection have denied this - it could be that Pi has a very long memory and hence the results this user got were influenced by a much earlier conversation: https://twitter.com/inflectionAI/status/1766173427441049684

I'm withholding judgement until more information comes out. It's a really weird situation.


> ...consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”.

Can anybody explain what is meant by "vibes" here? Is it an industry term?


It's a term to represent that in the real-world, the quality of a model can't be measured objectively so the most common way to evaluate models personally is to see whether you like the output, with all the selection, anchoring, and recency biases that entails.


It's a bit of an industry in-joke. The problem with LLM benchmarks is that they don't really help you get a "feel" for what it's like using them.

A model might score off the charts but not be particularly useful for your personal style of promoting.


First time I have heard the term (and I have done quite a bit of llm benchmarking) but I love it! Similar to the xkcd "wtfs per minute" measure of code quality measure I guess.

Everyone I know who plays with LLMs says things like

- it didn't answer my special secret benchmark question better than previous models

- it surprised me by being able to give a satisfying answer for x that I wasn't expecting

- it always seems to give answers that are [Some negative quality]

Obviously these things are super subjective but still important and useful in a way that "answers 5th grade maths questions more accurately than previous models" sometimes isn't


when I have a new use for ai I'll try Claude, Bing, and Gemini and go with whoever is best for that use case. sometimes I'll have another model check and offer suggestions for improvement like for Etsy listings title or description etc...


qualitative evaluations from simply trying the thing. many things not captured by the quantitative benchmarks, which are easy to game.


Translation: "mind share"


It's worth noting that most of these models have a similar price-per-1M-tokens, so differentiation is mostly whether the model output has the right "vibe" for your use case, which can't easily be represented in most benchmarks.

Competition will encourage a price race to the bottom (marginal cost = marginal revenue), but I question whether some of these companies will be able to maintain it long-term without lots and lots of venture capital.


I have a use case where GPT-3.5 is a bit inconsistent and GPT-4 is a little too expensive. I’ve been happy with Claude 3 Sonnet, though. They’re definitely worth a try and provide more granular choice.


more datapoints ive been collecting for the latent.space recap:

- lmsys ranks claude 3 opus ahead of GPT4 but behind GPT4T https://x.com/lmsysorg/status/1765774296000172289 mistral large is also well behind claude 3 sonnet, and both behind gpt4, which is... good but disappointing?.

- ai2 wildbench largely agrees with this ranking https://x.com/billyuchenlin/status/1766079601154064688?s=20

- claude summarization is better, at the cost of higher hallucination https://twitter.com/swyx/status/1764805626037993853

it's also noteworthy that gemini and claude seem to have completely solved the needle in a haystack problem vs openai/mistral - i'm not sure what breakthrough is responsible but the opensource community seems to be pointing at ringattention as the main promise


Great post as usual! Thank you! I was wondering about this though:

> I use these models a lot

I have very few uses of AI myself, that mostly center around writing scripts in languages I'm unfamiliar with so that I don't have to spend time looking up the specific syntax. But that would not count as "a lot".

Beyond curiosity and model testing, what do people really use AI for?


Writing code. Quick data analysis (with Code Interpreter and the like). Brainstorming ideas for things. World's best thesaurus. Entertainment.

Lots of quick answers to questions: my rule of thumb is that if someone who had just read the relevant Wikipedia page could answer my question, and the stakes aren't particularly high, I'll use an LLM.

Jargon decoding - I can read academic papers now!


Replace stackoverflow for all the stupid questions you might have. Or the intuitions you have and want to confront. Or to get examples on a topic. All this with no one to berate you because of points or dupes or 'you haven't even googled it', or you being stupid.


https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

The LLM arena shows the performance of GPT-4 compared to Claude 3 and Mistral-large. (GPT-4 is still on top)


A little past the headline:

> Not every one of these models is a clear GPT-4 beater, but every one of them is a contender.


I've seen these benchmarks over and over and I wonder how they work. Seems to me you would need something close to a test or exam like in schools to properly rate a model. How useful are these metrics?


The Open LLM Leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) is a good indicator of progress in the space that hedges on a variety of potential use cases, but for real world use it cannot replace manually experimenting with models to ensure output is what you expect for your use case, especially creative use cases which these benchmarks do not capture.


GPT has been having some issues the past couple of days. Returning the "network retry" error dialog for hours on end.

This gave me a window to try out Gemini. I 1.0 was better than I anticipated, but still got some of the (important) details wrong. I wish I had access to 1.5.

The competition is closing in, and the frequent downtime makes it an easy choice to try the competitors.

That said, I'm still pulling for OpenAI. We need someone to challenge the old tech companies.

I hope GPT-5 is similar to Sora - ten steps forward in terms of innovation.


ultra has a 2 month free trial.


Could use some examples of the improved output. It shouldnt be that hard.

What are each models strengths? I know GPT4 has a strength with code.


I linked to an example of Claude 3 beating GPT-4 on code from the article.

GPT-4: https://chat.openai.com/share/117fb1ad-6361-41e2-be59-110f32...

Claude 3 Opus: https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83...

The GPT-4 code didn't work - it was missing some async keywords.

The Claude 3 Opus code worked perfectly.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: