I just finished benchmarking both Claude 3 models for coding. Claude 3 Opus outperforms all of OpenAI’s models, making it the best available model for pair programming with AI.
The new claude-3-opus-20240229 model got the highest score ever on this benchmark, completing 68.4% of the tasks. While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.
The benchmarks are so easy to game it's not meaningful unless you see a major difference and even then it can be questionable. For example, did anyone read the Gemini paper[0]?
The team details the slight modification they made to the metric which is basically a tweak on maj@32 that worked in their favor. They choose to do the majority vote of 32 samples unless the probability of the highest sample was too low and then they chose greedy, they also had to learn what "too low" was. Sound convoluted and a bit contrived? They even show plots that clearly demonstrate they settled on the metric because it's the one where GPT-4 results remain they same, but Gemini wins.
Don't get me wrong, LLM evaluation is essential. But in order to do this you need to be very consistent between comparisons and even then the results can be misleading.
Yes, since GPT-4 is the one to beat, everyone is basically doing the equivalent of p-hacking their benchmarks. If you roll the dice sufficiently many times then only report on your best roll you can make it look like your model is better than it is.
Is anyone aware of a GitHub Copilot equivalent JetBrains plugin that can switch between models? Everything else I've looked at seems to just embed a chat UI or force you to do some kind of:
/chat Here is my message
In your code to invoke the LLM. Copilot is seamless and I've grown accustomed to how it works. Ideally I could also put in my own local LLM to test with as well. I've seen Continue.dev but it seems to require some kind of manual invoking (I'd love to be wrong).
I'd like to try swapping out GPT-4 for Claude 3 Opus but I'm not super interested in changing my workflow. I manually use ChatGPT or Copilot Chat (less so) for some cases but I really like the in-line suggestions in my editor.
I’m pretty sure Copilot just came out to some JetBrain’s products. Like literally this week. At least two people in my team already tried it in IDEA. Iirc from the thread the were messaging in it was still behind some nightly build and not on stable, or something. I’m sure you can google it. =)
All those benchmarks are sooo bad that I'm not really convinced one way or another. Having tried out a few of them I roughly prefer gpt-4 I think still (though I haven't worked with any of them long enough to make a clear judgement).
> Inflection-2.5, March 7th. This one came out of left field for me: Inflection make Pi, a conversation-focused chat interface that felt a little gimmicky to me when I first tried it. Then just the other day they announced that their brand new 2.5 model benchmarks favorably against GPT-4, and Ethan Mollick—one of my favourite LLM sommeliers—noted that it deserves more attention.
Inflection's offering looks like a scam given that the responses that it gives are very close or the same as Anthropic's Claude 3 or even being an API wrapper around the Claude API [0] and packaged as a product.
Pi is more like a toy with performative use-cases. How the company was able to raise millions of dollars out of that absolutely make no sense whatsoever other than pure VC hype.
It's a term to represent that in the real-world, the quality of a model can't be measured objectively so the most common way to evaluate models personally is to see whether you like the output, with all the selection, anchoring, and recency biases that entails.
First time I have heard the term (and I have done quite a bit of llm benchmarking) but I love it! Similar to the xkcd "wtfs per minute" measure of code quality measure I guess.
Everyone I know who plays with LLMs says things like
- it didn't answer my special secret benchmark question better than previous models
- it surprised me by being able to give a satisfying answer for x that I wasn't expecting
- it always seems to give answers that are [Some negative quality]
Obviously these things are super subjective but still important and useful in a way that "answers 5th grade maths questions more accurately than previous models" sometimes isn't
when I have a new use for ai I'll try Claude, Bing, and Gemini and go with whoever is best for that use case. sometimes I'll have another model check and offer suggestions for improvement like for Etsy listings title or description etc...
It's worth noting that most of these models have a similar price-per-1M-tokens, so differentiation is mostly whether the model output has the right "vibe" for your use case, which can't easily be represented in most benchmarks.
Competition will encourage a price race to the bottom (marginal cost = marginal revenue), but I question whether some of these companies will be able to maintain it long-term without lots and lots of venture capital.
I have a use case where GPT-3.5 is a bit inconsistent and GPT-4 is a little too expensive. I’ve been happy with Claude 3 Sonnet, though. They’re definitely worth a try and provide more granular choice.
more datapoints ive been collecting for the latent.space recap:
- lmsys ranks claude 3 opus ahead of GPT4 but behind GPT4T https://x.com/lmsysorg/status/1765774296000172289 mistral large is also well behind claude 3 sonnet, and both behind gpt4, which is... good but disappointing?.
it's also noteworthy that gemini and claude seem to have completely solved the needle in a haystack problem vs openai/mistral - i'm not sure what breakthrough is responsible but the opensource community seems to be pointing at ringattention as the main promise
Great post as usual! Thank you! I was wondering about this though:
> I use these models a lot
I have very few uses of AI myself, that mostly center around writing scripts in languages I'm unfamiliar with so that I don't have to spend time looking up the specific syntax. But that would not count as "a lot".
Beyond curiosity and model testing, what do people really use AI for?
Writing code. Quick data analysis (with Code Interpreter and the like). Brainstorming ideas for things. World's best thesaurus. Entertainment.
Lots of quick answers to questions: my rule of thumb is that if someone who had just read the relevant Wikipedia page could answer my question, and the stakes aren't particularly high, I'll use an LLM.
Replace stackoverflow for all the stupid questions you might have. Or the intuitions you have and want to confront. Or to get examples on a topic. All this with no one to berate you because of points or dupes or 'you haven't even googled it', or you being stupid.
I've seen these benchmarks over and over and I wonder how they work. Seems to me you would need something close to a test or exam like in schools to properly rate a model. How useful are these metrics?
The Open LLM Leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) is a good indicator of progress in the space that hedges on a variety of potential use cases, but for real world use it cannot replace manually experimenting with models to ensure output is what you expect for your use case, especially creative use cases which these benchmarks do not capture.
GPT has been having some issues the past couple of days. Returning the "network retry" error dialog for hours on end.
This gave me a window to try out Gemini. I 1.0 was better than I anticipated, but still got some of the (important) details wrong. I wish I had access to 1.5.
The competition is closing in, and the frequent downtime makes it an easy choice to try the competitors.
That said, I'm still pulling for OpenAI. We need someone to challenge the old tech companies.
I hope GPT-5 is similar to Sora - ten steps forward in terms of innovation.
The new claude-3-opus-20240229 model got the highest score ever on this benchmark, completing 68.4% of the tasks. While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.
https://aider.chat/2024/03/08/claude-3.html