Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of rel...

jxy · 2024-04-23T03:42:03.000000Z

This inductive logic is way overblown.

> Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

> And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

oersted · 2024-04-23T03:49:45.000000Z

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

saretup · 2024-04-23T04:21:14.000000Z

Firstly, English is a highly subjective category.

Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.

While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.

YetAnotherNick · 2024-04-23T06:16:38.000000Z

Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

imtringued · 2024-04-23T06:33:38.000000Z

When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".

YetAnotherNick · 2024-04-23T07:51:58.000000Z

Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.

rgbrgb · 2024-04-23T15:53:45.000000Z

> just for better writing

in my use cases, better writing makes a better answer

ignoramous · 2024-04-23T03:22:06.000000Z

> Phi-3-mini 3.8b: 71.2

Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

A model on par with GPT-3.5 running on phones!

(weights haven't been released, though)

coder543 · 2024-04-23T03:38:22.000000Z

> (weights haven't been released, though)

Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

Hopefully Microsoft will continue that trend with Phi-3.

> outputs 1212 tokens/sec on iOS

I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

jph00 · 2024-04-23T04:42:18.000000Z

Weights will be realised tomorrow, according to one of the tech report authors on Twitter.

ignoramous · 2024-04-23T04:54:37.000000Z

> you meant 12 tokens/sec

Thanks! The HTML version on archive.is has messed up markup and shows 1212 instead: https://archive.is/Ndox6

intellectronica · 2024-04-23T06:24:35.000000Z

Weights are coming tomorrow.

homarp · 2024-04-23T15:17:34.000000Z

tomorrow is now: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

karmasimida · 2024-04-23T08:33:27.000000Z

Where did you get this from?

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

alecco · 2024-04-23T08:26:25.000000Z

At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.

crakenzak · 2024-04-23T03:12:33.000000Z

Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

nl · 2024-04-23T05:37:59.000000Z

> What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

regularfry · 2024-04-23T12:53:17.000000Z

It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.

What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.

Deverauxi · 2024-04-23T03:49:22.000000Z

5 years? 5 years is a millennia these days.

We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.

refulgentis · 2024-04-23T04:16:32.000000Z

Absolutely not on the first one. Not even close.

ashirviskas · 2024-04-23T16:48:51.000000Z

Why not? There's still 7 months left for breakthroughs.

refulgentis · 2024-04-23T18:38:43.000000Z

Small leaves wiggle room, but it's extremely unlikely trad small, <= 7B, will get there this year even on these evals.

UX matching is a whole different matter and needs a lot of work: Worked heavily with Llama 8B over last days, and Phi 3 today, and the Q+A benchmarks don't tell the full story. Ex. It's nigh impossible to get Llama _70_B to answer in JSON; when Phi sees RAG from search it goes off inventing new RAG material and a new question.

bugglebeetle · 2024-04-23T04:45:37.000000Z

We already do. It’s called LLama 3 70B Instruct.

vitorgrs · 2024-04-23T07:28:01.000000Z

Llama 3 is awful in non-English. 95% of their training data is in English....

GPT is still the king when talking about multiple languages/knowledge.

stavros · 2024-04-23T03:39:09.000000Z

Is it released?

viraptor · 2024-04-23T07:02:29.000000Z

On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.

candiodari · 2024-04-23T11:39:00.000000Z

I wonder if that's a positive or negative. How does it affect hallucinations?

viraptor · 2024-04-23T12:14:03.000000Z

It depends what you want to do. If you want a chat bot that can replace most Google queries, you want as much learned data as possible and the whole Wikipedia consumed. If you want a RAG style system, you want good reasoning about the context and minimal-or-no references to extra information. It's neither positive nor negative without a specific use case.

blackeyeblitzar · 2024-04-23T15:35:15.000000Z

It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.

I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.

moralestapia · 2024-04-23T03:16:59.000000Z

>And on LMSYS English, Llama 3 8B is well above GPT-4

Source?

oersted · 2024-04-23T03:18:46.000000Z

Right thanks for the reminder, I added it

moralestapia · 2024-04-23T03:22:29.000000Z

Thanks, I don't see them being "well above GPT-4", merely 1 point? Also, no idea why one would want to exclude GPT-4-Turbo, the flagship "GPT-4" model, but w/e.

I also don't think they "beat Llama 3 8B"; their own abstract says "rivals that of models such as Mixtral 8x7B and GPT-3.5", "rivals" not even "beats".

Great model, but let's not overplay it.

oersted · 2024-04-23T03:32:24.000000Z

In the English category: GPT-4-0314 (ELO 1166), Llama 3 8B Instruct (ELO 1161), Mistral-Large-2402 (ELO 1151), GPT-4-0613 (ELO 1148).

You are right, I toned down the language, I got a bit overexcited, and I missed the difference in the versions of GPT-4. And LMSYS is a subjective benchmark for what users prefer, which I'm sure has weird inherent biases.

It's just that any signal of an 3.8B model being anywhere in the vicinity of GPT-4 is huge.

moralestapia · 2024-04-23T03:39:11.000000Z

Yeah, GPT3.5, in a phone, at ~1,000 tokens/sec ... nice!

mlyle · 2024-04-23T04:05:10.000000Z

> at ~1,000 tokens/sec

12 tokens per second.

moralestapia · 2024-04-23T11:53:42.000000Z

Whoops, made the same mistake as @ignoramous :P

zone411 · 2024-04-23T03:51:10.000000Z

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

No, we don't. LMsys is just one, very flawed benchmark.

ukuina · 2024-04-23T04:24:46.000000Z

Why is LMsys flawed?

Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

oersted · 2024-04-23T03:53:03.000000Z

Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.

infecto · 2024-04-23T11:57:01.000000Z

"But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.