GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2

starchild3001 · 2025-08-11T04:11:46 1754885506

What stood out to me is how much of gpt-oss’s “newness” isn’t about radical architectural departures, but about a careful layering of well-understood optimizations—RoPE, SwiGLU, GQA, MoE—with some slightly unusual choices (tiny sliding-window sizes, few large experts instead of many small ones, per-head attention sinks).

The MXFP4 quantization detail might be the sleeper feature here. Getting 20B running on a 16 GB consumer card, or 120B on a single H100/MI300X without multi-GPU orchestration headaches, could be a bigger enabler for indie devs and researchers than raw benchmark deltas. A lot of experimentation never happens simply because the friction of getting the model loaded is too high.

One open question I’m curious about: given gpt-oss’s design bias toward reasoning (and away from encyclopedic recall), will we start seeing a formal split in open-weight model development—specialized “reasoners” that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work? That separation could change how we architect systems that wrap these models.

regularfry · 2025-08-11T08:50:21 1754902221

> will we start seeing a formal split in open-weight model development—specialized “reasoners” that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work?

My bet's on the former winning outright. It's very hard to outrun a good search engine, LLMs are inherently lossy so internal recall will never be perfect, and if you don't have to spend your parameter budget encoding information then you get to either spend that budget on being a much better reasoner, or you shrink the model and make it cheaper to run for the same capability. The trade-off is a more complex architecture, but that's happening anyway.

asabla · 2025-08-11T06:17:28 1754893048

> that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work

I would say this isn't exclusive to the smaller OSS models. But rather a trait of Openai's models all together now.

This becomes especially apparent with the introduction of GPT-5 in ChatGPT. Their focus on routing your request to different modes and searching the web automatically (relying on an Agentic workflows in the background) is probably key to the overall quality of the output.

So far, it's quite easy to get their OSS models to follow instructions reliably. Qwen models has been pretty decent at this too for some time now.

I think if we give it another generation or two, we're at the point of having compotent enough models to start running more advanced agentic workflows. On modest hardware. We're almost there now, but not quite yet

codelion · 2025-08-11T06:43:07 1754894587

It is by design. OpenAI is not going to reveal any architectural innovation they have made in their own commercial models.

diggan · 2025-08-11T07:43:50 1754898230

Maybe not a architectural innovation, but both the Harmony format and splitting things into system/developer/user messages instead of just system/user messages, are both novel (in the released weights world) and different enough that I'm still in the process of updating my libraries so I can run fair benchmarks...

ethan_smith · 2025-08-11T09:00:18 1754902818

MXFP4's mixed precision approach (4-bit for weights, higher precision for KV cache) actually offers better accuracy/size tradeoffs than competing quantization methods like GPTQ or AWQ, which is why it enables these impressive resource profiles without the typical 4-bit degradation.

littlestymaar · 2025-08-11T07:07:46 1754896066

> careful layering of well-understood optimizations—RoPE, SwiGLU, GQA, MoE

They basically cloned Qwen3 on that, before adding the few tweaks you mention afterwards.

Voloskaya · 2025-08-11T08:19:55 1754900395

You seem to be conflating when you first heard about those techniques and when they first appeared. None of those techniques were first seen in Qwen, nor this specific combination of techniques.

NitpickLawyer · 2025-08-11T07:38:30 1754897910

> They basically cloned Qwen3 on that

Oh, come on! GPT4 was rumoured to be an MoE well before Qwen even started releasing models. oAI didn't have to "clone" anything.

littlestymaar · 2025-08-11T11:45:57 1754912757

First, it would be great if people stopped top acting as if those billion-dollar corporations where sport teams.

Second, I don't claim OpenAI have to clone anything, and I have no reason to believe that their proprietary models are copying other people's ones. But for this particular open weight models, they clearly have an incentive to use exactly the same architectural base as another actor's, in order to avoid leaking too much information about their own secret sauce.

And finally, though GPT-4 was a MoE it was most likely what TFA calls “early MoE” with a few very big experts, not many small ones.

7moritz7 · 2025-08-10T16:09:01 1754842141

Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.

In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.

So presumably, this comes down to...

- training technique or data

- dimension

- lower number of large experts vs higher number of small experts

jszymborski · 2025-08-10T16:11:25 1754842285

If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.

7moritz7 · 2025-08-10T16:15:18 1754842518

That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.

Edit: found this analysis, it's on the HN frontpage right now

> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.

https://x.com/jxmnop/status/1953899426075816164

CuriouslyC · 2025-08-10T16:26:38 1754843198

The strategy of Phi isn't bad, it's just not general. It's really a model that's meant to be fine tuned, but unfortunately fine tuning tends to shit on RL'd behavior, so it ended up not being that useful. If someone made a Phi style model with an architecture that was designed to take knowledge adapters/experts (i.e. small MoE model designed to get separately trained networks plugged into them with routing updates via special LoRA) it'd actually be super useful.

adastra22 · 2025-08-10T20:03:52 1754856232

The Phi strategy is bad. It results in very bad models that are useless in production, while gaming the benchmark to appear like it is actually able to do something. This is objectively bad.

CuriouslyC · 2025-08-10T21:31:35 1754861495

I like the idea of having a _HIGHLY_ unopinionated base model that's just good at basic logic and instruction following that I can fine tune to my use case. Sadly, full fine tuning tends to make models derpy, and LoRAs are limited in terms of what they can achieve.

adastra22 · 2025-08-10T22:06:05 1754863565

That seems unrelated? I think we are talking about past each other. Phi was trained on purely synthetic data derived from emulating the benchmark suite. Not surprisingly, this resulted in state of the art scores. And a model that was 100% useless at anything other than making the benchmark number go up.

johnisgood · 2025-08-10T23:21:55 1754868115

Is there an URL to the post itself on somewhere else?

unstatusthequo · 2025-08-10T16:57:51 1754845071

Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.

“What is the word that starts with S, ends with E, and contains A? → SAEA”

Then when I said that’s not a word and you gave me the answer already, no fun, it said

“I do not have access to confirm that word.”

verisimi · 2025-08-10T17:14:11 1754846051

lol. The answer it gave doesn't even end in an 'E'.

threeducks · 2025-08-10T19:02:34 1754852554

FWIW, I asked gpt-oss-120b this question 10 times and the answer was always "sauce", "sane" or "sale". I also tried different temperatures (from 0 to 1), which did not seem to have an effect on the correctness of the answer.

EDIT: I now have also questioned the smaller gpt-oss-20b (free) 10 times via OpenRouter (default settings, provider was AtlasCloud) and the answers were: sage, sane, sane, space, sane, sane, sane, sane, space, sane.

You are either very unlucky, your configuration is suboptimal (weird system prompt perhaps?) or there is some bug in whichever system you are using for inference.

yunusabd · 2025-08-10T19:36:30 1754854590

GP asked the model to _create_ a riddle, not solve a given one.

threeducks · 2025-08-10T20:38:41 1754858321

Yes, but the odds of getting GPT-OSS to respond with that riddle are pretty low and it is not necessary to demonstrate whether the LLM can answer the riddle correctly.

vidarh · 2025-08-11T00:54:50 1754873690

They said it provided the answer when it created the riddle. They didn't question itd ability to solve it.

roscas · 2025-08-10T17:22:25 1754846545

"The word is SAME." for qwen3-coder and it took like 10 seconds to write all the thinking because it writes it down, but processing was 0 seconds.

"I need to find a word that starts with S, ends with E, and contains A.

Let me think through this step by step:

    Starts with S
    Ends with E
    Contains A somewhere in the middle

One word that fits this pattern is "SAGE" - but that only has 4 letters and would be S-A-G-E, which doesn't work because it doesn't contain A in the middle.

Wait, let me reconsider. The word "SAGE" actually works if I read it correctly:

    Starts with S
    Ends with E  
    Contains A

Actually, "SAGE" is S-A-G-E, so it starts with S and ends with E, but it contains A in the second position.

Let me think of other possibilities:

    "SQUARE" - starts with S, ends with E, contains A (in the middle)
    "SALE" - starts with S, ends with E, contains A
    "SAME" - starts with S, ends with E, contains A

Wait, "SAME" is S-A-M-E - that works! It starts with S, ends with E, and contains A.

The word is SAME. "

bee_rider · 2025-08-11T05:13:59 1754889239

This is tangential because the task was to come up with the riddle, not solve it.

But, do reasoning models usually do this poorly?

It comes up with a valid solution, SAGE, then disqualifies it for incomprehensible reasons.

Then it discovers that SAGE works if it “reads it carefully.” But then seems to disqualify it(?), or at least goes to list other words for some reason.

Then it comes up with SAME, a word… with exactly the same shape as SAGE, just swapped out the irrelevant letter.

What is going on here? Is it programmed to constantly second-guess itself to make it better at finding weaknesses to its answers to harder riddles? But since it doesn’t know how to accept a good answer, it seems like it is just rolling the dice and then stopping at a random point.

I guess it is technically right, but the logic is a total mess.

yorwba · 2025-08-11T08:23:40 1754900620

The model isn't explicitly programmed to constantly second-guess itself, but when you do reinforcement learning with verifiable rewards (RLVR) where only the final answer is verified, even completely nonsensical reasoning can accidentally be rewarded if it gives correct results often enough.

E.g. if the model can generate multiple candidate solutions that are all equally likely (or unlikely) to be correct, it doesn't matter whether you stop at the first one or keep going until a random later one. But if the model can pick the correct solution from multiple candidates better than choosing uniformly at random, generating more candidates becomes an advantage, even if it sometimes results in discarding a correct solution in favor of another one.

adastra22 · 2025-08-10T20:05:30 1754856330

He was asking the llm to come up with the riddle.

faangguyindia · 2025-08-11T01:28:07 1754875687

this is exactly why strongest model gonna lose out to weaker models if the later ones have more data

for example, i was using deep seek webui and getting decent on point answers but it simply does not have latest data.

So, while Deep Seek R1 might be better model than Grok3 or even Grok4, it not having access to "twitter data" basically puts it behind.

Same is case with OpenAI, if OpenAI has access to fast data from github, it can help with bugfixs which claude/gemini2.5 pro can't.

model can be smarter but if it does not have the data to base its inference upon it's useless.

fspeech · 2025-08-11T06:08:32 1754892512

On the open source library part, you can ask DeepWiki the questions yourself and feed the answers to the LLMs by hand. DeepWiki gives you high quality answers because they are grounded in code and you can check the veracity yourself.

omneity · 2025-08-11T00:26:00 1754871960

Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one.

In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.

bee_rider · 2025-08-11T12:56:52 1754917012

Tangential question from an outsider:

When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)

selcuka · 2025-08-11T00:30:35 1754872235

> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.

They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.

Mars008 · 2025-08-11T02:15:00 1754878500

You call it fair? 32 / 5.1 > 6, it's takes 6 times more to compute each token. Put it other way, Qwen3 32B is 6 times slower than GPT OSS 120B.

kgeist · 2025-08-11T02:22:59 1754878979

>Qwen3 32B is 6 times slower than GPT OSS 120B.

Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)

selcuka · 2025-08-11T06:48:09 1754894889

We are talking about accuracy, though. I don't see the point of MoE if a 120B MoE model is not as accurate as even a 32B model.

littlestymaar · 2025-08-11T07:13:22 1754896402

I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.

In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

Mars008 · 2025-08-11T15:52:15 1754927535

Not sure there is on formula. Because there are two different cases:

1) performance constrained. like NVidia Spark with 128GB or AGX with 64GB.

2) memory constrained. like consumers' GPUs.

In first case MoE is clear win. They fit and run faster. In second case dense models will produce better results. And if performance in token/sec is acceptable then they are better choice.

selcuka · 2025-08-11T12:30:41 1754915441

> In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

That's actually in line with what I had (unscientifically) expected. Claude Sonnet 4 seems to agree:

> The most accurate approach for your specific 120B MoE (5.1B active) would be to test it empirically against dense models in the 10-30B range.

kgeist · 2025-08-11T11:20:01 1754911201

I've read that the formula is based on the early Mistral models and does not necessarily reflect what's going on nowadays.

BoorishBears · 2025-08-10T20:02:07 1754856127

MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model

faangguyindia · 2025-08-11T01:25:15 1754875515

yesterday, i signed up for qwen3-coder-plus. It fails 4/10 "diff" edit format in various code editing tools i use.

Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?

https://aider.chat/docs/more/edit-formats.html

eurekin · 2025-08-11T09:04:18 1754903058

I'm running a30-a3b-instruct q6 quant on exllamav2 and checked few simple tasks in roo and cline. Prompt adherence, tool calling and file changing worked flawlessly

faangguyindia · 2025-08-11T10:40:37 1754908837

okay turns out i was using it in aider with wrong edit format in editor mode, i switched to editor edit format it has not failed so far.

wickedsight · 2025-08-11T12:48:14 1754916494

Maybe I'm doing something wrong, but in my testing with Roo and Qwen3-Coder-30B via MLX, it constantly ends up in loops and often doesn't manage to finish editing a file, leaving it half finished.

If I give it really simple, straight forward tasks it works quite nice though.

cranberryturkey · 2025-08-10T20:09:03 1754856543

qwen3 is slow though. i used it. it worked, but it was slow and lacking features.

kgeist · 2025-08-11T02:34:19 1754879659

On my RTX 5090 with llama.cpp:

gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)

Qwen3 32B - 65 tok/sec

Qwen3 30B-A3B - 150 tok/sec

(all at 4-bit)

xfalcox · 2025-08-11T00:17:41 1754871461

Qwen 3 is not slow by any metrics.

Which model, inference software and hardware are you running it on?

The 30BA3B variant flies on any GPU.

SchemaLoad · 2025-08-11T02:06:33 1754877993

GPT-OSS is slow too. Gemma3 gives me better results and runs faster.

mark_l_watson · 2025-08-10T18:28:59 1754850539

Wow, Sebastian Raschk's blog articles are jewels - much appreciated.

I use the get-oss and qwen3 models a lot (smaller models locally using Ollama and LM Studio) and commercial APIs for the full size models.

For local model use, I get very good results with get-oss when I "over prompt," that is, I specify a larger amount of context information than I usually do. Qwen3 is simply awesome.

Until about three years ago, I have always understood neural network models (starting in the 1980s), GAN, Recurrent, LSTM, etc. well enough to write implementations. I really miss the feeling that I could develop at least simpler LLMs on my own. I am slowly working through Sebastian Raschk's excellent book https://www.manning.com/books/build-a-large-language-model-f... but I will probably never finish it (to be honest).

imtringued · 2025-08-11T11:27:56 1754911676

For me it is the opposite. I'm shocked by how simple transformer based models and how small the architectural differences are between the latest models. Almost nothing has changed since late 2023.

lvl155 · 2025-08-10T19:40:25 1754854825

He does an amazing job of keeping me up to date on this insanely fast-paced space.

roscas · 2025-08-10T17:18:46 1754846326

From my experience, qwen3-coder is way better. I only have gpt-oss:20b installed to make a few more tests but I give it a program to make a summary of what it does and qwen3 just works in a few seconds, while gpt-oss was cancelled after 5 minuts... doing nothing.

So I just use qwen3. Fast and great ouput. If for some reason I don't get what I need, I might use search engines or Perplexity.

I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.

Qwen3-coder is amazing. Best I used so far.

lvl155 · 2025-08-10T20:01:57 1754856117

Qwen3 coder 480B is quite good and on par with Sonnet 4. It’s the first time I realized the Chinese models are probably going to eclipse US-based models pretty soon, at least for coding.

indigodaddy · 2025-08-10T20:18:33 1754857113

Where do you use qwen3 480b from, I'm not even seeing it on Openrouter. EDIT nm, openrouter is just calling it qwen3-coder-- when I click for more info it shows it's Qwen3-Coder-480B-A35B-Instruct. And it's one of their free models. Nice

tough · 2025-08-11T01:03:42 1754874222

cerebras code (both sub and api) have it

faangguyindia · 2025-08-11T01:29:45 1754875785

what edit format u use with Qwen? https://aider.chat/docs/more/edit-formats.html

diff is failing me or do you guys use whole?

cpursley · 2025-08-10T21:20:48 1754860848

That might be a stretch, maybe Sonnet 3.5. But it is pretty impressive as is Kimi on opencode.

mhitza · 2025-08-10T17:47:00 1754848020

I've been using lightly gpt-oss-20b but what I've found is that for smaller (single sentence) prompts it was easy enough to have it loop infinitely. Since I'm running it with llama.cpp I've set a small repetition penalty and haven't encountered those issues since (I'm using it a couple of times a day to analyze diffs, so I might have just gotten lucky since)

nicolaslem · 2025-08-10T19:16:13 1754853373

I had the same issue with other models where they would loop repeating the same character, sentence or paragraph indefinitely. Turns out the context size some tools set by default is 2k and this is way too small.

ModelForge · 2025-08-10T19:00:30 1754852430

I’ve been using the ollama version (uses about 13 Gb RAM on macOS) and haven’t had that issue yet. I wonder if that’s maybe an issue of the llama.cpp port?

mhitza · 2025-08-10T19:05:13 1754852713

Never used ollama, only ready to go models via llamafile and llama.cpp.

Maybe ollama has some defaults it applies to models? I start testing models at 0 temp and tweak from there depending how they behave.

smokel · 2025-08-10T17:23:08 1754846588

The 20B version doesn't fit in 10GB. That might explain some issues?

SV_BubbleTime · 2025-08-10T20:08:07 1754856487

Are you using this in an agentic way or in a copy and paste and “code this” single input single output way?

I’d like to know how far the frontier models are from the local for agentic coding.

panki27 · 2025-08-11T08:29:32 1754900972

What Qwen3-Coder model are you using? Quantized or not?

Asking because I'm looking for a good model that fits in 12GB VRAM.

eurekin · 2025-08-10T22:25:24 1754864724

I'm still in awe that a local 3090 gpu was able to run the qwen3 coder instruct 30b-a3b exl3 q6 and...

Was able to create a sample page, tried starting a server, recognising a leftover server was running, killing it (and forced a prompt for my permission), retrying and finding out it's ip for me to open in the browser.

This isn't a demo anymore. That's actually very useful help for interns/juniors already.

mrheosuper · 2025-08-11T02:59:11 1754881151

How did you do your setup ?, right now the only way i know how to run LLM is through LM studio.

eurekin · 2025-08-11T04:25:42 1754886342

Using tabbyApi and exLlamav2.

Scene_Cast2 · 2025-08-10T18:05:00 1754849100

I find it interesting that the architectures of modern open weight LLMs are so similar, and that most innovation seems to be happening on the training (data, RL) front.

This is contrary to what I've seen in a large ML shop, where architectural tuning was king.

bobbylarrybobby · 2025-08-10T19:05:11 1754852711

My guess is that at LLM scale, you really can't try to hyperparameter tune — it's just too expensive. You probably have to do some basic testing of different architectures, settle on one, and then figure out how to make best use of it (data and RL).

ModelForge · 2025-08-10T19:04:09 1754852649

Good point. LLMs lower the barrier to entry if someone has enough resources because those architectures are more robust to tweaks given one throws enough compute and data at them. You can even violate scaling laws and still get a good model (like Llama 3 showed back then)

gglon · 2025-08-10T20:34:16 1754858056

> At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96.)

Tencent's hunyuan-turbos, another hybrid, is currently ranked at 22. https://arxiv.org/abs/2505.15431

storus · 2025-08-10T18:07:18 1754849238

In my tests, GPT-OSS-120B Q8 was close to DeepSeek R1 671B Q16 in solving graduate-level math but much faster with way fewer thinking tokens.

overfeed · 2025-08-10T20:09:25 1754856565

Supporting TFA'd thesis that it's trained to be good at benchmarks.

Mars008 · 2025-08-11T02:26:02 1754879162

Is it bad? It was trained on synthetic data with emphasis on coding and scientific thinking. Good on my opinion, that's what it can be used for. Not as universal do it all model.

dzogchen · 2025-08-11T08:41:35 1754901695

Say it with me: freely downloadable model weights does not mean a model is open source. https://opensource.org/ai/open-source-ai-definition

victorbjorklund · 2025-08-11T08:56:44 1754902604

Yeah but isn't it even doubtful if you can call a model a program or is it rather like a data set that can be used by a program.

pryelluw · 2025-08-10T18:56:08 1754852168

The Qwen3 4B has been very good to use local. I barely use the online models. Web searches are now more targeted thanks to it. Don’t quite fully trust the output but it’s generally good. Mods like these will revolutionize local knowledge and automation

indigodaddy · 2025-08-10T19:05:21 1754852721

Qwen is telling you better search parameters to then search the web with, or qwen is actually doing web searches for you?

oezi · 2025-08-10T20:45:30 1754858730

One question I was wondering about regarding the open models released by big labs is how much more the could improve with additional training. GPT-OSS has 2.1m hours of training, how much score improvements could we see at double that?

ModelForge · 2025-08-10T21:52:15 1754862735

I think GPT-4.5 was potentially the original GPT-5 model that was larger and pre-trained on more data. Too bad it was too expensive to deploy at scale so that we never saw the RL-ed version

poorman · 2025-08-10T20:52:08 1754859128

As we saw with GPT-5 the RL technique of training doesn't scale forever

energy123 · 2025-08-11T12:12:42 1754914362

Unless GPT-5 is 30% cheaper to run than o3. Then it's scaling brilliantly given the small gap between release dates. People are really drawing too many conclusions from too little information.

oezi · 2025-08-10T21:19:50 1754860790

I meant scaling the base training before RL.

poorman · 2025-08-10T20:59:49 1754859589

This article really goes into a lot of detail which is nice. gpt-oss is just not good for agentic use in my observation.

tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations.

[1] https://opencode.ai/

LarMachinarum · 2025-08-11T10:00:39 1754906439

Much as I understand how a 5 bit quantization might be a sweet spot in the tradeoff between precision and making it possible to cram more weight parameters into limited ram, and thus in that respect better than e.g. 4 bit or 8 bit,…

…I struggle to comprehend how an odd quantization like 5 bit, that doesn't align well with 8 bit boundaries, would not slow things down for inference: given that on one hand the hardware doing the multiplications doesn't support vectors of 5 bit values but needs repacking to 8 bit before multiplication, and on the other hand the weights can't be bulk-repacked to 8 bit once and for all in advance (otherwise it wouldn't fit inside the RAM, besides in that case one would use a 8 bit quantization anyways)

it would require quite a lot of instructions per multiplication (way more than for 4 bit quantization where the alignment match simplifies things) to ad-hoc repack the 5 bit values to vectors of 8 bit. So i kinda wonder how much (percentage-wise) that would impact inference performance

throw-qqqqq · 2025-08-11T10:28:30 1754908110

> I struggle to comprehend how an odd quantization like 5 bit, that doesn't align well with 8 bit boundaries, would not slow things down for inference

Who says it doesn’t :)?

At least in my tests there is a big penalty to using an “odd” bit stride.

Testing 4bit quantization vs 5bit in Llama.cpp, I see quite a bit more than the “naiively expected” 25% slowdown from 4 to 5 bits.

ModelForge · 2025-08-10T21:50:47 1754862647

The ollama one uses even less (around 13 GB), which is nice. Apparently the gpt-oss team also shared the mxfp4 optimizations for metal

ahmedfromtunis · 2025-08-10T22:42:15 1754865735

When I visit the site I get the error "Your connection is not private". Also: "You cannot visit magazine.sebastianraschka.com right now because the website uses HSTS."

Chrome latest on Ubuntu.

vintermann · 2025-08-11T05:04:24 1754888664

First suspicion is that HSTS is doing what it's supposed to, and that you're connecting from somewhere they try to insert themselves in the middle of all https traffic. Https snooping is sadly not uncommon, some businesses think they're entitled to do it for you using their network.

mike_hearn · 2025-08-11T12:48:46 1754916526

I'm really not a PyTorch expert so this is most likely a newbie error, but could someone explain to me the code in Figure 7?

The code circled as "4 x emb_dim" doesn't seem to apply a 4x multiplier anywhere. Actually, the layer definitions of fc1 and fc2 in the SwiGLU variant appear to be identical to the code in the regular feed forward block. What is making the two layers in the second code snippet different sizes to fc1 in the first?

fdalvi · 2025-08-11T13:00:20 1754917220

It is indeed not something clarified by the code snippets; In normal feedforward layers, it is common to choose the "hidden_dim = 4 x emb_dim", while in GLU feedforward layer, the convention is to use "hidden_dim = 2/3 * regular_ffn_hidden_dim" (to keep the overall number of parameters roughly the same). In the case of gpt-oss, they chose to go a bit more extreme and set "hidden_dim = emb_dim", thus reducing the overall number of parameters!

mike_hearn · 2025-08-11T14:10:13 1754921413

Ah, thank you!

chaos_emergent · 2025-08-10T20:56:01 1754859361

> This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced.

Wait, is this true? That seems like a wild statement to make, relatively unsubstantiated?

typon · 2025-08-10T20:57:47 1754859467

No this is well known. Look for Table 2.2 in GPT3 paper.

chaos_emergent · 2025-08-12T17:54:50 1755021290

Thank you, that was a wild thing to learn!

homarp · 2025-08-10T16:07:44 1754842064

"From GPT-2 to gpt-oss: Analyzing the Architectural Advances And How They Stack Up Against Qwen3"