I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence
With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.
Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:
Model: chutes/Qwen/Qwen3-235B-A22B
/~\
( * * )
( o o o )
\ - /
\ /\ /
\ /
\/
/|||\
/|||||\
/||||||||\
( o o o )
\ W /
\___/
With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.
Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:
Model: chutes/Qwen/Qwen3-235B-A22B
Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507 Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507 Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507