We're seeing a split in models between deep and wide. Wide models sound like the...

cma · 2024-12-31T23:59:57 1735689597

They may have a better approach for MoE selection during training:

> The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce in-domain balance on each sequence. This flexibility allows experts to better specialize in different domains. To validate this, we record and analyze the expert load of a 16B auxiliary- loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set. As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected.

And they have shared experts always present:

> Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones.

llm_trw · 2025-01-01T03:55:11 1735703711

That's just making the architecture work better.

I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.

The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.