We're seeing a split in models between deep and wide.
Wide models sound like they know more than deep models but fail at reasoning with more than a few steps and are cheap to train and serve. Deep models know a lot less but can reason much better.
An example I saw all moe models fail at a few months back was A and not B being implicit in the grounding text, all of them would turn it into A and B a substantial proportion of the time. Monolithic models on the other hand had no trouble with giving the right answer.
The Chinese AI companies can only do wide Ai because of restrictions on hardware exports. In the short term this will make more people think llms are stochastic parrots because they can't get simple thinks right.
They may have a better approach for MoE selection during training:
> The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies
in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise
auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce
in-domain balance on each sequence. This flexibility allows experts to better specialize in
different domains. To validate this, we record and analyze the expert load of a 16B auxiliary-
loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set.
As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater
expert specialization patterns as expected.
And they have shared experts always present:
> Compared with traditional MoE
architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and
isolates some experts as shared ones.
I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.
The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.
Wide models sound like they know more than deep models but fail at reasoning with more than a few steps and are cheap to train and serve. Deep models know a lot less but can reason much better.
An example I saw all moe models fail at a few months back was A and not B being implicit in the grounding text, all of them would turn it into A and B a substantial proportion of the time. Monolithic models on the other hand had no trouble with giving the right answer.
The Chinese AI companies can only do wide Ai because of restrictions on hardware exports. In the short term this will make more people think llms are stochastic parrots because they can't get simple thinks right.