Impressive to think about how DeepSeek achieved: ~ Parity with o1 and Claude with > 10x less resources.
Better algorithms and approaches are what's needed for the next step of ML.
While impressive, the deepseek models aren't really "on par" with either oAI or Anthropic offerings, right now. The models seem to be a bit overfitted in the post-training step. They are very "stubborn" models, and usually handle tasks well if they can handle them, but steering them is quite difficult. As a result, they score very well on various benchmarks, but often times perform slightly worse in real-life scenarios.
The blind test at lmarena.ai does give it a higher Elo than GPT-4o (API), Claude, and Gemini 1.5 Pro. It seems that people do enter real-life scenarios in the arena.
DeepSeek v3 feels very much like Sonnet 3.5 (v1) in particular, minus the character. Performs more or less similarly, "feels" overfitted just about the same, and repeats itself in multiturn chats even worse. I hope they address it in v3.5, v4, or whatever comes next.
I use 0.05 for math, just did a 5k problem set, trying to fine-tune a smaller model with the outputs. It has some very interesting training, borrowed from r1 per the tech report, where it does the o1/qwq "thinking steps", but a bit shorter. It solves ~80% of the problems in 4k context, while qwq would go on for 8k-16k. It's very good at what it does.
But as soon as I need it to do something other than solve a problem - say rewrite the problem in simpler terms, or given a problem + solution provide hints, or rewrite the solution with these <tags>, etc. it kinda stops working. Often times it still goes ahead and solves the problem. That's why I'm saying it's stubborn. If a task looks like a task that it can handle very well, it's really hard to make it perform that other, similar but not quite the same task.
We're seeing a split in models between deep and wide.
Wide models sound like they know more than deep models but fail at reasoning with more than a few steps and are cheap to train and serve. Deep models know a lot less but can reason much better.
An example I saw all moe models fail at a few months back was A and not B being implicit in the grounding text, all of them would turn it into A and B a substantial proportion of the time. Monolithic models on the other hand had no trouble with giving the right answer.
The Chinese AI companies can only do wide Ai because of restrictions on hardware exports. In the short term this will make more people think llms are stochastic parrots because they can't get simple thinks right.
They may have a better approach for MoE selection during training:
> The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies
in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise
auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce
in-domain balance on each sequence. This flexibility allows experts to better specialize in
different domains. To validate this, we record and analyze the expert load of a 16B auxiliary-
loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set.
As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater
expert specialization patterns as expected.
And they have shared experts always present:
> Compared with traditional MoE
architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and
isolates some experts as shared ones.
I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.
The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.
what's the engineering situation at OpenAI since the whole "firing Sam Altman" spectacle? Has there been significant brain drain that affects something like o1 etc?