Impressive to think about how DeepSeek achieved: ~ Parity with o1 and Claude wit...

NitpickLawyer · 2024-12-31T10:23:46 1735640626

While impressive, the deepseek models aren't really "on par" with either oAI or Anthropic offerings, right now. The models seem to be a bit overfitted in the post-training step. They are very "stubborn" models, and usually handle tasks well if they can handle them, but steering them is quite difficult. As a result, they score very well on various benchmarks, but often times perform slightly worse in real-life scenarios.

espadrine · 2024-12-31T10:48:42 1735642122

The blind test at lmarena.ai does give it a higher Elo than GPT-4o (API), Claude, and Gemini 1.5 Pro. It seems that people do enter real-life scenarios in the arena.

orbital-decay · 2024-12-31T15:25:39 1735658739

DeepSeek v3 feels very much like Sonnet 3.5 (v1) in particular, minus the character. Performs more or less similarly, "feels" overfitted just about the same, and repeats itself in multiturn chats even worse. I hope they address it in v3.5, v4, or whatever comes next.

rahimnathwani · 2024-12-31T11:16:18 1735643778

  They are very "stubborn" models

Have you found this to be the case even when using the recommended temperature settings (ranging from 0 for math, to 1.5 for creative tasks)?

NitpickLawyer · 2024-12-31T11:39:47 1735645187

I use 0.05 for math, just did a 5k problem set, trying to fine-tune a smaller model with the outputs. It has some very interesting training, borrowed from r1 per the tech report, where it does the o1/qwq "thinking steps", but a bit shorter. It solves ~80% of the problems in 4k context, while qwq would go on for 8k-16k. It's very good at what it does.

But as soon as I need it to do something other than solve a problem - say rewrite the problem in simpler terms, or given a problem + solution provide hints, or rewrite the solution with these <tags>, etc. it kinda stops working. Often times it still goes ahead and solves the problem. That's why I'm saying it's stubborn. If a task looks like a task that it can handle very well, it's really hard to make it perform that other, similar but not quite the same task.

In a similar vein - https://github.com/cpldcpu/MisguidedAttention/tree/main/eval...

victorbjorklund · 2024-12-31T11:00:50 1735642850

I found deepseek very useful at coding with Aider. On par with claude.

llm_trw · 2024-12-31T21:46:59 1735681619

We're seeing a split in models between deep and wide.

Wide models sound like they know more than deep models but fail at reasoning with more than a few steps and are cheap to train and serve. Deep models know a lot less but can reason much better.

An example I saw all moe models fail at a few months back was A and not B being implicit in the grounding text, all of them would turn it into A and B a substantial proportion of the time. Monolithic models on the other hand had no trouble with giving the right answer.

The Chinese AI companies can only do wide Ai because of restrictions on hardware exports. In the short term this will make more people think llms are stochastic parrots because they can't get simple thinks right.

cma · 2024-12-31T23:59:57 1735689597

They may have a better approach for MoE selection during training:

> The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce in-domain balance on each sequence. This flexibility allows experts to better specialize in different domains. To validate this, we record and analyze the expert load of a 16B auxiliary- loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set. As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected.

And they have shared experts always present:

> Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones.

llm_trw · 2025-01-01T03:55:11 1735703711

That's just making the architecture work better.

I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.

The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.

visarga · 2025-01-01T01:50:14 1735696214

> Better algorithms and approaches are what's needed for the next step of ML.

I think they did great, but they relied on distillation. So it's like riding on a skateboard while being pulled by a car.

caycep · 2024-12-31T22:17:44 1735683464

what's the engineering situation at OpenAI since the whole "firing Sam Altman" spectacle? Has there been significant brain drain that affects something like o1 etc?

amelius · 2024-12-31T20:14:23 1735676063

Makes you wonder if OpenAI has a moat.

amelius · 2024-12-31T20:13:40 1735676020

How are these models benchmarked?