
Understanding “Deep Double Descent” - alexcnwy
https://www.alignmentforum.org/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent
======
jackson1372
The reason you want to over-parameterize your model is that it protects you
from "bad bounce" learning trajectories. You effectively spread out your
overfitting risk until it's pretty close to 0.

Or at least that's the way I like to think of it.

The next step is to better compress the resulting model in a simpler, less
computationally costly network.

~~~
gyuserbti
Are you suggesting dd is about local minima sort of? Like if you extended the
risk: parametrization curve out you'd start to see overfitting again?

------
liaukovv
"Understanding" part seems rather optimistic.

------
gyuserbti
My initial intuition is there's limitations in test samples that are used, in
the sense they only have so much information. At some point overfitting is
likely to manifest not in test risk per se, but in random variations over
alternate test samples. Eg overfitting would evidence in susceptibility to
adversarial regimes not cross validation risk.

I've always been skeptical of cross validation based inference though and
admit it's a fascinating phenomenon in the paper.

It just seems, informationally speaking, to be proposing something akin to
free energy: that more data is worse and if you just increase your model
complexity you can magically infer truth. It seems more likely to be an error
in the inferential paradigm.

------
ganzuul
This observation seems to come up in discussions regarding the hot topic of
the
[https://en.wikipedia.org/wiki/Neural_tangent_kernel](https://en.wikipedia.org/wiki/Neural_tangent_kernel)

Please correct me if I'm wrong, but I think it means to say that you can in
theory conjure specialized kernel methods out of 'infinitely' over-
parametrized neural networks. At the moment this all gives unimpressive
performance, but it is theoretically promising and could give statisticians
interpretable NN-derived models.

------
lostmsu
Can someone explain what is "interpolation threshold"? Both articles talk
about it, but none defines.

~~~
preetum
Hi, we define the interpolation threshold in Section 2 of the full paper
([https://arxiv.org/abs/1912.02292](https://arxiv.org/abs/1912.02292)) as the
point when the "Effective Model Complexity" = # of train samples.

Where the "Effective Model Complexity (EMC)" of a model + training procedure
(w.r.t an input distribution) is the maximum number of samples from the
distribution that the model+training can fit to ~0 train error.

Our experiments are consistent with the hypothesis that the double-descent
peak occurs when EMC = n; that is, when the model+training is just barely able
to fit the train set.

~~~
joe_the_user
It seems like a phenomena of this sort would have to depend on the data set
you are dealing with. It might be true for "all typical data sets in important
domains" but it would still seem like basic point would hold.

~~~
preetum
Oh yes, absolutely. We only make claims for "natural distributions and models"
(^).

It's almost certainly possible to break this by pathological choice of a data
distribution or model/initialization/optimization scheme. But, I don't think
this is interesting -- what I think is interesting is that this seems to hold
true in real life, in "natural settings".

\---

(^) Whatever "natural distributions" means... (which I think is a good
research direction in itself).

~~~
jackson1372
Isn't the explanation for this that the world actually _does_ work in some way
or other and that it's not just infinite chaos and so if you keep throwing
parameters at some problem, you will eventually stumble upon the "real"
structure, but that's no guarantee of when that occurs, and with which
parameters?

~~~
joe_the_user
Well, the thing is that when one says "the world has structure", one is saying
that there are variety of structures "out there", in the world.

But that doesn't mean there's a single structure determined by a single set of
parameters. Quite possibly there are numerous structures with not-compatible
parameter structures.

Moreover, common AI data sets share parameters in a fashion that isn't always
obvious - most images on the web are photos taken by human photographers who
tend to center their subject, effectively giving them different parameters
than, say, security camera footage. IE, "normal data" may not mean what we
imagine.

------
_0ffh
Wow, that implies that given a large enough model, early stopping is actually
a mistake! Who’d a thunk it?

~~~
gwern
But maybe it's not a big one (the reduction on the other side is still fairly
small), and it may not be one at all (do you have the compute budget? look at
how many epoches it takes to reach the other side and actually realize any
gains!). More theoretically interesting than practical advice, I'd say.

~~~
_0ffh
Yup, I've seen on the graph it looks like at least an order of magnitude more
training time until it starts to get interesting. Obviously this is only one
example, so that factor might vary mightily. It might be an option for someone
who is really serious about squeezing out the last drop of generalisation
performance. I guess we'll see how things work out, because some folk are
bound to try.

------
dzdt
This is a really good read! I am not deep in machine learning research but the
exposition and diagrams make the point clearly. I really feel like this us
advancing deep learning as a science.

------
eutectic
What happens if you take a large model and regularize it to have the same
training error as a smaller model? Do you get the same benefits?

