

Estimating Coefficients in Linear Models: It Don't Make No Nevermind (1976) - nkurz
http://www-stat.wharton.upenn.edu/~hwainer/Readings/Wainer_Estimating%20Coefficients%20in%20Linear%20Models.pdf

======
stiff
Any commentary on why would anyone consider this paper of interest today? The
peculiarities of the least squares fit are no secret, the assumptions of the
"generalizability of fit theorem" presented are highly unrealistic (compares
equal weights to weights chosen uniformly at random), and we are now living in
the age of things like Support Vector Machines, Structural risk minimization,
VC dimension, boosting and so forth, which all address the issue of
overfitting very well. This reads like prehistory of the field, and not very
significant at that.

~~~
yummyfajitas
The paper is interesting because it is a highly readable and extremely
elementary introduction to the topic. Most HN readers don't even know the
prehistory of the field, so I think it's useful to them.

~~~
stiff
The rudiments of modern theory can be understood with little more difficulty,
and it is clear from those that any significant restriction of the number of
degrees of freedom of a model reduces the chances of overfitting occurring,
but also decreases the fraction of predicitions the model will get right, so
the real issue is where exactly do we draw a line, and this is now understood
quite well - the approach from the paper, for practical purposes, throws out
the baby with the bathwater. The first few lectures of the course of machine
learning by Yaser Abu-Mostafa are a really engaging introduction to those
topics:

[https://work.caltech.edu/](https://work.caltech.edu/)

By the way Howard Wainer is a noted author of semi-popular (some formulas
actually appear etc.) statistics books, so if enjoyed the writing, maybe a
better use of time would be to read his newer and more general stuff:

[http://www.amazon.com/Howard-
Wainer/e/B000AP7SUU/ref=sr_ntt_...](http://www.amazon.com/Howard-
Wainer/e/B000AP7SUU/ref=sr_ntt_srch_lnk_1?qid=1401098828&sr=8-1)

------
terranstyler
This is extremely interesting. In the forecast domains I work, (I use neural
networks on standardized inputs) I typically use 50-100 epochs at most for a
few thousand rows (with some approx 10 input neurons) and I don't see networks
improving very much after that.

However, our industry peers often use a few thousand up to several ten
thousands of epochs, yet our software is regularly best or second best when
compared by clients.

I think this may be the same effect: Network weights should be "ok" after a
few dozen epochs and any training beyond that maybe wins a percent of variance
explained but already risks overfitting.

We use the spare time to train more networks (on slightly different input
data) whose results we aggregate after that.

Something else: can someone explain the "I don't make no nevermind", I suppose
it's funny but I don't get it.

~~~
morenoh149
[http://www.urbandictionary.com/define.php?term=it+don%27t+ma...](http://www.urbandictionary.com/define.php?term=it+don%27t+make+no+nevermind)
though I don't get how it relates to the paper

------
DiabloD3
Given the almost 40 years since this was written, how much has actually
changed?

~~~
michaelochurch
Data set sizes have changed. That can make this issue (the tendency of OLS
regression to overfit) less or more of a problem.

If you have 100 observations and 10 dimensions (N = 100, p = 10) then OLS can
overfit, especially if the predictors are highly correlated or there's a lot
of noise on the response. If you have N = 10^6, p = 10, then probably not. If
you have N = 10^6 and p = 10^5, then overfitting is again a danger. If you
have N = 10^5 and p = 10^6, then you can't even do an OLS regression because,
when p > N, it involves inverting a singular matrix (and, to boot, a gigantic
one). The rough guideline is:

    
    
        N/p > 100: very low danger of overfitting.
        N/p < 100, > 10: low danger of overfitting. 
        N/p < 10, > 1: high danger of overfitting. Use regularization (ridge, Lasso, etc.)
        N/p < 1: OLS not even well-defined. Regularization mandatory. 
    

Equal weights is a primitive way of regularizing. You lose accuracy in the
zero-noise case but get a model that is more likely to be "wrong" in a way
that is harmless in prediction (because of correlations in predictors). These
days, people are more likely to use ridge regression or, if they want a
parsimonious model (few nonzero coefficients) they'll use Lasso or the elastic
net.

Though you almost never see equal weights in a _regression_ setting, you do
see equal-weight constraints in neural network architectures (e.g.
convolutional neural nets) to prevent overfitting. In neural nets, this
usually can be done with absolutely no loss of accuracy because ANNs are
already so highly parameterized that some form of regularization (weight
decay, equal-weight constraints, early stopping) is mandatory.

~~~
terranstyler
I was asking about that in another thread. Do you have any references about
you N/p rule of thumb? Thanks in advance!

~~~
michaelochurch
That N/p rule of thumb is ultimately subjective, except for the fact that when
N/p < 1, you can't use OLS anymore (because there is an infinitude of zero-
training-error solutions). It depends on correlations of predictors and
signal/noise ratio as well. Those are just approximate guidelines.

