> If the error vector in our regression model follows any distribution in the family of Elliptically Symmetric distributions, then any test statistic that is scale-invariant has the same null and alternative distributions as they have when the errors are normally distributed.
Most of the attention paid to distributional assumptions in regression is wasted, and would be better spent on really thinking through the assumed moment conditions underlying the estimator.
That's not wrong, but it's a strong way to word it. If linear regression were only suitable when the variables were perfectly linearly related, it would get a lot less use. Practically, linear regression can be used when the relationship is linear-ish, at least in the interval of interest. In other words, you can choose to declare linearity as an assumption (and take responsibility for what that choice entails, and for the error it might introduce into your analysis).
There's nothing stopping you from using it as a "best fit line", even when you have no reason to believe those assumptions. But then it's just a best-fit line. It tells you the direction and magnitude of linear trend, nothing more. That's never wrong in any sense, it's just that sometimes it's not very useful.
For example, see here where making arbitrary choices of how to code categorical variables will change the estimates:
If you change the model the meaning of all the coefficients changes.
I agree, changing the model changes the estimates, because the parameters you are estimating change.
However, given one misspecified model, the parameters of that model are still well defined, though they may not have the interpretation they would if the model was correctly specified. As OP called it, this is the "best fit line", and is a projection of the truth onto your model. E.g. for a simple linear regression of Y on X, where the true conditional mean of Y given X is not linear, there is still some "true" best line. This line depends also on the distribution of X, though it would not if the model was correct. Estimates from linear regression will converge to the parameters of this line, though using the usual standard errors will be wrong.
There's a very general theorem or corollary that covers this in Asymptotic Statistics by van der Vaart. I think in the chapter about M estimators, right around where MLEs are covered, but I don't have it in front of me.
First, there is the statistical level, at which we are drawing some conclusion about the model parameter. This may work even for a misspecified model.
Then there is the level at which you want to draw some conclusion about reality, call it the "scientific level". If the model is misspecified, the parameters/coefficients may or may not correspond to the thing of interest. Perhaps the model is a close enough approximation for those values to be meaningful, perhaps not...
I think it is the second ("scientific level") of inference that most people are concerned about. The rigor of the proofs/theorems that may work at the statistical level does not extend to the scientific level.
Afaict, the majority of erroneous inference occurs at the scientific level and statistical error/uncertainty is a sort of minimum error/uncertainty.
When you use linear regression to fit a model like
y ~ ax + b(x^2)
I get that it's a potential cause of confusion for someone who has no training in stats. But it's also jargon that describes a useful concept, and that is literally transparent if you do have enough understanding of the math to know what "linear" and "parameters" mean in this context.
* It allows you to model essentially arbitrary functions. The main model assumption is your choice of kernel, which defines the local correlation between nearby points.
* You can draw samples from the distribution of all possible functions that fit your data.
* You can quantify which regions of the function you have more or less certainty about.
* Imagine this situation: you want to discover the functional relationship between the inputs and outputs of a long-running process. You can test any input you want, but it's not practical to exhaustively grid-search the input space. A Gaussian Process model can tell you which inputs to test next so as to gain the most information, which makes it perfect for optimising complex simulations. Used in this way, it's one means of implementing "Bayesian Optimisation" 
In practice, I've found GPs to be great for getting actual insight into an unknown function, but much less useful as a black-box learner.
I’ve wanted to apply the approach you mention a few times, but documentation seems to go from “Wiki” level to novel research articles. Are there any good introductory books / resources that aren’t beginner level? That scikit library looks handy!
 https://www.sciencedirect.com/science/article/pii/S092540051... Section 4.1 about ozone predictors
Using lasso (also mentioned in TFA) would prefer to pick the best of the three and drop the others.
Using elastic net would be a combination of both.
Note, though, that any method other than simple regression has tuning parameters -- depending on those, you could still end with result equivalent to plain least squares.
As the article mentions however, there are regression methods meant for these situations (e.g. ridge regression).
Wrong, wrong, wrong, wrong.
If predictors are linearly dependent you don't get to do regression at all -- your X'X is singular. But then, the extra regressors add no information at all, and classical statistical packages (SPSS, Stata, etc.) drop them automatically.
Even if predictors are highly correlated, the OLS estimator is unbiased. This is the stuff of elementary statistics. You just get lower and lower p-values/wider and wider CIs, specially if your samples are econometrics-sized.
You people need to watch some Khan Academy or whatever the cool kids are doing now to learn maths.
Yes, if your variables are perfectly linearly dependent they get dropped. Did anyone say otherwise? I did not think about this case because most correlated measures causing multicollinearity problems aren't perfectly 'linearly dependent'. Linearly dependency usually only comes up practically if you miscoded some of your independent dummy variables (e.g. adding both 'male[0,1]' and 'not male[0,1]' as two categorical predictors). So I am not really sure of your point.
As to your second point, it might be unbiased but the statistical inference (i.e. p-value) would be incorrect with multi-collinearity..thus again, I am not sure of your point when you are only repeating what I said.
Moreover, it may not be particularly meaningful to the researcher even if the parameter estimate is unbiased. One frequently finds with multicollinearity that the signs of effects will switch (- to +, or + to -) as you add highly correlated predictors into a model, in oft-theoretically questionable ways, but does serve to remind one that the parameter estimates are only meaningful in the context of the other predictors in the model.
As long as the unexplained term is uncorrelated (in the probabilistic model; linear regression will force this to be the case computationally) with the included variables, your coefficients will remain unchanged. So adding/removing variables shouldn't change results at all -- unless the model is mis-specified and you're including variables that correlate with unobserved factors in unexpected ways.
So for example a regression of children's IQ on the income of their parents provides a plausible mechanism; but if you add the arm length of the kids you will have problems, since arm length is correlated to an omitted variable (kids with longer arms are older and perform better on IQ tests).
That's most of the "in context" story. Nothing to do with multicollinearity.
The 'in context' was not so much about multicollinearity but about shared and unique variance.
However, there are at least two types of regression I'd add to the list, and a suggestion.:
1 Multivariate Distance Matrix Regression (MDMR; Anderson, 2001; McArdle & Anderson, 2001).
2. Regression with splines
3. On polynomial regression, add mention of orthogonal polynomials.
I suppose vectors for both training and testing would be required.
Would gladly pay $1-$5 per batch for a service to do this.
1) Take a dataset and split into training and test
2) Using the training set: run a bunch of different regressors (for a training-training subset) and get predictions (for the remaining test-training subset)
3) Run a higher-level regression against test-training subset predictions. I use either plain linear regression (so my meta-regressor is a linear combination of the regressors) or K-nearest neighbors (so the best regressor for each region of feature space is chosen).
4) If there are hyperparameters, optimize against the test set (not the test-training subset).
It's not available as an API. I'm available for consulting though.
Weka and various ML tools require you select the algorithm and do the A/B testing on your own.
There's an opportunity for an Optimizely of ML.
You will find a model that looks good on your data. It will not be the model you should use.
The model I've consistently chosen (Decision Trees) may not be the best model. Need to get pushed outside my comfort zone.
I could put in the months/years like a proper Data Scientist and optimize the model. Or let a magic API tell me the best model. I'm lazy, so I prefer the latter ;-)
There are many ways to evaluate all of these methods, and for classification you may favor something else, but it's completely reasonable to use the (cross validated, or not) empirical risk for both logistic and linear regression. That would be a negative log likelihood in both cases, from the Bernoulli/binomial distribution for logistic regression or the normal distribution for linear regression.
“model” isn’t a simple word.
> It is to be kept in mind that the coefficients which we get in quantile regression for a particular quantile should differ significantly from those we obtain from linear regression. If it is not so then our usage of quantile regression isn't justifiable. This can be done by observing the confidence intervals of regression coefficients of the estimates obtained from both the regressions.
I thought the article was very good.