Hacker News new | more | comments | ask | show | jobs | submit login
Types of Regression Analysis (listendata.com)
319 points by _kcxz 10 months ago | hide | past | web | favorite | 66 comments

Despite what the article claims, normality is not actually an assumption of linear regression. It is "required" for doing F-tests (the F-distribution being related to the normal distribution), but it is not required for proving that the regression coefficients are consistent.

It's actually not even required for that! See http://davegiles.blogspot.com/2011/08/being-normal-is-option... which cites King (1980):

> If the error vector in our regression model follows any distribution in the family of Elliptically Symmetric distributions, then any test statistic that is scale-invariant has the same null and alternative distributions as they have when the errors are normally distributed.

Note also that any distributional assumptions are really only necessary for inference (i.e., tests and confidence intervals) in finite samples (read: small samples); the central limit theorem guarantees the tests work asymptotically, so you're usually going to be fine.

Most of the attention paid to distributional assumptions in regression is wasted, and would be better spent on really thinking through the assumed moment conditions underlying the estimator.

> Assumptions of linear regression: There must be a linear relation between independent and dependent variables.

That's not wrong, but it's a strong way to word it. If linear regression were only suitable when the variables were perfectly linearly related, it would get a lot less use. Practically, linear regression can be used when the relationship is linear-ish, at least in the interval of interest. In other words, you can choose to declare linearity as an assumption (and take responsibility for what that choice entails, and for the error it might introduce into your analysis).

That the linear model is "correct" is only assumption if you're trying to draw probabilistic inferences.

There's nothing stopping you from using it as a "best fit line", even when you have no reason to believe those assumptions. But then it's just a best-fit line. It tells you the direction and magnitude of linear trend, nothing more. That's never wrong in any sense, it's just that sometimes it's not very useful.

From the semiparametric perspective, you can still make correct inferences about the estimated parameters even if the model is not correctly specified, as long as you use the so-called robust estimator of the variance.

This is impossible. If the model is incorrectly specified (does not include all and only the relevant parameters and interactions), it doesn't matter much what games are played with the math. Changing the model will change the estimates...

Edit: For example, see here where making arbitrary choices of how to code categorical variables will change the estimates: https://news.ycombinator.com/item?id=16719754

If you change the model the meaning of all the coefficients changes.

Oops I did not see your response until now.

I agree, changing the model changes the estimates, because the parameters you are estimating change.

However, given one misspecified model, the parameters of that model are still well defined, though they may not have the interpretation they would if the model was correctly specified. As OP called it, this is the "best fit line", and is a projection of the truth onto your model. E.g. for a simple linear regression of Y on X, where the true conditional mean of Y given X is not linear, there is still some "true" best line. This line depends also on the distribution of X, though it would not if the model was correct. Estimates from linear regression will converge to the parameters of this line, though using the usual standard errors will be wrong.

There's a very general theorem or corollary that covers this in Asymptotic Statistics by van der Vaart. I think in the chapter about M estimators, right around where MLEs are covered, but I don't have it in front of me.

There are multiple inference levels here.

First, there is the statistical level, at which we are drawing some conclusion about the model parameter. This may work even for a misspecified model.

Then there is the level at which you want to draw some conclusion about reality, call it the "scientific level". If the model is misspecified, the parameters/coefficients may or may not correspond to the thing of interest. Perhaps the model is a close enough approximation for those values to be meaningful, perhaps not...

I think it is the second ("scientific level") of inference that most people are concerned about. The rigor of the proofs/theorems that may work at the statistical level does not extend to the scientific level.

Afaict, the majority of erroneous inference occurs at the scientific level and statistical error/uncertainty is a sort of minimum error/uncertainty.

Yes, well put!

It is actually wrong. The assumption is that y is a linear combination of the covariates in X. You can run regressions like y = x + x^2 (i.e. you permit a quadratic relationship) just fine.

It's not wrong, it's just a way of looking at things that speaks to the underlying math rather than the full extent of what you can do with it if you extend it with things like kernel methods.

When you use linear regression to fit a model like

  y ~ ax + b(x^2)
what you're technically doing is fitting a linear function with two parameters on two variables. One variable happens to always be equal to the square of the other variable, but, for the purpose of how the model is usually going to be fit, it is still using the same old analytical method that's based in linear algebra.

Fair enough. Mechanically, all you're ever doing when estimating a parameter vector using OLS is projecting Y onto the span of X, and that requires linearity in the sense that Y = XB. But far too often I've met people who've come away thinking OLS is useless because they mistake the linearity in parameters with 'y must be a linear function of x', which is they think is too simplistic, and so they go do more complicated methods when OLS would have been just fine as long as they used polynomials and/or interaction terms.

To me, that's a stellar example of why you probably shouldn't have people who don't even have a basic undergraduate "intro to stats" understanding of the subject doing your statistics work.

I get that it's a potential cause of confusion for someone who has no training in stats. But it's also jargon that describes a useful concept, and that is literally transparent if you do have enough understanding of the math to know what "linear" and "parameters" mean in this context.

This 'assumption' always bothered me when studying for DS roles because it's something that you're expected to know if asked, but isn't really true/accurate. Another is the non-collinearity assumption between variates, which is violated all the time in ML tasks but an 'assumption' of the model nonetheless.

In general, it's helpful for me to separate assumptions that are characteristics of the generative model (and possibly an inference procedure used with it), and "assumptions" meaning things that could lead to poor out of sample prediction.

You can say a lot about what linear regression gets you when you fit a line to nonlinear data. It's a weighted average of the derivative. You don't need a linearity assumption.

A tool that I've found myself reaching for more and more often is Gaussian Process Regression [1] [2]

* It allows you to model essentially arbitrary functions. The main model assumption is your choice of kernel, which defines the local correlation between nearby points.

* You can draw samples from the distribution of all possible functions that fit your data.

* You can quantify which regions of the function you have more or less certainty about.

* Imagine this situation: you want to discover the functional relationship between the inputs and outputs of a long-running process. You can test any input you want, but it's not practical to exhaustively grid-search the input space. A Gaussian Process model can tell you which inputs to test next so as to gain the most information, which makes it perfect for optimising complex simulations. Used in this way, it's one means of implementing "Bayesian Optimisation" [3]

[1] https://en.wikipedia.org/wiki/Gaussian_process

[2] http://scikit-learn.org/stable/modules/generated/sklearn.gau...

[3] https://en.wikipedia.org/wiki/Bayesian_optimization

When I tried this to choose xgboost hyperparameters it didn't seem to perform much better than random search while also adding another layer of hyper-hyper-parameters.

Yeah. The hyper parameter story that comes with Gaussian processeses is a big drawback. The choice of kernel has a massive impact.

In practice, I've found GPs to be great for getting actual insight into an unknown function, but much less useful as a black-box learner.

What kernels would you recommend trying initially? I’m still unclear if the Gaussian processes require normal distribution (e.g. would they work on log-log / binomial based functions).

I’ve wanted to apply the approach you mention a few times, but documentation seems to go from “Wiki” level to novel research articles. Are there any good introductory books / resources that aren’t beginner level? That scikit library looks handy!


I guess at its root the problem may just be how much compute is available to throw at the optimization. Alternatively there could be more efficient algos... I looked into but never fully tested this, it seemed promising: https://news.ycombinator.com/item?id=16241659

Friends don't let friends use MS Word to produce equation screenshots. Not in the age of MathJax.

Now this is a topic I desperately need. Can anyone here by any chance explain why would one choose predictors in multilinear regression that are NOT correlated to the target? I am having trouble understanding paper [1] where authors avoid using predictors that are correlated to target. Target is ozone concentration shown by referent instrument and predictors are low cost sensor outputs.

[1] https://www.sciencedirect.com/science/article/pii/S092540051... Section 4.1 about ozone predictors

The issue is intra-predictor correlation. In the extreme case that a predictor is duplicated, the correct beta might be {betaa, beta(1-a)} for a in [0, 1], which an algorithm may not estimate in a stable manner. A significant degree of correlation introduces this general problem.

... or worse; it is still true for any a. You could easily get {1,000,001, -1,000,000}, which for perfectly clean, precise, representable data is equivalent, but which magnifies any noise/error in one of the predictors by a million. or a billion.

So say you have 3 predictors that have high intra predictor correlation. Can you still pick one of them, and discard the remaning 2? Or you cant pick any one of them?

Using ridge regression (mentioned in TFA) would prefer a (1/3,1/3,1/3) average of those predictors (or a better combination, depending on their respective noises).

Using lasso (also mentioned in TFA) would prefer to pick the best of the three and drop the others.

Using elastic net would be a combination of both.

Note, though, that any method other than simple regression has tuning parameters -- depending on those, you could still end with result equivalent to plain least squares.

You can, but why trash information that is present when you can leverage it with a different approach?

Like PCA? But that way you loose physical meaning of the predictors.

PCA is a special case of factor analysis, so you are representing them as observations of a latent variable (which is often a narrative people use when explaining why two x variables are correlated)

When predictors are correlated with each other you get multicollinearity potentially leading to incorrect statistical inferences.

Thanks for the answer. And what is the correct approach here, if you can only chose/not chose predictor in final set? Discard all multicollinear predictors or pick just one of them?

Keeping just to linear regression. If those variables are measuring the same construct, pick the best one or use a method to combine their scores. If they measure different constructs but are very correlated, then you'd need to drop one..depending on the variance inflation factor...which you can test for.

As the article mentions however, there are regression methods meant for these situations (e.g. ridge regression).

One thing that should be mentioned though is in the case of polynomials e.g. y ~ x + x^2, there will be a lot of multicollinearity between these terms, but that multicollinearity is OK...just be sure to center your variables.


Wrong, wrong, wrong, wrong.

If predictors are linearly dependent you don't get to do regression at all -- your X'X is singular. But then, the extra regressors add no information at all, and classical statistical packages (SPSS, Stata, etc.) drop them automatically.

Even if predictors are highly correlated, the OLS estimator is unbiased. This is the stuff of elementary statistics. You just get lower and lower p-values/wider and wider CIs, specially if your samples are econometrics-sized.


You people need to watch some Khan Academy or whatever the cool kids are doing now to learn maths.

There is no need to be rude or yell.

Yes, if your variables are perfectly linearly dependent they get dropped. Did anyone say otherwise? I did not think about this case because most correlated measures causing multicollinearity problems aren't perfectly 'linearly dependent'. Linearly dependency usually only comes up practically if you miscoded some of your independent dummy variables (e.g. adding both 'male[0,1]' and 'not male[0,1]' as two categorical predictors). So I am not really sure of your point.

As to your second point, it might be unbiased but the statistical inference (i.e. p-value) would be incorrect with multi-collinearity..thus again, I am not sure of your point when you are only repeating what I said.

Moreover, it may not be particularly meaningful to the researcher even if the parameter estimate is unbiased. One frequently finds with multicollinearity that the signs of effects will switch (- to +, or + to -) as you add highly correlated predictors into a model, in oft-theoretically questionable ways, but does serve to remind one that the parameter estimates are only meaningful in the context of the other predictors in the model.

There's this other thing called the FWL theorem.

As long as the unexplained term is uncorrelated (in the probabilistic model; linear regression will force this to be the case computationally) with the included variables, your coefficients will remain unchanged. So adding/removing variables shouldn't change results at all -- unless the model is mis-specified and you're including variables that correlate with unobserved factors in unexpected ways.

So for example a regression of children's IQ on the income of their parents provides a plausible mechanism; but if you add the arm length of the kids you will have problems, since arm length is correlated to an omitted variable (kids with longer arms are older and perform better on IQ tests).

That's most of the "in context" story. Nothing to do with multicollinearity.

Thanks for the thoughtful comment and reference.

The 'in context' was not so much about multicollinearity but about shared and unique variance.

This article is obviously a jumping off point kind of article. Most people using linear regression have never even heard of things like ridge regression. So I like the article.

However, there are at least two types of regression I'd add to the list, and a suggestion.:

1 Multivariate Distance Matrix Regression (MDMR; Anderson, 2001; McArdle & Anderson, 2001).

2. Regression with splines

3. On polynomial regression, add mention of orthogonal polynomials.

There's also hierarchical regression, where you can estimate multilevel models. Also call fixed and random effect models. Assigning a variance coefficient for each parameter can account for heteroscedasticity.

Mixed models..the hell (+_+) I live in =_=

Important for many real use cases such as estimating demand from prices instrumental variable regression.

Why did the article cover a basic term like "outlier" under "Terminologies related to regression" but omitted information about how to evaluate a regression model? I liked that there was some information at the bottom about "How to choose a regression model" that mentioned "you can select the final model based on Adjusted r-square, RMSE, AIC and BIC" but providing a little more context would make this post even better. Perhaps a link to a future blog post on the topic?

Are there any ML APIs or web services that accept a vector and run various regression scenarios to identify optimal fit?

I suppose vectors for both training and testing would be required.

Would gladly pay $1-$5 per batch for a service to do this.

I have a magic regression aggregator that works like this:

1) Take a dataset and split into training and test

2) Using the training set: run a bunch of different regressors (for a training-training subset) and get predictions (for the remaining test-training subset)

3) Run a higher-level regression against test-training subset predictions. I use either plain linear regression (so my meta-regressor is a linear combination of the regressors) or K-nearest neighbors (so the best regressor for each region of feature space is chosen).

4) If there are hyperparameters, optimize against the test set (not the test-training subset).

It's not available as an API. I'm available for consulting though.

There is a Python library called TPOT that does this.


I think that's DataRobot's business model, although I think they run more sophisticated models as well. It was 5+ years ago that I spoke to them but IIRC they were able to compete pretty well in Kaggle competitions with a fairly hands-off algorithm.

Could you perhaps point me to a Kaggle competition where they perform well with a hands-off approach?

I'm afraid I can't. Take it with a grain of salt, I only mention it because it was the anecdote that stood out in my memory for 5+ years :)

I'm working on an MLaaS service now and I'd love to add that feature. That said, I'd like to learn a little more about exactly how you envision the use case working. To that end, if you can spare some time to chat sometime, would you drop me a line (prhodes@fogbeam.com)?

Why not just use regression trees, eg xgboost? The parameters aren't going to mean anything anyway.

Why wouldn't you just run weka or something locally?

I just want to provide the data and let a service decide the best algorithm.

Weka and various ML tools require you select the algorithm and do the A/B testing on your own.

There's an opportunity for an Optimizely of ML.

Model fishing is bad.

You will find a model that looks good on your data. It will not be the model you should use.

"If all you have is a hammer, everything looks like a nail"

The model I've consistently chosen (Decision Trees) may not be the best model. Need to get pushed outside my comfort zone.

I could put in the months/years like a proper Data Scientist and optimize the model. Or let a magic API tell me the best model. I'm lazy, so I prefer the latter ;-)

That's what 90% of data scientists do

Logistic regression is doing classification not regression. That is, it's assigning/predicting categories of data points instead of predicting some continuous value on some interval. Maybe this is splitting hairs but the way you evaluate a classification model is totally different than a regression one.

This is not correct. Logistic regression can be used for classification, true, but it can also be viewed as a way of estimating the conditional mean of an outcome variable that has a Bernoulli, or binomial distribution, depending on the formulation.

There are many ways to evaluate all of these methods, and for classification you may favor something else, but it's completely reasonable to use the (cross validated, or not) empirical risk for both logistic and linear regression. That would be a negative log likelihood in both cases, from the Bernoulli/binomial distribution for logistic regression or the normal distribution for linear regression.

This is a question of perspective, you can in fact just take it as a regression over the continuous 0-1 interval with loss |y-o(Xw)| where o is the sigmoid and could report that loss (in fact a package such as Sklearn will usually return the parameter values that minimise that loss, possibly with a penalty). If you want to use it as a classifier then you threshold the predictions.

Don’t forget to put RANSAC on you list: https://en.m.wikipedia.org/wiki/Random_sample_consensus

I was hoping one interesting graphic chart per Regression Analysis Type. That didn't happen, and I felt lost at sea. Please, improve the post on such amazing topic.

> In simple words, regression analysis is used to model the relationship between a dependent variable and one or more independent variables.

“model” isn’t a simple word.

This is just horrible quality material. What in the heck is this?

> It is to be kept in mind that the coefficients which we get in quantile regression for a particular quantile should differ significantly from those we obtain from linear regression. If it is not so then our usage of quantile regression isn't justifiable. This can be done by observing the confidence intervals of regression coefficients of the estimates obtained from both the regressions.

I'm a typical "math is hard; let's go programming" type of person, but the only problem i have with that quoted section is the missing antecedent of "This can be done...". But I worked it out from context.

I thought the article was very good.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact