
Why I’m Not a Fan of R-Squared - sndean
http://www.johnmyleswhite.com/notebook/2016/07/23/why-im-not-a-fan-of-r-squared/
======
mattb314
"As such, R^2 answers the question: 'does my model perform better than a
constant model?' But we often would like to answer a very different question:
'does my model perform worse than the true model?' "

Maybe I'm over-generalizing here, but I think this fundamental assumption is
untrue for most people who use statistical models to solve actual problems.
All of the "undesirable" behavior of the R^2 metric makes complete sense when
you view it as a comparison with the most naive model (the constant model),
and while R^2 certainly doesn't measure how close to "true" a model is, it
very well captures the utility of using a more sophisticated model over an
extremely simple one, which (I believe) is a critical question.

For example, if you had predict a process where measurement noise overwhelms
variation in the process itself, as in his first example of log(x) from .99 to
1.0, then the constant model is pretty much the best you can do, and both
log(x) itself and the linear model offer little additional benefit. Getting
low R^2 values for those two models makes total sense--they offer no marginal
benefit. In fact, if you had to make the decision "should I use log(x) or the
constant model?", you're often better off going with the constant model for
simplicity and predictability (unless you have domain knowledge motivating a
different choice).

I like well thought-out articles like this one on statistical concepts because
they make me think about things I often take for granted, and while treating
R^2 like a better version of RMSE or MAD is clearly wrong, it often better
captures things people actually care about by taking the difficulty of the
problem into account. If you're doing advanced statistical modeling, it's easy
to underestimate how common it is for a beginner to celebrate getting (say)
96% classification accuracy on a 2-class problem where one class makes up more
than 95% of the samples--an issue that using R^2 can quickly reveal.

tldr: Awesome article but (imo) R^2 is more useful than most other metrics,
not less

------
Noseshine

        > “does my model perform worse than the true model?”
    

What is "true model"? I can't make head nor tail of that term. I've never
heard this before, nor does it make sense to me when I take just the word
meaning.

~~~
nabla9
True model is the probability distribution that generates the observed data.

~~~
DiabloD3
In other words, what was actually observed?

~~~
nabla9
No. "true model", "true distribution" and "true population" is what generates
the data.

~~~
jsprogrammer
Which is a misnomer, because the data probably isn't generated by a model.

~~~
tomrod
Semantics at this point. Data-generating process is a term that's also used. A
model seeks to mimic or match the real process to a reasonable approximation.
Hence "true model" as a scoring engine or data relationship that represents
reality completely.

~~~
jsprogrammer
Except that we can't access reality completely. Assuming that there is a "true
model" generating data is just that: an assumption with no real basis.

Semantics is the essence of communication.

~~~
pessimizer
Unless you have data, that is. The data is the basis for assuming that a
process has generated data. Either that, or the data has existed for all
eternity, and therefore could never have been collected.

~~~
jsprogrammer
"data" is your own perception, however. Can you give an example of a data
generating process?

~~~
tomrod
We can create them!

Suppose I take the function y = log(x) and add random white noise. The
function log() and the parameters on the random white noise process are the
data generating process. We could then fit a model y = \beta X + \epsilon, and
then compare the "true" (first) model to our second model. When the natural
world generates our data, the idea behind all this is the same: there is a
process which generates the data, and the data reveals information about that
process to an approximate degree.

Some reads:

[1]
[http://www.rimini.unibo.it/fanelli/econometric_models2_2012....](http://www.rimini.unibo.it/fanelli/econometric_models2_2012.pdf)

[2]
[https://en.wikipedia.org/wiki/Data_generating_process](https://en.wikipedia.org/wiki/Data_generating_process)

~~~
jsprogrammer
Can you give a non-synthetic, ie. natural, example data generation process?

Edit: I don't know if people don't like the grammar, or what?

How about this:

Can you give a non-synthetic, ie. natural, example of a data generating
process?

~~~
tomrod
Sure! Most of the equations you find in your nearest physics or chemistry
books are validated experimentally/empirically. The validation comes by better
and better approximating the data generating process.

------
scottfr
I'm a huge fan of R^2 and you should be too.

The simple way to think about R^2 is that it is a measure of relative
predictive accuracy (and that is exactly how it is calculated). This is both a
more accurate (and a more useful definition for most tasks the HN crowd would
work on) than saying R^2 is a measure of the distance from the true model.

All the figures and findings in the post are completely reasonable given that
definition (the linear model performing better in the first case is due to the
high variance of the author's data, other samplings would lead to the reverse
result).

~~~
mrandrewandrade
I think what John is saying is that people commonly use R^2 as measure of
model fit where something like root mean squared error (RMSE) gives a better
measure of model fit (by measuring the distance from the true model) depending
on the model. Just using R^2 blindly for most tasks you would work on can lead
to choosing an incorrect model.

I think the main take-away from the post is to better understand the correct
measure for model fit for your specific data. For example, if you are
forecasting a time series with stationary demand, mean average deviation might
be the best measure of model fit, but it the case where there is seasonality
with a trend, the RMSE would be a better measure [1].

[1]
[http://robjhyndman.com/papers/foresight.pdf](http://robjhyndman.com/papers/foresight.pdf)

~~~
robzyb
> I think what John is saying is that people commonly use R^2 as measure of
> model fit where something like root mean squared error (RMSE) gives a better
> measure of model fit (by measuring the distance from the true model)

I don't mean to be rude, but that is definitely not what he is saying. There
are two important things I'd like to clarify:

\- It is wrong to call the alternative measure "the RMSE". The alternative
that the article was proposing was a made-up measure called E^2 which measures
the distance from the true model.

\- The author is not suggesting that people use E^2 instead of R^2 for any
case. In fact, in almost all cases it is impossible to use E^2, because
calculating it requires you to know the true model, and if you know the true
model it would be very unlikely that you'd be wasting your time measuring
other models.

The author makes it clear that E^2 isn't really to be considered an
alternative when he called it the "generally unmeasurable E^2".

~~~
mrandrewandrade
I think we are saying the same thing in different words, and I might be
confusing "an alternative" with "a comparison". John compares R^2 with E^2,
but RMSE can be considered an alternative to using R^2 in certain cases.

If you go back to the first line: > People sometimes use R^2 as their
preferred measure of model fit.

I think the post is going over why R^2 is not recommended as 2 is not only a
measure of the error, but it includes a comparison with a constant model. John
defines E^2 as a comparison metric which measures how much worse the errors
are than if you used the true model.

Going back to a metric for determining model fit, RMSE/MSE/MAD are all
alternative measures of model fit and are useful depending on the dataset.

------
sixbrx
An interesting tangential argument I had with a colleague recently was how to
actually calculate R^2 when testing model predictions on new data (emphasis on
new data).

He claimed the conventional way is as the square of the correlation
coefficient of the two lists of data, whereas I was suggesting 1 - SSE/SST in
accord with references I read (well, Wikipedia), which latter method yields
lower values for his data. I know they are equal on training data when using
linear regression with constant term, but generally they differ. Correlation
way being scale and shift invariant to me disqualifies it as a measure for
predicting _new data_.

Unfortunately he was able to barrage me with many online references that just
say that "R^2 is the square of the correlation coefficient", which without
very careful reading of the context (fitting not predicting), and sometimes
even with such careful reading, makes his interpretation look correct. I find
the whole thing rather exasperating...

It also occurs to me that just as a question of convention, that I may be
wrong which wouldn't surprise me as I'm new to modeling.

So: Do most modelers report correlation^2 as their R^2 values for holdout
tests? I wonder have other modelers here encountered this confusion?

------
kgwgk
I find this very confusing, but I guess I'm not the intended target audience.
Not that I say it's wrong, but I don't really see the point.

Do people really expect R^2 to measure the fit of the model to the true model?
R^2 measures the fit of the model to the data: i.e. how well does the model
perform in predicting the outcomes. In his first example is clear that all the
models are equally useless: the noise dominates and the predictive power of
the models is close to zero. In the second example the predictive power of all
the models has improved, because there is a clear trend. The true model
predicts much better than the others now, but each model predicts better than
in the previous example.

In the first example, he concludes: "Even though R^2 suggests our model is not
very good, E^2 tells us that our model is close to perfect over the range of
x."

Actually our model is "better than perfect". The R^2 for the linear model
(0.0073) and for the quadratic model (0.0084) is slightly better than for the
true model (0.0064). Of course this is not a problem specific to the R^2
measure (the MSE for the linear and quadratic fits is lower than for the true
generating function) and can be explained because the linear and quadratic
models overfit. E^2 is essentially the ratio the 1-R^2 values (minus one). We
get -0.00083 and -0.00193 for the linear and quadratic models respectively
(the ratios before substracting one are 0.9992 and 0.9981).

In the second example,"visual inspection makes it clear that the linear model
and quadratic models are both systematically inaccurate, but their values of
R^2 have gone up substantially: R^2=0.760 for the linear model and R^2=0.997
for the true model. In contrast, E^2=85.582 for the linear model, indicating
that this data set provides substantial evidence that the linear model is
worse than the true model."

The R^2 already indicates that the linear model (R^2=0.760) and the quadratic
model (R^2=0.898) are worse than the true model (R^2=0.997). The fractions of
unexplained variance are 0.240, 0.102 and 0.003 respecively and it's clear
that the last one performs much better than the others before we take the
ratios and substract one to calculate the E^2 values 85.6 and 35.7 for the
linear and quadratic models respectively.

(By the way: "we’ll work with an alternative R^2 calculation that ignores
corrections for the number of regressors in a model." That's not an
alternative R^2, that's the standard R^2. The adjusted R^2 that takes into
account the number of regressors is the alternative one.)

~~~
kgwgk
There is now a new definition for E^2 (one minus the ratio of the R^2 of the
model and the "true" model) which doesn't solve the most obvious issue:
getting negative values for a measure called "something squared". The values
of E^2 in the first example are now -0.13 for the linear model and -0.30 for
the quadratic model. In the second example, they are 0.24 and 0.10
respectively.

The graphical representation is a bit misleading. Leaving aside the fact that
in the first example MSE_T is between MSE_M and MSE_C, this drawing make E^2
and R^2 seem more complementary than they really are. E^2 is the length of the
blue bar as a fraction of the total length (blue+orange). R^2, however, is the
length of the orange bar as a fraction of the distance from the end of the bar
to the origin (not shown in the chart).

Edit: there is a new addition to the post, re-expressing E^2 in terms of a
mean/variance decomposition. It should be kept in mind that the derivation
presented is only asymptotically correct. In a small sample, the cross term
does not vanish and the variance of the observations around the "true" value
is not exactly sigma^2. In the second example, E^2 calculated using this new
definition is quite similar (0.2373 and 0.0991 for the linear and quadratic
models, compared to the previous values of 0.2382 and 0.0994). In the first
example, however, the values we get from the new definition are far from the
previous values: 0.0646 vs -0.129 for the linear model, 0.1528 vs -0.297 for
the quadratic model.

Edit2: changed "approximation" to "new definition", "good" to "similar" and
"exact" to "previous" in the previous paragraph. I'm not sure if he was
suggesting to use this formula to calculate E^2 instead of the previous one.
Anyway, it doesn't matter because this is not something that can be calculated
at all unless the "true model" is known.

------
graeham
Interesting article and I find it current for some problems I'm working on at
the moment.

I would add a few challanges. The example is a bit a of a strawman - a log(x)
function has unique properties that make the Xmax-Xmin vs R^2 work like that.
In real data, rarely does a single-variable 'true model' fit as well as the
example either.

Context is needed as well - depending on the use of the model, a linear or
quadratic fit may be sufficient even for what is clearly a log dataset. The
real failing on only for small values of x, maybe 5% of the range of total
values. For this case, a bilinear model could fit quite well for the lower 5%,
then the existing model for the upper 95%. It depends on the application. I
like this phrase:

"When deciding whether a model is useful, a high R2 can be undesirable and a
low R2 can be desirable."

Too often statistics are dominated by 'cutoff' values that people apply
blindly to all situations.

What do you think of robust regression methods, where obvious outliers are
down-weighted?

~~~
yummyfajitas
I'm not the author, but I'm a huge fan of robust regression. I make between
$500-2000/month off a trading strategy based on such a method. (The method is
basically Bayesian linear regression, but using an error model that has a
heavier tail than a gaussian.)

But a really important thing when using such methods is the lucas critique.
When you need to use robust regression you are definitively in a space where
all the simple and generic stuff (e.g. linear regression) doesn't work. So at
this point it's important to validate the underlying assumptions behind the
robust regression scheme.

E.g., in my trading strategy, I've gone to great lengths to make sure the tail
behavior I'm modelling is an overestimate of reality.

~~~
xoranth
Could you share more details about the robust regression you are using? All
resources I could find online on robust regression would either point to
Laplace distributed residuals, or some capped loss function.

------
Rexxar
That's seem an advantage for R²: if R² go down when you add more data, you
know your model doesn't fit the data any more.

------
tgb
Had the author offered an alternative? Namely, can E^2 be calculated in
practice?

~~~
a_bonobo
I don't think you can calculate E^2 without the "true model", which you
practically never have. The code uses the "true model" too:
[https://github.com/johnmyleswhite/r_squared/blob/master/util...](https://github.com/johnmyleswhite/r_squared/blob/master/utils.R#L63)

I guess the post is similar to Anscombe's quartet [1]: a warning not to
blindly trust summary statistics.

[1]
[https://en.wikipedia.org/wiki/Anscombe%27s_quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

~~~
static_noise
So the solution the author proposes is both absolutely correct and absolutely
useless in practice?

~~~
lottin
He's not proposing a solution as far as I can tell. He's simply using the E^2
statistic in order to illustrate the problem.

------
otl1248
This is a weird critique of R-Squared. You (should) learn very early on that
comparing R-Squared values of different models is a no-no.

~~~
achompas
And you're claiming it's weird because people learn this early on? Maybe
that's where we disagree -- it's not as common to learn about the issues with
R^2 as you might think.

~~~
otl1248
Maybe the fact that this is in the (reasonably brief) Wikipedia entry on R2 is
some evidence?

[https://en.wikipedia.org/wiki/Coefficient_of_determination#I...](https://en.wikipedia.org/wiki/Coefficient_of_determination#Inflation_of_R2)

Also, I find that the OP is actually more confusing on this topic than
Wikipedia.

Not sure how else I can support that this is a basic fact about this metric.
I'm not in the mood to find quotes in intro textbooks, etc.

------
jostmey
Learn about information theory. It is better to calculate the cross-entropy
error or KL-divergence between the data and the model.

~~~
chestervonwinch
Do you have any links where KL-divergence is used in the case of real-valued
response variables? Why is it better?

------
acbart
This is not very accessible for people with weak statistical backgrounds.

~~~
yiyus
R-Squared is the most commonly used indicator of how good a model fits some
data. The author discusses why this indicator can be misleading in some cases
and shows an example. There is nothing interesting for people unfamiliar with
R-Squared values or model fitting.

But in fact, it is not complicated. If you feel curious, feel free to ask any
specific question you have.

