
Understanding Variance, Co-Variance, and Correlation - CountBayesie
http://www.countbayesie.com/blog/2015/2/21/variance-co-variance-and-correlation
======
edtechdev
This is a horrible explanation of variance. And it's missing WHY we need
variance, or, what is the usefulness of variance vs. other measures like mean
and range.

Say you want to buy a car and want to choose a brand and model based on user
ratings of quality and value online.

Cars A, B, and C all have the same average rating - let's say 8 out of 10. How
to choose? You need more information, but all you have are the ratings.

You could look at the range of ratings. This is the difference between the
maximum rating and minimum rating. But what if only one or two people gave a
car a bad (low) rating of 1 or 2, whereas another car had a lot of low ratings
of 3 and 4, but no one rated it a 1 or 2. If you just look at the range, it
might not be a good characterization of the ratings on the whole, because just
one person (data point) can skew the information.

You want to look at the spread of the ratings - how consistent or variable the
ratings are. A car with a lot of 7, 8, 9 ratings is better than a car with
ratings all over the place, that happen to average the same (8). When you buy
a car with an average rating of 8 out of 10, you expect a car that is an 8.
You want to minimize the chance of getting a lemon.

This spread can be calculated by looking at the difference between each
individual rating with the average rating. If you add up all these differences
though, the negative differences with the mean would cancel out the positive
differences with the mean. With variance, this difference is thus squared to
make them all positive (or zero). And so on...

~~~
anonymousDan
But why take the square and not just the absolute value of the differences? Is
the idea to emphasize outliers and hence give higher variance to skewed
datasets?

~~~
yummyfajitas
The real idea is that you have an implicit model, specifically a normal
distribution. The variance is one of the parameters of the normal distribution
(the other being the mean).

A normal distribution is a good implicit model to choose - the central limit
theorem and similar laws suggest that lots of other distributions will
asymptotically approach it. But it's not always the right choice - e.g., it's
a disaster when you have power law tails, or low frequency high amplitude
noise.

~~~
Tycho
could you give some examples or detail about power law tails?

~~~
yummyfajitas
So the CLT says the sum of random variables with rapidly decaying tails will
approach a normal distribution. There are similar results showing that the sum
of slowly decaying random variables approaches a stable distribution:

[https://en.wikipedia.org/wiki/Stable_distribution](https://en.wikipedia.org/wiki/Stable_distribution)

This makes the stable distribution the right answer under some circumstances.

For different test statistics (e.g. max drawdown), you've got similar fat
tailed distributions, e.g. GEV:

[https://en.wikipedia.org/wiki/Generalized_extreme_value_dist...](https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution)

As an example of how you might use slowly decaying distributions, consider
this example of Cauchy PCA:

[http://arxiv.org/pdf/1412.6506v1.pdf](http://arxiv.org/pdf/1412.6506v1.pdf)

I'm working on an blog post explaining the use of fat tailed distributions for
linear regression in a Bayesian context.

------
thetwiceler
To define variance as

    
    
        E[x^2] - E[x]^2
    

and not ever allude to the far more meaningful version,

    
    
        E[ (x - E[x])^2 ]
    

is just criminal. This is _not_ at all a good explanation of variance,
covariance, and correlation.

~~~
mturmon
I have to agree with you here.

One problem with introducing it the way the article does is that it's hard to
see why the variance is never negative, and is zero exactly when the R.V. is
constant.

This is a very important property, to say the least.

It would be better to say you measure the "energy" with

    
    
      E x^2
    

but that this is not immune to level shifts, so you need to subtract some
constant off first. And it so happens that the optimal constant to subtract
off is our friend E x.

Edited to add: The notion of introducing the ideas of a sample space and a
random variable (in the technical sense), as is done in the article, and at
the same time being shy about calculus, is rather contradictory. That is, the
intersection of

    
    
      { people who want measure-theoretic probability concepts }
    

and

    
    
      { people who don't know calculus }
    

may be empty.

~~~
hessenwolf
It looks like an enthusiastic newbie with some clipart and an equation. Fair
play for trying.

I would suggest adding the following.

1\. What the poster above said.

2\. The reason for the E[(x_{bar} - x_i)^2] choice. Why not E[|x_{bar} -
x_i|]? Was it a mathematical convencience? Was it, perhaps, because Gauss had
the integral of e_{t^2} from -Inf to plus Inf lying around in a letter from
Laplace?

3\. It is an equation with a square. Use a square somewhere.

4\. The square root of the variance happens to be the horizontal distance
between the mean and the point of inflection in the normal distribution. How
cool is that?

~~~
mturmon
I like (3) in particular. You could introduce, in a very simple way, the idea
that the "error" (X - E X) is perpendicular to the "estimate" (E X). That's
the two legs of the right triangle; the hypotenuse is "X" itself.

------
krcz
Covariance - dot product. Variance - squared norm. Correlation - cosinus of
the angle between vectors.

~~~
sukilot
This is why I love math. Rigorous definitions and pattern reuse, distilling
concepts to their essential features.

------
cafebeen
Never seen covariance spelled with a hyphen...

------
hyperliner
I love how some people have the ability to explain in simple terms something
that is not as simple. And I appreciate that they take the time to write it
down.

Thank you OP.

------
it_learnses
so many spelling mistakes...

