
Introduction to Statistics using NumPy - mubaris
https://mubaris.com/2017-09-09/introduction-to-statistics-using-numpy
======
bckygldstn
The formula you give for variance is the sample variance, with N - 1 as the
divisor. It gives an unbiased estimate of the variance of a population,
assuming X is a sample of that much larger population.

However, by default numpy gives you population variance, with N as the
divisor. This assumes X is the entire population, which is different from R
and Matlab.

As a result, the answer in your code is different from the formula just above
it.

    
    
      np.var(X)
    

gives 207, while

    
    
      np.sum((np.array(X) - np.mean(X))**2) / (len(X) - 1)
    

gives 236. If you want the N-1 corrected sample variance from numpy you'd use

    
    
      np.var(X, ddof=1)

~~~
aisofteng
See: Bessel's correction.

[https://en.m.wikipedia.org/wiki/Bessel%27s_correction](https://en.m.wikipedia.org/wiki/Bessel%27s_correction)

------
dragandj
The problem with these kind of extremely short texts is that they jump into
explaining technical concepts without even mentioning basic things such as
_random variable_ for example.

Then, readers learn that statistics is about applying a few summing and
averaging procedures, and trust the numbers they get as a result.

~~~
RA_Fisher
I totally agree. It's much better to start with probability and then move to
statistics. It gives both a foundation and the motivation for statistics.

~~~
melling
Anyone have any recommendations on where to start? In particular, I’ve been
looking for a good book on probability.

~~~
maroonblazer
I like Norvig's Jupyter notebook on Probability using Python.

[http://nbviewer.jupyter.org/url/norvig.com/ipython/Probabili...](http://nbviewer.jupyter.org/url/norvig.com/ipython/Probability.ipynb)

------
nafizh
Calculating mean, median, and average in Numpy. How is this in the first page
of hacker news?

~~~
branchless
It's taking space that could be taken by yet another article about
confidence/burnout/fitness/sleeping/nutrition!

------
lettuce
For such basic functionality, is there any reason not to just use Python 3's
`statistics` module?

    
    
      from statistics import mean, median, stdev, variance

~~~
jzwinck
Two reasons. First is that NumPy works in Python 2 as well as 3.

Second is that everyone uses NumPy for this. It is proven, it is fast, and
using it for simple things is a great way to get started on a path toward
using it for more complex things. For example I use NumPy for tabular text
processing sometimes, as it is much faster than the built-in stuff if you have
a lot of data.

------
cristoperb
It's a little bit confusing that the article defines variance as "the averaged
squared deviation of each data from your dataset" but then the formula shown
is the unbiased estimation of the variance (with n-1 in the denominator)
without explanation.

~~~
mubaris
I fixed this issue.

------
gshubert17
Nice example. But two typos,

Missing a close paren to open(); and splitlines() is misspelled. Code should
be:

with open('salary.txt') as f: X = f.read().splitlines()

~~~
mubaris
Thanks. Fixed typos

------
pencilcode
Where did you get the dataset from? Adding a link to the source would be nice

~~~
mubaris
There was an Ask HN post about earning in Europe.

[https://news.ycombinator.com/item?id=15088840](https://news.ycombinator.com/item?id=15088840)

Original Responses -
[https://docs.google.com/spreadsheets/d/1cjwB_s4ya57auTjIw-
OV...](https://docs.google.com/spreadsheets/d/1cjwB_s4ya57auTjIw-
OVjoR_3IDvLcELt9sK420fzbs/edit#gid=1485289967)

Then I went on to refine the data for my purposes -
[https://docs.google.com/spreadsheets/d/1AAJmOWOE-
zYydNR_HzLU...](https://docs.google.com/spreadsheets/d/1AAJmOWOE-
zYydNR_HzLUBJ50ESUCN0x4RY4AkHrdBow/edit?usp=sharing)

------
paprikawuerzung
Nice, thanks! One question: where do you have the dataset from? The gender is
encoded as 0,1,2 - which is which? Is this dataset from the Excel that was
posted here some time ago? Could you share the link please?

I started to look at the dataset a little, if someone is interested:
[https://github.com/davidgengenbach/developer-salary-
statisti...](https://github.com/davidgengenbach/developer-salary-statistics)

