Introduction to Statistics using NumPy

bckygldstn · on Sept 9, 2017

The formula you give for variance is the sample variance, with N - 1 as the divisor. It gives an unbiased estimate of the variance of a population, assuming X is a sample of that much larger population.

However, by default numpy gives you population variance, with N as the divisor. This assumes X is the entire population, which is different from R and Matlab.

As a result, the answer in your code is different from the formula just above it.

  np.var(X)

gives 207, while

  np.sum((np.array(X) - np.mean(X))**2) / (len(X) - 1)

gives 236. If you want the N-1 corrected sample variance from numpy you'd use

  np.var(X, ddof=1)

aisofteng · on Sept 9, 2017

See: Bessel's correction.

https://en.m.wikipedia.org/wiki/Bessel%27s_correction

dragandj · on Sept 9, 2017

The problem with these kind of extremely short texts is that they jump into explaining technical concepts without even mentioning basic things such as random variable for example.

Then, readers learn that statistics is about applying a few summing and averaging procedures, and trust the numbers they get as a result.

RA_Fisher · on Sept 9, 2017

I totally agree. It's much better to start with probability and then move to statistics. It gives both a foundation and the motivation for statistics.

analog31 · on Sept 9, 2017

Indeed, and I think Python could be great for learning about probability too. It comes with an endless supply of random numbers. ;-)

melling · on Sept 9, 2017

Anyone have any recommendations on where to start? In particular, I’ve been looking for a good book on probability.

maroonblazer · on Sept 9, 2017

I like Norvig's Jupyter notebook on Probability using Python.

http://nbviewer.jupyter.org/url/norvig.com/ipython/Probabili...

RA_Fisher · on Sept 9, 2017

I recommend Think Stats by Allen Downey http://greenteapress.com/thinkstats/

dragandj · on Sept 9, 2017

William Bolstad - Introduction to Bayesian Statistics

Or E.T. Jaynes - Probability Theory, The Logic of Science (but this one is much more detailed and has lots of philosophy)

Or Ian Hacking - An Introduction to Probability and Inductive Logic (this is the light approach written for philosophy students)

joshvm · on Sept 9, 2017

Jaynes is the usual 'standard' probability text, and it's well written and not too dry. It's mathsy though.

mubaris · on Sept 9, 2017

I'll be posting a separate blog post about probability soon. Anyway, thanks for the suggestion.

nafizh · on Sept 9, 2017

Calculating mean, median, and average in Numpy. How is this in the first page of hacker news?

branchless · on Sept 9, 2017

It's taking space that could be taken by yet another article about confidence/burnout/fitness/sleeping/nutrition!

lettuce · on Sept 9, 2017

For such basic functionality, is there any reason not to just use Python 3's `statistics` module?

  from statistics import mean, median, stdev, variance

jzwinck · on Sept 9, 2017

Two reasons. First is that NumPy works in Python 2 as well as 3.

Second is that everyone uses NumPy for this. It is proven, it is fast, and using it for simple things is a great way to get started on a path toward using it for more complex things. For example I use NumPy for tabular text processing sometimes, as it is much faster than the built-in stuff if you have a lot of data.

cristoperb · on Sept 9, 2017

It's a little bit confusing that the article defines variance as "the averaged squared deviation of each data from your dataset" but then the formula shown is the unbiased estimation of the variance (with n-1 in the denominator) without explanation.

mubaris · on Sept 9, 2017

I fixed this issue.

gshubert17 · on Sept 9, 2017

Nice example. But two typos,

Missing a close paren to open(); and splitlines() is misspelled. Code should be:

with open('salary.txt') as f: X = f.read().splitlines()

mubaris · on Sept 9, 2017

Thanks. Fixed typos

pencilcode · on Sept 9, 2017

Where did you get the dataset from? Adding a link to the source would be nice

mubaris · on Sept 9, 2017

There was an Ask HN post about earning in Europe.

https://news.ycombinator.com/item?id=15088840

Original Responses - https://docs.google.com/spreadsheets/d/1cjwB_s4ya57auTjIw-OV...

Then I went on to refine the data for my purposes - https://docs.google.com/spreadsheets/d/1AAJmOWOE-zYydNR_HzLU...

paprikawuerzung · on Sept 9, 2017

Nice, thanks! One question: where do you have the dataset from? The gender is encoded as 0,1,2 - which is which? Is this dataset from the Excel that was posted here some time ago? Could you share the link please?

I started to look at the dataset a little, if someone is interested: https://github.com/davidgengenbach/developer-salary-statisti...