Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Statistics using NumPy (mubaris.com)
108 points by mubaris on Sept 9, 2017 | hide | past | favorite | 22 comments



The formula you give for variance is the sample variance, with N - 1 as the divisor. It gives an unbiased estimate of the variance of a population, assuming X is a sample of that much larger population.

However, by default numpy gives you population variance, with N as the divisor. This assumes X is the entire population, which is different from R and Matlab.

As a result, the answer in your code is different from the formula just above it.

  np.var(X)
gives 207, while

  np.sum((np.array(X) - np.mean(X))**2) / (len(X) - 1)
gives 236. If you want the N-1 corrected sample variance from numpy you'd use

  np.var(X, ddof=1)



The problem with these kind of extremely short texts is that they jump into explaining technical concepts without even mentioning basic things such as random variable for example.

Then, readers learn that statistics is about applying a few summing and averaging procedures, and trust the numbers they get as a result.


I totally agree. It's much better to start with probability and then move to statistics. It gives both a foundation and the motivation for statistics.


Indeed, and I think Python could be great for learning about probability too. It comes with an endless supply of random numbers. ;-)


Anyone have any recommendations on where to start? In particular, I’ve been looking for a good book on probability.


I like Norvig's Jupyter notebook on Probability using Python.

http://nbviewer.jupyter.org/url/norvig.com/ipython/Probabili...


I recommend Think Stats by Allen Downey http://greenteapress.com/thinkstats/


William Bolstad - Introduction to Bayesian Statistics

Or E.T. Jaynes - Probability Theory, The Logic of Science (but this one is much more detailed and has lots of philosophy)

Or Ian Hacking - An Introduction to Probability and Inductive Logic (this is the light approach written for philosophy students)


Jaynes is the usual 'standard' probability text, and it's well written and not too dry. It's mathsy though.


I'll be posting a separate blog post about probability soon. Anyway, thanks for the suggestion.


Calculating mean, median, and average in Numpy. How is this in the first page of hacker news?


It's taking space that could be taken by yet another article about confidence/burnout/fitness/sleeping/nutrition!


For such basic functionality, is there any reason not to just use Python 3's `statistics` module?

  from statistics import mean, median, stdev, variance


Two reasons. First is that NumPy works in Python 2 as well as 3.

Second is that everyone uses NumPy for this. It is proven, it is fast, and using it for simple things is a great way to get started on a path toward using it for more complex things. For example I use NumPy for tabular text processing sometimes, as it is much faster than the built-in stuff if you have a lot of data.


It's a little bit confusing that the article defines variance as "the averaged squared deviation of each data from your dataset" but then the formula shown is the unbiased estimation of the variance (with n-1 in the denominator) without explanation.


I fixed this issue.


Nice example. But two typos,

Missing a close paren to open(); and splitlines() is misspelled. Code should be:

with open('salary.txt') as f: X = f.read().splitlines()


Thanks. Fixed typos


Where did you get the dataset from? Adding a link to the source would be nice


There was an Ask HN post about earning in Europe.

https://news.ycombinator.com/item?id=15088840

Original Responses - https://docs.google.com/spreadsheets/d/1cjwB_s4ya57auTjIw-OV...

Then I went on to refine the data for my purposes - https://docs.google.com/spreadsheets/d/1AAJmOWOE-zYydNR_HzLU...


Nice, thanks! One question: where do you have the dataset from? The gender is encoded as 0,1,2 - which is which? Is this dataset from the Excel that was posted here some time ago? Could you share the link please?

I started to look at the dataset a little, if someone is interested: https://github.com/davidgengenbach/developer-salary-statisti...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: