The formula you give for variance is the sample variance, with N - 1 as the divisor. It gives an unbiased estimate of the variance of a population, assuming X is a sample of that much larger population.
However, by default numpy gives you population variance, with N as the divisor. This assumes X is the entire population, which is different from R and Matlab.
As a result, the answer in your code is different from the formula just above it.
The problem with these kind of extremely short texts is that they jump into explaining technical concepts without even mentioning basic things such as random variable for example.
Then, readers learn that statistics is about applying a few summing and averaging procedures, and trust the numbers they get as a result.
Two reasons. First is that NumPy works in Python 2 as well as 3.
Second is that everyone uses NumPy for this. It is proven, it is fast, and using it for simple things is a great way to get started on a path toward using it for more complex things. For example I use NumPy for tabular text processing sometimes, as it is much faster than the built-in stuff if you have a lot of data.
It's a little bit confusing that the article defines variance as "the averaged squared deviation of each data from your dataset" but then the formula shown is the unbiased estimation of the variance (with n-1 in the denominator) without explanation.
Nice, thanks! One question: where do you have the dataset from? The gender is encoded as 0,1,2 - which is which? Is this dataset from the Excel that was posted here some time ago? Could you share the link please?
However, by default numpy gives you population variance, with N as the divisor. This assumes X is the entire population, which is different from R and Matlab.
As a result, the answer in your code is different from the formula just above it.
gives 207, while gives 236. If you want the N-1 corrected sample variance from numpy you'd use