Hacker News new | past | comments | ask | show | jobs | submit login

This is a really big deal. In the first edition of Python for data analysis, they suggest using mean imputation. In case you don't know, this will totally break your variance calculations and thus any statistical tests.

In the second edition, they suggest doing some interpolation. Meanwhile, in R land there are multiple ways (as always) to do useful multiple imputation which gets you a much more accurate analysis which makes better use of all of the data (mice, Amelia and mi are all good, and somewhat complimentary).

That being said, I just thought of using PyTorch and a GAN to do multiple imputation, so maybe it's not impossible to do in Python. There is way, way less support for it though (but of course you could probably build in Numpy).

I guess the big difference is that R comes with numpy equivalent (matrix), a pandas equivalent (data.frame and base), and a well-tested, numerically-stable and reference implementation of pretty much all widely used statistical models.

Like, I really don't understand why you wouldn't want to look at residuals, even if all you care about is prediction. Your predictions will be much more stable and accurate, and it can often inform you as to how to model things more appropriately.

Finally, R's formula interface is a thing of beauty. Honestly, why the hell do I need to generate a model matrix for regression/classification when I can get R to do it for me.

I will also say that R is a frustrating, domain-specific, really irritating, wonderful language. But then I'm a crazy person, I wrote a stockfighter client in R.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact