> result length dependent operations: to calculate a mean in this case we have to pass the string “mean” to transform. This tells pandas that the result should be the same length as the original data.
g_students.score.mean()
has the same length as using `g_students.score.transform('mean')` but the result has different values!
I think that is a great point to add to you very interesting article. I wouldn't know which of the two operations is correct to use, and I would not notice anything wrong, or odd with either method in a code review, so this is ripe for adding wrong results in a production environment.
I also appreciate your idea of porting dplyr to python, keep up the good work :)
This table sums up some of it:
operation | time
apply score + 1 | 30s
apply score.values + 1 | 3s
transform score + 1 | 30s
transform score.values + 1 | 20s
It seems to me that pandas is simply a leakier abstraction than dplyr, data.table etc. As a user of the library in most instances you shouldn't have to profile your code to figure out why things behave the way they do (btw, thanks for pointing out snakeviz - it seems like a useful tool).
This being said, we shouldn't complain too much about pandas - it is in the end a very important and useful tool.
For example, a grouped filter is very cumbersome in pandas.
Interested to hear if you think it gets at the heart of the problem.
https://mchow.com/posts/2020-02-11-dplyr-in-python/