Hacker News new | past | comments | ask | show | jobs | submit login

I just finished a lengthy analysis of why pandas groupby operations ends up harder to use than R's dplyr or data.table.

For example, a grouped filter is very cumbersome in pandas.

Interested to hear if you think it gets at the heart of the problem.

https://mchow.com/posts/2020-02-11-dplyr-in-python/




> result length dependent operations: to calculate a mean in this case we have to pass the string “mean” to transform. This tells pandas that the result should be the same length as the original data.

    g_students.score.mean()
has the same length as using `g_students.score.transform('mean')` but the result has different values!

I think that is a great point to add to you very interesting article. I wouldn't know which of the two operations is correct to use, and I would not notice anything wrong, or odd with either method in a code review, so this is ripe for adding wrong results in a production environment.


I really think it does.

I also appreciate your idea of porting dplyr to python, keep up the good work :)

This table sums up some of it:

operation | time

apply score + 1 | 30s

apply score.values + 1 | 3s

transform score + 1 | 30s

transform score.values + 1 | 20s

It seems to me that pandas is simply a leakier abstraction than dplyr, data.table etc. As a user of the library in most instances you shouldn't have to profile your code to figure out why things behave the way they do (btw, thanks for pointing out snakeviz - it seems like a useful tool).

This being said, we shouldn't complain too much about pandas - it is in the end a very important and useful tool.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: