> Having many different ways to express the same logic makes it hard for developers to understand programs of heterogeneous styles. Besides having varying ways to express the same simple logic, the sheer number of APIs (> 200) that are not only overloaded but also have default parameters that may change version to version, making it hard to remember the APIs.
It's a bit tangential to the main point, but I do agree with this remark.
I have always found Pandas uncomfortable to work with. I'm never sure if I'm doing things in the most efficient/idiomatic way and I've found it hard to be consistent over time, especially since I've picked up different bits of code from different places.
I've gotten a lot more efficiency out of R, especially the data.table package.
I feel like I'm constantly looking up SO or blog posts that benchmark Pandas methods while I'm coding with Pandas. You have to, since the inefficiency you add with a slower method is nontrivial.
> result length dependent operations: to calculate a mean in this case we have to pass the string “mean” to transform. This tells pandas that the result should be the same length as the original data.
g_students.score.mean()
has the same length as using `g_students.score.transform('mean')` but the result has different values!
I think that is a great point to add to you very interesting article. I wouldn't know which of the two operations is correct to use, and I would not notice anything wrong, or odd with either method in a code review, so this is ripe for adding wrong results in a production environment.
I also appreciate your idea of porting dplyr to python, keep up the good work :)
This table sums up some of it:
operation | time
apply score + 1 | 30s
apply score.values + 1 | 3s
transform score + 1 | 30s
transform score.values + 1 | 20s
It seems to me that pandas is simply a leakier abstraction than dplyr, data.table etc. As a user of the library in most instances you shouldn't have to profile your code to figure out why things behave the way they do (btw, thanks for pointing out snakeviz - it seems like a useful tool).
This being said, we shouldn't complain too much about pandas - it is in the end a very important and useful tool.
I mean, perhaps comparing R to panda is too low a bar.
If I read "Having many different ways to express the same logic makes it hard for developers to understand programs of heterogeneous styles" in a vacuum, the very first thing I would think of is R.
Not sure that I'd consider those all that eloquent since it's just the product of 2 different pieces of syntactic sugar (df.a being shorthand for df["a"] and df[<index filter>] shorthand for df.loc[<index filter>]).
Well, that's kind of the point. What's the purpose of the syntactic sugar? Is it just that, or is there some hidden performance difference? This is not clear at first sight.
The point is to "Huffman encode" the API for expressing near-boilerplate. Like unix command names and flags.
The problem is that there is no simple logically coherent API to use when you haven't memorized all the shortcuts. And the author only allows "tax form" APIs (what he calls "Pythobic/Pandonic" where every parameter is a single atomic step, so it's laborious to express things like tree-structured queries that are more complex than parameter dictionaries.
It's a bit tangential to the main point, but I do agree with this remark. I have always found Pandas uncomfortable to work with. I'm never sure if I'm doing things in the most efficient/idiomatic way and I've found it hard to be consistent over time, especially since I've picked up different bits of code from different places.
I've gotten a lot more efficiency out of R, especially the data.table package.