Hacker News new | past | comments | ask | show | jobs | submit login

> Having many different ways to express the same logic makes it hard for developers to understand programs of heterogeneous styles. Besides having varying ways to express the same simple logic, the sheer number of APIs (> 200) that are not only overloaded but also have default parameters that may change version to version, making it hard to remember the APIs.

It's a bit tangential to the main point, but I do agree with this remark. I have always found Pandas uncomfortable to work with. I'm never sure if I'm doing things in the most efficient/idiomatic way and I've found it hard to be consistent over time, especially since I've picked up different bits of code from different places.

I've gotten a lot more efficiency out of R, especially the data.table package.




I feel like I'm constantly looking up SO or blog posts that benchmark Pandas methods while I'm coding with Pandas. You have to, since the inefficiency you add with a slower method is nontrivial.


I just finished a lengthy analysis of why pandas groupby operations ends up harder to use than R's dplyr or data.table.

For example, a grouped filter is very cumbersome in pandas.

Interested to hear if you think it gets at the heart of the problem.

https://mchow.com/posts/2020-02-11-dplyr-in-python/


> result length dependent operations: to calculate a mean in this case we have to pass the string “mean” to transform. This tells pandas that the result should be the same length as the original data.

    g_students.score.mean()
has the same length as using `g_students.score.transform('mean')` but the result has different values!

I think that is a great point to add to you very interesting article. I wouldn't know which of the two operations is correct to use, and I would not notice anything wrong, or odd with either method in a code review, so this is ripe for adding wrong results in a production environment.


I really think it does.

I also appreciate your idea of porting dplyr to python, keep up the good work :)

This table sums up some of it:

operation | time

apply score + 1 | 30s

apply score.values + 1 | 3s

transform score + 1 | 30s

transform score.values + 1 | 20s

It seems to me that pandas is simply a leakier abstraction than dplyr, data.table etc. As a user of the library in most instances you shouldn't have to profile your code to figure out why things behave the way they do (btw, thanks for pointing out snakeviz - it seems like a useful tool).

This being said, we shouldn't complain too much about pandas - it is in the end a very important and useful tool.


I mean, perhaps comparing R to panda is too low a bar.

If I read "Having many different ways to express the same logic makes it hard for developers to understand programs of heterogeneous styles" in a vacuum, the very first thing I would think of is R.


Indeed, but I was more specifically referring to smth like data.table, which I found simpler to use than Pandas.


Could you give some examples of this issue with pandas?


Some of the examples provided in the paper are eloquent:

•df[df.a>3]

•df[df["a"]>3]

•df.loc[df.a>3]

•df.loc[df["a"]>3]


Not sure that I'd consider those all that eloquent since it's just the product of 2 different pieces of syntactic sugar (df.a being shorthand for df["a"] and df[<index filter>] shorthand for df.loc[<index filter>]).


Here's another example: https://stackoverflow.com/questions/49936557/pandas-datafram...

What's the difference between query() and loc()? Do they evaluate to the same thing under the hood? Is one better than the other? In what cases?

These are questions that don't have obvious answers at first sight.


Well, that's kind of the point. What's the purpose of the syntactic sugar? Is it just that, or is there some hidden performance difference? This is not clear at first sight.


The point is to "Huffman encode" the API for expressing near-boilerplate. Like unix command names and flags.

The problem is that there is no simple logically coherent API to use when you haven't memorized all the shortcuts. And the author only allows "tax form" APIs (what he calls "Pythobic/Pandonic" where every parameter is a single atomic step, so it's laborious to express things like tree-structured queries that are more complex than parameter dictionaries.


Agree, data.table is the perfect blend of table and dataframe syntaxis for scientists. It's also performant.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: