Hacker News new | past | comments | ask | show | jobs | submit login

Can you please elaborate what's wrong with Pandas? Im looking to use either Polars or Pandas in a project and looking for insights.





Wes McKinney (author of Pandas, but also co-author of a host of other major data formats/tools like Ibis, Arrow, Parquet, Feather), wrote "10 Things I Hate About pandas"

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

It was one of his earlier projects, and it "stuck". It was one of the first popular dataframe libraries for Python and it filled that niche for years (alongside dplyr/data.table in R), especially during the big data/data science craze of the 2010s. Tons have been written on it, data scientists and other data folk were brought up on it, so it's the de facto standard in many pipelines.

Since then, we have moved on to better tools like Polars and DuckDB. You will still see a lot of Pandas code in the wild due to how prevalent it is, but if you were to start a new project, you might want to use more modern tools.

Pandas is kinda of analogous to jquery -- jquery was hugely influential during its time, but we have learned a lot since, and there are more modern options (React/Vue/Svelte).


Pandas has a rather unnatural API. For instance you have to name the DataFrame anytime you want to pull a column out of it, so filtering is e.g., `my_df[(my_df['col1'] == value) & (my_df['col2'] + my_df['col3'] > 42)]`. The indexing is also kind of a mess — there are like seven different styles. Row-wise mapping is a huge pain. And of course there is no optimization; it's all computed eagerly so you are responsible for your own optimizations.

Polars, on the other hand, lets you refer to columns like `col('col1')`, which starts to add up if your DataFrame has a long name. It has no row indexing; it's all done by filtering, which is conceptually very simple. Row-wise mapping is trivial. And there is an optimizer that runs before execution.

But more than that, polars has a very fluent API, whereas pandas relies heavily on statements that can't be chained; it really breaks the flow.


I don't find pandas intuitive (API simplicity), then you have the hard to debug issues and perf

Yes, agreed. The API is a big inconsistent kludge, has many warts, and generally requires too much typing and memorization. The performance is subpar. There are some very annoying design choices wrt. implicit type coercion that don't jive with my personal preferences, which caused me recurring grief.

And to engage in some light gatekeeping, there sure is a lot of terrible pandas code out there written by people that have no business calling themselves programmers. I fully realize this can happen anywhere, but I'm never excited anymore to read a line of pandas.


Related: Ibis (a portable Python dataframe library) dropping the pandas backend in favor of DuckDB for better performance and compatibility. [1]

--

1: https://news.ycombinator.com/item?id=41389806


Thank you everyone (all sibling posts) ! Very useful information

Not the OP.

The tl;dr is that Polars is faster and has cleaner syntax than Pandas. That said, there's more information (books, videos, discussion boards) and example code for Pandas. If you've got a small project then Pandas does fine. I use Pandas but it's not my day job. If it was, I'd probably switch to Polars.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: