It allowed me to take some code that reads in a bunch of data and performs a few...

theLiminator · on Jan 9, 2024

Running on a very high core count server? Polars in single thread applications definitely are faster but not 60x faster unless the work isn't comparable. Are you reading from parquet and only operating on some columns? That could also be it.

But yeah, polars is awesome, I'm all in on it.

radus · on Jan 9, 2024

I'm not including parsing time, both pandas and polars versions started from an in-memory data structure parsed from two XML files (low GB range). This is on my workstation with a single Xeon 4210 (10 cores, 20 threads @ 2.20-3.20Ghz).

Perhaps I can focus on a subset of this processing and write this up since it seems like there's at least some interest in real examples. As pointed out in a reply to a sibling comment, I don't guarantee that my starting code is the best that pandas can do -- to be honest, the runtime of the original code did not line up with my intuition of how long these operations should take. Maybe someone will school me but either way switching to polars was a relatively easy win that came with other benefits and feels right to me in a way that pandas never did.

mmaunder · on Jan 9, 2024

Is polars not parallelizing some ops on the GPU?

theLiminator · on Jan 9, 2024

It has zero GPU support for now.

lmeyerov · on Jan 9, 2024

Important point.

Nowadays, we write a pure pandas version, and when the data needs to be 100X bigger and faster, change almost nothing and have it run on the GPU via cudf, a GPU runtime that fully follows the pandas API. Most recently, we port GFQL (Cypher graph queries on dataframes) to GPU execution over the holiday weekend and it already beats most Cypher implementations. Think billions of edges traversed per second on a cheap 5 year old GPU.

We're planning the bigger than memory & multi node versions next, for both CPU + GPU, and while cudf leans towards dask_cudf, plans are still TBD. Polars, Ray, and Dask all have sweet spots here.

maronato · on Jan 9, 2024

According to GitHub, 90% of Pandas’ codebase is written in Python, which probably means there’s a lot of language overhead during operations compared to the rust code in polars.

That, plus parallelism, probably explains the performance difference. If anything, 60x sounds conservative to me.

theLiminator · on Jan 9, 2024

I think with parallelism that difference is realistic, definitely not in single core performance though, most of pandas is implemented in numpy which should be pretty fast.

mmaunder · on Jan 9, 2024

Bloody hell!! Thanks, that's exactly the kind of comment I was hoping to see. Sounds like a bit of an Apache --> Nginx moment for dataframes. Super cool!!

radus · on Jan 9, 2024

To add some balance:

- I can't rule out that a pandas wizard couldn't have achieved the same speed-up in pandas

- polars code was slightly more verbose. For example, when calculating columns based on other columns in the same chain, in pandas, each new column can be defined as a kwarg in a single call to `assign`, whereas in polars, columns that depend on other must be defined in their own calls to `with_columns`

- handling of categoricals in polars seemed a little underbaked, though my main complaint, that categories cannot be pre-defined, seems to have been recently addressed: https://github.com/pola-rs/polars/issues/10705

- polars is not yet 1.0, breaking changes will happen

billyjmc · on Jan 9, 2024

Regarding your second point, you can use the walrus operator to retain the results of a computation within a single `.with_columns()` call. See https://stackoverflow.com/a/77609494

Edited to add: also, if you’re using a lazy dataframe, you can just naively do the same operation twice (once to store it in a named column and once again in the subsequent computation), and Polars will use common subexpression elimination (CSE) to prevent recomputing the result. You can verify this is true using the `.explain()` method of a lazy dataframe operation containing the `.with_columns()` call.

radus · on Jan 9, 2024

That's awesome, thanks for sharing! Though tbh I'm not likely to use it.. it's a bit too magical - though still a delicious hack.

billyjmc · on Jan 9, 2024

I just edited my comment above to add more info about common subexpression elimination. It’s magic that happens behind your back on lazy dataframes. Polars is great!