Hacker News new | past | comments | ask | show | jobs | submit login
A bird's eye view of Polars (pola.rs)
197 points by rbanffy 9 months ago | hide | past | favorite | 97 comments



Do people actual use Rust for datascience? I mean this literally and not as a snarky retort. I dabble in data science in personal projects but I have little concept of the field in general. Polars seems cool I just haven't seen Rust used as a general purpose datascience language like R or Python which is what an dataframe library seems like it's about. As opposed to running some demanding calculations in C or C++ for performance. It seems like Rust might fit better in the latter, but again, that's not what a dataframe lib suggests to me.


Polars has a python front end, which I have used. All the work happens in Rust but the queries can be specified in python. The data is stored using Apache Arrow format so there is no copy required for the same data to be accessible in both Rust and python.


Oh, that's cool. I did not realize this. I realize this isn't novel but I do not really understand how a library written in one language is used by another - some sort of bindings that the library handles I guess.


The magic of well-defined APIs! If you're interested in mixing different DS backends and kernels in a single notebook, check out Quarto:

[1]: https://quarto.org/


It looks like polars is using PyO3, which are Rust bindings for Python. Python's reference implementation is in C, so I imagine it's interacting with that API [0] through FFI. Common Python extension modules (as these are called) are compiled as dynamically linked libraries and binaries (or compilation instructions) are included in Python wheels.

[0] https://docs.python.org/3/extending/extending.html


Both R and pythons pandas uses code written in C and Fortran to do the actual calculations when you ask them to manipulate data.


Polars is a Python library as well as a Rust library, and most of its use comes through the Python library.


I don’t think anyone is doing any explorative analytics using rust.

But I do imagine someone out there has a data engineer hat they put on, and starts rewriting Python calls to polars into pure rust code, compiling and then deploying.


I'm working on a project that leans heavily on matrix vectorization and large dirty datasets. I'm not a scientist, though.

Here's how I work.

* Hack together a PoC with python, sed, awk, grep, cut, xsv etc.

* Clean that up, run it on larger sample sets (samples made with said sed/awk/cut etc)

* Attempt to run it on the full dataset.

* Rewrite it in rust.

Step 2 and 3 are hit-or-miss in python. I find it near impossible to do any refactoring without static types and/or tests. And quite often, I'm looking at a run for over an hour to have it crash on that one broken line. Whereas the same happens in the rust version in seconds: crucial for my trial-and-error style of building.

So: Python because I must, rust, as soon as it's clear what I'm going to do.


Doesn't Python support optional static types these days?


It supports type hinting. Which helps, but is far from the tool (crutch) that I need when refactoring. Still far too much guessfactoring and not the confident "it compiles. tests are green. it's Friday: let's deploy!" that rust (or java, or even typescript) offer me.


It's got type annotations and mypy has a discussion about it here as well: https://github.com/python/mypy/issues/1282


Yes. What's cool is in Rust you have direct access to polars, so you can do all the low level munging and computations (and/or read/write the data to/from arrow directly from rust if need be) in Rust and return dataframes directly to Python. The front end is still Python, of course, but pyo3 makes it pretty trivial.


Rust bindings for Python: https://github.com/PyO3/pyo3


The first code snippet in the provided link is actually Python


I’ve been moving over to use polars more often for my data work. It’s much faster than pandas at things like imputing millions of lines. It’s also a little more intuitive and you don’t waste time pissing around with indexes everytime you transform a dataframe.

Polars biggest downfall is that pandas/matplotlib are so ubiquitous in data science and polars just plays so differently than pandas including using hvplot as its default plotting package, etc. It really is trying to do much of its ecosystem exactly how it wants to maximize productivity, speed, etc. This may slow down the adoption of it, but hopefully it will push others in the better direction.


Suggest the following pattern:

1. load and process / aggregate in polars to get the smaller dataset that goes into your plot. 2. df.to_pandas() 3. apply your favourite vis library that works with pandas.

There's no use case i can think of where building a data viz interface more specific to polars than this is beneficial or necessary.


Pandas is the best interface for putting with matplotlib (better than matplotlib itself).

What issues are you running into with the Python ecosystem?

I ask because I'm in the middle of writing Effective Polars, and my experience is that many things like Xgboost, matplotlib, etc work fine with Polars. (Sadly/oddly/ironically some libraries now have issues with Pandas 2 pyarrow types but work with Polars.)


> Polars biggest downfall is that pandas/matplotlib are so ubiquitous in data science and polars just plays so differently than pandas including using hvplot as its default plotting package, etc.

huh? Polars doesn't have a default plotting package, it's always something you add. matplotlib supports polars out of the box through the dataframe exchange protocol, which is ubiquitous enough due to the proliferation of other dataframe libraries (dask, vaex, modin/ray) that you really get about every other tool in the ecosystem with polars for free.


Polars can now leverage hvplot directly.

https://docs.pola.rs/user-guide/misc/visualization/


I absolutely love Polars, and I use the Rust crate all the time!

I think my only gripe is that the Rust API seems (to me at least) to be less well documented than the Python API. I guess Python is the de facto data science language, so maybe that explains it.


Likewise, I would prefer to stay in Rust when developing command line apps that might do DataFrame thingies and would love more examples of how to transform DataFrames in pure Rust.


I have written several rust/python apis now and I think what you’re feeling is the result of the api being designed for Python as well as being primarily tested in Python.

No matter the target, I test rust things using python. idk. Food for thought.


Curious why you use the Rust API over the Python API? I'm just wondering if you've also explored the datafusion crate, because I think on the rust side it's about equivalently ergonomic to use either and Datafusion seems a little more modular/pluggable. The Polars python API is definitely miles above the datafusion python API.


Oh, I actually had not yet heard of the DataFusion crate! Thanks for sharing!

And to answer your question, we use the Rust API because all of our backend services are in Rust, and we like to stay in Rust whenever possible.


With GPT4 helping with the refactor, there's no reason to start migrating code away from Pandas imo. A lot of people say they think Pandas is fast enough for their needs, but you're literally getting a 95% speed improvement for free.

This is a huge difference in productivity, especially when running code and doing a lot of slicing in notebooks.


Polars is an immense project, and I hope it continues to gain traction. But there's lots more factors than just speed.

The main one in my team is ubiquity- i.e. lots of people know pandas, who might not be traditional "developers". I.e. data scientists, data analysts etc. Having a data scientist put together some code, it gets optimized by an engineer, and they can talk back and forth about the same code is a massive benefit.

Shifting to polars (and keeping that ability to collaborate) would require not just training the engineers to use a new framework, but all the analysts, data scientists etc that they are adjescant to. That's a huge business cost, and in a lot of cases it might be worth it. But I wouldn't describe it as "getting 95% speed increase for free".


While that's fair, it's fairly easy to fit it in only the most intensive operations and then seamlessly convert back to a pandas data frame.

I understand why you wouldn’t do this on an organizational level for production workflows, but for personal workflows in my opinion, it’s a no-brainer to incrementally learn and adopt it.


Where ive seen resistance to polars in Python land its been either "Pandas is already used/standard and people understand/know it" or "If you really need speed, you can probably do it in numpy which should be faster again".


I think you meant to say there's no reason not to start migrating? Otherwise I can't parse how the rest of your post matches up with your conclusion.


What does GPT-4 have to do with this?


It's easier than ever to do a drop-in replacement for data workflows. The whole decision making process between migrating libraries like this is how much time investment is it going to take and how much is it going to pay off?


Not a good example - polars is new and has changed so much in the past couple of years that GPT-4 often gives outdated code for it.


It's still pretty good posting documentation as context or using phind.


> So, what is Polars? A short description would be “a query engine with a DataFrame frontend”.

Guess that would help if I already knew what a dataframe is!


Logically its a SQL table or spreadsheet. Where each column can be of different types. A column of usernames (strings), their age (integers) etc.

Every row is a distinct entity. Every column is usually stored / treated as its own distinct array.


So it's a sort of an embedded relational database table, but live, and maybe with some bells and whistles ? Sounds like a triumph of marketing. What am I missing ?


An earlier thread on Polars had this as the #1 thread (so the same complaint) - https://news.ycombinator.com/item?id=38920043


Another bit of context: it's "polars" (like polar bears) because the popular python dataframe library is "pandas".

My first assumption was that it was somehow related to polar coordinates.


If you don't know what a dataframe is, it doesn't matter what Polars is then, not for you.


Dataframe is a potentially useful concept to even those who haven't heard about it. And Polars can be a reasonable introduction to it too.


With that attitude I'd never learn anything new


Go learn about Dataframes then. There's no reason an article like this should start with a Dataframes 101.


Would be quicker for you to just search for the word "dataframe" than to post on Hacker News that you don't know what a dataframe is.


It's useful feedback for whoever wrote the article, that there are terms they need to not take for granted.


But articles shouldn't have to teach readers all the background needed. Otherwise even a well written article would get a lot of bloat. Even linking other guides would require some context, and the total breadth of knowledge could be quite high. I could agree if there was to be a brief "areas you should know about" with a few keywords.

There are some areas where Hacker News might uniquely shine in answering questions, but "what are dataframes" isn't one of them. If someone sees "dataframe" and is confused, they have all the opportunity to search up "dataframes 101" and get up to speed.


“Fancy array-of-associative-arrays with their own set of non-Python-native field data types and a lot of helper methods for sorting and import and export and such, but weirdly always missing any helpers that would be both hard to write and very useful”

“SQLite but only for Python and worse. Kind of.”

“That annoying transitional step you have no direct need for but that you have to do anyway, because all data-wrangling tools in Python assume a dataframe”

(I use this stuff daily)

[edit] “Imagine if you could read in from a database or CSV and then work with the returned rows directly and as if they were still a table/spreadsheet. Except by ‘directly’ I mean ‘after turning it into a dataframe and discarding/fucking-up all your data type information’. And then with another fucking-everything-up step if you want to write it out so you can do real stuff with it elsewhere.”


Is DuckDB a better tool for your purposes?

Dataframes are popular among people who are trying to use the data in a way that they don't intend to do anything more with the processed data after they've reported their findings.


I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software. I haven't done the same types of things in Polars yet (simple selects).


Hmm, how are you getting acceptable performance doing that? Is your data in a delta table or something?

I've actually personally found that DuckDB is tremendously slow against the cloud, though perhaps I'm going through the wrong API?

I'm using https://duckdb.org/docs/guides/import/s3_import.

My data is hive partitioned, when I monitor my network throughput, I only get a few MB/s with DuckDB but can achieve 1-2GB/s through polars.

Very possible it's a case of PEBKAC though.


Also, duckdb allows you convert scan results directly to both pandas and pl dataframes.. so you can mix and match based on need.


LOL.

I have been programming for over 30 years on all sorts of systems and the Pandas DataFrame API is completely beyond me. Just trying to get a cell value seems way more difficult than it should be.

Same with xarray datasets.

I just loaded the same CSV into Pandas and Polars and Polars did a much better job of it.


To me, Polars feels like almost exactly how I would want to redesign Pandas interfaces for small - medium sized data processing, given my previous experience with Pandas and PySpark. Throw out all the custom multi index nonsense, throw out numpy and handle types properly, memory map arrow, focus on method chaining interface, do standard stuff like groupby and window functions in the standard way, and implement all the query optimizations under the hood that we know make stuff way faster.

To be fair, Polars has the benefit of hindsight and designing their interfaces and syntax from scratch. The poor choices in Pandas were made long ago, and its adoption and evolution into the most popular dataframe library for python feels like mostly about timing the market than having the best software product.


Polars is at least more consistent & sensible.

The whole thing though is just shockingly unhelpful and half-baked for something that taken over so completely in its niche. I guess my perspective is that of someone trying to build reliable automation, though—it’s probably really nice if you’re just noodlin’ in notebooks or the repl or whatever.


What part of df.loc[index_val, column_val] is hard here?

Oh you meant multiindex... yeah, slicing multiindexes sucks :)


Multi indexes is baking my data into the horrible pandas object model into weird tuples. Every gripe I’ve had with pandas starts with trying to do something “simple” then following the pandas object model to achieve it and over complicating things. Polars is awesome, it fits my numpy understanding of dataframes as dict labeled arrays. I even like the immutability aspect.


Accessing an individual cell value is slightly clunky, but that's not really where you use a DataFrame. A DataFrame is an object for where the entirety of the dataset is under study. Where you are typically interested in the broad distributions contained within the data.

After you have highlighted trends (the majority of the work), then you might go spelunking at individual examples to see why something is funny.


i mean, if you are reading and writing csv then yeah, you've already fucked up.


CSV is a terrible format. But it is extensively used. See also:

Why isn’t there a decent file format for tabular data? https://news.ycombinator.com/item?id=31220841


parquet is perfectly fine

> There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

I can open parquet in excel



Polars allows to use SQL I think?


Eh, plenty of SQL-first analysts could find it helpful. Not just for people coming from pandas/r.


But it is such a shame that us plebes can't learn about it to maybe find the hammer to this nail that is sticking up in our projects because the product can't tell people what it is. And it's only made worse by arrogant responses like yours.


It is a data structure used by data scientists. Polars is API compatible with the previous solution, but faster. If you don't analyze data, you can ignore it.


Python framework for tables (arrays with column names).

You can query/filter/sort and pick /transform data on multiple directions (row/column)

It’s used for data mining, ML etc.

It’s basically like working with a spreadsheet/sql table


Note that dataframes aren't Python specific. R has it too, and is an inspiration for Pandas.


Wide-form data structure, often with an index or partition framework in place able to apply column- and row-transformations quickly


I do love Polars, and I constantly use it (I'm a network scientist and usually work with a lot of data). However, it is beyond me why I cannot use it on a M1 Mac. As soon as I import it in python, the latter gets killed with "illegal hardware instruction".


You have the Rosetta version of Python installed, which lacks the SIMD instructions we compile Polars with. Reinstalling Python as a native package should fix this.

I have recently added a warning to Polars for this on import, could you confirm you get this warning if (before installing native Python) you update your Polars package?

If for whatever reason you really want to keep using the Rosetta version of Python you should install the polars-lts-cpu package instead.


FWIW polars isn’t the only package that has this problem. And not all of them have so simple a solution.

IIRC pyarrow has some trouble like that, to pick one from the same ecosystem.


Ive found polars quite intuitive, though for python, I lean more towards [ibis](https://ibis-project.org/). The interface is nearly identical, but ibis has the benefit if building sql queries before pulling any actual data (like dbplyr) — whereas polars requires the data to be in-memory (at least for rdb’s, though correct me if Im wrong)

this to me seems like a good argument for only using ibis, but Im happy to be convinced otherwise


I haven't used it but polars has a lazy api https://docs.pola.rs/user-guide/lazy/using/#using-the-lazy-a...


Polars doesn't require all data to be in memory. It has a lazy API, optimizations that prune data at the scan level and parts of the engine can process data in batches.

Algorithms like joins, group bys, distinct, etc, are designed for out-of-core processing and can spill to disk if available RAM is low.


If it all fits in memory, isn't it simpler to just load it first? Sure, you're kinda painting yourself into a corner, but it can be a big corner. e.g. I know that until recently Amazon still ran a bunch of key recommendation algorithms on a single big machine each night, because it's just simpler to load everything into memory and crank on it.


When reading from the dataframe, depending on the columns, filters, and aggregates you want returned, most query engines can skip data that is irrelevant, making the query run much faster.

In your amazon example the data is probably optimized for fast lookups on a small number of records, like an in-memory cache.


Simpler, but you can avoid huge amounts of work just simply by not loading what you need. That's the beauty of the scan_* apis in polars. Most query engines have some level of support for io skipping.


What backend do you use? The little spin I've given Ibis felt like it's not deeply supported with a lot of the backends.


Nowadays I'm mostly using duckdb/chdb to process/transform data. I use polars mainly to view results output by duckdb/chdb. I like polars and I always use it over pandas but I must say that pandas slicing is certainly missed (or just ability to utilize slicing, not necessarily like pandas)


Are you thinks about something else or is this what you want to do in polars?:

https://docs.pola.rs/py-polars/html/reference/dataframe/api/...


That just slices rows but something like df[cols] for example would be nice (and is more intuitive), the __slice__ api in python is quiet powerful and I wish polars would take more advantage of it


I haven't heard of chdb. Thanks for sharing it.


I have spent probably over 100 hours now fiddling with data using polars and it is just so enjoyable to use. The interface is the real magic here.

This was captured well in their company announcement blogpost [0]:

> A strict, consistent and composable API. Polars gives you the hangover up front and fails fast, making it very suitable for writing correct data pipelines.

[0] https://pola.rs/posts/company-announcement/


There is something I don't get about the Polars DataFrame API.

https://docs.pola.rs/user-guide/migration/spark/

Look at the examples on this page of the Spark vs. Polars DataFrame APIs. (Disclaimer: I contributed this documentation. [1])

Having used SQL and Spark DataFrames heavily, but not Polars (or Pandas, for that matter), my impression is that Spark's DataFrame is analogous to SQL tables, whereas Polars's DataFrame is something a bit different, perhaps something closer to a matrix.

I'm not sure how else to explain these kinds of operations you can perform in Polars that just seem really weird coming from relational databases. I assume they are useful for something, but I'm not sure what. Perhaps machine learning?

[1]: https://github.com/pola-rs/polars-book/pull/113


I have not used spark, but I have written a lot of sql, polars and pandas. I think much more in terms of sql when I write polars than pandas. Do you have any examples of what you are referring to?


The examples I'm referring to are in that page I linked to in my comment above.

Here's one of them:

  # Polars
  df.select(
    pl.col("foo").sort().head(2),
    pl.col("bar").sort(descending=True).head(2),
  )
In SQL and Spark DataFrames, it doesn't make sense to sort columns of the same table independently like this and then just juxtapose them together. It's in fact very awkward to do something like this with either of those interfaces, which you can see in the equivalent Spark code on that page. SQL will be similarly awkward.

But in Polars (and maybe in Pandas too) you can do this easily, and I'm not sure why. There is something qualitatively different about the Polars DataFrame that makes this possible.


Because it's column based vs row based. Definitely can be a bit more of a footgun ("with great power comes great responsibility").

Long story short, the memory model operates on columns of data as opposed to rows, so fields in a conceptual "row" aren't necessarily an atomic unit.


How difficult is it to port pandas code to use polars instead? I use pandas a bunch, but sometimes it can be quite slow, especially with file I/O. Anybody have benchmarks?


Generally requires requiring the code.

Sadly, AI is quite poor at Polars right now. (It is ok but not great with Pandas).

However, getting pandas data into Polars is easy. If you already have the code and it is not a bottleneck, I would just wrap it.

(Disclosure: currently writing a book on Polars and have a big chapter on porting hairy pandas code to Polars.)


Mostly minor API differences. Methods with slightly different (usually better) names, sometimes slightly different behavior. Usually easy to wrap in a little function to handle the transition. Sometimes a property becomes a method. That kind of thing. A handful of nice-to-have utility things in Pandas aren’t in Polars. Some basic objects or datastructures are pretty different (print a dataframe’s dtypes property in both for an example)


Big disagree that it's minor - the polars API is completely different from pandas.



A tangent, from someone still doing most of their mangling the old fashioned way in sql used to extract stuff into old fashioned pandas df in a notebook, ... but I'm wondering and HN is a wonderful place to ask :)

How does polars sql context stack up against alternatives e.g. perhaps duckdb? If I'm in a notebook and I want to suck in and process a lot of data, which has the least boilerplate, the strongest support and the most efficiency (both RAM usage and speed)?


Here is a head-to-head comparison about efficiency: https://www.youtube.com/watch?v=wKH0-zs2g_U

"Strongest support" is probably Pandas, in that it is very widely used and easy to get help with. DuckDB lets you write SQL and is very fast.


That comparison is heavily outdated as at that benchmark we are IO bound on downloading.

Since then Polars has improved downloading speeds 20x with shipping a proper async runtime in the engine.


I grumbled previously about the lack of a document like this, so well done everyone.


That lazily evaluated API is very similar to Spark. Probably data engineers will tend to move to polars for smaller workloads, while keeping spark for multi-machine computing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: