Do people actual use Rust for datascience? I mean this literally and not as a snarky retort. I dabble in data science in personal projects but I have little concept of the field in general. Polars seems cool I just haven't seen Rust used as a general purpose datascience language like R or Python which is what an dataframe library seems like it's about. As opposed to running some demanding calculations in C or C++ for performance. It seems like Rust might fit better in the latter, but again, that's not what a dataframe lib suggests to me.
Polars has a python front end, which I have used. All the work happens in Rust but the queries can be specified in python. The data is stored using Apache Arrow format so there is no copy required for the same data to be accessible in both Rust and python.
Oh, that's cool. I did not realize this. I realize this isn't novel but I do not really understand how a library written in one language is used by another - some sort of bindings that the library handles I guess.
It looks like polars is using PyO3, which are Rust bindings for Python. Python's reference implementation is in C, so I imagine it's interacting with that API [0] through FFI. Common Python extension modules (as these are called) are compiled as dynamically linked libraries and binaries (or compilation instructions) are included in Python wheels.
I don’t think anyone is doing any explorative analytics using rust.
But I do imagine someone out there has a data engineer hat they put on, and starts rewriting Python calls to polars into pure rust code, compiling and then deploying.
I'm working on a project that leans heavily on matrix vectorization and large dirty datasets. I'm not a scientist, though.
Here's how I work.
* Hack together a PoC with python, sed, awk, grep, cut, xsv etc.
* Clean that up, run it on larger sample sets (samples made with said sed/awk/cut etc)
* Attempt to run it on the full dataset.
* Rewrite it in rust.
Step 2 and 3 are hit-or-miss in python. I find it near impossible to do any refactoring without static types and/or tests. And quite often, I'm looking at a run for over an hour to have it crash on that one broken line. Whereas the same happens in the rust version in seconds: crucial for my trial-and-error style of building.
So: Python because I must, rust, as soon as it's clear what I'm going to do.
It supports type hinting. Which helps, but is far from the tool (crutch) that I need when refactoring. Still far too much guessfactoring and not the confident "it compiles. tests are green. it's Friday: let's deploy!" that rust (or java, or even typescript) offer me.
Yes. What's cool is in Rust you have direct access to polars, so you can do all the low level munging and computations (and/or read/write the data to/from arrow directly from rust if need be) in Rust and return dataframes directly to Python. The front end is still Python, of course, but pyo3 makes it pretty trivial.
I’ve been moving over to use polars more often for my data work. It’s much faster than pandas at things like imputing millions of lines. It’s also a little more intuitive and you don’t waste time pissing around with indexes everytime you transform a dataframe.
Polars biggest downfall is that pandas/matplotlib are so ubiquitous in data science and polars just plays so differently than pandas including using hvplot as its default plotting package, etc. It really is trying to do much of its ecosystem exactly how it wants to maximize productivity, speed, etc. This may slow down the adoption of it, but hopefully it will push others in the better direction.
1. load and process / aggregate in polars to get the smaller dataset that goes into your plot.
2. df.to_pandas()
3. apply your favourite vis library that works with pandas.
There's no use case i can think of where building a data viz interface more specific to polars than this is beneficial or necessary.
Pandas is the best interface for putting with matplotlib (better than matplotlib itself).
What issues are you running into with the Python ecosystem?
I ask because I'm in the middle of writing Effective Polars, and my experience is that many things like Xgboost, matplotlib, etc work fine with Polars.
(Sadly/oddly/ironically some libraries now have issues with Pandas 2 pyarrow types but work with Polars.)
> Polars biggest downfall is that pandas/matplotlib are so ubiquitous in data science and polars just plays so differently than pandas including using hvplot as its default plotting package, etc.
huh? Polars doesn't have a default plotting package, it's always something you add. matplotlib supports polars out of the box through the dataframe exchange protocol, which is ubiquitous enough due to the proliferation of other dataframe libraries (dask, vaex, modin/ray) that you really get about every other tool in the ecosystem with polars for free.
I absolutely love Polars, and I use the Rust crate all the time!
I think my only gripe is that the Rust API seems (to me at least) to be less well documented than the Python API. I guess Python is the de facto data science language, so maybe that explains it.
Likewise, I would prefer to stay in Rust when developing command line apps that might do DataFrame thingies and would love more examples of how to transform DataFrames in pure Rust.
I have written several rust/python apis now and I think what you’re feeling is the result of the api being designed for Python as well as being primarily tested in Python.
No matter the target, I test rust things using python. idk. Food for thought.
Curious why you use the Rust API over the Python API? I'm just wondering if you've also explored the datafusion crate, because I think on the rust side it's about equivalently ergonomic to use either and Datafusion seems a little more modular/pluggable. The Polars python API is definitely miles above the datafusion python API.
With GPT4 helping with the refactor, there's no reason to start migrating code away from Pandas imo. A lot of people say they think Pandas is fast enough for their needs, but you're literally getting a 95% speed improvement for free.
This is a huge difference in productivity, especially when running code and doing a lot of slicing in notebooks.
Polars is an immense project, and I hope it continues to gain traction. But there's lots more factors than just speed.
The main one in my team is ubiquity- i.e. lots of people know pandas, who might not be traditional "developers". I.e. data scientists, data analysts etc. Having a data scientist put together some code, it gets optimized by an engineer, and they can talk back and forth about the same code is a massive benefit.
Shifting to polars (and keeping that ability to collaborate) would require not just training the engineers to use a new framework, but all the analysts, data scientists etc that they are adjescant to. That's a huge business cost, and in a lot of cases it might be worth it. But I wouldn't describe it as "getting 95% speed increase for free".
While that's fair, it's fairly easy to fit it in only the most intensive operations and then seamlessly convert back to a pandas data frame.
I understand why you wouldn’t do this on an organizational level for production workflows, but for personal workflows in my opinion, it’s a no-brainer to incrementally learn and adopt it.
Where ive seen resistance to polars in Python land its been either "Pandas is already used/standard and people understand/know it" or "If you really need speed, you can probably do it in numpy which should be faster again".
It's easier than ever to do a drop-in replacement for data workflows. The whole decision making process between migrating libraries like this is how much time investment is it going to take and how much is it going to pay off?
So it's a sort of an embedded relational database table, but live, and maybe with some bells and whistles ? Sounds like a triumph of marketing. What am I missing ?
But articles shouldn't have to teach readers all the background needed. Otherwise even a well written article would get a lot of bloat. Even linking other guides would require some context, and the total breadth of knowledge could be quite high. I could agree if there was to be a brief "areas you should know about" with a few keywords.
There are some areas where Hacker News might uniquely shine in answering questions, but "what are dataframes" isn't one of them. If someone sees "dataframe" and is confused, they have all the opportunity to search up "dataframes 101" and get up to speed.
“Fancy array-of-associative-arrays with their own set of non-Python-native field data types and a lot of helper methods for sorting and import and export and such, but weirdly always missing any helpers that would be both hard to write and very useful”
“SQLite but only for Python and worse. Kind of.”
“That annoying transitional step you have no direct need for but that you have to do anyway, because all data-wrangling tools in Python assume a dataframe”
(I use this stuff daily)
[edit] “Imagine if you could read in from a database or CSV and then work with the returned rows directly and as if they were still a table/spreadsheet. Except by ‘directly’ I mean ‘after turning it into a dataframe and discarding/fucking-up all your data type information’. And then with another fucking-everything-up step if you want to write it out so you can do real stuff with it elsewhere.”
Dataframes are popular among people who are trying to use the data in a way that they don't intend to do anything more with the processed data after they've reported their findings.
I've been testing polars and duckdb recently.. Polars is an excellent dataframe package.. extremely memory efficient and fast. I've experienced some issues with a hive-partitioning of a large S3 dataset which DuckDB doesn't have. I've scanned multi-TB S3 parquet datasets with DuckDB on my m1 laptop executing some really hairy SQL (stuff I didn't think it could handle).. window functions, joins to other parquet datasets just as large, etc. Very impressive software.
I haven't done the same types of things in Polars yet (simple selects).
I have been programming for over 30 years on all sorts of systems and the Pandas DataFrame API is completely beyond me. Just trying to get a cell value seems way more difficult than it should be.
Same with xarray datasets.
I just loaded the same CSV into Pandas and Polars and Polars did a much better job of it.
To me, Polars feels like almost exactly how I would want to redesign Pandas interfaces for small - medium sized data processing, given my previous experience with Pandas and PySpark. Throw out all the custom multi index nonsense, throw out numpy and handle types properly, memory map arrow, focus on method chaining interface, do standard stuff like groupby and window functions in the standard way, and implement all the query optimizations under the hood that we know make stuff way faster.
To be fair, Polars has the benefit of hindsight and designing their interfaces and syntax from scratch. The poor choices in Pandas were made long ago, and its adoption and evolution into the most popular dataframe library for python feels like mostly about timing the market than having the best software product.
The whole thing though is just shockingly unhelpful and half-baked for something that taken over so completely in its niche. I guess my perspective is that of someone trying to build reliable automation, though—it’s probably really nice if you’re just noodlin’ in notebooks or the repl or whatever.
Multi indexes is baking my data into the horrible pandas object model into weird tuples. Every gripe I’ve had with pandas starts with trying to do something “simple” then following the pandas object model to achieve it and over complicating things. Polars is awesome, it fits my numpy understanding of dataframes as dict labeled arrays. I even like the immutability aspect.
Accessing an individual cell value is slightly clunky, but that's not really where you use a DataFrame. A DataFrame is an object for where the entirety of the dataset is under study. Where you are typically interested in the broad distributions contained within the data.
After you have highlighted trends (the majority of the work), then you might go spelunking at individual examples to see why something is funny.
> There is Parquet. It is very efficient with it’s columnar storage and compression. But it is binary, so can’t be viewed or edited with standard tools, which is a pain.
But it is such a shame that us plebes can't learn about it to maybe find the hammer to this nail that is sticking up in our projects because the product can't tell people what it is. And it's only made worse by arrogant responses like yours.
It is a data structure used by data scientists. Polars is API compatible with the previous solution, but faster. If you don't analyze data, you can ignore it.
I do love Polars, and I constantly use it (I'm a network scientist and usually work with a lot of data). However, it is beyond me why I cannot use it on a M1 Mac. As soon as I import it in python, the latter gets killed with "illegal hardware instruction".
You have the Rosetta version of Python installed, which lacks the SIMD instructions we compile Polars with. Reinstalling Python as a native package should fix this.
I have recently added a warning to Polars for this on import, could you confirm you get this warning if (before installing native Python) you update your Polars package?
If for whatever reason you really want to keep using the Rosetta version of Python you should install the polars-lts-cpu package instead.
Ive found polars quite intuitive, though for python, I lean more towards [ibis](https://ibis-project.org/). The interface is nearly identical, but ibis has the benefit if building sql queries before pulling any actual data (like dbplyr) — whereas polars requires the data to be in-memory (at least for rdb’s, though correct me if Im wrong)
this to me seems like a good argument for only using ibis, but Im happy to be convinced otherwise
Polars doesn't require all data to be in memory. It has a lazy API, optimizations that prune data at the scan level and parts of the engine can process data in batches.
Algorithms like joins, group bys, distinct, etc, are designed for out-of-core processing and can spill to disk if available RAM is low.
If it all fits in memory, isn't it simpler to just load it first? Sure, you're kinda painting yourself into a corner, but it can be a big corner. e.g. I know that until recently Amazon still ran a bunch of key recommendation algorithms on a single big machine each night, because it's just simpler to load everything into memory and crank on it.
When reading from the dataframe, depending on the columns, filters, and aggregates you want returned, most query engines can skip data that is irrelevant, making the query run much faster.
In your amazon example the data is probably optimized for fast lookups on a small number of records, like an in-memory cache.
Simpler, but you can avoid huge amounts of work just simply by not loading what you need. That's the beauty of the scan_* apis in polars. Most query engines have some level of support for io skipping.
Nowadays I'm mostly using duckdb/chdb to process/transform data. I use polars mainly to view results output by duckdb/chdb. I like polars and I always use it over pandas but I must say that pandas slicing is certainly missed (or just ability to utilize slicing, not necessarily like pandas)
That just slices rows but something like df[cols] for example would be nice (and is more intuitive), the __slice__ api in python is quiet powerful and I wish polars would take more advantage of it
I have spent probably over 100 hours now fiddling with data using polars and it is just so enjoyable to use. The interface is the real magic here.
This was captured well in their company announcement blogpost [0]:
> A strict, consistent and composable API. Polars gives you the hangover up front and fails fast, making it very suitable for writing correct data pipelines.
Look at the examples on this page of the Spark vs. Polars DataFrame APIs. (Disclaimer: I contributed this documentation. [1])
Having used SQL and Spark DataFrames heavily, but not Polars (or Pandas, for that matter), my impression is that Spark's DataFrame is analogous to SQL tables, whereas Polars's DataFrame is something a bit different, perhaps something closer to a matrix.
I'm not sure how else to explain these kinds of operations you can perform in Polars that just seem really weird coming from relational databases. I assume they are useful for something, but I'm not sure what. Perhaps machine learning?
I have not used spark, but I have written a lot of sql, polars and pandas. I think much more in terms of sql when I write polars than pandas. Do you have any examples of what you are referring to?
In SQL and Spark DataFrames, it doesn't make sense to sort columns of the same table independently like this and then just juxtapose them together. It's in fact very awkward to do something like this with either of those interfaces, which you can see in the equivalent Spark code on that page. SQL will be similarly awkward.
But in Polars (and maybe in Pandas too) you can do this easily, and I'm not sure why. There is something qualitatively different about the Polars DataFrame that makes this possible.
How difficult is it to port pandas code to use polars instead? I use pandas a bunch, but sometimes it can be quite slow, especially with file I/O. Anybody have benchmarks?
Mostly minor API differences. Methods with slightly different (usually better) names, sometimes slightly different behavior. Usually easy to wrap in a little function to handle the transition. Sometimes a property becomes a method. That kind of thing. A handful of nice-to-have utility things in Pandas aren’t in Polars. Some basic objects or datastructures are pretty different (print a dataframe’s dtypes property in both for an example)
A tangent, from someone still doing most of their mangling the old fashioned way in sql used to extract stuff into old fashioned pandas df in a notebook, ... but I'm wondering and HN is a wonderful place to ask :)
How does polars sql context stack up against alternatives e.g. perhaps duckdb? If I'm in a notebook and I want to suck in and process a lot of data, which has the least boilerplate, the strongest support and the most efficiency (both RAM usage and speed)?
That lazily evaluated API is very similar to Spark. Probably data engineers will tend to move to polars for smaller workloads, while keeping spark for multi-machine computing.