Hacker News new | past | comments | ask | show | jobs | submit login
I wrote one of the fastest DataFrame libraries (ritchievink.com)
389 points by polyrand 10 months ago | hide | past | favorite | 140 comments

> At the time of writing this blog, Polars is the fastest DataFrame library in the benchmark second to R’s data.table, and Polars is top 3 all tools considered

This is a very strange way to write “Polaris is the second fastest” but I guess that doesn’t grab headlines

I think it's mostly a nod to the fact that R's data.table blows everybody else out of the water by such a ridiculously wide margin. It's like a factor of 2 faster than the next fastest...

So if you're writing a dataframe library as a hobby project, it's far less demotivating to use "all the other implementations" as your basis for comparison, at least initially.

I think a hobby project written in a general purpose language being the second fastest dataframe library is a hell of an accomplishment.

Sure, but data.table also fits those criteria.

Any idea what makes R's data.table so fast compared to the others?

I read somewhere else (Another comment I think) that it was a ground-up implementation taking a very performance orientated approach.

Basically it seemed like they really got in the weeds to make it super fast.

R is from ~2000, while pandas started in 2011. Is it possible that the lack of compute power had an effect on the required performance characteristics?

data.table is basically a highly optimized C library


That's somewhat like libvips which was started when a 486 was state of the art - fast forward and it's an image processing monster.

R is much older than 2000, it's from 1993.

And it’s an implementation of S, originally from Bell Labs in 1976

Thank you, my brief research led to a list of versions that had R 1.0 as 2000, but it appears that v0 lasted a good many years. Pandas as well was in v0 for many years so it is the better comparison to use like-for-like.

Probably worth pointing out that the sections on microarchitecture stopped being representative with the pentium pro in the mid 90s. The processor still has a pipeline, but it's much harder to stall it like that.

http://www.lighterra.com/papers/modernmicroprocessors/ is a good guide (Agner Fog's reference isn't really a book so I don't recommend it for the uninitiated)

[note: see more nuanced comment below]

The Julia benchmark two links deep at https://github.com/h2oai/db-benchmark doesn't follow even the most basic performance tips listed at https://docs.julialang.org/en/v1/manual/performance-tips/.

What specifically are you thinking of?

The non-const global variables stand out to me, but I'm not experienced enough tell whether that would make a large difference.

Non-const globals could be an issue, but it's possible it doesn't matter too much for this particular benchmark. I'm a little worried about taking compilation time (apart from precompilation) into account (would that also be done for C++ code?). But I must confess I maybe posted my comment a bit too soon, partially because of the time of day, partially because of the semicolons at the end of each line in the code, which made me quickly think the benchmark writer was using Julia for the first time. While I have a good amount of experience with Julia, I don't have that much experience with DataFrames.jl itself, so I don't know for sure whether the reported benchmark times are reasonable or not.

From the git history it seems like DataFrames.jl maintainers contributed at least some fixes to the scripts, so I guess that means they aren't opposed to it.

I often end every line with a semicolon, so that it doesn't flood a REPL if I run it there.

IIRC, groupby hasn't been optimized in DataFrames.jl yet.

My naive interpretation - The canonical Apache Arrow implementation is written in C++ with multiple language bindings like PyArrow. The Rust bindings for Apache Arrow re-implemented the Arrow specification so it can be used as a pure Rust implementation. Andy Grove [1] built two projects on top of Rust-Arrow: 1. DataFusion, a query engine for Arrow that can optimize SQL-like JOIN and GROUP BY queries, and 2. Ballista, clustered DataFusion-like queries (vs. Dask and Spark). DataFusion was integrated into the Apache Arrow Rust project.

Ritchie Vink has introduced Polars that also builds upon Rust-Arrow. It offers an Eager API that is an alternative to PyArrow and a Lazy API that is a query engine and optimizer like DataFusion. The linked benchmark is focused on JOIN and GROUP BY queries on large datasets executed on a server/workstation-class machine (125 GB memory). This seems like a specialized use case that pushes the limits of a single developer machine and overlaps with the use case for a dedicated column store (like Redshift) or a distributed batch processing system like Spark/MapReduce.

Why Polars over DataFusion? Why Python bindings to Rust-Arrow rather than canonical PyArrow/C++? Is there something wrong with PyArrow?

[1] https://andygrove.io/projects/

Hi Author here,

Polars is not an alternative to PyArrow. Polars merely uses arrow as its in-memory representation of data. Similar to how pandas uses numpy.

Arrow provides the efficient data structures and some compute kernels, like a SUM, a FILTER, a MAX etc. Arrow is not a query engine. Polars is a DataFrame library on top of arrow that has implemented efficient algorithms for JOINS, GROUPBY, PIVOTs, MELTs, QUERY OPTIMIZATION, etc. (the things you expect from a DF lib).

Polars could be best described as an in-memory DataFrame library with a query optimizer.

Because it uses Rust Arrow, it can easily swap pointers around to pyarrow and get zero-copy data interop.

DataFusion is another query engine on top of arrow. They both use arrow as lower level memory layout, but both have a different implementation of their query engine and their API. I would say that DataFusion is more focused on a Query Engine and Polars is more focused an a DataFrame lib, but this is subjective.

Maybe its like comparing Rust Tokio vs Rust async-std. Just different implementations striving the same goal. (Only Polars and DataFusion can easily be mixed as they use the same memory structures).

Pandas supports JOIN and GROUP BY operators so you are saying that there is a gap between Apache Arrow and other mature dataframe libraries? If there is a gap, is there no plan to fix it in the standard Arrow API?

I understand the case for a SQL-like DSL and an optimizer for distributed queries (in-memory column stores, not so much). I'm trying to understand the value add of Polars. I don't mean to come across as critical; perhaps DataFusion is a poor implementation and you are being too polite to say so.

I also think that there is a C++/Arrow vs Rust/Arrow decision that has to be made. I associate PyArrow with the C++/Arrow library. Is Polars' Eager API a superset of the PyArrow API with the addition of JOIN/GROUPBY/other operators?

There is definitely a gap, and I don't think that Arrow tries to fill that. But I don't think that its wrong to have multiple implementations doing the same thing right? We have PostgresQL vs MySQL, both seem valid choices to me.

A SQL like query engine has its place. An in memory DataFrame also has its place. I think the wide-spread use of pandas proves that. I only think we can do that more efficient.

With regard to C++ vs Rust arrow. The memory underneath is the same, so having an implementation in both languages only helps more widespread adoption IMO.

Thank you for your work! I've decided to kick the tires after reading your Python book, I think you understimate the clarity of the API you have exposed which, honestly, looks a fair bit more sane than the tangled web that pandas is.

Thanks, I feel so too. There is still a hope work to do. I hope that I can also bridge the gap regarding utility and documentation.

Not sure if anything exists but I wish something would do in memory compression + smart disk spillover. Sometimes I want to work with 5-10GB compressed data sets (usually log files) and decompressed that ends up being 10x (plus add data structure overhead). There's stuff like Apache Drill but it's more optimized for multi node than running locally

If you're not afraid of exotic languages, I encourage you to have a look at the APL ecosystem, and especially J and it's integrated columnar store Jd.

I have just embarked on an adventure to do just what you describe in... Racket. But it's nowhere to be seen yet.

I'm an epidemiologist and I've been wanting to make my own tools for a while, now. It'll be interesting to see how far I can go with Racket, which already includes many pieces of the puzzle.

Would you mind pointing me at a resource that explains how J/Jd handle it by comparison?


Jd only packs int vectors, though. So if you're hoping for string compression then I don't know of any free solution. Jd heavily leverages SIMD and mmap. Larger-than RAM columns can be easily processed by ftable. I use Jd for data wrangling before making models in R. Of course, the J language is not for the faint of heart but it's really well-suited to the task.

I wonder if Arrow (https://arrow.apache.org/faq/) would work for this:

"The Arrow IPC mechanism is based on the Arrow in-memory format, such that there is no translation necessary between the on-disk representation and the in-memory representation. Therefore, performing analytics on an Arrow IPC file can use memory-mapping, avoiding any deserialization cost and extra copies."

Dask is popular for this purpose. No memory compression, but supports disk spillage as well as distributed dataframes.

If youre doing OLAP style queries you should look at DuckDB, it's hella fast and supports out-of-memory compute (it's not exactly "smart" but it handles spillover)

I feel like the HDF5 libraries could be helpful here if you could figure out how to get compatible compression.

What do you mean with smart disk spillover?

When the library attempts to load something from disk that doesn’t fit into memory, it’s transparently, and (usually) without extra intervention from the user, swaps to memory-mapping and chunking through the file(s).

Particularly useful for when you’ve got a bunch of data that doesn’t fit in memory, but setting up a whole cluster is not worth the overhead (operationally or otherwise) and/or if you’ve already got a processing pipeline written in a language/framework and you can’t/don’t-want to go through rewriting it for something distributed.

How is that different to virtual memory?

In memory representation also tends to have some data structure overhead so 5GB compressed -> 50GB uncompressed -> 100-250GB as a data structure (seems 2-5x is pretty normal). It starts out seeming pretty innocuous but quickly explodes. In addition, some things like Drill can do some automatic indexing/metadata recording which can reduce the amount of data it needs to access

Memory mapping lazily loads the file, completes immediately, and scales arbitrarily; using disk-backed virtual memory would require the entire file to be read from disk, written back out to disk, and then read in from disk again on access; it would also require swap to be set up at the OS level, and the amount of swap set up puts a hard limit on the size of the file.

You've described the difference between using mmap vs relying on the operating system's swap mechanism. But neither of those is quite the same as having an application that's aware of its memory usage and explicitly manages what it keeps in RAM. Using mmap may be useful for achieving that, but mmap on its own still leaves most of the management up to the OS.

Thanks for clearing that up!

gzip, zcat, parallel... That sort of thing?

Yeah but starts to fall apart for anything beyond basic search/filtering (like grouping, computed fields)

How is that different from, say, opening up a gzipped file with a reader object in python?

It's not but implementing intelligent data access each time can become pretty tedious (sure you can write libraries and tools, but I'm basically asking if those already exist)

For something like AWS CloudTrail logs, 5GB is 40k 100-130kb gzipped json files so hit single-core CPU bounds almost immediately (just reading/decompressing/json parsing off an SSD). CPU scale out model in Python is processes so now you're copying data between processes if you want to parallelize it so now you hit IPC bottlenecks just using the standard library multiprocessing/concurrent futures stuff

5GB compressed /probably/ won't fit in memory so now you have to deal with that, too unless you have a way to keep it compressed (which would come at the cost of additional CPU usage)

tldr; it's non-trivial to actually fully use the hardware

Stepping back a bit, if you're CPU bound in a single thread decompressing data off an SSD, does copying the compressed data into memory first actually buy you anything?

If you truly need the dataset fully loaded into memory for performance reasons, then it's presumably because of the need to do lots of random accesses, where the read latency would otherwise harm you. The tricky bit is the fact that it's generally hard to randomly access compressed streams of data. You need to compress the data in a way that makes random access possible, likely to the detriment of compression ratio. Unless you also use the same compression format to store the data on disk, then you're back to having to decompress (and recompress) the whole file sequentially anyway in order to build the data structure in RAM.

I've seen purpose-made compressed log formats that support efficient seeking. I've never seen them loaded into RAM in their raw compressed form, though. Generally they do have a corresponding library to make accessing the log data easy.

maybe give Vaex a try. it's along those lines – out-of-core data frame library, but also works on a single machine


You can memory map Arrow

I think ClickHouse does that.

I actually started looking at Clickhouse a couple weeks ago but got a bit side tracked trying to grok how distributed tables work. It looks promising but there's a bit of a learning curve (seems some of the performance also comes from its use of arrays but best I can tell my use case should just use regular tables)

ClickHouse performance is principally due to column storage, compression, and ability to parallelize processing. Arrays can improve performance in some specific cases but are more commonly used to help deal with semi-structured data or perform custom processing on values within groups.

If your data maps cleanly to tables, that's in fact the best case with the easiest options for performance enhancement.

> This directly shows a clear advantage over Pandas for instance, where there is no clear distinction between a float NaN and missing data, where they really should represent different things.

Not true anymore:

> Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.

> The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).


I wonder how pandas can both be at version 1.0 and have a an experimental feature for something so central. Honest question

Because Pandas is built on top of NumPy and NumPy has never had a proper NA value. I would call that a serious design problem in NumPy, but it seems to be difficult to fix. There have been multiple NEPs (NumPy Enhancement Proposals) over the years, but they haven't gone anywhere. Probably since things are not moving along in NumPy, a lot of development that should logically happen at the NumPy level is now happening in Pandas. But, I agree, I find it baffling how Python has gotten so big in data science and been around so long without having proper NA support.


It‘s at version 1.0 because it has a mature and stable interface. That does not mean that it cannot have experimental features which are not part of that stable interface.

And much more recently (December 26, 2020): https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0...

> Experimental nullable data types for float data

> We’ve added Float32Dtype / Float64Dtype and FloatingArray. These are extension data types dedicated to floating point data that can hold the pd.NA missing value indicator (GH32265, GH34307).

> While the default float data type already supports missing values using np.nan, these new data types use pd.NA (and its corresponding behavior) as the missing value indicator, in line with the already existing nullable integer and boolean data types.

I'm surprised that data.table is so fast, and that pandas is so slow relative to it. It does explain why I've occasionally had memory issues on ~2GB data files when performing moderately complex functions. (to be fair, it's a relatively old Xeon w/ 12GB ram) I'll have to learn the nuances of data.table syntax now.

I'm convinced that data.table is wizardry.

For anyone who's turned off by dt[i, j, by=k], Andrew Brooks has a good set of examples at http://brooksandrew.github.io/simpleblog/articles/advanced-d.... Data Camp's Cheat Sheet is also a good resource https://s3.amazonaws.com/assets.datacamp.com/blog_assets/dat....

Are you using Pandas >= 1.0? I noticed a big speedup without changing my code.

It seems like DataFrames.jl still has a ways to go before Julia can close the gap on R/data.table. I don't think these benchmarks include compilation time either?

I started using Julia in December, DataFrames are in a sort of weird place because they're so much less necessary compared to e.g. Python. In Julia, you could just use a dict of arrays and get most of the benefits, thanks to libraries like Query.jl and Tables.jl. Thus the ecosystem is a lot more spread out. I actually use DataFrames much less than I used to in Python.

This is mostly good, because you can apply the same operations on DataFrames, Streams, Time Series data, Differential Equations Results, etc., but it does mean that some of the specialized optimizations haven't made it into DataFrames.jl

Those arguments apply to Python as well. There is nothing special about Julia that warrants your arguments.

I've been using Python a lot longer than I've been using Julia, and this isn't really true. Python tends towards much larger packages where everything is bundled together, and there are fairly deep language-level reasons for that. Python doesn't have major alternatives to pandas the way Julia has half a dozen alternatives to DataFrames. There is nothing like Query.jl that applies to all table-like structures in Python.

In pandas, you'll see things like exponentially weighted moving averages, while DataFrames.jl is pretty much just the data structure.

The centralization of the Python ecosystem and extra attention that pandas has gotten has made it much better in several ways – for example, pandas's indexing makes filtering significantly faster. These optimizations might make it to DataFrames.jl eventually, but I doubt you'll ever see the same level of centralization.

I disagree. Python's data science community is strongly clustered around pandas, even though it's possible to use other approaches

Not that I am a heavy DataFrame user, but I have felt more at home with the comparatively light-weight TypeTables [1]. My understanding is that the rather complicated DataFrame ecosystem in Julia [2] mostly stems from whether tables should be immutable and/or typed. As far as I am aware there has not been any major push at the compiler level to speed up untyped code yet – although there should be plenty of room for improvements – which I suspect would benefit DataFrames greatly.

[1]: https://github.com/JuliaData/TypedTables.jl

[2]: https://typedtables.juliadata.org/stable/man/table/#datafram...

> As far as I am aware there has not been any major push at the compiler level to speed up untyped code yet – although there should be plenty of room for improvements – which I suspect would benefit DataFrames greatly.

That's not quite correct. The major `source => fun => dest` API as part of DataFrames.jl was designed specifically to get around the non-typed container problem. And it definitely works. That's not the cause of slow performance.

I think the reason is that, as you mentioned, DataFrames has a big API and a lot of development effort is put towards finalizing the API in preparation for 1.0. After that there will be much more focus on performance.

In particular, some changes to optimize grouping may have recently been merged but didn't make it into the release by the time this test suite was run, as well as multi-threaded operations, which havent been finished yet, should speed things up a lot.

That said, this new Polars library looks seriously impressive. Congrats to the developer.

It does look like the benchmarks include compilation time.

Looks like a cool project.

It's better to separate benchmarking results for big data technologies and small DataFrame technologies.

Spark & Dask can perform computations on terabytes of data (thousands of Parquet files in parallel). Most of the other technologies in this article can only handle small datasets.

This is especially important for join benchmarking. There are different types of cluster computing joins (broadcast vs shuffle) and they should be benchmarked separately.

Yeah, nobody uses Spark because they want to, they use it because from certain data size, there's nothing else.

This is very cool. I'm happy to see the decision to use Arrow, which should make it almost trivially easy to transfer data into e.g. Julia, and maybe even to create bindings to Polar.

I tried running your code via docker-compose. After some building time, none of the notebooks in examples-folder worked.

The notebook with the title "10 minutes to pypolars" was missing the pip command which I had to add to your Dockerfile (actually python-pip3). After rebuilding the whole thing and restarting the notebook, I had to change "!pip" to "!pip3" (was to lazy to add an alias) in the first code-cell which installed all dependencies after running. All the other cells resulted in errors.

I suggest to focus on stability and reproducibility first and then on performance.

Author here. These examples and docker-composes files are heavily outdated. Please take a look at the docs for up to date examples.

P.S. I do what I can to keep things up to date, but only have the time I have.

It is often doubtful if one uses the word "fastest". You often see that one micro-bench lists ten products, then it says "look, I am running in the shortest time".

The problem is that, people often compare "apple to orange". Do you know how to correctly use ClickHouse(there are 20-30 engines in ClickHouse to use. Do you compare an in-memory engine to an disk-persistent-design Database?), Spark, Arrow... ? How can you guarantee to do a fair evaluation among ten or twelve products?

Pretty impressed with the data.table benchmarks. The syntax is a little weird and takes getting used to but once you have the basics it’s a great tool.

I use it a lot but it really breaks the tidyverse, which makes using R actually enjoyable. Why aren’t these other libraries (not in R; I’m talking the others in the benchmark) consistently as fast as data.table? Are the programmers of data.table just that much better?

While I like tidyverse, I honestly have trouble using it most of the time, knowing how much slower it is. It becomes addictive, where I have trouble accepting minutes over seconds many operations take in DT.

As for the speed, Matt Dowle definitely strikes me as a person that optimizes for speed. Then of course, there is the fact that everything is in place, and parallelization is at this point baked in. It's also mature unlike a lot of other alternatives and has never lost sight of speed. Note, for example, how in pandas, in place operations have become very much discouraged over time, and are often not actually in place anyways.

Note back to tidyverse. Why do you think tidyverse breaks with DT. If you enjoy the pipe, write out DT to a function (e.g. dt) that takes a data frame, and ensure that any operations you need specific to DT return a reference to your data table object and off you go with something like this:

  df %>%
    dt(, x := y + z) %>%
    unique() %>%
    merge(z, by = "x") %>%
    dt(x < a)
There, it looks like tidyverse, but way faster.

There are almost 200 magrittr-related issues in GitHub and I have had a bad time pairing data.table with tidyverse packages (and others because of e.g. IDate). DT code is like line noise to me, but I prefer to write things in it directly — the only reason I use it is because it’s fast, and guessing how it’s going to interact with tidy stuff and NSE (especially when using in place methods) is counterproductive to that goal.

19 of those are open and most of them not terribly relevant. Considering the ubiquity of the package, I'd say the total number of issues is shockingly low.

As for NSE, DT uses NSE as well, but differently of course. I guess it all comes to what we "mean" by tidyverse. If we mean integration with the cast majority of packages, then yeah, it will work, but of course certain things are out of bounds. If you just want to use data table like dplyr, then tidytable is your ticket.

I'd argue the beast thing to do though is to just get used to the syntax. Data table looks like line noise until you're really comfortable with it, then the terse syntax comes across as really expressive and short. I've come to like writing data table in locally scoped blocks, pretty much without the pipe, and using mostly vanilla R (aside from data table). I think it looks pretty good actually, and I think less line noise than pandas with its endless lambda lambda lambda lambda.

I counted closed issues intentionally — this isn’t some one-off matter that’s easily resolved, as clearly hundreds of people have struggled with these issues over the years, and this should not be dismissed.

It’s far better aesthetically than Python. It’s just too different from the other libraries I use to disrupt my cognitive flow. You might say there are too many ways to do something, too, which makes it that much harder to figure out what code written by someone else (or myself three months ago) does. I also severely dislike seeing calls to eval or unevaluated code within the main body of my program —- quoted code looks awful and I trust it less.

It’d be interesting to see DT repackaged as its own tool with its own syntax. As it stands, it’s constrained by R and it has no comparable ecosystem to the tidyverse around it.

I dropped dplyr in favor of data.table and never looked back.


Vanilla R got a bad name but once you understand the fundamentals it's quite good, fewer footguns than used to be there, and I find it easier to reason about than tidyverse.

But the hexagons! Where are it's hexagons?

There are dozens of us!

> It really breaks the tidyverse

You may want to look at tidyfst.

> Are the programmers of data.table just that much better?

Pixie dust, R's C API (and yes, they're just exceptionally good).

dplyr and related packages use the existing R data frame class. (A "tibble" is just a regular R data frame under the hood.) This means that it inherits all the performance characteristics of regular R data frames. data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm. There are other more subtle reasons for the differences, but that's the absolute simplest explanation.

Supposedly you can use data.tables with dplyr, but I haven't experimented with it in depth.

> data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm.

This is totally false. data.table inherits from data.frame. Sure, it has some extra attributes that a tibble doesn’t but the way classing works in R is so absurdly lightweight, that’s meaningless in comparison. Both tibble and data.table are data.frames at their core which are just lists of equal length vectors. You can pass a data.table wherever you pass a data.frame.

Thank you for the correction. I knew that tibbles were essentially just data frames with an extra class attribute, but for some reason I didn't realize this was also true of data.table. I think assumed that data.table's reference semantics couldn't be implemented on top of the existing data frame class, but I guess I'm wrong about that. Unfortunately it's too late for me to edit my original comment.

Tibbles are not just data frames with extra class attribute. For one - they don't have row names. Second, consider this example, demonstrating how treating tibbles as data frames can be dangerous:

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1
R-devel mailing list had a long discussion about this too: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...

Ok, fine, to be more precise, tibbles and data frames and data tables are all implemented as R lists whose elements are vectors which form the columns of the table. And also `is.data.frame` currently returns TRUE for all of them, whether or not that is ultimately correct.

dtplyr, the dplyr backend for data table is still IMHO not great, and will often break in subtle and not so subtle ways. Tidytable is, I think, a much more interesting implementation, and gets close to the same speeds.

Hmm, this looks very interesting! I've ended up preferring dplyr for it's expressiveness in spite of the speed difference, so this might be a nice compromise for when dplyr gets too slow.

Oh, I know that, I use it daily and I’ve read some of its source code. I’m just astonished that the best-performing data frame library in the world is developed in R and it outperforms engines written with million/billion dollar companies behind it.

data.table is written primarily in C. But R happens to have a very good package system and a very good interface to C code.

And Matt Dowle has bled for that C code.

I feel like some of it is to do with the way R's generics work - being lisp-based and making use of promises. It allows for nice syntax / code while interfacing the C backend.

Me too: I've tended to let the database do a lot of heavy lifting before I bring data in. Maybe I don't actually need to do that.

There’s really no harm in doing that, and it’s still a pretty good idea.

I generally try and get my data sources as far as possible with the database, then leave framework/language specific things to the last step, means that-if nothing else-someone else picking up your dataset in a different language/framework toolset doesn’t need to pick up yours as a dependency, and you’re not spending time re-implementing what a database can already do (and can do more portably).

The only downside to letting the database do some of the pre-processing is that I don't have a full raw data set to work with within either R or Python. If I decide I need a an existing measure aggregated up to a different level, or a new measure, I've got to go back to the database and then bring in an additional query. So I have less flexibility within the R or Python environment. But you make a good point: there's trade offs either way, and keeping the dataset as something like a materialized view on the database makes it a little more open to others' usage.

If this will read a csv that has columns with mixed integers and nulls without converting all of the numbers to float by default, it will replace pandas in my life. 99% of my problems with pandas arise from ints being coerced into floats when a bull shows up.

Pass dtype = {"colX":"Int64"} for the columns that you want to read as a Nullable integer type: https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...

The problem is not that it can't be done, it's that I'll read one dataset and write the script that behaves as expected (using `head` here and there to check things as out the script progresses), then come back to it later after I get a new dataset, that now has nulls mixed with numbers. It starts behaving differently or is broken in a subtle way, and it's not always obvious why. After lots of experience, I have learned to check for int mangling each time a new Dataframe is read or two Dataframes are merged together. It is enough of a frustration that I am willing to look for a viable alternative, because I think it's a bit absurd that Int64 isn't the default for columns that are clearly meant to integers mixed with nulls, or that I can't set a flag to tell it to stop int mangling.

It's not absurd that Int64 isn't the default, because:

1. nullable Int64 was only implemented recently, still experimental, and changing defaults can break lots of existing code

2. implementing nullable Int64 was a very non-trivial exercise, because pandas was mostly built on top of numpy which didn't (and still doesn't) have nullable integer arrays

I disagree that those things make it not absurd. The current behavior is a surprise when you discover it and continues to bite you long after. It shouldn't be changed to a default now; the current behavior should never have existed.

I understood the technical reasons since I've researched them myself. It does literally nothing to change the frustration or convince me not to look for an alternative.

I've been intrigued about this library, and specifically the possibility about a Python workflow, but a fallback to rust if needed. I mean, I haven't really looked at what the interop is but should work, right?

It's not going to happen for now though because the project is still immature and there's zero documentation in Python from what I can see. But it's something in keeping a close eye on, I often work with R and C++ as a fallback when speed is paramount, but I think I'd rather replace C++ with Rust.

Rust projects takes longer. If memory safety is not a concern, I'd advice stick to Modern C++.

This feels like a gross generalization that's not applicable in many situations, and is immensely dependent on each individual person and situation.

I can write non-trivial performant code in Rust, including bindings across a C FFI much faster than I can weave together the equivalent code and build scripts in C++. Memory safety isn't the only thing Rust brings to the table. I sometimes don't because C++'s ecosystem is far developed for a certain application and it's not worth it for that particular situation. As with most things, it's about trade-offs.

Are you not using CMake for building C++ apps?

You can use C++ for everything and it's not developed for certain applications.

I'm guessing Polars and Ballista (https://github.com/ballista-compute/ballista) have different goals, but I don't know enough about either to say what those might be. Does anyone know enough about either to explain the differences?

Ballista is distributed. Its author, Andy Grove is the author of the Rust implementation of Arrow though so there will be similarities between the two projects.

What’s up with all the Dask benchmarks saying “internal error”? I expected at least some explanation in the post.

If you click through to the detailed benchmarks page (https://h2oai.github.io/db-benchmark/). A lot of them are that it's running out of memory, a few of them are features that haven't been implemented yet.

Inefficient use of memory is a problem I've seen with several projects that focus on scale out. All else being equal, they tend to use a lot more memory. This happens for various reasons, but a lot of it is the simple fact that all the mechanisms you need to support distributed computing, and make it reliable, add a lot of overhead.

I suppose there isn’t as much focus on squeezing everything possible out of a single machine if the major focus is on distribution.

There's that, but there's also just costs that are baked into the fact of distribution itself.

For example, take Spark. Since it's built to be resilient, every executor is its own process. Because of that, executors can't just share immutable data the way threads can in a system that's designed for maximum single-machine performance. They've got to transfer the data using IPC. In the case of transferring data between two executors, that can result in up to four copies of the data being resident in the memory at once: The original data, a copy that's been serialized for transfer over a socket, the destination's copy of the serialized data, and the final deserialized copy.

These are single machine benchmarks.

Still mind boggling to me how amazing data.table is

"Polars is based on the Rust native implementation Apache Arrow. Arrow can be seen as middleware software for DBMS, query engines and DataFrame libraries. Arrow provides very cache-coherent data structures and proper missing data handling."

This is super cool. Anyone know if Pandas is also planning to adopt Arrow ?

I believe Pandas is incompatible with Arrow for a few reasons, such as their indexes and datetime types. But it's pretty easy to convert a pandas dataframe to Arrow and vice versa – I actually use this to pass data between Python & Julia.

As a side note, Wes McKinney, the creator of Pandas, is heavily involved in Arrow.

I could be misremembering but I seem to recall Wes McKinney saying in some talk that rather than rewrite Pandas to be Arrow-backed it will probably eventually be replaced by newer Arrow-backed libraries some of which might have pandas-like apis. I think the idea was that pandas API is too large and the library too widely used for it to be practical to correct some of the design problems people have mentioned. He'd sketched out a vision for Pandas 2.0 at one point and I think he said that basically that would probably just be a new library.

There's a lot of related discussion in this post on his blog.


It’ll probably never be fully comparable because Pandas can represent python objects and nulls (badly). However, for the most part Arrow and Numpy are compatible. There’s no overhead in converting an arrow data structure into a Numpy one.

I don’t think this is the case. Particularly if you move past 1d numpy numeric arrays. And even in the simplest case of say a 1d float32 array, Arrow arrays are chunked which means there is significant overhead if you try to use an arrow table as your data structure when using Python’s scientific/statistics/numerics ecosystem.

We have been rewriting our stack for multi/many-gpu scale out via python GPU dataframes, but it's clear that smaller workloads and some others would be fine on CPU (and thus free up the GPUs for our other tenants), so having a good CPU impl is exciting, esp if they achieve API compatibility w pandas/dask as RAPIDS and others do. I've been eyeing vauex here (I think the other rust arrow project isn't DF's?), so good to have a contender!

I'd love to see a comparison to RAPIDS dataframes for the single GPU case (ex: 2 GB), single GPU bigger-than-memory (ex: 100 GB), and then the same for multi-GPU. We have started to measure as things like "200 GB/s in-memory and 60 GB / s when bigger than memory", to give perspective.

I would love to learn the details of building a Python wrapper on Rust code like you did with pypolars.

I just did that this weekend with py03 (pyo3?). The github page has some working examples.

Does anyone have insights into why data.table is as fast as it is? PS: Great work with polars!

Pandas does seem to be on the out if I'm being honest, and thats coming from someone who has invested heavily in it (backend for my project http://gluedata.io/). JMO

I would happily adopt Polars if the feature set is expansive enough.

Pandas is great because its so ubiquitous but I have always felt that it was slow (especially coming from R).

One thing that is weirdly terrible in pandas is data types. The coupling with numpy is awkward. Its so dependent on numpy and if pandas isn't moving fast numpy isn't moving at all. I'd be curious to see how Polars handles this. e.g. Null values, datatimes etc.

Is this faster than a "traditional" RDBMS like SQL Server, Postgres, Oracle (::uhhg:: but I have to work with it)?

This is all in-memory afaik, so yes, this will be orders of magnitude faster than an RDBMS.

How is data.table so amazing? I didn’t expect it to be that fast...

Optimized C can be very hard to match.

I would like it if Vaex was added to the benchmarks.

Looks great, but I would like to see how this compares against proprietary solutions (e.g. kdb+).

Arrow and SIMD, how would it work on Arm? I've had quite a success with Gravitons recently.

Out of curiosity, what did you run on Graviton? How did you define success?

What is your question? ARM supports SIMD through NEON

You can implement the intel intrinsics on the arm end, but NEON and AVX aren't exactly the same thing, so there's usually performance to be found.

I'm also not aware of any free SIMD profilers that work on ARM that hold a candle to vTune.

No profilers really hold a candle to vTune is the problem in general. I love the new AMD chips but uProf isn't in the same class as vTune and that is sad. I'm certain with better tools the AMD chips could be demolishing Intel by an even greater margin.

Writing a seoarate path for NEON is what would be needed. It's not like there are these magical SIMD functions (intrinsics) that work across architectures.

How does it compare to datafusion which is also a rust project that has dataframe support

Polars is to Pandas as Ballista/datafusion is to Spark on a very broad basis.

Now any potential this become next DataBricks?

That’s a bit like asking when Python/Pandas will become the next SaaS product.

This is analogous to Pandas, Databricks is a commercial offering of managed Spark, Ballista is a new project that a Rust/“modern??” Rewrite of Spark.

Is there a data frame project for Node.js?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact