Hacker News new | past | comments | ask | show | jobs | submit login
A simple group-by-and-agg speed comparison inc. Clojure, Pandas, R and Julia (github.com/zero-one-group)
4 points by akhong 6 days ago | hide | past | favorite | 6 comments

Repost from the Reddit:

https://h2oai.github.io/db-benchmark/ is obligatory to show as well. I think you should probably note that this isn't just DataFrames.jl but also in conjunction with Queryverse tools. the reason is because it's somewhat known that the Queryverse tools are very nice to use but not as performant as other parts of the Julia programming language (performance is a result of the language and how it's used). For example, the Parquet to DataFrame conversion that you're using has known performance issues: https://github.com/queryverse/ParquetFiles.jl/issues/32

In general, very nice benchmark contribution and thanks for helping showcase the performance landscape.

I'm a pandas core developer and this is very interesting to me.

That `groupby.apply` is a lot slower than `groupby.agg` does not surprise me at all: `groupby.apply` can do a lot of things that `groupby.agg` can't do, at the cost of being potentially a lot slower. In general, `groupby.apply` should only be used, when `groupby.agg` can't do the job.

However, are you saying that pandas's `groupby.agg` is faster than r's data.table, julia and clojure? That surprises me a lot.

For this particular data and on my machine, that was certainly the case! I've been shown other benchmark results (such as this one: https://h2oai.github.io/db-benchmark/) that demonstrate otherwise. I'm not really sure what to make of it - maybe try more cases?

One possible explanation I could think of is that Pandas support for Parquet is pretty good compared to data.table and Julia. I've been asked to split the read/write part and the groupby-agg part for a more complete picture. I'll be sure to work on that in the coming weeks.

Another hypothesis by u/joinr about why Pandas performs better in the smaller dataset:

"I wonder if there's some default column size allocation that happens up front for the 2^6 case that helps prevent growth in pandas, and maybe the hueristic falls down a little as the dataset gets larger leading to more resizing."

I can't imagine that `.to_parquet` takes any time at all, relative to `groupby.agg`. But yeah, It would be nice to get seperate benchnmarks for the two parts of your benchmark.

Maybe one-factor groupbys are faster in pandas, while two-factor groupbys (as in https://h2oai.github.io/db-benchmark/) are slower?

Yes, I agree with you in Pandas' case. However, for other libraries, a good chunk of the run time comes from reading the parquet files and concatenating the partial datasets. Pandas and Spark are particularly really good with reading a directory of 12 Parquet files with no noticeable performance penalty.

I'd just like to say that I made this comparison, and I hope it is a fair one. Any feedback on how to improve the current versions or suggestions on other libraries/approaches to include would be greatly appreciated!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact