Funny enough, I actually just (2 weeks ago) added support for streaming from Pyspark to Polars/DuckDB/etc through Arrow PyCapsule. By streaming, I mean actually streaming, not collecting all data at once. It won't be released probably until May/June but it's there: https://github.com/apache/spark/commit/ecf179c3485ba8bac72af...
I always appreciate blog posts like this, there are obviously cases where SQL shines, and in part I think the dataframe abstraction helps with filling a lot of the missing pieces that SQL doesn't handle so well (composability, debuggability, interactivity, etc.)
Even pandas (with all its faults) is more flexible as a language than SQL[1]. I'm of the opinion that there's a better together story in the end, but I guess we will see.
There are loads of things that are not possible or are very cumbersome to write in SQL, but that pandas and many other dataframe systems allow. Examples are dropping null values based on some threshold, one-hot encoding, covariance, and certain data cleaning operations. These are possible in SQL but very cumbersome to write. There are also things that are outright impossible in a relational database related to metadata manipulation.
SQL is super expressive, but I think pandas gets a bad rap. At it's core the data model and language can be more expressive than relational databases (see [1]).
I co-authored a paper that explained these differences with a theoretical foundation[1].
Thanks for sharing this. I believe we essentially agree: chaining method calls is inexpressive compared to composing expressions in an algebraic language.
I'm not defending Pandas but just want to point out that the inability to conveniently compose expressions is one of the biggest problems with SQL, since it was designed to be written as a sort of pseudo-English natural language, in an era when people imagined that it would be used by non-programmers. To be clear, that's a problem with SQL, not with the idea of a language based on relational algebra. There are various attempts to create SQL-alternatives which behave like real programming languages in terms of e.g. composability. This blog post makes the point better than I can:
I absolutely agree - one of the biggest shortcomings of SQL is that its primary programming interface is based on text and intended for human, instead of being based on data structures and intended for programs.
This blog post is a great way to learn about the theory underlying dataframes without reading a bunch of technical jargon from our academic paper[1]. AMA!
This was my PhD focus. We identified a core "dataframe algebra"[1] that encompasses all of pandas (and R/S data.frames): a total of 16 operators that cover all 600+ operators of pandas. What you describe was exactly our aim. It turns out there are a lot of operators that are really easy to support and make fast, and that gets you about 60% or so of the way to supporting all of pandas. Then there are really complex operators that may alter the schema in a way that is undeterminable before the operation is carried out (think a row-wise or column-wise `df.apply`). The flexibility that pandas offers is something we were able to express mathematically, and with that math we can start to optimize the dataframe holistically, rather than chipping away at small parts of pandas that are embarrassingly parallel.
Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.
I'm one the maintainers of Modin, so I can chime in here. Dataframes are the focus of my PhD thesis, and Modin started as my PhD project. Most of the differences come down to functionality and support. Truthfully, the goals of the projects are quite different so it's a bit of apples-to-oranges.
As a part of developing Modin, we identified a low-level algebra and data model that both generalizes and encompasses all of the pandas and R dataframe functionalities. Modin is an implementation of this data model and algebra[1]. Based on our studies, Vaex's architecture can support somewhere in the range of 35-40% of the pandas DataFrame API, including the exclusion of support for row indexes. Compare this to Dask, currently at 44% of the pandas API, and Modin, currently at 90%.
Vaex is great if you're already working with a compatible memory-mapped file format; it'll be exceptionally fast in that case. That is the use case I believe they are (successfully) targeting.