
Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and Rapids - FHMS
https://datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray
======
haltingproblem
We need a better decomposition of scalability. Do you mean scalability in data
or scalability in compute or scalability of both?

Definitions:

Scalability in Data (SD): doing fast computation on a _very_ large number of
rows

Scalability in compute (SC): doing slow computations on a large number of rows

For SD, I have found that a 16-32 core machine is more than enough for tens of
billions of rows as long as your disk access is relatively fast (SSD vs. HDD).
If you vectorize your compute operations you can typically get to within 10x
the assembly compute time. This allows you to tap into in a 32 core machine
for 10s of effective giga flops. These machines are rated at 100s of giga
flops. For example I had to compute a metric on a 100 million row table
(dataframe) which effectively required on the order of 10-20 tflops of
compute. Single-core pandas was showing us 2 months of compute time. Using
vectorization and using mp.Pool I was able to reduce to a few hours. The big
win here was vectorization and not mp.Pool.

For Compute scalability - e.g. running multiple machine learning models which
cannot be effectively limited to a single machine, nothing beats Dask. Dask is
extremely mature, has seen a large number of real world cases and people have
used it for hundreds of hours of uptime.

Vectorization is a oft unlooked realm of speedup which can easily give you
10-100x speedups in Pandas. Understanding vectorization and what it can and
cannot do is a highly productive exercise.

------
sixhobbits
Oh, I wrote this :) I submitted it last week but it didn't get much attention
then.

Happy to answer questions as far as possible. I have used Pandas extensively
but I don't have deep experience with all of these libraries so I learnt a lot
while summarising them.

If you know more than I do and I made any mistakes, let me know and I'll get
them corrected.

~~~
iwebdevfromhome
What are your thoughts on AWS Glue/Spark ? We’re starting to have problems
with data frames that won’t fit into memory anymore on 32Gb clusters and
upgrading to the next option, a 64Gb cluster, is an expensive thing. We plan
to migrate to glue as a long term solution but I think we need to figure out a
short term solution to the issue while the migration takes place.

Thanks for the article, before it I only knew of Dask as a real alternative.

P.D. I just remember that I wanted to try Pandarallel as well, so you have any
insight on this library ? Thanks!

~~~
disgruntledphd2
Not the OP, but moving to sparse matrices is probably going to give you the
most bang for your buck. I would strongly suspect that those huge dataframes
could be encoded sparsely in a much more efficient format.

To be fair, that's one of the reasons that Spark ML stuff works quite well. Be
warned though, estimating how long a Spark job will take/how much resources it
will need is a dark, dark art.

------
closed
Articles like these are interesting, but what surprises me is that they rarely
set up a holistic use case, so most debates imagine how long it would take an
expert user to use each tool. But time constraints (eg spent coding) separates
expert from novice performance in many domains.

FWIW I have 2020 set aside to implement siuba, a python port of the popular R
library dplyr (siuba runs on top of pandas, but also can generate SQL). A huge
source of inspiration has been screencasts by Dave Robinson using R to analyze
data he's never seen before lightning fast.

Has anyone seen similar screencasts with pandas? I suspect it's not possible
(given some constraints on its interface), but would love to be wrong here,
because I'd like to keep all my work in python :o.

Expert R screencasts:
[https://youtu.be/NY0-IFet5AM](https://youtu.be/NY0-IFet5AM)

Siuba: [https://github.com/machow/siuba](https://github.com/machow/siuba)

~~~
sixhobbits
David Robinson is great and you won't easily find material of the quality he
produces elsewhere.

Take a look at Jake Vanderplas and Joel Grus's stuff for a general 'people
doing cool things with Python' theme, but not quite comparable.

------
jzwinck
The subtitle is "How can you process more data quicker?"

NumPy. It scores an A in Maturity and Popularity, and either an A or a B in
Ease of Adoption depending on which Pandas features you use (e.g. GroupBy).

When you're using NumPy as the main show instead of an implementation detail
inside Pandas, it is easier to adopt Numba or Cython, and there are huge gains
to be made there. Most Pandas workloads on small clusters of say 10 machines
or fewer could be implemented on a single machine.

Even simple operations on smallish data sets are often much faster in NumPy
than Pandas.

You don't have to leave Pandas behind, just try using NumPy and Numba for the
hot parts of your code. Numba even lets you write Python code that works with
the GIL released, which can lead to linear speedup in the number of cores with
much less work than multiprocessing without the overhead of copying data to
multiple processes.

~~~
haltingproblem
This does not solve the issue of compute scalability - slow computations,
which are fundamentally opaque, applied to large data frames . Given a series
of data frames (or one large one that can be chunked) how do I apply a long
running function to each chunk. For that you need scalability across cores and
machines hence Dask.

~~~
jzwinck
Why do you consider computations to be opaque? Do you not have the source
code?

There is a ton of low hanging speed in many computations that people treat as
black boxes. Often as the result of knowing something extra about the specific
input data rather than relying on a generic implementation.

In some cases all you need is to write NumPy code instead of Pandas code for a
2-3x speedup. Then suddenly your small cluster program runs on one machine.

~~~
isoprophlex
Besides the speedup from using native numpy, theres also the potential for
50-100x speedup if your code isn't vectorized to begin with, and anywhere from
1-1000x if theres a couple of joins in there that you can optimize.

But for the latter, see discussion on shifting the pd compute to a RDBMS
elsewhere in these comments.

------
alexott
It would be interesting to see koalas compared as well

~~~
MrPowers
Koalas is the Pandas API on top of Apache Spark for anyone that's interested:
[https://github.com/databricks/koalas](https://github.com/databricks/koalas)

It works similar to PySpark and is scalable to massive datasets (hundreds of
terabytes). Koalas is probably the best bet if you're working on a massive
dataset and want the Pandas API. Or you can simply use PySpark which has a
cleaner interface.

------
blackbear_
I'll just throw it in the discussion: pandas could just interface with and
leave the heavy lifting to a RDBMS.

~~~
isoprophlex
In my opinion, pandas is fundamentally broken and unsuitable for any
production workload.

The heavy lifting should be left to a RDBMS like you say: something with a
sensible, battle-hardened query planner. I've written and debugged too many
lines of manual pd joins/merges; something declarative like SQL is much nicer
because the query planner is almost always right.

Furthermore, as a user, I've always found the pandas API to be very confusing.
I'm always having to interrupt my workflow to figure out boring details about
the API (is it df.groupBy().rolling(center=True).median() or any other
permutation?), whereas eg pyspark or sql are so much more ergonomic.

Finally, typing inside pd dataframes is a complete and utter nightmare. Int64
missing a null, or the idiocy around datetimes expressed as epoch
nanoseconds...

Pandas is nice for noodling around in notebooks. But for me, it should never
be used beyond that.

~~~
disgruntledphd2
Pandas combines the intuitive nature of base-R with the excellent missing data
model of Python.

------
sandGorgon
The tie breaker here really is kubernetes. Most likely your company's
infrastructure is run on k8s. As a data scientist you do not get control over
that.

Dask natively integrates with Kubernetes. That's why I see a lot of people
moving away even from Apache Spark (which is generally used through its
inbuilt scheduler YARN) and towards Dask.

Second reason is that the dask-ml project is building seamless compatibility
for higher order ML algorithms (sklearn,etc) on top of Dask. Not just
Numpy/Pandas

------
marcell
I working on a project called CloudPy that's a 1-line drop in for pandas to
process 100 GB+ dataframes.

It works by proxying the dataframe to a remote server, which can have a lot
more memory than you're local server. The project is in beta right now, but
please reach out if you're interested in trying it out! You can read more at
[http://www.cloudpy.io/](http://www.cloudpy.io/) or email hello@cloudpy.io

------
untoreh
[https://github.com/weld-project/weld](https://github.com/weld-project/weld)
this is in the same domain of vaex I suppose, the scope is much larger though

------
jdnier
Any sense of where Apache Arrow will fit in these libraries?

~~~
hcrisp
"Behind the scenes, RAPIDS is based on Apache Arrow..."

[https://www.forbes.com/sites/janakirammsv/2018/11/05/nvidia-...](https://www.forbes.com/sites/janakirammsv/2018/11/05/nvidia-
brings-the-power-of-gpu-to-data-processing-pipelines/)

