
Pandas on Ray – Early Lessons from Parallelizing Pandas - xmo
https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/
======
miggyrozay
How does this compare to dask.distributed? Dask dataframes are also a wrapper
on pandas API.

edit- They explain differences in a section of this blog post:
[https://rise.cs.berkeley.edu/blog/pandas-on-
ray/](https://rise.cs.berkeley.edu/blog/pandas-on-ray/)

------
innagadadavida
Does anyone here know if Ray is some sort of Yarn competitor? If not what
problem space is it in?

~~~
mehrdadn
I don't know what Yarn is, but Ray is a distributed heterogeneous computing
framework. It's meant to make it easy to take fairly arbitrary computational
programs (with machine learning being an important test/use case) and
run/debug them in parallel across lots of machines with high performance in a
natural fashion and without drastic changes. [1] The advantage compared to
(say) Hadoop is that it allows for heterogeneous programming models and isn't
limited to or designed for (say) MapReduce; you can invoke functions in
parallel in a dynamic fashion pretty easily.

[1] You can get an idea of that here:
[https://ray.readthedocs.io/en/latest/](https://ray.readthedocs.io/en/latest/)

------
kgos
Code Repo: [https://github.com/modin-project/modin](https://github.com/modin-
project/modin)

~~~
smittywerben
For those of you confused like me, "Pandas on Ray has moved into the Modin
project"

[http://ray.readthedocs.io/en/latest/pandas_on_ray.html](http://ray.readthedocs.io/en/latest/pandas_on_ray.html)

------
axiom92
This could be really helpful for implementations that were written with
relatively smaller datasets in mind but now need to be scaled up. However, for
someone starting from scratch, it is not clear what advantages do they plan to
offer against Spark used with the Dataframe API.

------
rmbeard
Unclear what this is good for.

~~~
AdamM12
It is a WIP distributed Pandas implementation. Allows you to spread massive
dataframes, their data and computations, over multiple machines vs. Pandas
which is only local to your machine. It's not quite there yet [1]

[1]
[http://modin.readthedocs.io/en/latest/pandas_on_ray.html#usi...](http://modin.readthedocs.io/en/latest/pandas_on_ray.html#using-
pandas-on-ray-on-a-cluster)

~~~
nerdponx
Pandas replacing PySpark? Sign me up.

~~~
makmanalp
Dask is already this! They have a dataframe replacement, a numpy array
replacement, and some lower level primitives like dask.delayed too. Plus, the
nice thing is that it's already being used with large amounts of data, and the
warts (which were plentiful two years ago) are rapidly reducing.

[http://matthewrocklin.com/blog/work/2018/06/26/dask-
scaling-...](http://matthewrocklin.com/blog/work/2018/06/26/dask-scaling-
limits)

~~~
nerdponx
Nice, I've heard of it before but never used. How is deployment compared to
Spark?

~~~
makmanalp
The deployment is /way/ more simple, especially if you already have a python
environment (even more so with conda), which is one of the main attractions
for me. The other is that the API is way richer - spark dataframes tie your
hands down in so many ways and require you to write tons of custom code for
routine stuff, while pandas (and dask) has builtins for almost everything
imaginable these days.

The tradeoff is that spark and hadoop in general have invested /serious/
efforts into resilience, while dask only really provides protection against
worker failure, but not really scheduler failure. In practice is this really
an issue? _shrug_ it depends. "Works for me". How many tasks do you run in
parallel? If you're doing ad-hoc analysis on a very large dataset, dask might
be a great fit. If you have a data warehouse use case and have tons of people
running analytics queries, then you have uptime requirements, and spark might
be a better fit. That's where I ended up.

------
guard0g
This looks interesting. Thanks for sharing and will have my DS team try it
out.

