
Pandas on Ray – Make Pandas faster - dsr12
https://rise.cs.berkeley.edu/blog/pandas-on-ray/
======
chrisaycock
The one line of code is

    
    
      import ray.dataframe as pd
    

They've replaced many pandas functions with an identical API that runs actions
in parallel on top of Ray, a task-parallel library:

[https://github.com/ray-project/ray](https://github.com/ray-project/ray)

Unlike Dask, Ray can communicate between processes without serializing and
copying data. It uses a shared-memory object store within Apache Arrow:

[http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-
obj...](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-
store/)

Worker processes (scheduled by Ray's computation graph) simply map the
required memory region into their address space.

------
JPKab
As a person who LOVES pandas, numpy, scikit, and all things SciPy, I really
wish these kinds of posts would take a few seconds to include a link, or maybe
just a quick paragraph, to answer one question:

What the %$#@ is Ray?

I make a habit of doing this myself whenever I do a post like this. Sure, I
was able to look up Ray from the Riselab and figure this out myself, but I
wish I didn't have to.

From the Ray homepage:

Ray is a high-performance distributed execution framework targeted at large-
scale machine learning and reinforcement learning applications. It achieves
scalability and fault tolerance by abstracting the control state of the system
in a global control store and keeping all other components stateless. It uses
a shared-memory distributed object store to efficiently handle large data
through shared memory, and it uses a bottom-up hierarchical scheduling
architecture to achieve low-latency and high-throughput scheduling. It uses a
lightweight API based on dynamic task graphs and actors to express a wide
range of applications in a flexible manner.

Check out the following links!

Codebase: [https://github.com/ray-project/ray](https://github.com/ray-
project/ray) Documentation:
[http://ray.readthedocs.io/en/latest/index.html](http://ray.readthedocs.io/en/latest/index.html)
Tutorial: [https://github.com/ray-project/tutorial](https://github.com/ray-
project/tutorial) Blog: [https://ray-project.github.io](https://ray-
project.github.io) Mailing list: ray-dev@googlegroups.com

~~~
IIAOPSW
When people write obtusely, I simply read it obtusely and move on with my
life.

In my imagination, this article is about some guy named Ray teaching Panda's
how to run faster.

------
sqquuiiiddd
Comparison to Dask: [https://github.com/ray-
project/ray/issues/642](https://github.com/ray-project/ray/issues/642)

------
neves
Ok, it would be quicker, but is it a free lunch? Is there any chance that it
would introduce bugs in my code and generate wrong calculations? There's no
problem if it does not optimize or even if it crashes. I can always go back to
the original version.

------
cottonseed
> ... 100's of terabytes of biological data. When working with this kind of
> data, Pandas is often the most preferred tool

"biological data" is a bit vague, but for the data I know to be that big,
sequence and array data, it does not naturally have the structure of a
dataframe nor is pandas the tool of choice.

~~~
thewizardofaus
Yeah, I tend to use modified hdf5 tools for data that big.

~~~
sfsylvester
This is exactly my go-to-move as well.

pandas.read_hdf has beaten out ray.dataframe.read_csv in terms of speed on the
few files I've just initially tested now. But I imagine the programmable
flexibility csvs have over hdfs (I've never used a Unix command to edit a hdf
for example) is why this new approach could get some traction.

~~~
tavert
Try parquet if your data is tabular, pyarrow and related tools are getting
parquet up to a pretty comparable speed to hdf5, with arguably more
flexibility and a better multithreading story.

------
kmax12
How does Ray compare to spark? Is there a reason to use spark once libraries
like Dask or Ray become more mature?

~~~
lmeyerov
RE:spark, we're curious about Ray mostly because of the potential for
interactive-time (ms-level) compute for powering user-facing software.
RE:dask, we care that Ray interops with the rest of our stack (Arrow). I
haven't evaluated Ray-on-pandas, and the Ray was previously focused on
powering traditional ML, so again, just first blush on the announce.

I don't think anything is inherent, more about priorities and momentum. For
example, Spark devs have been working on cutting latency, and Conda Inc is/was
contributing to the Arrow world. I had assumed the pygdf project would get to
accelerating arrow dataframe compute before others, so this announce was a
pleasant surprise!

------
tmaic
For Mac and Linux:

    
    
        pip install ray
    

For windows:

    
    
        ¯\_(ツ)_/¯

~~~
tavert
[https://github.com/ray-project/ray/issues/631](https://github.com/ray-
project/ray/issues/631)

------
lmeyerov
We're super excited by the overall project at Graphistry, and hadn't realized
there was a ray-on-pandas component! The first line w/ GPU count is intriguing
given what we do :) Super excited to try it on our pandas code.

For node/data hackers: Our team is trying to bring the full & accelerated
pydata world to JavaScript. We started with Arrow columnar data bindings
([https://github.com/apache/arrow/tree/master/js](https://github.com/apache/arrow/tree/master/js)).
Next stop is Plasma bindings in node for zero-copy node<>pydata interop. That
enables nearly-free calls from node web etc. apps to accelerated pandas-on-
ray. If others are interested in contributing, let me/us know!

------
dsr12
Github link: [https://github.com/ray-project/ray](https://github.com/ray-
project/ray)

------
prashnts
I've found splitting the dataframe, and using multiprocessing module with
`apply` to compute chunks of data to be quite efficient. One can use
`group_by` method for that, or just slice the dataframe.

For example:

    
    
        concurrency = 4  # Num of cores
        pool = multiprocessing.Pool(processes=concurrency)
        results = pool.map(fn, df.group(...))   # fn would be a callable for computing on a chunk.
        pool.close()
        return pd.concat(results)

------
xiaodai
What's not covered are the crucial operations of group-by and reduce!!!

~~~
kgos
One of the Pandas on Ray coauthors here: We're planning on releasing another
post in the next few weeks discussing the technical details around group-bys.
Stay tuned!

------
thomzi12
Could someone compare this to Dask? Would you use the two in different
situations?

