
Fast Python Serialization with Ray and Apache Arrow - rshin
https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html
======
fermigier
Robert, this seems an exciting project (I've already looked at it when I
stumbled upon previous announcements) but it's lacking a "what's in it for me"
factor, and at this point, I effectively don't know if it's for me or not.

"Ray is a flexible, high-performance distributed execution framework." is
certainly a nice tagline, but I'm pretty sure there are other projects in this
domain - what's the USP ? who's using it and for what ?

~~~
pcmoritz
For Ray, the main use case at the moment is parallel/distributing machine
learning algorithms, people have been using it for parallelizing MC(MC) style
applications, doing hyperparameter search, (pre-)process data, we are using it
for reinforcement learning (and have a library for that, see
[http://ray.readthedocs.io/en/latest/rllib.html](http://ray.readthedocs.io/en/latest/rllib.html))

More broadly it is useful for many parallel/distributed Python applications
where low latency (~1ms) and high throughput of tasks are a requirement.

~~~
Eridrus
Python is almost synonymous with shitty performance in my mind; am I just
wrong about typical python performance, or are you doing something special to
make this of a non-issue (e.g. the way numpy essentially shoves all the work
into C), or is the flexibility Ray affords worth the performance penalty for
your users?

~~~
pcmoritz
There are two considerations here.

(1) Python single threaded performance: Here, most of the libraries we are
using are implemented in C++ (like numpy, TensorFlow, Cython to speed up the
code, etc.). Ray is orthogonal to that.

(2) Python parallel performance: Here Python is mostly problematic because of
its lack of support for threading (the GIL is one problem here); we handle
this problem by using multiple processes and shared memory throughout.
Efficient serialization makes this feasible.

The core of Ray is implemented in C++, so performance is not an issue for
that; also all of the serialization is implemented in C++.

------
subhobroto
Apologies for being a debbie downer, but it looks like the authors looked into
Protocol Buffers and Flatbuffers.

Did they consider CapnProto?

In my limited experience with efficient serdes for analytics, CapnProto would
meet 1 - 4.

Arrow is an in-memory format and CapnProto would afford the ondisc (memmap)
heavy lifting with (very) little overhead. My CapnProto/LMDB datastructures
allow me to write better than C++ code in Python with comparable runtime
performance.

The only issue I have with CapnProto is getting it to run on Windows 7 x64,
but it works out of the box on Linux.

~~~
pcmoritz
Author here. From our perspective, CapnProto has similar characteristics as
Flatbuffers and the reasons to prefer Arrow over it are the same: We would
need to develop a mapping from Python types to CapnProto from scratch and
Arrow has many facilities that are useful for us already built (Tensor
serialization, code to deal with some Python types like datetimes, zero copy
DataFrames, a larger ecosystem for interfacing with other formats like
parquet, reading from HDFS, etc.). And it is designed for Big Data. So Arrow
was a very natural choice (and it also supports Windows). Wes is doing some
amazing work here!

~~~
elcritch
It looks like Arrow utilizes FlatBuffers internally [1]. Seems like using the
Arrow project builds a lot of scaffolding that'd otherwise need to be built
for this particular use case.

1:
[https://arrow.apache.org/docs/metadata.html](https://arrow.apache.org/docs/metadata.html)

------
wdroz
It's sad that the standard python multiprocessing doesn't allow you to select
your own serializers.

~~~
splike
Indeed it is.

My workaround for the failures of pickle is to use pathos. The API for the
multiprocessing module is almost identical to the default multiprocessing
module which makes it easy to use as a drop in replacement. Pathos uses Dill
to serialise, which can serialise many more things, but you have to pay the
price in performance. Although again, you can't swap out Dill for the
serialiser of your choice.

------
karmakaze
For a moment I was confused with Apache Avro which does serialization.

------
Derbasti
This looks cool! How does it compare to dill?

~~~
subhobroto
Great question.

Not the author, but I am assuming it breaks two of their requirements:

> 4\. ... it should not require reading the entire serialized object

> 5\. ... It should be language independent

I am curious about their thoughts about CapnProto though:
[https://news.ycombinator.com/item?id=15485082](https://news.ycombinator.com/item?id=15485082)

~~~
pcmoritz
Author here. For CapnProto see the answer to the other comment above.

Concerning dill, we have been using it for serializing function and classes
(and then switched to cloudpickle because it supports some Python
functionality better and the community around it is very active responsive);
cloudpickle/dill are great in that they support a very wide variety of Python
objects, especially concerning code (functions, lambdas, classes); it is less
ideal for large data, because there is no zero copy mechanism, the format is
not standardized and serialization/deserialization can be slow, sometimes it
is slower than pickle. We do fall back to cloudpickle for objects we don't
support like Python lambdas. We also use it to serialize class definitions, so
the data associated to classes is serialized using the solution presented
above and the code/methods are serialized using cloudpickle. This combines the
advantages of both solutions.

------
RHSman2
This is such an awesome project.

