Hacker News new | past | comments | ask | show | jobs | submit login
Fast Python Serialization with Ray and Apache Arrow (ray-project.github.io)
76 points by rshin on Oct 16, 2017 | hide | past | favorite | 16 comments



Robert, this seems an exciting project (I've already looked at it when I stumbled upon previous announcements) but it's lacking a "what's in it for me" factor, and at this point, I effectively don't know if it's for me or not.

"Ray is a flexible, high-performance distributed execution framework." is certainly a nice tagline, but I'm pretty sure there are other projects in this domain - what's the USP ? who's using it and for what ?


For Ray, the main use case at the moment is parallel/distributing machine learning algorithms, people have been using it for parallelizing MC(MC) style applications, doing hyperparameter search, (pre-)process data, we are using it for reinforcement learning (and have a library for that, see http://ray.readthedocs.io/en/latest/rllib.html)

More broadly it is useful for many parallel/distributed Python applications where low latency (~1ms) and high throughput of tasks are a requirement.


Python is almost synonymous with shitty performance in my mind; am I just wrong about typical python performance, or are you doing something special to make this of a non-issue (e.g. the way numpy essentially shoves all the work into C), or is the flexibility Ray affords worth the performance penalty for your users?


There are two considerations here.

(1) Python single threaded performance: Here, most of the libraries we are using are implemented in C++ (like numpy, TensorFlow, Cython to speed up the code, etc.). Ray is orthogonal to that.

(2) Python parallel performance: Here Python is mostly problematic because of its lack of support for threading (the GIL is one problem here); we handle this problem by using multiple processes and shared memory throughout. Efficient serialization makes this feasible.

The core of Ray is implemented in C++, so performance is not an issue for that; also all of the serialization is implemented in C++.


Adding Java interop, as the authors aim to do, would make this extremely interesting for some of my use cases.


Apologies for being a debbie downer, but it looks like the authors looked into Protocol Buffers and Flatbuffers.

Did they consider CapnProto?

In my limited experience with efficient serdes for analytics, CapnProto would meet 1 - 4.

Arrow is an in-memory format and CapnProto would afford the ondisc (memmap) heavy lifting with (very) little overhead. My CapnProto/LMDB datastructures allow me to write better than C++ code in Python with comparable runtime performance.

The only issue I have with CapnProto is getting it to run on Windows 7 x64, but it works out of the box on Linux.


Author here. From our perspective, CapnProto has similar characteristics as Flatbuffers and the reasons to prefer Arrow over it are the same: We would need to develop a mapping from Python types to CapnProto from scratch and Arrow has many facilities that are useful for us already built (Tensor serialization, code to deal with some Python types like datetimes, zero copy DataFrames, a larger ecosystem for interfacing with other formats like parquet, reading from HDFS, etc.). And it is designed for Big Data. So Arrow was a very natural choice (and it also supports Windows). Wes is doing some amazing work here!


It looks like Arrow utilizes FlatBuffers internally [1]. Seems like using the Arrow project builds a lot of scaffolding that'd otherwise need to be built for this particular use case.

1: https://arrow.apache.org/docs/metadata.html


It's sad that the standard python multiprocessing doesn't allow you to select your own serializers.


Indeed it is.

My workaround for the failures of pickle is to use pathos. The API for the multiprocessing module is almost identical to the default multiprocessing module which makes it easy to use as a drop in replacement. Pathos uses Dill to serialise, which can serialise many more things, but you have to pay the price in performance. Although again, you can't swap out Dill for the serialiser of your choice.


Urg, when I looked inside the multiprocessing module it made my eyes bleed.

There are too many bits that are no customisable like that (amongst other things).


For a moment I was confused with Apache Avro which does serialization.


This looks cool! How does it compare to dill?


Great question.

Not the author, but I am assuming it breaks two of their requirements:

> 4. ... it should not require reading the entire serialized object

> 5. ... It should be language independent

I am curious about their thoughts about CapnProto though: https://news.ycombinator.com/item?id=15485082


Author here. For CapnProto see the answer to the other comment above.

Concerning dill, we have been using it for serializing function and classes (and then switched to cloudpickle because it supports some Python functionality better and the community around it is very active responsive); cloudpickle/dill are great in that they support a very wide variety of Python objects, especially concerning code (functions, lambdas, classes); it is less ideal for large data, because there is no zero copy mechanism, the format is not standardized and serialization/deserialization can be slow, sometimes it is slower than pickle. We do fall back to cloudpickle for objects we don't support like Python lambdas. We also use it to serialize class definitions, so the data associated to classes is serialized using the solution presented above and the code/methods are serialized using cloudpickle. This combines the advantages of both solutions.


This is such an awesome project.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: