
Evaluating Ray: Distributed Python for Scalability - luord
https://blog.dominodatalab.com/evaluating-ray-distributed-python-for-massive-scalability/
======
Jupe
A quick browse and this looks interesting. Does anyone know if ray is hadoop-
like (meaning it brings code to the existing data), or does it serialize the
data to be processed amongst nodes? The docs indicates that it uses Apache
Arrow to serialize state (which I assume to be the culminated state of the
processed data), but doesn't mention the actual data to be processed.

~~~
dwampler
It's a very different system. Hadoop was structured to compute over massive
data sets, especially at a time when hardware forced moving compute to the
data to optimize overall performance. (That's less true today; networks
sometimes are faster than disk access!) In Ray, when you annotate a function
to make it a remote task or annotate a class to make it a remote actor, _both_
the data passed to the function, or the data and object state for the actor,
serialized together and moved to the node where execution will occur. Then
results (or object/actor state) are retrieved with the `ray.get` method. So,
the "chunks" of data are relatively small, compared to Hadoop, much more fine
grained.

~~~
Jupe
Thank you! Very informative answer.

I like this design, it seems much more "manageable" from a mere human's
perspective.

