
A Comparison of Distributed Machine Learning Platforms - ipsum2
http://muratbuffalo.blogspot.com/2017/07/a-comparison-of-distributed-machine.html
======
agibsonccc
I would like to encourage anyone to read these kinds of "evaluations" with a
bit of criticism.

When doing evaluations there are a lot of nuances.

I will comment on a real world example from our own experience.

Disclaimer: I am biased running a company implementing my own platform. I will
try to stick to topics that folks should "look for" when evaluating these
things rather than inviting a debate on "my tool vs yours"

For deep learning on spark specifically, there are 2 workloads, training and
inference.

Your data pipeline here matters a lot. A job that pulls data from hive will
_surprise_ be bottlenecked by hive. This by extension means the kind of data
and small things like the number of records you fetch or cache on say: hdfs
matters a TON.

On our platform, we implement 2 versions of distributed communications: spark
driver and parameter server. The parameter server is faster. The latter is
what mxnet and tensorflow do as well. Tensorflow uses grpc underneath. mxnet
writes its own.

While both are data flow graphs, the latter is optimized for linear algebra
and I would like to say has very different requirements (for example gpus
which are another level of complexity)

When using these parameter servers, you should look at what guarantees these
things provide. For example, do you need consistent latency? Fault tolerance?
At what scale?

Something as simple as"do these parameter servers do p2p on each node?"

With spark, it's not well designed for that. A lot of workers are meant to be
stateless.

With spark another gotcha (especially on the linear algebra part) is on heap
memory. When you are using say: pyspark for example, how you handle data
movement matters a lot, and again _surprise_ ends up being your bottleneck.

We use pyjnius instead in our python interface for linear algebra:
[https://github.com/deeplearning4j/jumpy](https://github.com/deeplearning4j/jumpy)

We implemented this because we liked the idea of being able to pass numpy
pointers straight to the JVM rather than copying data. I believe pytorch and
co do this as well (which are just straight C/C++). That is the right way to
do it

The default is py4j (which overall is more versatile but not meant for DL)

Don't even get me started on a normal etl pipeline vs "images", which is
typically what you see.

Another big complexity with say: spark and even gpus is gpu usage. When using
mesos, you can only reserve a whole gpu, not fractions.

Another thing is worker setup. For deep learning specifically, What BLAS did
they use? How was it compiled? Was it optimized for the right cpu
architecture?

How does the the workers communicate with the gpu, if any?

If you're using a JVM, how do they handle the FFI?

How are workers configured for cpu/cpu? Things like OMP_NUM_THREADS on each
worker matter a lot.

If you're using spark, did you configure k small workers or 1 big one? How did
that interact with your multi threaded linear algebra?

Hopefully this gives people an idea of how complicated these things get.

~~~
FractalNerve
Before reading the Blog's summary of the paper, I thought hey they're going to
analyze the distributed algorithms[1][2] exclusively and give conclusions on
how heterogenous hardware usgage of GPU/CPU works (best) when spread over the
network/cloud. My guess was that MXNet would win, because Amazon [3][4] and
benchmarks favored it[5]. Unfortunately surveys like this are too complex and
out of date, when released, due to the fast progress in our field.

 _> > @agibsonccc What exactly would YOU, your company and your team like to
see reviewed and what would you make different?_

A standard benchmark, like the Open-Source, Automated Benchmark provided by
the Phoronix-Test-Suite[6] or the benchmarksgame[7] are what we need.
Everybody should be able to contribute easily, or even anonymously like with
Steam's hardware survey[8].

The main problem with distributing operations with Caffe in example is the
need to learn another system for a different programing flavor.

\-- References

[1]
[https://en.wikipedia.org/wiki/Distributed_algorithm](https://en.wikipedia.org/wiki/Distributed_algorithm)

[2]
[https://en.wikipedia.org/wiki/Category:Distributed_algorithm...](https://en.wikipedia.org/wiki/Category:Distributed_algorithms)

[3] [http://www.infoworld.com/article/3144025/cloud-
computing/why...](http://www.infoworld.com/article/3144025/cloud-
computing/why-amazon-picked-mxnet-for-deep-learning.html)

[4] [http://www.allthingsdistributed.com/2016/11/mxnet-default-
fr...](http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-
deep-learning-aws.html)

[5] See Page 52 -
[http://alex.smola.org/talks/NIPS15.pdf](http://alex.smola.org/talks/NIPS15.pdf)

[6] [https://www.phoronix-test-suite.com/](https://www.phoronix-test-
suite.com/)

[7]
[http://benchmarksgame.alioth.debian.org/](http://benchmarksgame.alioth.debian.org/)

[8]
[http://store.steampowered.com/hwsurvey/](http://store.steampowered.com/hwsurvey/)

~~~
agibsonccc
Yes indeed! Thanks for the follow up. I just wanted to give folks a survey of
the things they should look at when evaluating things like these charts to see
how fast something is.

The casual reader won't know to look for these things (it's actually not their
fault, a lot of this knowledge is _very_ field specific)

I've made comparisons to deep learning flow graphs and spark's computation
graphs in the past, but there are _very_ different considerations and use
cases for both.

I think what I was commenting on was more of the minute details that get lost
in summaries like this.

If anything this is a pitch for having an end to end specialized system that
understands your data pipeline all the way to your linear algebra
calculations. That way all of these variables are taken care of.

A simple example where this matters a ton is _just_ the ETL in to a neural
network.

You're loading data from some disk (that may or may not be networked).

What's the input format of that data? What's the block size?

Say it's image blobs stored on hdfs. You know ahead of time your gpu can fit
an optimal batch size of 32 in to memory because your image resolution is
high. You then need to go back and prepare the data for loading. To get the
optimal performance for data loading, you need that data loading to be aligned
with hdfs block size (otherwise you cause an additional hdfs seek) and saved
in whatever your favorite vector format is. Then to evaluate how fast a
"minibatch" will be processed, you need to figure out how fast your
architecture is in addition to how the communication with the gpu is handled.

I think overall this is just a hairy subject to cover in a small blog post.

While these "programming flavors" are very similar, they are optimized
differently in practice.

If you think about spark, the "nodes" are more general purpose operations with
a ton of JVM based code gen happening.

It knows about different "edges/ops" in addition to having a registered
schema. It's lazily initialized with a DAG built with code gen after the
declarations are done.

A "tensor graph" doesn't have types beyond knowing what device it's running on
and perhaps a data type like float/double/half.

------
JoeriHer
To be honest, I think the blog post lags a lot of details and gives only a -
very - basic overview of the field (or I'm missing the point). I've been
working on distributed optimization systems for over a year now [5], and the
field evolved drastically in this time. The most profound contribution is from
[1] imho, where they show that asynchronous optimization can be defined in
terms of implicit momentum.

Nevertheless, during my research I found that local work matters a lot!
Especially when you have large communication constraints. I addressed this in
my master thesis [2], where I introduce 2 novel distributed optimization
techniques based on a redefinition of parameter staleness. In this
redefinition of parameter staleness, I define staleness in terms of the
distance (or difference) between two parameterizations. In a sense, this
allows you to automatically tune the amount of work done by a worker. For
instance, if multiple workers compute gradients based on older
parameterizations which were close to each other, they do not inflict negative
work. Imagine it as follows, at the start of an optimization process, you will
have relatively large gradients compared to when you are close to an optimum.
As a result, having a lot of asynchronous workers at the start of the
optimization process (using DOWNPOUR) could really hurt the convergence
capabilities, and could even cause divergence (for GIF see [3] and [4]).

To counter this, I introduce AGN (Accumulated Gradient Normalization) as a
mechanism to compute a better single gradient update based on increased local
exploration. Basically, you compute a sequence of first-order gradients, and
then normalize them with the number of "exploration steps" to produce a better
parameter server update (this also reduces the magnitude of the vector,
reducing the amount of staleness injected in the system, see Chapter 3 of my
thesis).

Nevertheless, divergent behavior can still be observed as the number of
asynchronous workers increases. This can be countered by constructing a per-
weight learning rate based on the difference between the parameterization of
the central variable (parameterization held by the parameter server), and the
- current - parameterization of the worker (see Chapter 4). Intuitively, this
mechanism will nullify the contributions of workers which are "too far" from
the current central variable.

[1]
[http://stanford.edu/~imit/tuneyourmomentum/theory/](http://stanford.edu/~imit/tuneyourmomentum/theory/)
(paper in blogpost) [2] [https://github.com/JoeriHermans/master-
thesis](https://github.com/JoeriHermans/master-thesis) [3] Animation
convergence
[http://joerihermans.com/media/blog/ddl/downpour.gif](http://joerihermans.com/media/blog/ddl/downpour.gif)
[4] Animation divergence
[http://joerihermans.com/media/blog/ddl/downpour_momentum.gif](http://joerihermans.com/media/blog/ddl/downpour_momentum.gif)
[5] [https://github.com/cerndb/dist-keras](https://github.com/cerndb/dist-
keras)

~~~
arcanus
I'm not convinced that SGD is the answer for large scale distributed problems.
It is certainly the best answer for single node applications, but it might not
be the best for distributed training. The search for effective optimization
algorithms in machine learning is ongoing.

The big problems for SGD at massively parallel scale:

* SGD does not parallelize well

* SGD is a first order method affected by ill-conditioning

* SGD is too noisy when high accuracy is desired

