
Deep Learning with Spark and TensorFlow - mateiz
https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html
======
vonnik
So the cool thing here is that you can use Spark and TF to find the best model
like Microsoft Research did with Resnets.

[http://www.wired.com/2016/01/microsoft-neural-net-shows-
deep...](http://www.wired.com/2016/01/microsoft-neural-net-shows-deep-
learning-can-get-way-deeper/)

They're showing you how to train different architectures simultaneously, and
then compare their results in order to select the best one. That's great as
far as it goes.

The drawback is that with this schema, you can't actually train a given
network faster, which is what you want to do with Spark. What is the role of a
distributed run-time in training artificial neural networks? It's easy. NNs
are computationally intensive, so you want to share the work over many
machines.

Spark can help you orchestrate that through data parallelism, parameter
averaging and iterative reduce, which we do with Deeplearning4j.

[http://deeplearning4j.org/spark](http://deeplearning4j.org/spark)
[https://github.com/deeplearning4j/dl4j-spark-
cdh5-examples](https://github.com/deeplearning4j/dl4j-spark-cdh5-examples)

Data parallelism is an approach Google uses to train neural networks on tons
of data quickly. The idea is that you shard your data to a lot of equivalent
models, have each of the models train on a separate machine, and then average
their parameters. That works, it's fast, and it's how Spark can help you do
deep learning better.

------
hcrisp
Impressive, but it seems an inversion of paradigms. Small data to compute
ratios is usually associated with high performance computing (HPC). Why use
Spark when the data is small and is broadcast to each worker? You have to pay
the serialization-deserialization penalties of moving the data from Python to
JVM and back again. In fact the JVM isn't really needed here at all since all
the computation is done in the pure-Python workers in an embarrassingly-
parallel way. Seems to me that you would just move onto an HPC and use
TensorFlow within a IPython.parallel paradigm and be done much sooner.

~~~
rxin
The "broadcast" is pretty cheap because often you already have the data in
some distributed file system, or if on a single node the network bandwidth is
pretty high. The problem with a lot of the deep learning workloads is that it
is very compute intensive and as a result takes a long time to run. For
example, it is not uncommon to take a week to train some models.

~~~
gcr
Deep learning workloads are typically compute-intensive, but they also tend to
be extremely I/O intensive, and convergence may depend on a synchronous step
where all the nodes must finish making their contribution to the model before
any of them can continue. (This may not be quite true though -- see Google's
DistBelief paper--but most frameworks work this way). Often times, adding more
machines to a cluster may make training proportionally slower.

~~~
rxin
Did you actually read the article? It was using Spark to parallelize
hyperparameter tuning, which is embarrassingly parallel.

~~~
doobwa
Why not just use GNU Parallel (or something similar) instead of Spark?

~~~
orm
I think one reason would fault tolerance. Is there a fault tolerance layer in
GNU parallel? last time I checked their homepage ( a few minutes ago), there
was no reference to fault tolerance.

Another reason is, perhaps, scheduling.

~~~
chimtim
what fault tolerance does spark give you in this scheme? It cannot look into
TF progress and checkpoint all state. Using Spark with TF, seems like an
overkill -- you need to manage and install two framework what should ideally
be a 200 line python wrapper or small mesos framework at most.

------
amelius
I have a question about neural networks.

Say, you are training a NN to recognize handwritten characters 0 and 1, and
you have 1000 training images for each character (so 2000 images in total).
All images are bitmaps with 0 for black and 1 for white.

Now, by accident, all the "0" training-images have an even number of black
pixels, and all the "1" training-images have an odd number of black pixels.

How do you know that the NN really learns to recognize 0's and 1's, as opposed
to recognizing whether the number of pixels in an image is even or odd?

~~~
Homunculiheaded
There's actually a case in the early history of perceptrons that brings up
this exact issue:

"There is a humorous story from the early days of machine learning about a
network that was supposed to be trained to recognize tanks hidden in forest
regions. The network was trained on a large set of photographs – some with
tanks and some without tanks. After learning was complete the system appeared
to work well when “shown” additional photographs from the original set. As a
final test, a new group of photos were taken to see if the network could
recognize tanks in a slightly different setting. The results were extremely
disappointing. No one was sure why the network failed on this new group of
photos. Eventually, someone noticed that in the original set of photos the
network had been trained on, all of the photos with tanks had been taken on a
cloudy day, while all of the photos without tanks were taken on a sunny day.
The network had not learned to detect the difference between scenes with tanks
and without tanks, it had instead learned to distinguish photos taken on
cloudy days from photos taken on sunny days!"[0]

The pragmatic answer is that this is why you have two hold-out sets: cross
validation/dev set and the test set. Typically you keep 70% of the data for
training, 15% of the data for CV and 15% for Test. Ideally you should shuffle
the data enough that there isn't any bias in the natural order of the data.

You train the model on the train data, and estimate how well the model
actually performs on the CV set which the model did not see in training. You
continue to use the CV set while you tweak parameters, try out new models etc.
At this point you may have "cheated" a bit because you only kept things that
worked well on your CV data. Finally when you say "this is done!" you try out
your model on the Test data set.

Of course it's still possible that you would have the even/odd issue, and the
answer to this whole set of issues is "healthy skepticism", and checking for
these types of errors.

Take for example this Sentence Completion Challenge from Microsoft Research
[1]

They claim some astounding results on correctly predicting GRE type questions
using a very simple model (LSA for those who care). These results seemed
impossible! But it turns out they cheated by training the model _only_ on
possible answers (which is akin to studying for the actually GRE by only
review the possible answers that will be on the exam).

We tend to obsess over p-values and test validation scores as a substitute for
reasoning. But all research papers should be read as an argument a friend is
making to you, "I've done this incredible thing... ", and no single number
should replace reasoned inquisition into possible errors.

[0]
[http://watson.latech.edu/WatsonRebootTest/ch14s2p4.html](http://watson.latech.edu/WatsonRebootTest/ch14s2p4.html)

[1]
[http://research.microsoft.com/apps/pubs/?id=157031](http://research.microsoft.com/apps/pubs/?id=157031)

~~~
rahimiali
the tank anecdote is also famously apocryphal. here's a good analysis of the
origin of that story: [http://www.jefftk.com/p/detecting-
tanks](http://www.jefftk.com/p/detecting-tanks)

------
tachim
0.1% accuracy increments correspond to 10 images in the testing set; they
should be reporting standard error bars with those numbers.

------
elcct
That article reminded me of this:
[http://i.imgur.com/XQJ3ACO.jpg](http://i.imgur.com/XQJ3ACO.jpg)

~~~
obituary_latte
[http://i.imgur.com/boZRjbB.png](http://i.imgur.com/boZRjbB.png)

~~~
mindcrime
There's actually a little bit more info out there for would-be "Watson
builders".

[https://www.ibm.com/developerworks/community/blogs/InsideSys...](https://www.ibm.com/developerworks/community/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7)

[http://www.theregister.co.uk/2011/02/21/ibm_watson_qa_system...](http://www.theregister.co.uk/2011/02/21/ibm_watson_qa_system/)

[http://learning.acm.org/webinar/lally.cfm](http://learning.acm.org/webinar/lally.cfm)

[http://www.cs.nmsu.edu/ALP/2011/03/natural-language-
processi...](http://www.cs.nmsu.edu/ALP/2011/03/natural-language-processing-
with-prolog-in-the-ibm-watson-system/)

Of course, there's still a big gap between "Download some stuff" and "Build
Watson", but at least there's a trickle of details on what happens in the "a
miracle happens here" step. :-)

~~~
obituary_latte
Yup - and very grateful for those.

To me, recently, the linked graphic represents pretty well what I'm faced with
on a daily basis. People seem to think that because a hammer exists, it's easy
to build a house.

