
BigDL: Distributed Deep Learning on Apache Spark - ubolonton_
https://github.com/intel-analytics/BigDL
======
vonnik
Deeplearning4j does this already. It has a huge community, a Scala API and
does model import from Keras. It's important to note that Spark is not an
efficient computation later -- it's best if used for fast ETL. If you get that
wrong, your going to be training slow.

[https://deeplearning4j.org](https://deeplearning4j.org)
[https://github.com/deeplearning4j/ScalNet](https://github.com/deeplearning4j/ScalNet)
[https://deeplearning4j.org/model-import-
keras](https://deeplearning4j.org/model-import-keras)
[https://gitter.im/deeplearning4j/deeplearning4j](https://gitter.im/deeplearning4j/deeplearning4j)

~~~
agibsonccc
Chris my cofounder forgot to disclose he works on the project :).

I"ll do it for him.

I'd just like to say that as far as this niche is concerned. This is basically
an attempt at "non gpus on spark".

We are heavily biased towards cuda and distributed gpu applications:
[https://blogs.nvidia.com/blog/2016/10/06/how-skymind-
nvidia-...](https://blogs.nvidia.com/blog/2016/10/06/how-skymind-nvidia-deep-
learning/)

I respect what intel is trying to do here, but it's going to take a lot more
than "we built stuff" to get anyone to switch let alone build a community
around.

To be fair to intel, I can't wait to see what they do with accelerators and
phi, but I need to see more results first.

Competition in the space is definitely needed :D.

We have yet to see fpgas and nervana acquisition really play out as well.

It will take them a while to catch up either way.

~~~
sandGorgon
that sounds interesting.

so dl4j works with spark ?
[https://deeplearning4j.org/spark#how](https://deeplearning4j.org/spark#how)

is it because spark does "distributed computing" very efficiently ? In that
case, would the apples-to-apples comparison be versus spark+tensorflow ?
[https://databricks.com/blog/2016/12/21/deep-learning-on-
data...](https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html)

~~~
vonnik
Benchmarks we ran ourselves show that we're faster than TensorFlow using
multi-GPUs for a non-trivial image processing task:
[https://github.com/deeplearning4j/dl4j-benchmark](https://github.com/deeplearning4j/dl4j-benchmark).
That's the best apples to apples we have for the moment.

When you're aiming to put deep learning into production, a bunch of other
things are important too, notably integrations. DL4J comes with integrations
for Hadoop, Kafka and ElasticSearch as well as Spark. In the inference stage,
we autoscale elastically as a micro-service using Lagom and a REST API. Most
frameworks are just libs that don't solve problems deeper in the workflow. Our
tools include data pipelines with DataVec (reusable data preprocessing), to
model evaluation with Arbiter and a GUI for heuristics during training.

[https://github.com/deeplearning4j/DataVec](https://github.com/deeplearning4j/DataVec)
[https://github.com/deeplearning4j/Arbiter](https://github.com/deeplearning4j/Arbiter)
[https://deeplearning4j.org/visualization](https://deeplearning4j.org/visualization)

~~~
agibsonccc
I have to correct chris here. He is talking about a lot of features that are
in our enterprise version SKIL.

We will offer a limited developer version of SKIL for free.

Think of SKIL as similar to gitlab or github enterprise.

In SKIL we also have auto provisioning of a cluster and a higher level
interface for running deep learning workloads. It auto configures most of the
parameters like the spark worker native library path and setting up things
like a training UI as well as installation of the mkl and cudnn libraries.

Optionally, you can also run a version of this with DC/OS and co where there
is a packaged spark.

What we _do_ have in dl4j is the raw components you can use to create these
things such as datavec and dl4j-streaming which covers our integration with
kafka.

------
mmrezaie
Some projects like SparkNet or DeepLearning4j+spark or even Sparkling Water
are kinda doing the same thing. So, how this is compared to them?

~~~
blueyes
This is a classic vanity deep-learning framework that Intel built due to NIH
syndrome. It's like DSSTNE. Doomed to be abandoned. I can't a worse way to
position a deep learning library than to say: this only works on CPUs. When
you look at Intel's track record with software, especially their Trusted
Analytics Platform this year, BigDL's prospects are poor. I'm just waiting for
IBM to copy this move and come out with yet another deep learning lib: YADLL.

------
sandGorgon
Isn't spark more versatile than tensorflow at this point? It does graph
processing and deep learning.

Plus it's built for distributed processing.

Pyspark makes it easy to use.

~~~
nl
(Heavy Spark user here)

Comparing Spark and TensorFlow is sort of like comparing Numpy and Pandas.
There is some overlap, but they are pretty different things.

Spark is a big data manipulation tool, which comes with a somewhat-adequate
machine learning library. TensorFlow is an optimised math library with machine
learning operations built on it.

Spark doesn't support GPU operations (although as you note Databricks has
proprietary extensions on their own cluster). DeepLearning4J and various other
libraries do similar things.

However, if you are building your own Neural Network architectures then TF
(which has highly optimised distributed training mode) is more useful.

~~~
sandGorgon
for someone just getting started on Spark - what do you mean "somewhat
adequate" ? Because I see MLLib ([https://spark.apache.org/docs/2.0.2/mllib-
guide.html](https://spark.apache.org/docs/2.0.2/mllib-guide.html)) and a quick
glance shows me a lot of overlap with tensorflow.

At google, their graph processing system (Expander) and deep learning
framework (tensorflow) are separate systems. Spark looks to be built from the
graph side (RDD) first and is now getting ML components.

how do you see spark evolving ?

~~~
nl
So..

MLLib seems awesome, but the devil is in the detail. Example that have burnt
me include things like using LogisticRegression for classification only
supports binary classification, the LibSVM support only support import, the
GBT implantation is weak compared to eg XGBoost etc.

A lot of the time it is fine though.

Graph support.. hmm. GraphX is ok, but there are lots of things that eg
NetworkX has the GraphX doesn't. In my experience, we've started a lot of
projects with GraphX and abandoned them because GraphX's implementations
didn't have the features we needed.

BTW, RDDs aren't graphs. I think you might be confusing the Spark directed-
acyclic-graph (DAG) execution model with graph processing.

TensorFlow doesn't have as many general purpose ML algorithms. For example, I
don't think there is a Random Forest in TF, and for 90% of ML problems RF is
what you need.

But if you are doing Neural Network stuff then TF is exactly what you need.

~~~
Aeolos
Tensorflow has a GPU-accelerated Random Forest implementation:
[https://github.com/tensorflow/tensorflow/blob/v0.10.0rc0/ten...](https://github.com/tensorflow/tensorflow/blob/v0.10.0rc0/tensorflow/contrib/learn/python/learn/estimators/random_forest.py)

~~~
nl
Nice. I'm glad to be wrong.

I'll point out that this is TF Contrib Learn, not TF Learn[1], or one of many
other places where things might be implemented. Makes things a bit confusing.

[1] [http://tflearn.org/](http://tflearn.org/)

------
zero-x
Hmm, seems interesting but wondering how it compares to H2O's Sparkling Water.
Been using that for clients and i love it.

~~~
happynewyear
We don't seem to hear much about 0xdata on hn.

------
ris
A rather transparent attempt to pull people away from GPUs and instead use
massive farms of (intel powered) machines.

------
rustyconover
It does not appear that this uses a GPU at all. Which is okay of course but
may not win any speed contests.

~~~
sandGorgon
gpu acceleration in spark in generally production ready -
[https://databricks.com/blog/2016/10/27/gpu-acceleration-
in-d...](https://databricks.com/blog/2016/10/27/gpu-acceleration-in-
databricks.html)

in fact looks like you can use tensorflow models in spark with GPU -
[https://databricks.com/blog/2016/12/21/deep-learning-on-
data...](https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html)

~~~
agibsonccc
Tensorframes despite the marketing is already defunct. It hasn't seen a commit
since august.

[https://github.com/databricks/tensorframes](https://github.com/databricks/tensorframes)

Spark is just a data access layer here. It's not even remotely gpu friendly.
Most people also still relies on mesos or yarn for running distributed. The
library you're using matters alot. Mesos just added gpu support:
[http://mesos.apache.org/documentation/latest/gpu-
support/](http://mesos.apache.org/documentation/latest/gpu-support/)

Yarn can sorta support it with node labeling for job completion but it's still
kinda hacky.

The real work in this space (without the marketing) is done by IBM:
[http://www.slideshare.net/ishizaki/exploiting-gpus-in-
spark](http://www.slideshare.net/ishizaki/exploiting-gpus-in-spark)

When spark can (without "production ready" buzzwords) run gpus like this out
of the box then we're talking. For now spark needs a companion library to work
with gpus though.

