I"ll do it for him.
I'd just like to say that as far as this niche is concerned. This is basically an attempt at "non gpus on spark".
We are heavily biased towards cuda and distributed gpu applications:
I respect what intel is trying to do here, but it's going to take a lot more than "we built stuff" to get anyone to switch let alone build a community around.
To be fair to intel, I can't wait to see what they do with accelerators and phi, but I need to see more results first.
Competition in the space is definitely needed :D.
We have yet to see fpgas and nervana acquisition really play out as well.
It will take them a while to catch up either way.
so dl4j works with spark ? https://deeplearning4j.org/spark#how
is it because spark does "distributed computing" very efficiently ? In that case, would the apples-to-apples comparison be versus spark+tensorflow ? https://databricks.com/blog/2016/12/21/deep-learning-on-data...
When you're aiming to put deep learning into production, a bunch of other things are important too, notably integrations. DL4J comes with integrations for Hadoop, Kafka and ElasticSearch as well as Spark. In the inference stage, we autoscale elastically as a micro-service using Lagom and a REST API. Most frameworks are just libs that don't solve problems deeper in the workflow. Our tools include data pipelines with DataVec (reusable data preprocessing), to model evaluation with Arbiter and a GUI for heuristics during training.
We will offer a limited developer version of SKIL for free.
Think of SKIL as similar to gitlab or github enterprise.
In SKIL we also have auto provisioning of a cluster and a higher level interface for running deep learning workloads. It auto configures most of the parameters like the spark worker native library path and setting up things like a training UI as well as installation of the mkl and cudnn libraries.
Optionally, you can also run a version of this with DC/OS and co where there is a packaged spark.
What we do have in dl4j is the raw components you can use to create these things such as datavec and dl4j-streaming which covers our integration with kafka.
Plus it's built for distributed processing.
Pyspark makes it easy to use.
Comparing Spark and TensorFlow is sort of like comparing Numpy and Pandas. There is some overlap, but they are pretty different things.
Spark is a big data manipulation tool, which comes with a somewhat-adequate machine learning library. TensorFlow is an optimised math library with machine learning operations built on it.
Spark doesn't support GPU operations (although as you note Databricks has proprietary extensions on their own cluster). DeepLearning4J and various other libraries do similar things.
However, if you are building your own Neural Network architectures then TF (which has highly optimised distributed training mode) is more useful.
At google, their graph processing system (Expander) and deep learning framework (tensorflow) are separate systems. Spark looks to be built from the graph side (RDD) first and is now getting ML components.
how do you see spark evolving ?
MLLib seems awesome, but the devil is in the detail. Example that have burnt me include things like using LogisticRegression for classification only supports binary classification, the LibSVM support only support import, the GBT implantation is weak compared to eg XGBoost etc.
A lot of the time it is fine though.
Graph support.. hmm. GraphX is ok, but there are lots of things that eg NetworkX has the GraphX doesn't. In my experience, we've started a lot of projects with GraphX and abandoned them because GraphX's implementations didn't have the features we needed.
BTW, RDDs aren't graphs. I think you might be confusing the Spark directed-acyclic-graph (DAG) execution model with graph processing.
TensorFlow doesn't have as many general purpose ML algorithms. For example, I don't think there is a Random Forest in TF, and for 90% of ML problems RF is what you need.
But if you are doing Neural Network stuff then TF is exactly what you need.
I'll point out that this is TF Contrib Learn, not TF Learn, or one of many other places where things might be implemented. Makes things a bit confusing.
I see that you worked with GraphX and abandoned it. This is disappointing - we were really looking forward to Spark Graphframes with HBase as the oltp data store for graph data.
In your situation, how did you overcome the problems in Spark ? Did you use an accompanying toolkit to augment spark or did you build your own (hopefully not!).
What specific graph operations do you want?
If the stuff you need is there, then you might be fine! Note that the set of pre-built algorithms in GraphFrames is pretty small (https://graphframes.github.io/user-guide.html#graph-algorith...). It is pre-release though.
Graph stuff is generally hard, so I don't think there is a magic bullet here.
I mean, even just the Spark-using-HBase bit is non-trivial to do in a way that provides adequate performance. There are 3(?) different connectors, with pluses and minuses for each one. Making sure data locality is working will depend on you YARzn or Mesos setup, and debugging that is a nightmare.
In our case, we prefilter data in Spark then load into NetworkX. Works ok, mostly.
our data sets have massively grown over the laat few months and now need a bigger solution. I think we will start off with a hosted solution like EMR - performance is not super critical right now (batch mode training)... but developer productivity is key.
Spark is more focused on "counting at scale with a functional DSL". Hence its focus on things like ETL and columnar processing ala dataframes.
As far as spark doing "deep learning" what you should mean here is: "libraries in the ecosystem leverage spark as a data access layer for doing the real numerical compute"
Spark can count things with functional programming. It's not meant for heavy numerical operations. They are working on this where they can but you really can't beat a gpu or good ole simd instructions on hardware.
I havent used this feature - but are you sure ?
In addition, IBM has multiple projects around GPU aware spark - https://github.com/IBMSparkGPU
Yes, as I said elsewhere there are plenty of projects to enable GPU usage via Spark. Have you actually tried them though? I have (eg https://github.com/IBMSparkGPU/GPUEnabler/issues/25 ) and there are... issues.
in fact looks like you can use tensorflow models in spark with GPU - https://databricks.com/blog/2016/12/21/deep-learning-on-data...
Spark is just a data access layer here. It's not even remotely gpu friendly. Most people also still relies on mesos or yarn for running distributed. The library you're using matters alot. Mesos just added gpu support:
Yarn can sorta support it with node labeling for job completion but it's still kinda hacky.
The real work in this space (without the marketing) is done by IBM:
When spark can (without "production ready" buzzwords) run gpus like this out of the box then we're talking.
For now spark needs a companion library to work with gpus though.