Hacker News new | past | comments | ask | show | jobs | submit login
BigDL: Distributed Deep Learning on Apache Spark (github.com)
112 points by ubolonton_ on Dec 31, 2016 | hide | past | web | favorite | 37 comments

Deeplearning4j does this already. It has a huge community, a Scala API and does model import from Keras. It's important to note that Spark is not an efficient computation later -- it's best if used for fast ETL. If you get that wrong, your going to be training slow.

https://deeplearning4j.org https://github.com/deeplearning4j/ScalNet https://deeplearning4j.org/model-import-keras https://gitter.im/deeplearning4j/deeplearning4j

Chris my cofounder forgot to disclose he works on the project :).

I"ll do it for him.

I'd just like to say that as far as this niche is concerned. This is basically an attempt at "non gpus on spark".

We are heavily biased towards cuda and distributed gpu applications: https://blogs.nvidia.com/blog/2016/10/06/how-skymind-nvidia-...

I respect what intel is trying to do here, but it's going to take a lot more than "we built stuff" to get anyone to switch let alone build a community around.

To be fair to intel, I can't wait to see what they do with accelerators and phi, but I need to see more results first.

Competition in the space is definitely needed :D.

We have yet to see fpgas and nervana acquisition really play out as well.

It will take them a while to catch up either way.

that sounds interesting.

so dl4j works with spark ? https://deeplearning4j.org/spark#how

is it because spark does "distributed computing" very efficiently ? In that case, would the apples-to-apples comparison be versus spark+tensorflow ? https://databricks.com/blog/2016/12/21/deep-learning-on-data...

Benchmarks we ran ourselves show that we're faster than TensorFlow using multi-GPUs for a non-trivial image processing task: https://github.com/deeplearning4j/dl4j-benchmark. That's the best apples to apples we have for the moment.

When you're aiming to put deep learning into production, a bunch of other things are important too, notably integrations. DL4J comes with integrations for Hadoop, Kafka and ElasticSearch as well as Spark. In the inference stage, we autoscale elastically as a micro-service using Lagom and a REST API. Most frameworks are just libs that don't solve problems deeper in the workflow. Our tools include data pipelines with DataVec (reusable data preprocessing), to model evaluation with Arbiter and a GUI for heuristics during training.

https://github.com/deeplearning4j/DataVec https://github.com/deeplearning4j/Arbiter https://deeplearning4j.org/visualization

I have to correct chris here. He is talking about a lot of features that are in our enterprise version SKIL.

We will offer a limited developer version of SKIL for free.

Think of SKIL as similar to gitlab or github enterprise.

In SKIL we also have auto provisioning of a cluster and a higher level interface for running deep learning workloads. It auto configures most of the parameters like the spark worker native library path and setting up things like a training UI as well as installation of the mkl and cudnn libraries.

Optionally, you can also run a version of this with DC/OS and co where there is a packaged spark.

What we do have in dl4j is the raw components you can use to create these things such as datavec and dl4j-streaming which covers our integration with kafka.

Nothing to be "biased" about, Cuda is the industry standard.

Sure :D. It's still in my interest to disclose we partner with nvidia pretty closely though. I would hope to see competition here but we have a large vested interest in gpus succeeding. Thanks for the sentiment though!

Oh I don't like the situation either, I'm just saying there's no need to be guilty! :)

I started experimenting with DL4J about a month ago. The getting started example apps are actually pretty good. You can pretty much clone the repo and run them.

Some projects like SparkNet or DeepLearning4j+spark or even Sparkling Water are kinda doing the same thing. So, how this is compared to them?

This is a classic vanity deep-learning framework that Intel built due to NIH syndrome. It's like DSSTNE. Doomed to be abandoned. I can't a worse way to position a deep learning library than to say: this only works on CPUs. When you look at Intel's track record with software, especially their Trusted Analytics Platform this year, BigDL's prospects are poor. I'm just waiting for IBM to copy this move and come out with yet another deep learning lib: YADLL.

Isn't spark more versatile than tensorflow at this point? It does graph processing and deep learning.

Plus it's built for distributed processing.

Pyspark makes it easy to use.

(Heavy Spark user here)

Comparing Spark and TensorFlow is sort of like comparing Numpy and Pandas. There is some overlap, but they are pretty different things.

Spark is a big data manipulation tool, which comes with a somewhat-adequate machine learning library. TensorFlow is an optimised math library with machine learning operations built on it.

Spark doesn't support GPU operations (although as you note Databricks has proprietary extensions on their own cluster). DeepLearning4J and various other libraries do similar things.

However, if you are building your own Neural Network architectures then TF (which has highly optimised distributed training mode) is more useful.

for someone just getting started on Spark - what do you mean "somewhat adequate" ? Because I see MLLib (https://spark.apache.org/docs/2.0.2/mllib-guide.html) and a quick glance shows me a lot of overlap with tensorflow.

At google, their graph processing system (Expander) and deep learning framework (tensorflow) are separate systems. Spark looks to be built from the graph side (RDD) first and is now getting ML components.

how do you see spark evolving ?


MLLib seems awesome, but the devil is in the detail. Example that have burnt me include things like using LogisticRegression for classification only supports binary classification, the LibSVM support only support import, the GBT implantation is weak compared to eg XGBoost etc.

A lot of the time it is fine though.

Graph support.. hmm. GraphX is ok, but there are lots of things that eg NetworkX has the GraphX doesn't. In my experience, we've started a lot of projects with GraphX and abandoned them because GraphX's implementations didn't have the features we needed.

BTW, RDDs aren't graphs. I think you might be confusing the Spark directed-acyclic-graph (DAG) execution model with graph processing.

TensorFlow doesn't have as many general purpose ML algorithms. For example, I don't think there is a Random Forest in TF, and for 90% of ML problems RF is what you need.

But if you are doing Neural Network stuff then TF is exactly what you need.

Tensorflow has a GPU-accelerated Random Forest implementation: https://github.com/tensorflow/tensorflow/blob/v0.10.0rc0/ten...

Nice. I'm glad to be wrong.

I'll point out that this is TF Contrib Learn, not TF Learn[1], or one of many other places where things might be implemented. Makes things a bit confusing.

[1] http://tflearn.org/

Spark 2.1.0 released this week evidentially supports multiclass logistic regression now!

thanks for that comment. Indeed we are looking at general purpose ML (gbm and logit regression being our primary usecases). I was not looking at RDD but rather Graphframes.

I see that you worked with GraphX and abandoned it. This is disappointing - we were really looking forward to Spark Graphframes with HBase as the oltp data store for graph data.

In your situation, how did you overcome the problems in Spark ? Did you use an accompanying toolkit to augment spark or did you build your own (hopefully not!).

I think it's very hard to give general advice in this area. You are best off prototyping a deep spike into what you need, and seeing where things don't work.

What specific graph operations do you want?

If the stuff you need is there, then you might be fine! Note that the set of pre-built algorithms in GraphFrames is pretty small (https://graphframes.github.io/user-guide.html#graph-algorith...). It is pre-release though.

Graph stuff is generally hard, so I don't think there is a magic bullet here.

I mean, even just the Spark-using-HBase bit is non-trivial to do in a way that provides adequate performance. There are 3(?) different connectors, with pluses and minuses for each one. Making sure data locality is working will depend on you YARzn or Mesos setup, and debugging that is a nightmare.

In our case, we prefilter data in Spark then load into NetworkX. Works ok, mostly.

well, it is not very different from what Google Expander does - https://research.googleblog.com/2016/10/graph-powered-machin...

our data sets have massively grown over the laat few months and now need a bigger solution. I think we will start off with a hosted solution like EMR - performance is not super critical right now (batch mode training)... but developer productivity is key.

Yes, certainly label propagation type algorithms are more suited to Spark than TensorFlow (although of course the fast matrix operations in TF could work well for this).

Spark is vastly different. Tensorflow focuses more on numerical computing and is a low level tool.

Spark is more focused on "counting at scale with a functional DSL". Hence its focus on things like ETL and columnar processing ala dataframes.

As far as spark doing "deep learning" what you should mean here is: "libraries in the ecosystem leverage spark as a data access layer for doing the real numerical compute"

Spark can count things with functional programming. It's not meant for heavy numerical operations. They are working on this where they can but you really can't beat a gpu or good ole simd instructions on hardware.

Spark doesn't do GPU acceleration, which is super important if you don't have a lot of spare CPU capacity. The one time I tried to train a DL model on CPU it was 48x slower, and with communication overhead it would have taken more than 48 cores to match the single GPH. And given you want to do hyperparameter search on top of that, those CPU cores start adding up.

that doesnt seem right. Databricks seems to have this in production. https://databricks.com/blog/2016/10/27/gpu-acceleration-in-d...


I havent used this feature - but are you sure ?

This seems exactly right. The Databricks GPU stuff isn't generally available and is built on TF anyway.

hmm.. atleast databricks claims that the GPU clusters are beta, but generally available



In addition, IBM has multiple projects around GPU aware spark - https://github.com/IBMSparkGPU http://www.spark.tc/gpu-acceleration-on-apache-spark-2/

Yes the Databricks stuff is available on their cluster. It isn't part of Spark the Open Source project.

Yes, as I said elsewhere there are plenty of projects to enable GPU usage via Spark. Have you actually tried them though? I have (eg https://github.com/IBMSparkGPU/GPUEnabler/issues/25 ) and there are... issues.

It's a proprietary add-on, not part of Apache Spark itself

Hmm, seems interesting but wondering how it compares to H2O's Sparkling Water. Been using that for clients and i love it.

We don't seem to hear much about 0xdata on hn.

A rather transparent attempt to pull people away from GPUs and instead use massive farms of (intel powered) machines.

It does not appear that this uses a GPU at all. Which is okay of course but may not win any speed contests.

gpu acceleration in spark in generally production ready - https://databricks.com/blog/2016/10/27/gpu-acceleration-in-d...

in fact looks like you can use tensorflow models in spark with GPU - https://databricks.com/blog/2016/12/21/deep-learning-on-data...

Tensorframes despite the marketing is already defunct. It hasn't seen a commit since august.


Spark is just a data access layer here. It's not even remotely gpu friendly. Most people also still relies on mesos or yarn for running distributed. The library you're using matters alot. Mesos just added gpu support: http://mesos.apache.org/documentation/latest/gpu-support/

Yarn can sorta support it with node labeling for job completion but it's still kinda hacky.

The real work in this space (without the marketing) is done by IBM: http://www.slideshare.net/ishizaki/exploiting-gpus-in-spark

When spark can (without "production ready" buzzwords) run gpus like this out of the box then we're talking. For now spark needs a companion library to work with gpus though.

Intel is pushing their own chips for deep learning (i.e. Xeon and Xeon Phi). They claim that using MKL on these chips gets comparable performance to using Caffe/cuDNN on a high-end GPU.

Phi may yet win out, I saw a talk from an NLP researcher about Seq2Seq models and someone asked him what his wish for hardware was and it was for GPU cores to not be as stuck in lock step with each other control-wise. Not sure if he knew about the Phi's though.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact