
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE - shanxS
http://blogs.aws.amazon.com/bigdata/post/TxGEL8IJ0CAXTK/Generating-Recommendations-at-Amazon-Scale-with-Apache-Spark-and-Amazon-DSSTNE
======
minimaxir
If you haven't kept up with Spark, and do not have Amazon-level workloads, the
built-in machine learning APIs are _extremely_ robust and well-documented
([https://people.apache.org/~pwendell/spark-nightly/spark-
mast...](https://people.apache.org/~pwendell/spark-nightly/spark-master-
docs/latest/api/python/pyspark.ml.html)), even in the non-Scala languages like
Python. There's even a Multilevel Perceptron model creation function, which
creates artificial neural networks using the feed-forward/backpropagation
everyone loves.

The new Spark DataFrames also make manipulating data almost as easy as with
Python Pandas / R dplyr. The new Berkeley eDX course
([https://courses.edx.org/courses/course-v1:BerkeleyX+CS105x+1...](https://courses.edx.org/courses/course-v1:BerkeleyX+CS105x+1T2016/info))
is a very good explainer.

After Spark 2.0.0 is released, it wouldn't surprise me if it really takes off.
(as long as setting up a cluster is a bit easier!)

~~~
nchammas
> as long as setting up a cluster is a bit easier!

If you're on AWS, it's already quite easy to set up a cluster today, no?

There's EMR of course, and there are tools like spark-ec2 [0] and Flintrock
[1].

There are a few more tools listed on Spark Packages that target different
cloud providers [2], too.

Disclaimer: I am the primary author of Flintrock and am a contributor to
spark-ec2.

[0] [https://github.com/amplab/spark-ec2](https://github.com/amplab/spark-ec2)

[1]
[https://github.com/nchammas/flintrock](https://github.com/nchammas/flintrock)

[2] [https://spark-packages.org/?q=tags%3Adeployment](https://spark-
packages.org/?q=tags%3Adeployment)

~~~
minimaxir
Setting up Spark clusters is easy relative to setting up clusters, but not
easy enough yet to set up and configure relative to simply downloading a
package in R/Python.

~~~
nchammas
Well, the absolute easiest way to run Spark is to do it locally (e.g. you can
brew install it on a Mac and just go) or to pay for a proprietary service like
Databricks, which makes setting up a cluster take a few clicks.

That said, I think `flintrock launch my-cluster` is almost as easy as doing
`pip install ...`.

You do need an AWS account and you do need to set your preferences like region
and key name in a config file, but I don't see how you can get out of doing
even that without subscribing to some managed service like Databricks that
abstracts everything away and replaces it with a nice Web UI.

------
vonnik
We already built what Amazon was looking for: An extensible deep-learning
framework that works on distributed CPUs and GPUs heterogeneously, integrating
with Spark as an access layer to orchestrate multiple host threads.

[https://github.com/deeplearning4j](https://github.com/deeplearning4j)

DL4J may be the DL library with the most sophisticated Spark integration at
this point. The trick is to avoid using Spark as a computation layer, since it
doesn't do that well.

We're pushing a CuDNN wrapper tomorrow.

[http://deeplearning4j.org/quickstart](http://deeplearning4j.org/quickstart)

Unlike DSSTNE, Tensorflow or CNTK, our deep learning library is neutral, and
not designed with the intention of locking people into to a cloud service.

~~~
scottlegrand2
Wow, defensive much? Afraid of a little competition maybe?

Anyway, DSSTNE itself is not in any way locked to any cloud service
whatsoever. It is an Apache-licensed library with dependencies on a C++11
compiler, a 7.0 or later CUDA Toolkit and Kepler or better GPU, a
C++11-friendly MPI library, netcdf, and libjsoncpp. That's it. Please stop
insinuating otherwise.

The article here shows how one could use DSSTNE with Spark for recommendations
at scale. Speaking of which, how's your sparse data and model-parallel
training support? Because as the author of DSSTNE, the poor support for these
features in other frameworks was what forced to us to "roll our own" code in
the first place. Everyone else was optimizing for ImageNet-winning CNNs
(including NVIDIA). And there's nothing wrong with that, $25M+ companies have
been built from that, but it just wasn't the use case here.

Finally, cuSparse is cuSlow for datasets at Amazon. DSSTNE's hand-coded sparse
kernels stomp on cuSparse in the same way that Neon's convolution kernels
stomp on cuDNN. And there's nothing wrong with that either. CuSparse is a
great choice for other sparse data problems, just not Amazon's (and I suspect
many other companies) recommendations problems.

------
Havoc
Their suggestions aren't all that great in my experience. Its either something
I looked at/bought before or its something completely arbitrary.

