
Bayesian Optimization for Collaborative Filtering with MLlib - Zephyr314
http://blog.sigopt.com/post/148703071378/sigopt-for-ml-bayesian-optimization-for
======
minimaxir
Wait, Spark has built-in Model Hyperparameter selection
([http://spark.apache.org/docs/latest/ml-
tuning.html](http://spark.apache.org/docs/latest/ml-tuning.html)), that was
not mentioned in the article. What advantages does your service do?

Relatedly, why are you advocating using MLLib/RDDs when they have been
deprecated in favor of ML/DataFrames ([http://spark.apache.org/docs/latest/ml-
guide.html](http://spark.apache.org/docs/latest/ml-guide.html))?

~~~
Zephyr314
Great questions!

The Spark hyperparameter selection uses an exhaustive grid search approach,
which can take a long time (complexity grows exponentially with number of
parameters to tune) and produce poor results when compared to other methods
[1]. Bayesian optimization is a great way to tune time consuming and expensive
functions like ML pipelines, where finding a good configuration in a small
number of total attempts is the only tractable way to tune the system.

ALS implementation in MLLib still requires ratings in RDD and hasn't moved
over to ML/DataFrames yet.

Edit: Looks like the original comment was edited, but this post does in fact
use the built in MLlib ALS implementation.

[1]: [http://arxiv.org/abs/1603.09441](http://arxiv.org/abs/1603.09441)

~~~
minimaxir
> ALS implementation in MLLib still requires ratings in RDD and hasn't moved
> over to ML/DataFrames yet.

ALS has been available in ML/DataFrames since 1.3.0, per the documentation.
([http://spark.apache.org/docs/latest/api/scala/index.html#org...](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS))

> Edit: Looks like the original comment was edited, but this post does in fact
> use the built in MLlib cross validation methods.

I edited the comment to correct my error that ALS was not mentioned as being
native to Spark. However, for Hyperparameter cross validation, I looked at the
code in the repository ([https://github.com/sigopt/sigopt-
examples/blob/master/spark/...](https://github.com/sigopt/sigopt-
examples/blob/master/spark/recommender_sys/src/main/scala/MovieLense.scala)),
and while the Spark CrossValidation function is imported at the beginning of
the Scala file (from ML, not MLlib), _it is not used in the code_ , in favor
of SigOpt.

I'm confused.

~~~
idewanck
Thanks for pointing out the switch away from RDD, looking at the master branch
for MLlib in github it looks like ALS is still using RDDs?
[https://github.com/apache/spark/blob/master/mllib/src/main/s...](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala)

Admittedly I'm not a spark expert though, thanks for bringing this up!

~~~
minimaxir
MLlib is for operating in RDDs, ML is for operating in DataFrames.

RDDs themselves, however, are obsolete to DataFrames as DataFrames are faster
and cover most of the functionality of RDDs for common use cases.

~~~
kod
RDDs are not obsolete. The reason dataframes are faster is exactly because
they are more restrictive than RDDs. If you need to run arbitrary code, RDDs
are stil more flexible.

~~~
dxbydt
Since every dataframe has a lazy instance of rdd, and several methods on
dataframes simply call the corresponding method on rdd ( eg. foreach ), I am
not sure about the faster bit of your assertion.

~~~
nl
They are _faster_ in the sense that many thing you previously had to do in a
RDD lambda are now Dataframe operations (which are optimized by the Catalyst
compiler).

So if you want to do one of the operations in the sql.functions package[1]
then Dataframes (and Datasets) are very valuable.

If not, then they won't give you much benefit. However, you will get a little
improvement because the Tachyon out-of-JVM-memory framework[2] which I don't
think RDD version has access to.

[1]
[http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.ht...](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-
pyspark.sql.functions)

[2] [https://dzone.com/articles/Accelerate-In-Memory-
Processing-w...](https://dzone.com/articles/Accelerate-In-Memory-Processing-
with-Spark-from-Hours-to-Seconds-With-Tachyon)

------
apathy
I could give a shit about the hyperparameter tuning (CV... it Works For Me)
but your writeup of Gaussian processes and why they are called kriging in
spatial stats is awesome.

[http://blog.sigopt.com/post/130275376068/sigopt-
fundamentals...](http://blog.sigopt.com/post/130275376068/sigopt-fundamentals-
intuition-behind-gaussian)

~~~
mccourt
Dear apathy (love the name), I wrote that post and am really glad that you
liked it. I had another more recent one that focused on a different topic but
had a solid paragraph right at the beginning that also stepped through some of
the history. It might be useful because it had a bunch of links trying to join
content between GP stuff, kriging and RKHS. It is linked below. Have a great
day.

[http://blog.sigopt.com/post/147952139093/sigopt-
fundamentals...](http://blog.sigopt.com/post/147952139093/sigopt-fundamentals-
breaking-free-of-the-grid)

~~~
apathy
Great post. Why is the kernel trick so important for Gaussian processes? My
understanding is that operating in an RKHS is primarily needed to enable that
cute little maneuver.

Wait, never mind. If GPR is like an infinite-dimensional linear regression
then doing it within an RKHS means you don't actually have to bother with the
functional generalization and can get solutions to a potentially ill-behaved
loss function along a grid/cube/hypercube/whatever. Is this part of why SigOpt
works more efficiently than classical parameter space sampling designs?

Not-so-ninja edit: Time for me to re-read Rasmussen, I think.

~~~
mccourt
There's a couple different points there, so lemme see if I can answer each of
them.

First off, the use of the term "kernel trick" appears, I think, primarily
within the machine learning community. It refers to the idea that some other
(likely more useful) representation of the data of interest exists, but that
the representation might be in a much larger, or even infinite-dimensional,
space. Fortunately, in the context of certain algorithms such as support
vector machines, that representation never appears by itself ... it only
appears when inner-producted (not a word) with another such representation
associated with some other piece of data. For certain representations, that
inner product can be represented by a reproducing (also called positive
definite) kernel, and thus can be computed without ever forming the larger
representation. This concept appears in Wikipedia
[https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_spa...](https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space),
both for the Hilbert space inner product and for the Mercer's series
representation. Unfortunately, as is the case with a lot of higher math on
Wikipedia, it's more useful as a reference for an expert than to get you
rolling on a new topic.

As far as kernel methods in general, those pop up all over the place,
including in the context of Gaussian processes. Gaussian process regression
(GPR), also called kernel-based approximation in the numerical analysis
community, is one great example of that. I generally try to think of it not as
an infinite dimensional linear regression method, but rather as a constrained
optimization problem. There are infinitely many functions that pass through a
given set of data, thus asking for the "one" is ill-posed ... I am interested
in the most well-behaved one. We define the behavior of a function as its
RKHS-norm, and the function that minimizes that norm, but still respects the
observed data, is the solution to the GPR problem.

Regarding why SigOpt performs better than a good old-fashioned grid, that can
be for multiple reasons. Probably the most important one, at least in my mind,
is that SigOpt is not trying to create some perfect model of the function at
hand - SigOpt is only interested in optimizing that function. That drives
every decision we make, and actually it is something I often need to remind
myself of because I grew up in approximation theory. A basic design of
experiments is interested in understanding how the function works everywhere,
but we can save some expense by more swiftly discarding regions unlikely to
contain the optimum.

As far as why RKHS in particular work very well - that deals with the
optimality theorems underlying approximation using reproducing kernels.
Assuming you have a reasonable function on a reasonable domain it probably
belongs to a RKHS - such functions can be represented very effectively by an
appropriate kernel. Now, determining that kernel can be a complicated task
(maximum likelihood estimation or cross-validation are common tools) but if
you have an acceptable kernel you can make strong statements about the quality
of the model. Because the quality of the model is constantly improving, the
GP-backed optimization tool is constantly providing a better representation of
the true function, and thus pointing towards the optimum more quickly.

There's another more fundamental reason why searching on a grid in higher
dimensions is trouble, and it deals with the fact that the ratio of the volume
of a sphere and of a cube with the same radius decreases as the dimension
increases. This means that there is increasingly much volume away from the
center of a box in increasingly many dimensions. Using a grid to try and fill
that space becomes unacceptably costly. Of course, Gaussian processes have
their own issues for larger problems (>30 dimensions, maybe) but SigOpt has
more than just Gaussian processes behind the scenes. Also, there have been
improvements over the years in GP performance in higher dimensions (for
example,
[http://epubs.siam.org/doi/abs/10.1137/10080138X](http://epubs.siam.org/doi/abs/10.1137/10080138X)).

There's some pretty solid notes for graduate students on this topic at
[http://math.iit.edu/~fass/590/](http://math.iit.edu/~fass/590/). That was the
class I took to first learn this stuff as a student and I've tried to
contribute back to them over the years. It comes at the topic first from the
math side, not the stats side, but there's some stats stuff in there as well.
Chapter 8 of those slides contains that theorem regarding the minimum-norm
interpolant.

Hope that helps.

~~~
apathy
Reading the slides now, my goodness a lot of things just became clear to me.
What a wonderful collection of pearls. Thank you for "giving back" to the
topic in this way.

------
a1k0n
So SigOpt was tuning rank (number of latent factors), number of iterations to
run the algorithm (in my experience alternating least squares generally
converges within 10-20 iterations, but there'd be no downside to running it
longer unless it's overfitting), and the regularization strength.

What optimal parameters did it find for these?

~~~
Zephyr314
Great question. As you point out, all of these parameters (rank, num
iterations and the actual reg. term) can in some way contribute to the
regularization of the reconstruction of the ratings matrix. We thought it
would be interesting to include all of them and here are the optimal
parameters SigOpt found for this experiment:

rank = 36

numIter = 30

log_lambda = -2.90405347693

------
Zephyr314
I'm one of the co-founders of SigOpt (YC W15) and am happy to answer any
questions about this post (or anything about SigOpt).

More info on the methods behind SigOpt can be found at
[https://sigopt.com/research](https://sigopt.com/research).

------
apathy
Oh, also, for students: [https://sigopt.com/edu](https://sigopt.com/edu)

I'm worried this is going to be like good Scotch for me.

------
blahi
There is package, mlrMBO, created by the great guys who created mlr
(absolutely awesome for building pipelines, you will ditch caret in a
second!). Not on Spark obviously, but thought some might find it useful.

[https://github.com/mlr-org/mlrMBO](https://github.com/mlr-org/mlrMBO)

------
tachim
How does SigOpt compare to GPs?

~~~
Zephyr314
Great question. SigOpt is a hosted, scalable ensemble of different Bayesian
optimization methods. Many of these methods use Gaussian Processes (GPs) as
part of the modeling aspect of their Sequential Model Based Optimization
(SMBO). We've written up a primer on Bayesian optimization [1] that goes over
some of the different methods, with lots of citations for diving deeper.

We've found that SigOpt compares very favorably [2] to other Bayesian
optimization approaches. In addition to this, our hosted platform allows
people to harness the full power of GP backed Bayesian optimization with just
a few lines of code [3] instead of the sometimes heavy administration required
by other methods.

[1]:
[https://sigopt.ninja/1470077644/pdf/SigOpt_Bayesian_Optimiza...](https://sigopt.ninja/1470077644/pdf/SigOpt_Bayesian_Optimization_Primer.pdf)

[2]: [http://arxiv.org/abs/1603.09441](http://arxiv.org/abs/1603.09441)

[3]: [https://sigopt.com/docs](https://sigopt.com/docs)

~~~
tachim
Thanks. Do you guys do any meta-learning based on information from
optimization runs on the many instances that SigOpt runs on? E.g. if Alice
uses Sigopt for a neural net architecture and Bob does too, is the information
from Alice's run used to improve Bob's?

~~~
Zephyr314
We don't explicitly share information at the experiment level like that for
privacy reasons. All user data is explicitly isolated from other users, on top
of the fact that due to the black box nature of our optimizer we often do not
know what the underlying method or system being optimized actually is.

We do, however, run a rigorous evaluation framework [1] over our methods as we
iteratively improve (and compare to other techniques). This allows us to build
up our ensemble of optimization strategies to most efficiently tackle problems
that are most important to our users. As we see users leveraging our service
for certain types of problems (like mixed continuous/categorical + failure
regions) we do try to incorporate them more into our testing, roadmap, and
ensemble, but only at the meta level.

[1]: [http://arxiv.org/abs/1605.06170](http://arxiv.org/abs/1605.06170)

------
visarga
[https://sigopt.com/pricing](https://sigopt.com/pricing)

\- Individual: $1,000/month

\- Enterprise: Custom pricing

I am not a multi-million $ company, so I guess it's useless for me.

~~~
Zephyr314
We have a free academic tier as well:
[https://sigopt.com/edu](https://sigopt.com/edu)

------
idewanck
Post author here, happy to answer any questions as well.

