Relatedly, why are you advocating using MLLib/RDDs when they have been deprecated in favor of ML/DataFrames (http://spark.apache.org/docs/latest/ml-guide.html)?
Note that the DataFrame version of MLLib (which is still called MLLib btw, even if the package is different) isn't feature complete compared to the RDD version.
For example, the DataFrame version of LogisticRegression is binary only, while the RDD one allows multiple classes. I guess the suggested work around is to use the OneVsRest classifier (in Spark 2.0), but that blows the memory out on fairly small tasks.
I like Spark a lot, but MLLib isn't as strong as it appears at first glance.
The Spark hyperparameter selection uses an exhaustive grid search approach, which can take a long time (complexity grows exponentially with number of parameters to tune) and produce poor results when compared to other methods . Bayesian optimization is a great way to tune time consuming and expensive functions like ML pipelines, where finding a good configuration in a small number of total attempts is the only tractable way to tune the system.
ALS implementation in MLLib still requires ratings in RDD and hasn't moved over to ML/DataFrames yet.
Edit: Looks like the original comment was edited, but this post does in fact use the built in MLlib ALS implementation.
ALS has been available in ML/DataFrames since 1.3.0, per the documentation. (http://spark.apache.org/docs/latest/api/scala/index.html#org...)
> Edit: Looks like the original comment was edited, but this post does in fact use the built in MLlib cross validation methods.
I edited the comment to correct my error that ALS was not mentioned as being native to Spark. However, for Hyperparameter cross validation, I looked at the code in the repository (https://github.com/sigopt/sigopt-examples/blob/master/spark/...), and while the Spark CrossValidation function is imported at the beginning of the Scala file (from ML, not MLlib), it is not used in the code, in favor of SigOpt.
Admittedly I'm not a spark expert though, thanks for bringing this up!
RDDs themselves, however, are obsolete to DataFrames as DataFrames are faster and cover most of the functionality of RDDs for common use cases.
So if you want to do one of the operations in the sql.functions package then Dataframes (and Datasets) are very valuable.
If not, then they won't give you much benefit. However, you will get a little improvement because the Tachyon out-of-JVM-memory framework which I don't think RDD version has access to.
Pushdown is the most obvious one. If I don't know what data store is underlying your RDD, I don't know your schema, and I don't know what column you're projecting, pushdown is impossible. I can't know that with an RDD, because all I know when you call map is that you're converting from type A to type B.
Dataframes make that class of optimization possible, because they have more information (your schema, the underlying store), and have more limited operations (select a column, not an arbitrary map operation).
Wait, never mind. If GPR is like an infinite-dimensional linear regression then doing it within an RKHS means you don't actually have to bother with the functional generalization and can get solutions to a potentially ill-behaved loss function along a grid/cube/hypercube/whatever. Is this part of why SigOpt works more efficiently than classical parameter space sampling designs?
Not-so-ninja edit: Time for me to re-read Rasmussen, I think.
First off, the use of the term "kernel trick" appears, I think, primarily within the machine learning community. It refers to the idea that some other (likely more useful) representation of the data of interest exists, but that the representation might be in a much larger, or even infinite-dimensional, space. Fortunately, in the context of certain algorithms such as support vector machines, that representation never appears by itself ... it only appears when inner-producted (not a word) with another such representation associated with some other piece of data. For certain representations, that inner product can be represented by a reproducing (also called positive definite) kernel, and thus can be computed without ever forming the larger representation. This concept appears in Wikipedia https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_spa..., both for the Hilbert space inner product and for the Mercer's series representation. Unfortunately, as is the case with a lot of higher math on Wikipedia, it's more useful as a reference for an expert than to get you rolling on a new topic.
As far as kernel methods in general, those pop up all over the place, including in the context of Gaussian processes. Gaussian process regression (GPR), also called kernel-based approximation in the numerical analysis community, is one great example of that. I generally try to think of it not as an infinite dimensional linear regression method, but rather as a constrained optimization problem. There are infinitely many functions that pass through a given set of data, thus asking for the "one" is ill-posed ... I am interested in the most well-behaved one. We define the behavior of a function as its RKHS-norm, and the function that minimizes that norm, but still respects the observed data, is the solution to the GPR problem.
Regarding why SigOpt performs better than a good old-fashioned grid, that can be for multiple reasons. Probably the most important one, at least in my mind, is that SigOpt is not trying to create some perfect model of the function at hand - SigOpt is only interested in optimizing that function. That drives every decision we make, and actually it is something I often need to remind myself of because I grew up in approximation theory. A basic design of experiments is interested in understanding how the function works everywhere, but we can save some expense by more swiftly discarding regions unlikely to contain the optimum.
As far as why RKHS in particular work very well - that deals with the optimality theorems underlying approximation using reproducing kernels. Assuming you have a reasonable function on a reasonable domain it probably belongs to a RKHS - such functions can be represented very effectively by an appropriate kernel. Now, determining that kernel can be a complicated task (maximum likelihood estimation or cross-validation are common tools) but if you have an acceptable kernel you can make strong statements about the quality of the model. Because the quality of the model is constantly improving, the GP-backed optimization tool is constantly providing a better representation of the true function, and thus pointing towards the optimum more quickly.
There's another more fundamental reason why searching on a grid in higher dimensions is trouble, and it deals with the fact that the ratio of the volume of a sphere and of a cube with the same radius decreases as the dimension increases. This means that there is increasingly much volume away from the center of a box in increasingly many dimensions. Using a grid to try and fill that space becomes unacceptably costly. Of course, Gaussian processes have their own issues for larger problems (>30 dimensions, maybe) but SigOpt has more than just Gaussian processes behind the scenes. Also, there have been improvements over the years in GP performance in higher dimensions (for example, http://epubs.siam.org/doi/abs/10.1137/10080138X).
There's some pretty solid notes for graduate students on this topic at http://math.iit.edu/~fass/590/. That was the class I took to first learn this stuff as a student and I've tried to contribute back to them over the years. It comes at the topic first from the math side, not the stats side, but there's some stats stuff in there as well. Chapter 8 of those slides contains that theorem regarding the minimum-norm interpolant.
Hope that helps.
Thanks! Also go Big Red. ;-)
What optimal parameters did it find for these?
rank = 36
numIter = 30
log_lambda = -2.90405347693
More info on the methods behind SigOpt can be found at https://sigopt.com/research.
I'm worried this is going to be like good Scotch for me.
We've found that SigOpt compares very favorably  to other Bayesian optimization approaches. In addition to this, our hosted platform allows people to harness the full power of GP backed Bayesian optimization with just a few lines of code  instead of the sometimes heavy administration required by other methods.
We do, however, run a rigorous evaluation framework  over our methods as we iteratively improve (and compare to other techniques). This allows us to build up our ensemble of optimization strategies to most efficiently tackle problems that are most important to our users. As we see users leveraging our service for certain types of problems (like mixed continuous/categorical + failure regions) we do try to incorporate them more into our testing, roadmap, and ensemble, but only at the meta level.
- Individual: $1,000/month
- Enterprise: Custom pricing
I am not a multi-million $ company, so I guess it's useless for me.