I’ll have to test it of course but this looks truly life changing. A few clarifications, if the author happens to be watching this thread:
- looks like the search space is “enumerated”, I.e. I define the grid spacing for each independent variable in search_space and it puts together an outer product? In other words, this will never be truly continuous?
- by the same token, constraints are applied by virtue of specifying the endpoints of each grid in the search_space? Is it possible for me to apply more complex constraints by generating my own enumerated list of search_space points and feeding that directly?
Seriously well done. Like I said above, this looks to me like an immediate staple library.
I am not sure i understand your second question. The entire search space is always a N-dimensional cube. So you just define a start and end point and the step size in each dimension.
Maybe you could do something like:
I would caution against using nan to always mean infeasible. Instead users should catch experiments outside the feasibility region and return a special infeasible value. This will increase visibility into the behavior of the optimizer, because it leaves nan to be used for values inside the region of constraint that are still problematic (due to bugs, numerical instability, etc)
Not always easy to do, but can work in some cases if the optimizer cant deal with it natively.
What I’m wondering is if your library converts the search_space to an enumerated list of candidate points somewhere in the backend, and if there’s a way for me to just construct that list and pass it directly for more complex cases.
An I right in thinking this library doesn't contain a simplex / Nelder-Mead optimiser? To me that's the bread and butter of gradient-free optimisation.
Also no Powell algorithms like BOBYQA?
nlopt has these and many more, and a Python wrapper:
So, are those omitted because this library is modern and those are now considered out of date?
I don't know the architecture of your project, but is there any way it could include NLopt, and so make all of its algorithms available? Some of them could be useful, and i doubt it's a good use of your time to reimplement them all from scratch.
I think it will help my understanding of optimization algorithms if i implement them myself. I currently have my eyes on the Downhill-simplex, Direct and Powell's Method. But they are quite different from what i have already implemented, so this could take some time.
It's a great intro because it covers a large breadth and gives you a sense of what class of techniques are useful for what (rather than just going super deep into one particular technique).
They also have a github in julia notebooks implementing most of the ideas: https://github.com/sisl/algforopt-notebooks
Two questions off top of my head:
- Might be worth mentioning how Hyperactive compares to tons of other similar packages, like Hyperopt, Optunity and a ton of others? Also how your GFO package is different from optima, opytimizer, and many others as well. Having some benchmarks or performance stats (or even just listing functionality differences) would be very helpful to anyone who's been already using those.
- One of the best universal 'works-out-of-the-box-most-of-time' global optimizers I've used recently is DLib's: http://blog.dlib.net/2017/12/a-global-optimization-algorithm... - any chances for implementing something similar?
I will also look into Dlib. If you like you can open an official issue in my repository. We could discuss why it is useful and if it should be implemented.
For this reason, I chose Mango for a recent project:
(bayesian optimizer with built-in support for both parallel optimization (via multiprocessing) and distributed (via celery))
Hyperactive can do parallel computing with multiprocessing or joblib, or a custom wrapper-function.
Parallel computing can be done with all its optimization algorithms. You could even pass a gaussian process regressor like GpFlow to the Bayesian Optimizer (GFO and Hyperactive) to get GPU acceleration.
I have previously used this approach in a project where the objective function contained a half-hour long simulation, which was the bottleneck that made estimating a gradient intractable. When the optimization algorithm gave a batch of several points in the search space to evaluate, our objective function could prepare and run several instances of the simulation in parallel, and return when the whole batch was complete. From this, it was easy for us to also distribute simulation runs across several machines, without needing any changes to the optimizers. We would not have been able to easily achieve this with an optimization framework that tried to directly manage parallelization, because the steps necessary to prepare the input files for the simulation software had to be done serially.
For that project we tried: DIRECT, several variants of Nelder-Mead, and an evolutionary strategy. In hindsight, the Nelder-Mead variants worked best; once we accumulated enough simulation results it became clear that our objective function was convex and pretty well-behaved in the region of interest. Nelder-Mead was also trivial to extend to trying several extra points per batch to ensure that each of our several workstations had something to work on. (We didn't have access to a large cluster, and Nelder-Mead wouldn't generalize well to a large degree of parallelization in that manner.)
But as far as illustrating how the optimization framework would need to work to support a vectorized objective function, you can take any existing sample objective function that's written to take N scalar arguments and update it to take N vector arguments, where the length of the vectors is the number of points in the batch to be evaluated. For simple numpy functions, there might not even need to be any changes to the code of the objective function.
Is there anything like this for gradient-free optimisers?
You probably also want to include some advantages/disadvantages of each algorithm. How robust against local minima is it? Up to how many dimensions does it work well? How is the convergence speed when started far from the optimum? Does it work well with few function evaluations? Etc.
But there is still a lot of work to be done.
...python3.8/site-packages/sklearn/gaussian_process/kernels.py:402: ConvergenceWarning: The optimal value found for dimension 0 of parameter k2__noise_level is close to the specified lower bound 1e-05. Decreasing the bound and calling fit again may find a better value.
Sklearn warnings are often difficult to silence. Just google "sklearn suppress warnings" to get a code snippet for this problem.
Perhaps a bit of a far out question - I would be interested in using this for optimizing real-world (ie slow, expensive, noisy) processes. A caveat with this is that the work is done in batches (eg N experiments at a time). Is there a mechanism by which I could feed in my results from previous rounds and have the algorithm suggest the next N configurations that are sufficiently uncorrelated to explore promising space without bunching on top of each-other? My immediate read is that I could use the package to pick the next optimal point, but would then have to lean on a random search for the remainder of the batch?
I think those wrapper-classes will already answer some of your questions. If you have additional or unanswered questions you can open an issue in the Grad-Free-Opt repository.
Also, love or hate Python, there's almost always a library for whatever it is you want to do.
Can you, though? How does the outcomes compare to gradient based methods?
If there are more "must have"-algorithms you could open an issue in Gradient-Free-Optimizers.
I don't know if it's popular, heheh, it's been a while since I've been in maths school, so that's why I'm asking.
Bayesian Optimization is very good if your objective function takes some time (> 1 second) to evaluate. If you look at the gifs in the readme of Gradient-Free-Optimizers you will see how fast it finds promising solutions. It is almost scary how good it is (even compared to other smbo).
For example the objective function is the "score" of a particular stochastic simulation, which can be started with varied initial random seed, or the result of a real physical experiment, which is naturally stochastic (and expensive to evaluate).
There is a tradeoff between getting a very accurate estimation of the objective function + variance of a single point vs exploring other points. Is there a search algorithm that somehow manages this tradeoff automatically?
Note: In the past I've used Tree of Parzen Estimators (Kernel density estimators), wasting 3-4 evaluations per point, but I have a feeling it is sub-optimal. Is there an "optimal" algorithm, like the optimal algorithm for the multi-armed bandit problem (which is similar)
Edit: BO does usually require some tuning for your use case. Its acquisition function sometimes samples replicates where there’s high noise, especially if the first sample looks particularly “good”. There’s usually a hyper parameter that can be set to favor exploration vs exploitation, I.e. to favor non-replicate samples. But I am not aware of an algo that can learn your preference along that axis.
You've precisely described the problem: the algorithm will get stuck on a point if the first sample looks good and the assumption of zero variance. Until it randomly hits a luckier sampler (but not necessarily better point).
Another related problem, is that the boundaries of the parameter space have a bad score (objective function), but very low variance (they're always bad), which confuses the search function into believing that the interior points also have a very low variance, which is incorrect.
If anyone knows of a library that handles those cases correctly, without providing user-defined priors for each dimensions, I'd be glad to hear
This will cause the algorithm to "chase random noise", as morelandjs wrote below
Which selects new points based on both the model’s mean estimate and its variance. See around line 58
I'm afraid it never samples points more than once, since it estimated already-sampled-points as points with variance zero, and no expected improvement.
IMHO that's wrong. Variance of a single sample should be infinite (classical statistics), or similar to the variance of nearby points (bayesian+model), or some pre-defined prior (not a great idea... I'd prefer some automatic method). But not zero.
Anyway, I guess I stand by my original suggestion that BO is the best tool for gradient free optim with slow and noisy fevals, but to my knowledge nobody has built a way to dial in the hyper parameters automatically. And there are quite a few. Entire companies exist for this purpose, SigOpt comes to mind..
If you like you could provide other use cases and applications for optimization algorithms.
When there are large correlation lengths in some regions of the dataset and small correlation lengths elsewhere, it often over (or under) shoots the curvature of the hypersurface.