Hacker News new | past | comments | ask | show | jobs | submit login
Gradient Free Optimizers: A collection of modern optimization methods in Python (github.com/simonblanke)
239 points by EntICOnc 51 days ago | hide | past | favorite | 75 comments

Be still my heart... could this be... the missing gradient-free optim library for Python? Particle swarm, evolutionary, AND Bayesian all in a single package? A no-nonsense, simple API with strong enough defaults that you can basically just call .search() with a function (I’m looking at you, nevergrad)??

I’ll have to test it of course but this looks truly life changing. A few clarifications, if the author happens to be watching this thread:

- looks like the search space is “enumerated”, I.e. I define the grid spacing for each independent variable in search_space and it puts together an outer product? In other words, this will never be truly continuous?

- by the same token, constraints are applied by virtue of specifying the endpoints of each grid in the search_space? Is it possible for me to apply more complex constraints by generating my own enumerated list of search_space points and feeding that directly?

Seriously well done. Like I said above, this looks to me like an immediate staple library.

I am glad you like my project. The search space is not continuous. But you can make the steps as small as you want. Like: np.arange(0, 10, 0.00001)

I am not sure i understand your second question. The entire search space is always a N-dimensional cube. So you just define a start and end point and the step size in each dimension.

They probably mean that you can search in n-dimensional cubes, but what they'd like is having a more general shapes of the grid, e.g., have restriction of the sort: a+b<1.

Okay that is interesting. You could realize that by making a restricted area in the search space by returning np.nan in the objective function for those cases. Gradient-Free-Optimizers can handle np.nan and np.inf just fine.

Maybe you could do something like:

If a+b>=1: return np.nan else: return score

This is at a high level how one of our Google internal black box optimizers behaves. The term used is "infeasible region" but it's the same idea as using a nan.

I would caution against using nan to always mean infeasible. Instead users should catch experiments outside the feasibility region and return a special infeasible value. This will increase visibility into the behavior of the optimizer, because it leaves nan to be used for values inside the region of constraint that are still problematic (due to bugs, numerical instability, etc)

If the volume of feasible space is small, the optimizer may have considerable trouble finding the feasible region(s) in the domain. The simple solution is to use a penalty function instead, i.e. add a term like M*max(0, 1-(a+b)) to the objective, with sufficiently large M. If the optimum is at the boundary of the feasible region, you might still get a solution that is slightly infeasible, which can be worked around with more special case logic...

Is there a subset of these constrained optimisation techniques that work when the constraint bounds are unknown before evaluation time? This applies to certain physical engineering problems where, due to the complexity of the environmental system you're modelling, you don't necessarily know what combination of input parameters lead to a non-viable solution. That is, a given solution might be numerically valid but the objective function could determine it to be non-viable due to not satisfying engineering constraints (e.g. it vibrates too fast).

Just adding a penalty to the objective works. Another alternative is to consider a two-dimensional lexicographic objective (total constraint violation, actual objective), so that reducing constraint violations is always preferable, and the actual objective is only compared between solutions that are equally good in terms of constraint violations.

Annnnnnnd you've added gradients to the gradient free optimizer. Every shovel is a hammer if try hard enough.

This would be like the barrier function [1] for gradient based methods; which produces some challenges for those techniques. Do you know how this affects gradient free techniques?

[1] https://en.m.wikipedia.org/wiki/Barrier_function

Another option in some cases is to just map the space to something else. Like in this case, x and y are unconstrained and are what the optimizer sees, a = x and b = (1 - a) - abs(y).

Not always easy to do, but can work in some cases if the optimizer cant deal with it natively.

Nope, that answers the question. I was asking about cases where the space is, for example, a triangle or something non-cuboidal. Think constraints like x1 + x2 < 1 or something like that.

What I’m wondering is if your library converts the search_space to an enumerated list of candidate points somewhere in the backend, and if there’s a way for me to just construct that list and pass it directly for more complex cases.

There's also Opytimizer [0] for almost every metaheuristic optimization algorithm under the Sun.

[0] https://github.com/gugarosa/opytimizer

I love gradient-free optimisers - there are so many interesting problems where there aren't analytical gradients available.

An I right in thinking this library doesn't contain a simplex / Nelder-Mead optimiser? To me that's the bread and butter of gradient-free optimisation.

Also no Powell algorithms like BOBYQA?

nlopt has these and many more, and a Python wrapper:


So, are those omitted because this library is modern and those are now considered out of date?

I am currently working on the Nelder-Mead optimiser. I will also look into "BOBYQA". I am always searching for interesting algorithms that my users need. If you have more suggestions you can open an issue in the repository.

I've used this, and it works nicely: https://github.com/numericalalgorithmsgroup/pybobyqa. I'd be happy if it were added to your project, then I could just use yours and have access to a bunch of alternatives with the same API.

That's good to hear. I don't have any other specific requests - for me it seems impossible to know what will work on my problems, so i just try things at random to see what works!

I don't know the architecture of your project, but is there any way it could include NLopt, and so make all of its algorithms available? Some of them could be useful, and i doubt it's a good use of your time to reimplement them all from scratch.

Unfortunately NLopt cannot be included into Gradient-Free-Optimizers. But i would like to implement multiple algorithms that are also preset in NLopt.

I think it will help my understanding of optimization algorithms if i implement them myself. I currently have my eyes on the Downhill-simplex, Direct and Powell's Method. But they are quite different from what i have already implemented, so this could take some time.

Came here to ask about the Powell method. I find Powell is typically what engineers come up with if asked to optimize something "by hand" without a mathematical model of the system they are optimizing.


Also: Gradient-Free-Optimizers is basically just the optimization-backend for a much larger project of mine: https://github.com/SimonBlanke/Hyperactive

For those who want to learn more, I really like the book "Algorithms for Optimization": https://algorithmsbook.com/optimization/#outline

It's a great intro because it covers a large breadth and gives you a sense of what class of techniques are useful for what (rather than just going super deep into one particular technique).

They also have a github in julia notebooks implementing most of the ideas: https://github.com/sisl/algforopt-notebooks

Looks very nice!

Two questions off top of my head:

- Might be worth mentioning how Hyperactive compares to tons of other similar packages, like Hyperopt, Optunity and a ton of others? Also how your GFO package is different from optima, opytimizer, and many others as well. Having some benchmarks or performance stats (or even just listing functionality differences) would be very helpful to anyone who's been already using those.

- One of the best universal 'works-out-of-the-box-most-of-time' global optimizers I've used recently is DLib's: http://blog.dlib.net/2017/12/a-global-optimization-algorithm... - any chances for implementing something similar?

Two very interesting questions! I should work soon on a comparison of Hyperactive and GFO to other packages. If some important features are missing, maybe i could add them.

I will also look into Dlib. If you like you can open an official issue in my repository. We could discuss why it is useful and if it should be implemented.

Given the rise of multi-core machines, parallel optimization seems pretty important. Which of these support parallel execution ?

For this reason, I chose Mango for a recent project: https://github.com/ARM-software/mango (bayesian optimizer with built-in support for both parallel optimization (via multiprocessing) and distributed (via celery))

Gradient-Free-Optimizers is a lightweight optimization package that serves as a backend for Hyperactive: https://github.com/SimonBlanke/Hyperactive

Hyperactive can do parallel computing with multiprocessing or joblib, or a custom wrapper-function. Parallel computing can be done with all its optimization algorithms. You could even pass a gaussian process regressor like GpFlow to the Bayesian Optimizer (GFO and Hyperactive) to get GPU acceleration.

Consider adding some affordance for parallel computing to Gradient-Free-Optimizers by allowing the user to provide a vectorized objective function instead of one that evaluates only a single point in the search space per function call. That leaves all the hard work of parallelization as an exercise for the user, and gives the user the flexibility to parallelize their objective function with whatever mechanism they wish.

I have previously used this approach in a project where the objective function contained a half-hour long simulation, which was the bottleneck that made estimating a gradient intractable. When the optimization algorithm gave a batch of several points in the search space to evaluate, our objective function could prepare and run several instances of the simulation in parallel, and return when the whole batch was complete. From this, it was easy for us to also distribute simulation runs across several machines, without needing any changes to the optimizers. We would not have been able to easily achieve this with an optimization framework that tried to directly manage parallelization, because the steps necessary to prepare the input files for the simulation software had to be done serially.

For that project we tried: DIRECT, several variants of Nelder-Mead, and an evolutionary strategy. In hindsight, the Nelder-Mead variants worked best; once we accumulated enough simulation results it became clear that our objective function was convex and pretty well-behaved in the region of interest. Nelder-Mead was also trivial to extend to trying several extra points per batch to ensure that each of our several workstations had something to work on. (We didn't have access to a large cluster, and Nelder-Mead wouldn't generalize well to a large degree of parallelization in that manner.)

Your parallel computing approach sounds intriguing! Could you provide an example script? I would like to look into this. If you like you could open an issue as a feature request and provide a code snipped there.

I don't have any code handy to share. The project in question was a decade ago and we started with optimization code written in MATLAB. The objective function then turned into a thin wrapper around a python2 script, because that could actually spawn multiple processes unlike the version of MATLAB we were working with. When we started distributing jobs to several machines, we used execnet: https://codespeak.net/execnet/ with hard-coded IP addresses and CPU core counts for each machine. So nothing pretty or particularly useful to share if I could dig it up from my archives.

But as far as illustrating how the optimization framework would need to work to support a vectorized objective function, you can take any existing sample objective function that's written to take N scalar arguments and update it to take N vector arguments, where the length of the vectors is the number of points in the batch to be evaluated. For simple numpy functions, there might not even need to be any changes to the code of the objective function.

For linear programming solvers, there is an annual competition to see which is best:


Is there anything like this for gradient-free optimisers?

I thought about a table in the readme that shows some kind of metric for each optimizer that describes its performance. I will look into that.

I'd recommend also implementing the DIRECT algorithm, which balances local and global search very well (but consequently will not refine as aggressively near the current [possibly only local] optimum as other algorithms):


You probably also want to include some advantages/disadvantages of each algorithm. How robust against local minima is it? Up to how many dimensions does it work well? How is the convergence speed when started far from the optimum? Does it work well with few function evaluations? Etc.

I will look into this algorithm. Thanks for the suggestion. I have some basic explanations of the optimization techniques and their parameters in a separate repository: https://github.com/SimonBlanke/optimization-tutorial

But there is still a lot of work to be done.

I just saw, that my project got a lot of new stars on github. A surprise to be sure, but a welcome one.

Cool! I'd considering adding CMA-ES [1], which as far as I know is the gold standard for derivative free optimization.

1. https://en.m.wikipedia.org/wiki/CMA-ES

Ah nevermind I see this was addressed in another comment.

Getting a lot of these errors when using the GPR. Seems to be an issue with the estimation of the noise level term. Scratching my head on this one. Anyone familiar with what's going on? I tried adding some random noise to the optimization function, and it had the same issue.

...python3.8/site-packages/sklearn/gaussian_process/kernels.py:402: ConvergenceWarning: The optimal value found for dimension 0 of parameter k2__noise_level is close to the specified lower bound 1e-05. Decreasing the bound and calling fit again may find a better value.

Yeah this is a warning from sklearns gaussian process regressor. Sklearn probably wants you to increase the "n_restarts_optimizer"-parameter of the gpr, but from my experience this warning does not correlate with bad results from Gradient-Free-Optimizers.

Sklearn warnings are often difficult to silence. Just google "sklearn suppress warnings" to get a code snippet for this problem.


This looks super interesting, I have previously considered using the Bayesian Optimization[0] package for some work, but the ability to switch out the underlying algorithms is appealing.

Perhaps a bit of a far out question - I would be interested in using this for optimizing real-world (ie slow, expensive, noisy) processes. A caveat with this is that the work is done in batches (eg N experiments at a time). Is there a mechanism by which I could feed in my results from previous rounds and have the algorithm suggest the next N configurations that are sufficiently uncorrelated to explore promising space without bunching on top of each-other? My immediate read is that I could use the package to pick the next optimal point, but would then have to lean on a random search for the remainder of the batch?

0: https://github.com/fmfn/BayesianOptimization

Yes it is quite easy to switch algorithms via the "gpr" parameter. You just have to write a wrapper class. I am currently working on a repository that discusses how to do that in detail: https://github.com/SimonBlanke/surrogate-models I will upload some examples (GPy and Gpflow) within the next 24 hours.

I think those wrapper-classes will already answer some of your questions. If you have additional or unanswered questions you can open an issue in the Grad-Free-Opt repository.

If you have an expensive but not high dimensional problem you might want to try https://github.com/DLR-SC/sqpdfo . It is based on Krigging, SQP and also handles constraints.

I am surprised CMA-ES is not included. It is the state of the art gradient free optimization algorithm last I checked.


Evolution strategy is included. The "Covariance matrix adaptation" is for making this algorithm work for continuous search spaces. But gradient-free-optimizers has discrete search space.

I hope to see gradient-free/black box optimization algorithms gain more popularity in machine learning. I'm always surprised when I see a review of ML optimization methods, and it's not even mentioned that you can optimize an ML model without using the gradient.

Also, love or hate Python, there's almost always a library for whatever it is you want to do.

> and it's not even mentioned that you can optimize an ML model without using the gradient.

Can you, though? How does the outcomes compare to gradient based methods?

In general GFO is not a good alternative when there are lots of model parameters or when it’s cheap to get the gradient. GFO could be used to train models, but kinda in the same way that you could find your way to Cleveland by just wandering around long enough. It sure would help a lot if you had a compass.

Yeah, that's what I thought. For deep learning models, when your parameter space has many thousands dimensions, it's really cheap to use gradient-based methods, when you can use stochastic gradient descent and backpropagation to quickly calculate good estimates of gradient. I simply don't see why eschewing this structure can lead to anything competitive.

SPSA seems to be missing. OpenAI's "evolutionary strategies" paper used a form of SPSA. https://en.wikipedia.org/wiki/Simultaneous_perturbation_stoc...

A book on Practical Evolutionary Algorithms with Python for anyone interested https://datacrayon.com/shop/product/practical-evolutionary-a...

Is the Nelder-Mead a.k.a. the other simplex method any good? I remember most of these other methods from my optimisation lessons, but I'm surprised to not see Nelder-Mead in this list.

I am currently working on the Nelder-Mead algorithm. I did not realize that it is that popular. This gives me motivation to implement it soon ;-)

If there are more "must have"-algorithms you could open an issue in Gradient-Free-Optimizers.

I like it because it was so easy to implement and it worked well, unless your objective function had very narrow and steep valleys.

I don't know if it's popular, heheh, it's been a while since I've been in maths school, so that's why I'm asking.

Can you use the methods in this library to do constrained optimization? Whenever I look for such gradient free optimizer it's always comes with some constraint.

One simple trick is doing a bijective transformation between your constrained input space and the unconstrained optimizer space. For example, if you have x>0 in your input space, you take the input from your optimizer and apply e^x to it. Or if you have a<x<b you can do a+b*logit(x). What exactly you choose will depend on your prior on how your function behaves.

This looks great! You have a lot of algorithms here. Which ones do you find to be the most promising for general use and why?

I think Random Restart Hill Climbing is good if your objective function evaluates fast (simple functions). It does not get stuck in local optima and also does local search very well.

Bayesian Optimization is very good if your objective function takes some time (> 1 second) to evaluate. If you look at the gifs in the readme of Gradient-Free-Optimizers you will see how fast it finds promising solutions. It is almost scary how good it is (even compared to other smbo).

Which algorithm would you recommend when the objective function is noisy (and nondeterministic)?

For example the objective function is the "score" of a particular stochastic simulation, which can be started with varied initial random seed, or the result of a real physical experiment, which is naturally stochastic (and expensive to evaluate).

There is a tradeoff between getting a very accurate estimation of the objective function + variance of a single point vs exploring other points. Is there a search algorithm that somehow manages this tradeoff automatically?

Note: In the past I've used Tree of Parzen Estimators (Kernel density estimators), wasting 3-4 evaluations per point, but I have a feeling it is sub-optimal. Is there an "optimal" algorithm, like the optimal algorithm for the multi-armed bandit problem[1] (which is similar)

[1] https://en.wikipedia.org/wiki/Multi-armed_bandit

Bayesian Optim is designed for that case specifically. It fits a surrogate model with uncertainty estimates and picks the next point with an understanding of that uncertainty. Look up the MEI acquisition function for more info.

Edit: BO does usually require some tuning for your use case. Its acquisition function sometimes samples replicates where there’s high noise, especially if the first sample looks particularly “good”. There’s usually a hyper parameter that can be set to favor exploration vs exploitation, I.e. to favor non-replicate samples. But I am not aware of an algo that can learn your preference along that axis.

> especially if the first sample looks particularly “good”.

You've precisely described the problem: the algorithm will get stuck on a point if the first sample looks good and the assumption of zero variance. Until it randomly hits a luckier sampler (but not necessarily better point).

Another related problem, is that the boundaries of the parameter space have a bad score (objective function), but very low variance (they're always bad), which confuses the search function into believing that the interior points also have a very low variance, which is incorrect.

If anyone knows of a library that handles those cases correctly, without providing user-defined priors for each dimensions, I'd be glad to hear

Correct me if I'm wrong, but it seems the bayesian_optimization.py optimizer in this library assumes that the sampled points are exact, ie their variance is zero. It doesn't seem to re-sample existing points.

This will cause the algorithm to "chase random noise", as morelandjs wrote below

I haven't looked at his source, but usually there's a white noise term in the covariance structure of the Gaussian process regressor that adds some statistical uncertainty at each evaluation point. So even when it evaluates a point of parameter space the GPR is still somewhat uncertain about the value of the optimization function at that point. So it should balance exploration versus exploitation taking that statistical uncertainty into account.

I would be very disappointed if that were the case.. no, it looks like it’s set up to capture variance. The BO algo wraps an “Expected Improvement Optimizer”:


Which selects new points based on both the model’s mean estimate and its variance. See around line 58

line 62: exp_imp[sigma == 0.0] = 0.0

I'm afraid it never samples points more than once, since it estimated already-sampled-points as points with variance zero, and no expected improvement.

IMHO that's wrong. Variance of a single sample should be infinite (classical statistics), or similar to the variance of nearby points (bayesian+model), or some pre-defined prior (not a great idea... I'd prefer some automatic method). But not zero.

Ah, good catch. So in the event the gpr predicts zero variance, the optimizer says EI is zero and thus won’t sample again. That may depend on the settings of the gpr.. if I’m not mistaken there are ways for gpr to model noise and not collapse to zero variance on every sampled point.

Anyway, I guess I stand by my original suggestion that BO is the best tool for gradient free optim with slow and noisy fevals, but to my knowledge nobody has built a way to dial in the hyper parameters automatically. And there are quite a few. Entire companies exist for this purpose, SigOpt comes to mind..

I'm wondering the same. I'd be concerned that Random Restart Hill climbing would essentially chase random noise.

You could be right. I must confess, that i have a (probably) very narrow understanding of typical optimization problems. Most of the objective functions i optimize have machine learning algorithms in it (to optimizer hyperparameters). Depending on the evaluation those can have low noise.

If you like you could provide other use cases and applications for optimization algorithms.

Thanks, that's useful. My impression of Bayesian parameter optimization using Gaussian processes is that it is quite good when the data has a more or less constant correlation length across the evaluation points as in your example.

When there are large correlation lengths in some regions of the dataset and small correlation lengths elsewhere, it often over (or under) shoots the curvature of the hypersurface.

Not OP, but you can't go wrong with randomized/stochastic hill-climbing.It's actually very competitive with complex, fancy genetic algorithms:


nice repo, i saw some comments expressing excitement and thought of the first time i saw pymoo, tpot, and finally automl. there are a lot of stuff laying around github but we just need to search better, like using topics (tags) and the gazer repo which kinda page-ranks the github repos.

I had the same excitement for tpot! That project was the reason i started creating mine.

Amazing collection! I'm learning about backprop in neural networks and I'm reading how that kind of optimization may not be biologically 'implementable'. Are non-gradient based optimizers an alternative proposal for biological learning? How do they compare to backprop?

Does this include Hill Climbing ?

Yes Hill Climbing + some of its variants are featured in this package.

What about genetic optimization?

Gradient-Free-Optimizers has a type of genetic algorithm: Evolution Strategy

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact