
A Global Optimization Algorithm Worth Using - blt
http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html
======
wenc
There's a whole class of deterministic Global Optimization solvers on COIN-OR
[0] that don't seem to get noticed outside the Operations Research/Applied
math community. (many are free and open source, e.g. Couenne)

These solvers (BARON [1], ANTIGONE [2], Couenne [3] etc.) are capable of
solving nonconvex global optimization problems that are structurally more
complicated (i.e. non-ML applications) than the ones presented in the article
linked above.

They employ a bunch of mathematically-rigorous methods such as interval
analysis, branch-and-cut and McCormick relaxations. There's a treasure trove
of little ideas from decades of global optimization research that I feel the
Deep Learning community could benefit from taking a peek at.

[0] [https://neos-server.org/neos/solvers/index.html](https://neos-
server.org/neos/solvers/index.html)

[1]
[http://archimedes.cheme.cmu.edu/?q=baron](http://archimedes.cheme.cmu.edu/?q=baron)

[2]
[http://helios.princeton.edu/ANTIGONE/index.html](http://helios.princeton.edu/ANTIGONE/index.html)

[3] [https://www.coin-or.org/Couenne/](https://www.coin-or.org/Couenne/)

~~~
xoroshiro
Glad to see someone posted about COIR-OR!

I've been looking at open source solvers for a while now to solve an MIP, and
this one seems to be the best.

There is this annoying thing about the whole ecosystem though, aside from the
proprietary bits.

I started by looking at GLPK and wrote my MIP in GMPL. For some reason a lot
of tools I looked at don't "just work". I have to export it to some other
format. It works, but feels like a work-around.

I wish there were a lingua franca in this domain and sort of have support from
all major solver implementations, but each one seems to have their own way, or
just provide an API ( _ehem_ google/or-tools)

I am happy I can make it work, but it may not be as easy for others.

~~~
tavert
> I wish there were a lingua franca in this domain and sort of have support
> from all major solver implementations

JuMP is kind of that in terms of modeling software -
[http://www.juliaopt.org/](http://www.juliaopt.org/)

Pyomo is also pretty useful if you'd rather use Python than Julia, though it's
not as well-documented and doesn't make installation of the open-source
solvers quite as easy for you.

~~~
xoroshiro
Thanks, but it was not really what I was looking for. I was looking for
something that feels closer to a DSL than an API. In OR, a language like AMPL
(IIRC, by Brian Kernighan) or GAMS is typically used where you define sets,
variables, etc. It's something I feel I can more easily share to co-workers
with different backgrounds, since it's closer to math, and follows the whole
objective function, constraint, etc. organization in OR problems.

If you want to get a feel for it, here's an example in GMPL, which is a subset
of AMPL used in GLPK:

[https://en.wikibooks.org/wiki/GLPK/GMPL_%28MathProg%29](https://en.wikibooks.org/wiki/GLPK/GMPL_%28MathProg%29)

~~~
tavert
JuMP and Pyomo are both exactly that. Pyomo is as near as you can get to an
AMPL clone with python syntax. JuMP is an optimization DSL using Julia macros.

~~~
xoroshiro
Maybe I was confusing my terms or something. I think of these as like an API
as well, although packaged into a nice library that automates some stuff. So I
guess I'll try to describe my ideal scenario instead.

What I was thinking of was a common language that can be used by all solvers.
Similar to how C can be compiled by gcc or clang, this language can be used
directly with the solver binary. Something like `export CC=my_solver` and `CC
-d data.dat -m model.mod`, if you will.

What a lot of these projects do feels like calling a solver by translating it
into some format that's different for everyone, whereas I wanted for the
solvers to all agree on some language.

Or at the very least, some standard similar to how languages have libraries
for something like XML. It would be nice if JuMP and PyOmO could read these
standardized formats.

Then again, it's probably a pain to implement and even C compilers implement
different extensions. But one can dream.

~~~
wenc
> whereas I wanted for the solvers to all agree on some language.

Umm, they (mostly) do. Pyomo supports the AMPL .nl format, which means you can
call any AMPL solver from Pyomo (most COIN-OR solvers have AMPL interfaces).
The AMPL .nl format is the de facto standard for open-source optimization
solvers because it's the only open standard, I believe [0].

So you would define your mathematical program in Pyomo (which isn't a terribly
challenging syntax compared to AMPL). When you invoke solve, Pyomo will spit
out an .nl file and call a solver (which reads the .nl file, solves the
problem, and outputs a .sol file which can be read back into Pyomo).

Incidentally this is exactly how it works in AMPL as well -- all the interop
is done through .nl and .sol files. I recommend reading [0] to understand how
this works.

[0] [https://ampl.com/resources/hooking-your-solver-to-
ampl/](https://ampl.com/resources/hooking-your-solver-to-ampl/)

~~~
xoroshiro
Oh, wow. I couldn't really figure out which parts of AMPL were proprietary and
I guess I immediately dismissed it after that without understanding how it
worked.

As for Pyomo, while at first glance I dislike the syntax (it's not as easy to
share with others, at least compared to something like GNU MathProg, where the
receiver doesn't need to concern themselves with python or julia), I like
python enough to maybe give it a try. I'll have to learn more about it though.
Thanks to you and the others!

------
beagle3
I do not have anything to contribute regarding this specific optimization
algorithm, but I want to recommend the dlib [0] library it comes from, which
is a very high quality library that has machine learning algorithms, numerical
analysis algorithms, image processing, network, threading, gui, and a few
other things.

The (sole, afaik) library author and maintainer Davis King is also extremely
responsive and helpful. dlib and Davis deserve a lot more recognition.

[0] [http://dlib.net](http://dlib.net)

~~~
cjalmeida
Dlib saved me a few times and I have a lot to thank David.

However I just wish it's Python binding were published as binary wheels. Dlib
is very large and the compilation times suck when installing from pip.

In fact, I might as well contact him and help setup this!

------
petters
Of course, global optimization using the Lipschitz constant has been around
for decades. As far as I understand, this paper also estimates the constant.
Is that the novelty or is it the convergence proofs?

Another method for global optimization is interval arithmetic, but it is not a
black-box method.

~~~
kxyvr
I'm also curious what this adds that's new. I'll add one other bit: If anyone
ever claims they have an efficient global optimization algorithm for
continuous optimization, ask whether or not P=NP. We can encode the integer
constraint that x \in {0,1} as x(x-1)=0, so if we can find the global optimum
to a continuous problem efficiently, then we can do the same for a discrete.
That's probably unlikely, so I'm always a bit skeptical.

By the way, the parent mentioned interval arithmetic. If anyone is looking for
optimization algorithm look for "interval Newton". It's a cool algorithm.

~~~
wenc
> If anyone ever claims they have an efficient global optimization algorithm
> for continuous optimization, ask whether or not P=NP

Why? Continuous global optimization on convex problems can be solved in
polynomial time. Did you mean on nonconvex problems?

Also, x*(x-1)=0 is a complementarity constraint and it is mathematically
degenerate (it violates constraint qualifications) when solved as a continuous
optimization problem. There are many ways to work-around it [0, 1, 2] and even
solvers built for this of problems (MPECs = Mathematical Programs with
Equilibrium Constraints) which I suspect only work well for narrow classes of
problems like Linear Complementarity Problem (LCPs).

I have found that ultimately for nonlinear non-convex problems, the solution
technique for continuous optimization of complementarity constraint problems
has been empirically unreliable or the solution is often low equality (even if
it can be shown that the complementarity penalty method does not violate
constraint qualifications).

Bilinear constraints generally add nonconvexities and add a lot of difficult
terrain for a continuous optimizer.

For general nonconvex problems, it's much more reliable to cast the problem as
an NP-hard MINLP.

And if you're solving an LCP, you might as well recast it as an MIP rather
than mucking around with an LCP. MIP solvers like Gurobi and CPLEX are so
amazingly performant these days that it's just a no-brainer

[0]
[http://www.tandfonline.com/doi/abs/10.1080/10556780410001654...](http://www.tandfonline.com/doi/abs/10.1080/10556780410001654241)

[1]
[http://epubs.siam.org/doi/abs/10.1137/S1052623403429081](http://epubs.siam.org/doi/abs/10.1137/S1052623403429081)

[2]
[http://epubs.siam.org/doi/abs/10.1137/040621065](http://epubs.siam.org/doi/abs/10.1137/040621065)

[4]
[https://www.artelys.com/tools/knitro_doc/2_userGuide/complem...](https://www.artelys.com/tools/knitro_doc/2_userGuide/complementarity.html)

~~~
kxyvr
Oh I very much agree that good MIP solvers work well for complementary
problems. And, I suppose my point about P=NP was that if we had a general
scalable global optimization solver, then it would have all sorts of
implications outside of continuous optimization. I didn't qualify convex vs
nonconvex and, of course, many convex problems can be solved in polynomial
time. Though, funny enough, not all. De Klerk and Pasechnik have a
paper,"Approximation of the stability number of a graph via copositive
programming," where they prove that NP hard problems can be encoded as convex
problems. Pasechnik has a nice explanation here:

[https://mathoverflow.net/questions/92939/is-that-true-all-
th...](https://mathoverflow.net/questions/92939/is-that-true-all-the-convex-
optimization-problems-can-be-solved-in-polynomial-ti)

~~~
wenc
You are correct. SDPs are a clear counterexample.

------
alexbeloi
Anyone considering using this for continuous hyper-parameter optimization in
machine learning should be warned that the automatic Lipschitz constant
optimization (the novelty of the method) will __very likely not work well with
a noisy objective function __(any stochastic learning algorithm).

edit: Any grad students looking for a paper idea might consider extending this
result to allow stochastic objective functions satisfying a stochastic
Lipschitz condition.

~~~
eesmith
Quoting from the linked-to page:

> However, if you want to use LIPO in practice there are some issues that need
> to be addressed. The rest of this blog post discusses these issues and how
> the dlib implementation addresses them. First, if f(x) is noisy or
> discontinuous even a little it's not going to work reliably since k will be
> infinity.

> ... Now each sample from f(x) has its own noise term, σi, which should be 0
> most of the time unless xi is really close to a discontinuity or there is
> some stochasticity.

Is that not what you were pointing out?

------
mjn
The strategy that the dlib implementation uses of combining a global
optimization method (LIPO) with one that works better locally (TR) reminds me
of another recent paper I ran across, which takes a black-box global
optimization algorithm (StoSOO) and improves local convergence behavior using
CMA-ES [1]. A main difference seems to be that StoSOO focuses on robustness to
noisy functions more than LIPO does.

[1]
[https://link.springer.com/article/10.1007/s10898-016-0482-9](https://link.springer.com/article/10.1007/s10898-016-0482-9)
/ [http://ssamot.me/papers/global-
optimisation.pdf](http://ssamot.me/papers/global-optimisation.pdf)

------
aldanor
Very cool!

I'd like to share another global optimization algorithm 'worth using' that
I've recently discovered for myself: Hyperband [0]. It's extremely simple to
implement (i.e. a dozen lines of code), is based on successive halving
principle and is particularly suitable for iterative machine learning
algorithms where early stopping is applicable (i.e. NN, GBM).

[0]
[https://people.eecs.berkeley.edu/~kjamieson/hyperband.html](https://people.eecs.berkeley.edu/~kjamieson/hyperband.html)

------
WeaselNo7
I don't speak fluent maths, but i'm extremely interested. Is anyone in a
position to explain what the initial equation for U(x) means? I'll be in your
debt forever!

~~~
contravariant
A Lipschitz function is one that doesn't change to quickly, such that the
difference between two values |f(x) - f(y)| is at most as large as the
distance between x and y times a constant k i.e. |f(x) - f(y)| <= k|x-y|. In
particular this implies that:

f(x) - f(y) <= k|x-y|

hence

f(x) <= f(y) + k|x-y|.

Therefore the function:

u(x) = f(y) + k|x-y|

is an upper bound for the function f(x). Repeating this for multiple points
x1, x2, x3,... you can construct a stricter upper bound by taking the minimum:

U(x) = min_i f(xi) + k|x - xi|

~~~
WeaselNo7
Wonderful answer, very intuitive, thank you so much!

------
zawerf
Can someone explain the intuition behind how the lipschitz constant k is
chosen?

It seems like even in his demo video, k starts off underestimated which made
the upperbound not a upperbound (f is sometimes above the green line U) and is
only slowly corrected towards the end. Why does it still work?

------
AllegedAlec
Would it be possible to make the hyperparameters optimizable? Maybe it's the
biologist in me, but I can't really see a reason why you couldn't just let the
algorithm decide for itself which values of those parameters would work best.

~~~
archgoon
I don't understand; the entire article is entirely about methods of doing
that.

~~~
AllegedAlec
Maybe I wasn't clear, or misunderstood some of the article, but I thought the
article was about optimising the hyperparameters before the machine learning
part happens. I meant optimising the parameters during the machine learning
iterations.

------
splike
Am I wrong in thinking that this algorithm could only be applied to continuous
variables/hyperparameters?

~~~
alexbeloi
You're not wrong, the objective is assumed to be Lipshitz which is continuous.

~~~
jefft255
It's actually a stronger assumption than simply being continuous

------
amelius
What is the worst-case time complexity in terms of N local minima, and D
variables?

~~~
robotresearcher
Not defined. It's continuous optimization.

------
kevinwang
Is the algorithm used here the Shubert-Piyavskii Method?

