
The Simulated Annealing Algorithm - luu
http://katrinaeg.com/simulated-annealing.html
======
drostie
You can get a lot of the way (just not to `exp((E_old - E_new) / Temp)`) by
explaining the origin of the algorithm:

People who work metals have a process for making metals more supple, called
"annealing". In annealing, you heat the metal up almost to the point where it
loses its shape, then let it cool slowly. In the atomic picture, this chunk of
metal has knots of stress and tiny mis-aligned crystals, but if we let the
atoms jiggle, the knots can just work themselves out and the crystals can
grow, because atoms want to align with their neighbors. It works the other way
too though: If atoms jitter randomly then at any time they can exhibit a
chaotic state, so if you suddenly stop that jitter ("lowering the temperature
too fast"), then you just get a random state. So there is a sweet spot. You
want just enough jitter to undo knots, but not so much jitter that you create
new ones. By slowly reducing the temperature, you get to pass some special
points where the neighbors have about the same impact on the crystal as the
jitter, and as you slowly pass that, the jitter starts undoing the knots for
you and then growing the crystals for you. You still often don't get
perfection, but you can get closer.

What does this have to do with optimization? Physicists like to think of this
"aligning with neighbors" process as minimizing some "interaction energy". But
you can minimize whatever function you want with the same approach: (1) take
any state you want, (2) let it randomly jitter to neighboring states, (3)
accept the new states when it makes your function smaller, but also (with some
probability p) when it makes your function larger, and (4) slowly lower p as
you allow the system to randomly jitter around.

You can also get a good idea of what's going on with a graph:

    
    
        \   _b    / _
         \_/ \   /  _|-- B
              \_/   _|-- A
    

Let's explain simulated annealing by looking that the barrier above (b), which
has two different energy costs to jumping over it: B from the left and A + B
from the right.

Here we understand the jitter as having a typical energy E which lets it jump
up the graph, but it still tends to fall in the holes. When E > A + B then it
will jump from the deeper valley on the right into the left and back. But as
we get into the range B < E < A + B, then it's possible to jump from left to
right, but not easily from right to left. The more we allow the system to
jitter in this region as we slowly reduce E, the more likely the state ends up
on the right somewhere, and then from that point on the algorithm mirrors your
normal gradient descent algorithm.

~~~
bjornsing
> You can get a lot of the way (just not to `exp((E_old - E_new) / Temp)`) by
> explaining the origin of the algorithm

The exp is from the so-called Boltzmann factor [1] which (in somewhat
simplified terms) describes the likelihood that a physical system in
equilibrium at temperature T will transition from a state with energy E_old to
one with energy E_new. So sure, you can derive the whole algorithm from
classic "annealing" and physics alone.

1\.
[http://en.wikipedia.org/wiki/Boltzmann_distribution](http://en.wikipedia.org/wiki/Boltzmann_distribution)

~~~
drostie
I mean, I know where it comes from; I just don't think that there's a nice
hand-wavy way to get people to understand what's going on there: you really
need to start with how uncertainties multiply across a system, define entropy
as log(W) in the NVE ensemble, discover temperature hiding there too, then
switch to the NVT ensemble (i.e. connect to an infinite reservoir of energy at
a certain temperature) -- then you can get at Boltzmann factors.

Even then, as was pointed out to me recently, it's not clear that the "always
accept a lower energy, accept a higher energy proportional to the Boltzmann
factor" yields a steady-state Boltzmann distribution -- especially because
that would imply that the result is independent of the number of
configurations with a given energy (the "density of states"), which seems
surprising. If that doesn't hold then it's either a convenient approximation
or a bit of magical thinking...

------
Fede_V
It's very hard to have 'proofs' about convergence of global optimization
algorithms, because to do so, you have to make assumptions about the
smoothness of the fitness landscape, the type of minimas, etc, etc.. This
means that when someone is interested in publishing a new optimization
algorithm, they just cherry pick a subset of problems and show that their
algorithm works better on those, when used with a given set of hyper-
parameters.

In theory, the no-free-lunch theorem means that there is no optimal algorithm
for an arbitrary fitness landscape, but in practice, almost every single
optimization of actual interest is somewhat smooth.

Anyway, here is an interesting set of benchmarks by Andrea Gavana about
different optimization algorithms:
[http://infinity77.net/global_optimization/ampgo.html](http://infinity77.net/global_optimization/ampgo.html).
In my experience, as long as the fitness landscape is relatively smooth,
multi-start gradient based methods tend to perform better, assuming you can
get a somewhat decent approximation of the gradient (with AD, it's not too
difficult).

------
ska
As introductions go, this is not very good.

For example, the first sentence "Simulated annealing is a method for finding a
good (not necessarily perfect) solution" is incorrect in theory, while correct
in practice. I feel that it is this dichotomy that brings the most insight to
the strengths and weaknesses of the approach. Further, the post doesn't give
any good intuition on why it works, or how it works, or how it compares to
alternatives... everything you would want to know. It also gets other details
not quite right, like the origin of the Gibbs measure (the name "simulated
annealing" comes from analogy to the annealing process in metallurgy, but the
Gibbs formulation comes from probability theory, and the application here via
the Ising model and electromagnetism). This isn't very important to a
practical intro I guess, but it makes it harder to find the right places to
read deeper if you wish to.

It's fairly easy to prove that simulated annealing done "properly" will
converge to a uniform distribution over all global minima. The "properly" here
involves a logarithmic cooling schedule and a little thought should convince
you that this is in practice not going to help you any more than exhaustive
search will.

It's very difficult to prove useful properties about convergence or
distribution on faster cooling: therin lies the rub.

I suspect a big part of the popularity of the approach is that it is extremely
easy to implement, but unfortunately naively application often doesn't work
very well on complicated energy state spaces.

~~~
carlob
> It's fairly easy to prove that simulated annealing done "properly" will
> converge to a uniform distribution over all global minima. The "properly"
> here involves a logarithmic cooling schedule and a little thought should
> convince you that this is in practice not going to help you any more than
> exhaustive search will.

Are you sure this is proven? What if ergodicity is broken?

~~~
ska
Too late to edit the parent, but you are right to call me out on this,
thanks!. Clearly without some ergodicity constraint you cannot have uniform
distribution.

I was sloppy with terminology, sorry. The original approach to this require
weak and strong ergodicity conditions on an inhomogeneous Markov chain (but
not stationarity of the chain, i.e. you don't have to run it forever). This
was worked out in the mid 80s by Gidas and Mitra if I recall correctly, and
several interesting updates since then. There are other probabilistic
arguments made as well, after that.

I can dig out references if anyone asks.

~~~
carlob
The problem then is that proving ergodicity is extremely hard for anything
that is not a Markov chain to begin with.

IIRC ergodicity of hard spheres is not yet proved in full generality (n
spheres, d dimensions), and that's basically the simplest physical system
there is!

------
ColinWright
I nearly snorted my coffee through my nose at this:

    
    
        "Simulated annealing's strength is that it
         avoids getting caught at local maxima ..."
    

It's true that SA generally has better local-maximum avoiding properties than
simplistic hill-climbing, but I've generally found that shotgun stochastic
ballistic hill-climbing works at least as well.

But all these things work the same way:

* Do something random and see how good it is;

* Jiggle it at random and see if it's better;

* Consider changing.

~~~
tfgg
> shotgun stochastic ballistic hill-climbing works at least as well

This is the method that I'd like a lot of these stochastic optimization
algorithms compared to, especially genetic algorithms.

At least brute force sampling of the entire space is guaranteed to find the
global extremum at some point, an argument I've heard for random sampling &
minimising vs. GAs for random structure searching in materials science (a very
high dimensional problem). It doesn't code any assumptions about the surface
in, so can't be optimal, but that can be a benefit if you want to find
something surprising.

~~~
ColinWright

      >> shotgun stochastic ballistic hill-climbing
      >> works at least as well
    
      > This is the method that I'd like a lot of these
      > stochastic optimization algorithms compared to,
      > especially genetic algorithms.
    
      > At least brute force sampling of the entire space ...
    

Not sure you meant it to do so, but this sounds like you think that "shotgun
stochastic ballistic hill-climbing" is a brute force sampling of the entire
space. It's not, it's a stochastic optimization algorithm. Brute force
sampling of the entire space, or even reasonable portions of it, is completely
infeasible for the problems I deal with, so I'm not sure what you're saying
here. You might like to explain a little more.

~~~
tfgg
Now I've read your other comment with the terminology, I think "shotgun hill-
climbing" is the appropriate phrase for the sort of materials science thing
I'm talking about - you start atoms in random configurations, with minimal
bias, and then minimise using the forces (available analytically) on those
atoms using gradient descent/whatever optimiser (usually BFGS) until you find
local minima. There's no momentum involved, and the gradient is available, but
you still start in random positions and minimise to local minima. That sort of
method is in competition with other structure finding algorithms that use GAs,
SAs or some other fudged hybrids.

My reference to "brute force" was that, as far as I understand, in the limit
of lots of attempts, "shotgun" will approach "brute force" \-- imagine a
really nasty landscape with lots of tiny basins. I understand "stochastic
ballistic" modifies things slightly.

~~~
jmmcd
If gradient is available then it makes little sense to say you want people to
compare that method to GAs. GAs are for black box search.

~~~
ColinWright
Discrete gradients are always available - you can step in random directions
and re-evaluate giving an approximation. If your space is smooth(ish) (and
usually it's "smooth enough") then you can use the local approximations.

Yes, sometimes it's not possible, but usually local random search gives enough
to use the concept of "gradient" to help. Mostly I find that people doing the
optimisation/search don't know enough about the techniques they're using to
adapt them and transform their problem to get the best out of things.

~~~
jmmcd
Yes, but it still makes no sense to compare GA performance with a gradient-
based method on the problem GP is talking about, because an explicit gradient
is available.

------
eliben
As a tangent, I'm really happy to see that Python is slowly but surely
becoming the go-to demonstration language of posts like this. Wasn't always
this way. Python is so lucidly readable that it serves great as "executable
pseudocode". I mean, if you provide pseudocode for an algorithm, might as well
write Python - it's mostly as readable and a huge advantage is that you can
actually run and test it.

~~~
getsat
I think the widespread use of scipy/numpy is directly correlated. Python code
is generally very easy to port, too, so I'm very happy when it's used for
examples even though I don't "know" or write Python.

------
leni536
This method comes from statistical physics. Basically you want to generate a
Boltzmann-distribution of states in the function of "energy". By defining the
Markov chain in a certain way you can achieve this. This is the Metropolis-
Hasting algorithm[1].

I believe it's somewhat hard to use this algorithm for optimum finding. You
have to tune the temperature and it's works best if you have a good initial
temperature and you need to cool the system down slowly (so you have to choose
a reasonable timescale for thermalization). Some physical insight could help
with both.

However this algorithm is used to actually simulate physical systems on finite
temperature where you do not need dynamics. You only simulate the Boltzmann-
distribution and you measure interesting mean values and correlations. Mostly
used for describing phase transitions.

[1]
[http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_alg...](http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)

------
thearn4
Genetic algorithms and simulated annealing are probably some of my favorite
topics which ended up not being applicable in my day job, despite working very
heavily in non-linear optimization. Our problems are of the size where
gradient-based wins the day, hands down. But GA/SA are still very fun for my
side projects.

~~~
Fede_V
Simulated annealing can be gradient based - the simulated annealing step
refers to the probability of accepting a given minima as a function of the
'temperature'.

~~~
a-priori
Simulated annealing is gradient based. It's an extension of gradient descent
and in the degenerate case (zero temperature) they're the same: it generates
random neighbouring states, and if the fitness of that state is better than
the current one then it jumps there. That is, it seeks the local minimum.

The key part that simulated annealing adds is that when the temperature is
non-zero there's also a chance of jumping to _worse_ states. The probability
depends on how much worse the new state is and what the temperature is. It's
unlikely to make the jump if it's much worse or if the temperature is too
cold. The idea here is to jump out of local minima to (hopefully) find the
global minimum.

The temperature is a knob that trades off between exploration (high
temperature) and exploitation (low temperature). You start off with a high
temperature and jump around a lot, hopefully seeking the general vicinity of
the global minimum. As you do so, you gradually decrease the temperature and
settle on a locally-optimal solution.

Let's say the fitness landscape looks like this:

    
    
        \   B     /
         \A/ \C  /
              \D/
    

If you are at A and you're seeking the lowest point, then there's a
probability that you will jump to B _even though it is worse_. The hope is
that you'll then discover C and finally D, the global minimum. There's no
route from A to D with monotonically increasing fitness, so gradient descent
won't find it. But simulated annealing might.

~~~
Fede_V
The local minimization step does not have to be gradient based. You could use
something like Nelder-Mead to do the local minimization step, and still use an
annealing schedule to decide whether to accept or reject that particular
minima.

------
fiatmoney
There's a whole genre of stochastic optimization algorithms in the same family
- parallel tempering in particular has some nice theoretical properties (it
strictly dominates simulated annealing, assuming you have computational power
to run all the replicas, and has a light guarantee that "eventually" it will
converge), and there are adaptive methods to optimize the step sizes, number
of replicas, and temperature ranges.

------
fmax30
Aaah good old simulated annealing, i remember using it for my computational
modelling based final year project (which lead to a workshop paper [1] ) where
brute forcing wasn't really a viable option.

[1]
[http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnum...](http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=6690725)

------
andy_ppp
If only I'd implemented the obvious extension to my graduate dissertation on
optimising taxi routes (using a slightly different mechanism, tabu search and
squeaky wheel)...

Unfortunately someone has fairly recently done it.

------
krazydad
Although she doesn't say so, it appears she is assuming the cost function is
normalized (range is 0-1), in this formula: exp((E_old - E_new) / Temp). Is
that the case?

------
graycat
Okay, once I looked into simulated annealing and here just read all the
comments on this thread. So, apparently what I have to contribute is not here
yet.

Here I outline _non-linear duality theory and Lagrangian relaxation_ and
report a war story where this approach badly beat some efforts with simulated
annealing.

Basically non-linear duality theory has some surprising properties commonly
just too powerful to ignore.

Basically we're considering a case of _optimization_ or _mathematical
programming_.

So, let R denote the set of real numbers, and suppose positive integers m and
n are given. Let R^n be the usual n-dimensional real vector space of n-tuples.
To use matrix notation, we regard each n-tuple x in R^n as a matrix with n
rows and one column, that is, as n x 1.

Similarly for R^m.

We are given set C a subset of R^n and functions

f: R^n --> R

g: R^n --> R^m

We seek x in R^n to solve non-linear program (NLP)

min z = f(x)

subject to

g(x) <= 0

x in C

where the 0 here is the m x 1 matrix of zeros.

We let the _feasible_ region R, a subset of R^n, be

F = { x | x in C and g(x) <= 0 }

We regard g as m x 1 where for j = 1, 2, ..., m component j of g is

g_j: R^n --> R

where here g_j, borrowing TeX notation, is g with subscript j.

Okay, given 1 x m l, we let the _Lagrangian_

L( x, l ) = f(x) + l g(x)

Then for l >= 0 and x in F,

f(x) >= L( x, l )

and

z >= L( x, l )

Then the _dual_ is to find l to maximize

M( l ) = min_{x in C} L( x, l )

Then M( l ) <= z to that M( l ) is a lower bound on z.

In particular (H. Everett), if g(x) = 0, l >= 0, and

z >= M( l ) = min_{x in C} L( x, l ) =

min_{x in C} f(x) + l g(x) = f(x)

and x in F so that x solves NLP.

A practical special case is essentially if spend all the productive resources
and do so optimally, then are optimal.

Of course, the usual notation used for l is lambda, and for a while around DC
Everett ran Lambda Corporation which did resource allocation problems for the
US DoD.

Theorem: M is concave.

Proof: For t in the interval [0,1] and m x 1 l_1 and l_2, we have

M( tl_1 + (1 - t)l_2) =

min_{x in C} L( x, t l_1 + (1 - t) l_2) =

min_{x in C} f(x) + ( t l_1 + (1 - t) l_2) g(x) =

min_{x in C} ( t + (1 - t)) f(x) + ( t l_1 + (1

\- t) l_2) g(x) =

min_{x in C} t L( x, l_1 ) + (1 - t) L( x, l_2 )

>= t min_{x in C} L( x, l_1 ) + (1 - t) min_{x in C} L( x, l_2 )

= t M( l_1 ) + (1 - t) M( l_2 )

so that M is concave. Done.

Immediately we have supporting hyperplanes for our concave M:

Suppose y and l are given and

M( l ) = L( y, l )

Then for 1 x m u,

M( l + u ) = min_{x in C} L( x, l + u )

= min_{x in C} ( L( x, l ) + u g(x) )

<= L( y, l ) + u g(x)

= M( l ) + u g(x)

so that for 1 x m u

s( u ) = M( l ) + u g(x)

is a _supporting hyperplane_ for concave M( l ); that is,

M( l + u ) <= s( u ) = M( l ) + u g(x)

and s( 0 ) = M( l + 0 ) = M( l ).

So, here we have the basic theory of non-linear duality as needed by the
iterative algorithmic approach of _primal-dual Lagrangian relaxation_.

War Story:

Once some guys had a big resource application problem having to do with a
suddenly legal case of _cross selling_ for banks. They had formulated their
problem as a 0-1 integer linear program with ~40,000 constraints and 600,000
variables.

They had tried simulated annealing, ran for days, and quit with no idea how
close they were to optimality.

I looked at their formulation and found that, except for 16 of the
constraints, the problem was comparatively easy.

So, I wrote out

L( x, l ) = f(x) + l g(x)

where l was 1 x 16 and g represented the 16 troublesome constraints. The set C
represented the 0-1 constraints and the rest of the 40,000 constraints.

Then for each l, finding x to solve

M( l ) = min_{x in C} L( x, l )

was easy. Then I just had to find l to solve

max M( l )

subject to

l >= 0

and that was just to maximize a concave function.

So, I did some _primal-dual_ iterations:

    
    
              Set
    
                   k = 1
    
              Pick l^k >= 0
    
         Step 1 (Primal):
    
              Find x = x^k to solve
    
                   min_{x in C} L( x, l^k )
    
              Then
    
                   M( l^k ) = L( x^k, l^k )
    
         Step 2 (Termination)
    
              If x^k in F then
    
                   M( l^k ) <= z <= f( x^k )
    
              so that if
    
                   M( l^k ) - f( x^k )
    
              is sufficiently small, then
              terminate.  Else if k i
              to large, terminate.
    
              Step 3 (Dual):
    
              Find 1 x m u and w in R to solve
    
                   max w
    
                   subject to
    
                   w <= M( l^p ) + u g(x^p)
    
                   p = 1, 2, ..., k
    
              Of course, this is just a linear
              program.  Set
    
                   l^(k + 1) = l^K + u
    
                   k = k + 1
    
              Go to Step 1.
    

There are refinements: E.g., for the linear program to find l^{k + 1}, can
hope to get better numerical performance with by using _central cutting
planes_.

For the real problem, in 905 seconds on a 90 MHz processor, I ran 500 primal-
dual iterations and, then, had a feasible solution, from the bound
inequalities, guaranteed to be within 0.025% of optimality.

Worked fine! So, simulated annealing is not the only tool in the tool box!
Sometimes Lagrangian relaxation can work well, too!

~~~
jlouis
Introducing lagrangian relaxation quickly in a comment to a simulated
annealing post is perhaps not the best way to convey a method. The machinery
of mathematical programming you have to introduce quickly becomes an
incomprehensible wall of text to most people. Better write this in a blog post
somewhere and refer that.

Indeed, there are often better solutions than simulated annealing, if the
problem you are facing have some nice properties. I've been doing optimization
on a highly nonlinear mathematical problem, but the objective function is so
nice it only has a single maxima in the region I'm looking. So I just ran
Nelder-Mead which terminates much quicker.

~~~
graycat
For anyone interested in simulated annealing, what I wrote could be of more
interest to them than simulated annealing and, really, easier to read than a
comparably detailed description of simulated annealing.

