
Stochastic Superoptimization - qubitsam
http://blog.regehr.org/archives/923
======
pbsd
The ASPLOS paper makes a bold assertion regarding the first example:

> To the best of our knowledge, it is truly optimal: it is the fastest program
> for this function written in the 64-bit x86 (x86-64) instruction set.

This is an unwise claim to make, since x86_64 speed is a moving target, and
what is fast today will not be 6 months from now. For example, this sequence
should be slightly superior on modern Intel CPUs:

    
    
        mov edx, edx
        shl rcx, 32
        ; xor rcx, rdx
        ; mov rax, rcx
        lea rax, [rdx + rcx]
        mul rsi
        add rdi, r8
        adc rdx, 0
        add rax, rdi
        adc rdx, 0
        mov r8, rdx
        mov rdi, rax

~~~
nkurz
Wow, that's really embarrassing.

I checked against IACA ([http://software.intel.com/en-us/articles/intel-
architecture-...](http://software.intel.com/en-us/articles/intel-architecture-
code-analyzer)) in case I was missing something, and it agrees that your
version is 2 cycles faster on Sandy Bridge. Worse, while both blocks are
slower on Nehalem than Sandy Bridge, yours appears to be 2 cycles faster on
the older architecture as well. Seeing as that came out in 2008, it's hard to
be charitable to their boasting.

Would you mind if I use your sample code to ask the authors?

~~~
pbsd
Sure, you can use that however you wish.

------
xyzzy123
Ha, OAAS (optimisation as a service). I like it. That's the first time I've
seen that idea.

Of course, if you standardise something based on say LLVM IR, that spins into
the idea of a marketplace of competing optimisation services... -Ofastly
versus -Ogoogle-optimizer-cloud ...

Sadly, I don't think it would be feasible to commercialise this. That doesn't
mean there aren't amusing possibilities:

"Although -Omechanical-turk does produce very good code, most people cannot
wait 3 weeks for that compiler pass to finish..."

~~~
sanxiyn
A year ago, Paul Graham selected "optimisation as a service" as a startup
idea. It's #6 in <http://paulgraham.com/ambitious.html>

~~~
xyzzy123
Oh, so it is!

The reasons I wasn't convinced a web-service-based IL optimisation pass would
be commercially viable were: loss of determinism in compilation, the law of
diminishing returns (throwing 10x effort might generally give only a few
percentage points gain), and the idea of parts of the compiler _literally_
being a black box being unpalatable to a lot of people. Also, IP and
confidentiality concerns.

OTOH PG's idea was slightly different; it reads to me like he's talking about
something that starts with source code.

------
Houshalter
Now run the superoptimizer on itself (or at least parts of it.) Yes I realize
this would quickly hit a point of diminishing returns since it would only try
to create code that produces the exact same output, not novel code that
produces even better output, but it's still an interesting idea to think
about.

And maybe you could judge the output code by it's utility rather than how
closely it matches the output of the original algorithm, but that's a much
harder problem.

------
Xcelerate
(Warning! This is a long one!) This is a really intriguing concept. I've often
daydreamed about writing something like this, but didn't have the time or
expertise to go about attempting an implementation. It's so cool to see others
doing this.

What would be really cool with these optimizers is if the stored databases of
"patterns" could be shared. In other words, if the optimizer stumbles across a
particularly excellent implementation of a particular code goal, it would be
uploaded to a central database and cached for reuse and further mutation.
People all around the world would be working from, and contributing to, one
giant database of randomly discovered optimization techniques.

As a field of study, I also feel like stochastic optimization has a huge
potential for improvement over the coming years. Recently, I had the problem
of trying to space equally-sized spheres as far apart from each other as I
could -- in other words, if I have 361 points that I'm putting in a box, I
want the minimum distance between two points to be maximized (this is the
fitness function).

My first attempt was to give each point a "charge" and attempt a little
electrostatic simulation. This didn't work that well, for various reasons.
After stumbling around for a while, I decided to attempt a genetic mutation
algorithm. I had never written anything like this before, but I thought it
would be fun to give it a go.

Initially, my program randomly placed points in the box. Then, for each point,
it finds its nearest neighbor and computes the distance to it. I then take the
average of all these distances, so I essentially have the mean nearest
neighbor distance. On the next iteration, all points less than this mean value
are perturbed. The process is repeated until no random perturbations produce
improvement.

This worked alright, but it needed more. I kept modifying my program, but in
the process of modifying it, I introduced a bug that somehow actually made the
code the converge to a larger minimum distance (which is the goal). Inspecting
my code, I realized the source of the "error": in the process of computing the
distance between a point and its nearest neighbor, I was setting an initial
"minimum" distance, which was supposed to be infinity. As closer distances
were found, the infinity was replaced, so eventually it ends up with the
closest distance. Well, it turns out that instead of infinity, I had been
setting it to something like 10 (for a 100^3 box). So if the nearest neighbor
distance was _larger_ than 10, it was still a value of 10 that was
contributing to the mean NN distance!

I soon realized that I could also modify the distances that I randomly
perturbed the points in each step, and that some values worked better than
others. Not only that, but that larger perturbations worked better at the
beginning of the convergence, and smaller perturbations worked better toward
the end of the convergence. So I created a heuristic that based the next
"perturbation distance" on the most recently successful distances.

But this wasn't enough. Eventually as the perturbation distance became
smaller, the convergence process would "stall". I found that by temporarily
increasing this distance, the system would find new mutation paths that would
get the program out of its local optimum and move it forward again. But how
frequently do I modify this distance?

The more code I wrote, the more I realized could be changed. There's a
ridiculous amount of parameters that could be optimized _in my optimizer_ ,
and I was just sitting there tweaking them by hand and testing what worked
best. The thought occurred to me, why don't I write a program, that writes a
program, that optimizes the sphere placing problem? And then, darn, I may have
to write a program that even optimizes that one.

Well, I had to get this thing done -- since it was for my research -- so I
couldn't go off on a tangent and I had satisfactory results. But the sheer
number of techniques to optimize this thing left me highly intrigued, and I
will certainly be revisiting my program when I can find time to do so.

~~~
ISL
A lot of people have thought really hard about this problem. Plenty of open
problems remain:

<http://en.wikipedia.org/wiki/Packing_problem>

~~~
Someone
And for this particular problem:

<http://en.wikipedia.org/wiki/Close-packing_of_equal_spheres>

<http://demonstrations.wolfram.com/SpherePacking/>

<http://www.maa.org/devlin/devlin_9_98.html> (nb: that proof is for infinitely
many spheres. For finitely many, the problem is, AFAIK, still widely open. 361
likely is close to infinitely many, though, if your box isn't weirdly formed)

------
jacques_chester
Very cool. It reminds me a bit of the genetic programming work (aside: GP and
genetic algos are different schools of research) on evolving programs where
the genome is a string of assembler instructions.

Markov chain Monte Carlo would appear on its face to be better suited to this
problem than a pure GP approach; at least to speed up convergence to a
solution. It's actually quite different from GP -- no crossovers, more global
state etc. Maybe a flock/swarm optimiser approach would be a better analogy in
the nature-inspired computation world.

I'd be interested to know whether they experimented with other cost functions.
For example, in some schemes, some candidates which fall outside the
acceptable solution state (eg code that doesn't run, code that produces the
wrong answer) are simply culled. In some you try to perform "repairs". In this
case they've taken the other common approach of assigning a cost to
incorrectness.

~~~
Dn_Ab
I don't think this is MCMC. The way I read it, monte-carlo was mentioned with
regards to the idea of searching widely and sparsely. This is counter to the
more targeted idea of MCMC, which generates a bunch of correlated samples to
more efficiently sample the distribution.

Another reason to suspect this is closer to genetic programming is that they
talk about mutation and fitness. I got the impression that the algorithm
searches a space of program trees via mutation, without keeping a population -
so no crossover. If they did use MCMC it would be as a very specialized form,
such as simulated annealing. Also, the Monte Carlo technique used in Go is
with tree search, not MCMC.

I find it amazing that randomization works so effectively, we must live in a
friendly universe. Many real world problems tend to have enough exploitable
structure that the idea of saving effort on optimization (xor measurement) via
randomization is a powerful one.

~~~
coherentpony
>This is counter to the more targeted idea of MCMC, which generates a bunch of
correlated samples to more efficiently sample the distribution.

The idea of MCMC is not to more efficiently sample a distribution through to
use of correlated samples. Sampling a Gaussian exactly is more efficient than
constructing a Markov chain with some random walk proposal (for example) to
sample it.

The idea of MCMC is to sample distributions that are considered intractable.
Correlation in samples skews moment calculations, and this is bad. In fact,
there is a whole field of research dedicated to designing proposals that allow
chains to reach stationarity more quickly.

~~~
Dn_Ab
Yeah I know that it is to aid in the computation of intractable inference, I
should have been more careful in my wording. I am not saying it works by
generating correlated samples, I'm saying because of how it works it will
exhibit autocorrelation. Essentially, MCMC will generate samples that are near
each other which will tend to be positively correlated. This property is why
one needs to tune the variance of the proposal, do burn in and test for
convergence. However, due to the nature of markov chains, large enough samples
will have you sampling from P(x). The idea is to approximate the density of
P(x) while trying to reject as little as possible. Hence less wasteful. But
unlike simpler rejection methods samples will not be i.i.d.

I've also read the actual paper now, which I should have done in the first
place. I see how they make use of MCMC. And looking at it, I think my
suspicion that what they have is very close to annealing (sans Temp) is
accurate since they emulate density with cost and proposal with local edits.
The particulars of the more general (though less generally applicable) MCMC as
I understand it involve actual distributions. I'm not sure why the blog
emphasizes searching sparsely instead of sampling.

~~~
coherentpony
>The idea is to approximate the density of P(x) while trying to reject as
little as possible. Hence less wasteful.

The idea is to approximate P(x) period. Rejecting as little as possible means
nothing other than one's choice of proposal was poor. There are published
works stating the optimal acceptance probability in high dimensions is 23.4%
(see the work of Roberts et al (1997)). I think if one wanted to reject as
little as possible one would have, albeit naïvely, tuned to a much higher
acceptance rate.

~~~
Dn_Ab
You misconstrued. Reject as little _as possible_ (<> reject little) was the
motivation with respect to more naive random walks which suffer at
dimensionality. I was not talking about leniency of proposal, where you are of
course correct.

~~~
coherentpony
>You misconstrued. Reject as little as possible (<> reject little) was the
motivation with respect to more naive random walks which suffer at
dimensionality.

Misunderstanding on my part; apologies.

Can you clarify the statement on dimensionality? Moment calculations converge
at a sharp rate of 1/sqrt(n). Riemann sums as approximations to integrals
(moments) suffer from the curse of dimensionality.

------
tikhonj
Following that paper, a friend and I implemented a simple MCMC-based
synthesizer for the GreenArrays chip (a very low-power and rather obtuse
architecture based on a variant of Forth).

Most of the code is up on GitHub in several different repos[1][2]. I'm
actually rather proud of some of the code, especially part of the F18A
interpreter (F18A is the name of the instruction set on GreenArrays chips).
It's rather short and neat.

[1]: <https://github.com/TikhonJelvis/array-forth>

[2]: <https://github.com/jacobt/mcmc-synthesis>

My version actually supports loops and even self-modifying code. This no doubt
has some performance implications--my current approach is to throttle each
test case to take at most as many steps as the input program does. For various
reasons, I found it easier to just allow loops than to try to stick to loop-
free code.

I found that their correctness measure--looking at the number of bits that
differs from the expected result--was basically useless for GreenArrays. I
actually have some graphs lying around which show how it basically doesn't
vary until it suddenly finds a correct program and shoots up to being 100%
correct immediately. It does not exhibit the nice hill-climbing behavior that
we want.

For better or worse, I have not been able to get this style of approach to
perform as well as a similar synthesizer using an SMT solver (namely
Microsoft's Z3). The advantage with the MCMC system is that it's easily
parallelizable. It can also deal with certain programs that can choke up the
SMT-based version.

The SMT-based system has problems dealing with lots of bits, so we usually run
it pretending that the architecture uses ~4 bits instead of 18 (yes, it uses
18 by default!). So for problems requiring more precision, the MCMC approach
might be better. Similarly, the SMT system does not model memory terribly
efficiently, so problems needing a lot of mutated memory might be better for
the randomized system.

Also, I should add that I know literally nothing about machine learning and
even less about SMT solvers, so it's not surprise that the things I have
written don't perform terribly well. Basically, my only solace is that this
project's code is mostly rather elegant. I learned quite a bit about
programming at a high level of abstraction from this project--some of the
things I ended up doing rely on significantly more abstract reasoning than
anything I've done before.

Anyhow, it's cool to see people reading about something I've been playing
around with for a while now. Even if I find the whole field a trifle annoying
:P.

If you want to play around with this sort of synthesizer, I have uploaded both
a generic library based around MCMC[3]--it's surprisingly simple--and the
GreenArrays-specific code[4] to cabal, so they're both just a cabal install
away.

[3]: <http://hackage.haskell.org/package/mcmc-synthesis>

[4]: <http://hackage.haskell.org/package/array-forth>

------
nkurz
I've recently been trying to write some very fast x64 assembly, and have been
thinking about this quite a bit. I think one could get substantial improvement
even with the simpler approach of just reordering the code. With modern
processors, small changes in number of instructions have much less impact than
how well they interact with the superscalar nature of the CPU. And while out-
of-order processing helps diminish this, the order of instructions still
matters a lot.

Instead of trying lots of random orders and seeing if they work produce the
right result, my thought is to generate a data dependency graph of the
instructions, and then enumerate all possible topological sorts of this DAG.
There's a classic 1974 Knuth typewritten Knuth paper on generating all the
sortings efficiently: <http://www.cs.ncl.ac.uk/publications/trs/papers/61.pdf>

Generating the dependency graph is difficult in the general case, but should
be simple enough if we presume branch-agnostic and alias-free code. As long as
the dependency graph is correct, all the orderings should work. Thus if one of
them doesn't, I'll know immediately that the graph is wrong. Since I only need
to generate the graph once, this step doesn't need to be particularly fast.

Then I'll directly generate assembly for each of these orderings. For speed,
and since I already have the machine code, I'm planning to write the bytes
directly memory. Then instead of analyzing the code and estimating the
processors front-end scheduler, I'll just call each function and time it using
RDTSC or a a comparable on-die performance counter. As the timer is cycle-
accurate, I shouldn't need many if any iterations.

This should be straightforward in C, and apparently possible in the standard
scripting languages using libffi:
[http://stackoverflow.com/questions/6143042/how-can-i-call-
in...](http://stackoverflow.com/questions/6143042/how-can-i-call-inlined-
machine-code-in-python-on-linux). Then I'll save and write out the fastest
one.

To keep it simple, it would be nice to outsource the heavy lifting to existing
libraries and programs. Their availability might dictate my choice of
language. I'm most comfortable with Python, Perl or C, but something else
might be fun if all the tools are available. Here's my current list of
potential tools:

For disassembly, DiStorm from Python or C. Haven't found anything great for
Perl. Maybe X86::Udis86? Maybe just parse 'objdump -d'? My needs are very
simple since I'll already have machine code.

For calling assembly, wrappers for libffi if using Perl or Python; nothing
needed for C. I guess this could be trickier if I want to consider different
alignments, but I think I'll start without.

For timing, I don't think it's usually possible to call RDTSC from Python or
Perl, but since I'm planning to be able to execute arbitrary assembly, this
shouldn't be any harder to add. A much more travelled path in C, though

For generating all the possible topological sortings, I'll just write
something following Knuth's approach. It would be nice if I could generate and
call the functions from the sorting algorithm so I can monitor progress and
start getting results sooner.

I think my weakest part is generating the dependency graph. I'm willing to cut
corners on this by making lots of assumptions about the code, but it needs to
be right. I can parse the output from Intel's IACA if necessary, but would be
nice to have something more easily available. Are there libraries that do
this?

Suggestions and alternative approaches greatly appreciate. My thought is that
once it works, it shouldn't be that hard to add some of the "substitute
different code" approaches. I like the idea of having a database of approaches
one could try out. Of course, what I really hope is that something exactly
like this already exists, and I'm just searching by the wrong terms.

------
fchollet
It would seem intuitive that the search process would be greatly speeded up by
using a genetic search of the solution space rather than brute force search,
it's a bit odd that genetic algorithms aren't mentioned.

~~~
jacques_chester
Two things.

1\. You're thinking of genetic programming, not genetic algorithms.

2\. They used the Markov chain Monte Carlo method to perform the search. It's
a different method altogether. For example, note the absence of crossovers.

~~~
Houshalter
I think that it would be correct to call it a genetic algorithm. I thought
that genetic programming usually uses a tree structure in a functional
programming language.

~~~
jacques_chester
Most GP representations are of instruction trees, but some GP experiments have
been done with strings of assembler instructions. The commonality is that the
population of agents are programs and the test of fitness is a function
execution.

~~~
Houshalter
The problem with that is that it's really hard to crossover strings of
instructions in ways that don't completely break it, especially when code is
inserted or removed from one parent. And crossover is 90% of evolution (in
fact some claim that genetic programming works just as well or better with no
mutations whatsoever and only crossover being used.)

~~~
jacques_chester
It's hard, but it's been done. The nice side effect is that fitness functions
evaluate quickly.

------
taliesinb
And so an NKS prediction comes to pass.

------
LolWolf
Nice, that's pretty awesome.

------
asimpletune
This is pretty cool.

