
Stochastic Superoptimization [pdf] - eslaught
http://theory.stanford.edu/~aiken/publications/papers/asplos13.pdf
======
nkurz
There was earlier discussion of this paper here about a year ago:
[https://news.ycombinator.com/item?id=5509254](https://news.ycombinator.com/item?id=5509254)

------
cschmidt
Abstract:

We formulate the loop-free binary superoptimization task as a stochastic
search problem. The competing constraints of transformation correctness and
performance improvement are encoded as terms in a cost function, and a Markov
Chain Monte Carlo sampler is used to rapidly explore the space of all possible
programs to find one that is an optimization of a given target program.
Although our method sacrifices completeness, the scope of programs we are able
to consider, and the resulting quality of the programs that we produce, far
exceed those of existing superoptimizers. Beginning from binaries compiled by
llvm -O0 for 64-bit x86, our prototype implementation, STOKE, is able to
produce programs which either match or outperform the code produced by gcc
-O3, icc -O3, and in some cases, expert handwritten assembly.

~~~
michaelochurch
What's "the loop-free binary superoptimization task"?

~~~
icegreentea
Superoptimization is finding -the- optimal program to perform a task, as
opposed to optimization which only looks for better programs. Loop-free
means... no loops.

That said, their approach does not actually give you the superoptimal
solution, since it is a stochastic search.

------
jules
Using MCMC for optimization is really strange. MCMC is for generating random
samples from a probability distribution, not for optimization. Some variant of
hill climbing would be more appropriate here.

~~~
pavpanchekha
Strange—ingenious. Hill climbing is bad for these sorts of problems since it
gets stuck in local maxima—an as the charts in the evaluation show, there are
often multiple maxima in the floating point precision of a program. So instead
you need any of a variety of methods that guarantee finding a global maximum
eventually. Luckily, MCMC gives you such a guarantee, with better asymptotic
probabilistic bounds than simulated annealing. In the evaluation, they have a
comparison of MCMC to annealing and hill climbing.

~~~
thisisdave
>with better asymptotic probabilistic bounds than simulated annealing

This seems unlikely to me. The Metropolis algorithm is just simulated
annealing with a constant temperature. Which means that simulated annealing
includes Metropolis MCMC as a special case.

Also, while simulated annealing is guaranteed to find the global optimum given
infinite time (almost sure convergence), I'm not aware of any such guarantees
for Metropolis.

------
mturmon
I'm used to MCMC's for statistics problems, and it seems to me the authors are
being too optimistic about its feasibility as a search strategy.

Also, from a technical point of view, it seems like they started out with an
optimization problem on a hard domain, and then made their problem even harder
by using a sampling algorithm with stochastic moves (i.e., MCMC). The
particular usefulness of MCMC is sampling, not optimization, and for this
problem, you don't really want to sample.

~~~
pavpanchekha
As per the evaluation, STOKE works great. So if they made a hard problem
harder, they clearly didn't make it too hard.

~~~
mturmon
Let's stipulate that it worked great on the problem they analyzed. That
problem was really very simple. If they used a more appropriate technique,
they could expand the domain of applicability significantly.

Like I said, I've used MCMC for loosely-structured inference problems (but
more structured than the one treated in the paper). I was doing MAP estimation
(i.e., maximizing the objective function), just as in the paper.

When speaking about my work, I was once asked "Why MCMC on such a hard problem
if you're really just maximizing?" It's still a good question.

------
frozenport
How can they validate if the code is correct? On the extremer side, they may
LTO out a piece of code that isn't used by any loops but might included
critical error recovery.

~~~
kosievdmerwe
Looking through the summary (and working from memories a while back, so I may
be confusing stuff for this super optimizer with other super optimizers), they
have a two step process for verification. For certain fixed inputs they check
that the code outputs the same values (this can also be used as a hashing
mechanism). Then afterwards they do symbolic comparison of the initial asm and
the final asm, by converting each asm sequence into SMT[1] statements, feeding
them the same input symbols and checking that their output symbols are the
same. The tricky part here is doing the conversion as there are also many
x86-64/SSE instructions; however, once you've done it you can just use an off-
the-shelf solver.

This is why it is important that the code is loop free, because otherwise you
couldn't use SMT and essentially would have to solve the halting problem.
However, SMT, like SAT, is a NP-complete problem, but the good thing is that
in most cases (according to the paper) solving the SMT problems generated is
quick as they weren't designed to be diabolical.

[1]
[http://en.wikipedia.org/wiki/Satisfiability_Modulo_Theories](http://en.wikipedia.org/wiki/Satisfiability_Modulo_Theories)

~~~
gwern
And SMT solvers pop up in _another_ interesting paper! It's fascinating how
over the last decade, I've started seeing SMT appear in practically anything
and everything.

------
PaulJulius
I had Alex Aiken as my professor for the class I took on compilers last
Spring. I really enjoyed his teaching and I remember him discussing this
project during the last lecture. Very cool stuff. The class is still probably
my favorite class I've taken.

------
userbinator
That example in figure 1 really confused me for a moment, since it's a mix of
GAS and Intel syntax - no $ or %, but with size suffices and the opposite
operand order.

It's interesting to observe that "brute force" approaches to optimisation are
yielding very good results, but I've always believed (and in this case, the
"in _some_ cases, expert handwritten assembly" phrase somewhat seems to
support this) that making compilers more _intelligent_ is the way to go --
make them "think" more like a human Asm programmer would.

For example, using their first sequence, 3 out of 11 instructions are moves
that do nothing more than unproductive "register shuffling", which suggests to
me that there's still room for improvement. As the saying goes, "the fastest
way to do something is to not do it at all." That shuffling is only there
because they constrained their register usage, and if I was implementing this,
I'd arrange the code that came before such that the right operands are
naturally in the right registers when instructions that require them e.g. mul
are used. To my knowledge, this technique is not implemented in any compiler
and none of the existing research on register allocation or optimisation
mentions anything like that; instead they all seem to suggest "introduce extra
moves and hope we can somehow remove them"... which doesn't work as well as
_planning ahead_ so you don't ever need them in the first place. Moves should
only be used for making copies of values, since instructions that do other
operations can implicitly "move" data as part of their operation anyway.
Applying this principle to their first example eliminates two moves and yields
this sequence:

    
    
        shlq 32, rcx
        movl edx, eax  <-- move + zero-extend
        xorq rcx, rax  <-- why xor? or probably works too
        mulq rsi
        addq r8, rdi
        adcq 0, rdx
        addq rax, rdi  <-- we want result in rdi anyway
        adcq 0, rdx
        movq rdx, r8   <-- this is still "room for improvement"
    

I think the same goes for optimisation as a separate pass - the idea of
generating horribly inefficient code (gcc -O0 is a great example) and then
trying to optimise it just doesn't make much sense to me; in my mind, if I
tell a compiler to do max size or speed optimisation, it should be selecting
and generating instructions that are pretty close to optimal already. I wonder
if this is a result of that famous "premature optimisation" quote...

"Fastest" is also something that can be highly dependent on the processor
model, so this is also important to keep in mind if you're compiling on one
machine but executing on another. If anyone would like to, I'd be really
interested in seeing the performance of my two-moves-less sequence above vs
the one in the paper (don't have a 64-bit machine to test this on at the
moment.)

~~~
msandford
A lot of times the "no nothings" are actually important for software
pipelining. They enable the hardware scheduler to "figure out" that things are
independent and to overlap their execution. Too tightly coupled and the
INCREDIBLY limited on-die smarts can't separate things.

------
onurgu
Is this implemented?

~~~
eslaught
Yes, the code in Figure 1 (right side) is the output of the STOKE
implementation.

~~~
onurgu
I saw that but I am curious whether this is implemented for a compiler, say
gcc?

~~~
eslaught
It's not built into any compiler, but because it takes x86 machine code as
input, that's not really a limitation. You can run any compiler (GCC, LLVM,
MSVC, Intel, etc.) and run STOKE on the output to optimize the code. You can
even run it on binaries for which you have no access to the source code (e.g.
proprietary programs).

Given that STOKE takes tens of minutes to optimize a couple dozen lines of
assembly, you probably wouldn't want to include it as a general compiler
optimization. If you want the performance badly enough to use this tool, you
are probably willing to run a profiler to find the hotspots and focus the tool
on those specifically (which is how this tool is meant to be run).

That said, there are alternatives which are able to run at closer to normal
compiler speeds, such as the peephole superoptimizer:

[http://theory.stanford.edu/~aiken/publications/papers/osdi08...](http://theory.stanford.edu/~aiken/publications/papers/osdi08.pdf)

