
Superoptimizing Compilers - dmmalam
http://superoptimization.org/wiki/Superoptimizing_Compilers
======
wbhart
The GMP (GNU Multi Precision library) guys found that the hand written
assembly sequences (which were already 2-6x faster than that produced by a
decent C compiler) could be sped up another factor of 2x by superoptimisation,
at least of out-of-order CPUs.

Superoptimisation isn't only feasibly, it is actually quite practical. A
factor of 2x across an already highly optimised library can cut your
supercomputing budget in a grant by half. Whether you need 1 or 2 million CPU
hours can make or break entire projects

These days, however, one can often do better with SIMD instructions. These
often don't superoptimise well because the individual atomic operations they
perform aren't able to be rearranged by the programmer. However, lots of
existing supercomputers available for research still don't have chipset
revisions with decent SIMD support.

~~~
gopalv
The SIMD auto-vectorization is an off-shoot of this research.

But what excites me about super-optimization is to make functional
optimization repeatable.

Let me give you an example of why I'd use it for more than HPC libs.

Assume you have a firewall system, which needs to determine whether a tuple of
(srcip, dstip, dstport) is a safe one or not.

All 3 variables fit inside registers (assume xmm).

A rules engine as written traditionally would be a hashtable of some sort,
always regenerated with a gperf hash (already we're in super-optimal territory
partially).

But what if going to RAM is 300x more expensive than a SIMD sequence over 3
registers run a few hundred times?

Then writing a functional chunk of code which produces the same outputs for
the same inputs, but without any storage system would be a killer feature.

That is sort of the problem where a super-optimizer can take a huge amount of
data and turn it into bit-twiddling checks which find hidden invariants in
each data-set.

Being able to take a bunch of rules which say subnet to subnet or port access
patterns and compile it into a storage -less ACL mechanism (like bit 11 on
port is 1, then DENY).

I'm almost certain these problems were solved in the past using a super-
optimized FPGA as well (well, I would if I had the budget).

For me, being able to churn a dumb SWITCH loop through these systems is more
interesting than chucking well written code into this.

~~~
DanWaterworth
I don't think this will be as successful as you might hope. Consider how many
possible sets of rules there are. Call that n. The instruction sequence will
need to encode log2 n bits of data, but you have no ability to tune the
representation to say which inputs are likely.

Instructions aren't free, they still need to be loaded from RAM like data.

------
dustingetz
There is another way to super-optimize (this is not very related to the
approach outlined in OP).

See: "Morte: an intermediate language for super-optimizing functional
programs"
([https://news.ycombinator.com/item?id=8308282](https://news.ycombinator.com/item?id=8308282))

TLDR: it trades turning completeness (no infinite loops) for the ability to
decide "do these two differently coded functions compute the same thing".

Thus "if two programs are equal then they generate identical executables. Such
a language would be immune to abstraction: no matter how many layers of
indirection you might add the binary size and runtime performance would be
unaffected." (quote from linked article)

~~~
hawkice
To rephrase/expand your tl'dr slightly: you can express turing complete
programs using Morte. But it only does a certain number of well defined things
(like inlining) to reduce an expression and compare it, so some programs that
are 'the same' due to a more complex isomorphism could generate different
executables. This is fine, because program isomorphism in general is basically
the halting problem.

~~~
ttctciyf
> program isomorphism in general is basically the halting problem.

How far does this go? On the face of it, for some system to determine whether
program X halts period could be different from finding whether program X
halts, given that some program Y halts. The fact that program Y halts I guess
would count as an additional axiom in the system, and depending on details of
program Y could be more or less helpful in determining that X halts?

~~~
drostie
Yep! If you want to learn a lot more about this stuff, search for the buzzword
"codata." Basically, we can start to define type systems which "wall off" a
section (which may or may not have all the things you want to compute in
practice; we don't know!) which follows simple recursive rules from a base
case up. This section is called "data", acting upon it programmatically is
called "recursion", and proofs about it are called "induction". This part of
the computer is _not_ Turing-complete and therefore we _can_ solve the halting
problem. Everything else is "codata" acted upon with "corecursion" and
reasoned about with "coinduction".

One super-awesome thing about the "data" used in codata systems is that, once
you've proven that something halts, there is no formal distinction between
lazy and eager evaluation -- it simply becomes a choice that the compiler
makes, perhaps guided by pragmas inserted by a smart programmer.

One not-so-awesome thing: consider the normal model of the natural numbers,
`data Nat = Zero | Succ Nat` where every number is either 0 or the successor
of another natural number. The obvious definition of _recursion_ involves
stripping off a `Succ` structure one level deeper. Anything which follows that
pattern is proven to halt, so it's a whitelisted computation that can be used
in other data-computations.

But now, consider the usual algorithm for GCD:

    
    
        gcd' m n | n == 0    = m
                 | m < n     = gcd' n m 
                 | otherwise = gcd' n (mod m n)
    

Is this corecursive or recursive? It's clearly recursive. But does the
computer know that? We have to clue it in. We need to tell the computer that
actually, given our definition of recursion here -- "it's recursive if we only
go from n to n - 1" \-- holds a larger theorem, "it's recursive if we only go
from n to some number strictly less than n". Even then, we have to prove that
`mod m n < n` and our system needs to be guided to understand that it's _okay_
if `m` increases as long as `n` is strictly decreasing (the arguments could
have been flipped, after all).

So this is your "additional axioms in the system." They come when your
"recursion" method is not the most obvious way to traverse the data structure;
in this case we're using argument-swaps and jumping through the graph of N to
traverse two data structures simultaneously and it creates a bunch of
headaches.

~~~
ttctciyf
Wow, thanks a lot for the detail! That is quite a bit more practical than I
imagined.

------
Strilanc
The site appears to be down. It's in google cache; looks like just a project
description with some linked reports. Here's a cache of the final report [1].

1:
[http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...](http://webcache.googleusercontent.com/search?q=cache%3Ahttp%3A%2F%2Fsuperoptimization.org%2Fw%2Fimages%2Ff%2Ff1%2FWp4.pdf&ie=utf-8&oe=utf-8)

------
drallison
[http://en.wikipedia.org/wiki/Superoptimization](http://en.wikipedia.org/wiki/Superoptimization)
and/or the articles it cites should have been cited and consulted. Where
mechanical discovery is not appropriate and you know what you want to
accomplish (e.g., division by a constant integer at the cost of a
multiplication or computing the number of one bits in a word) the "tricks"
described in Warren's _Hacker 's Delight_ can be applied. Some, but not all
optimizing compilers, look for sequences for which there is some known
"trick".

~~~
agumonkey
Massalin's paper were surprisingly accessible and inspiring
[http://en.wikipedia.org/wiki/Alexia_Massalin](http://en.wikipedia.org/wiki/Alexia_Massalin)

------
enos_feedler
I think its worthless to find the "optimal" sequence of instructions these
days. The performance of a computer system depends far more on the runtime
context then on anything a compiler can infer from compile-time information.

I worked on production compilers at a major chip company, and the best thing
we can do is improve the context in which a compiler is deployed. Runtime
systems and compilers have to cooperate better. The compiler needs to have an
interface that is cheap (low overhead) for the runtime to call into and make
changes to programs. The runtime needs to sample more architectural perf
counters that matter and send them into the compiler so that it can make
smarter decisions.

~~~
clord
Also worked on compilers at a major chip company. Generally it's better to
have clear code than fast code. For that reason I would say that the future of
optimization are things like whole-program optimization that exploits language
features to improve the asymptotic costs of mechanically provable algorithms.
For a primitive example, consider taking a simple Haskell program that is
provably correct but needs O(n^3). Put it through some 'equivalent optimal
implementation' system that provably spits out an algorithm that does the same
task in O(n lg n).

We did stuff like this for simple matrices in Fortran and C, but I think it is
generalizable to more problems than that.

And of course, I agree that most of the time no one really cares except HPC.

~~~
meowface
The problem is that the subset of programs which can be proved to be correct
(in 2015) is mostly restricted either to uninteresting problems, solved
problems, or both.

~~~
clord
Not always true. For a trivial example, in fortran code, we'd often see people
wanting to write code that calls runtime methods directly because it leads to
clearer implementations. The language rules were such that we could prove
invariants on these runtime calls that were much stronger and farther reaching
than the general bitcode optimizer could do, resulting in a class of matrix
operations going to n*lg(n) from n^3, and we could also switch the stride
direction to make better use of cache because of the algorithm change. This
sort of thing was only possible because language knowledge was exploited —
generic optimizers will miss this stuff.

I think the future of languages like haskell are bright in terms of
performance, because they are so loosey goosey with what gets evaluated when,
which gives language-aware optimizers a lot of meat to chew. An aggressive
optimizer could turn out versions of the same code that run screaming-fast
single-threaded, or that fully exploit massive parallel machines, or even
version for all of those.

------
viraptor
If someone wants to try a real superoptimizer right now, there's the souper
([https://github.com/google/souper](https://github.com/google/souper)) project
which does this for LLVM IR code. Although the goal seems to be slightly
different: "It uses an SMT solver to help identify missing peephole
optimizations in LLVM's midend optimizers."

Still, it does compile some pretty big programs as long as you cache the
results (in redis).

------
ridiculous_fish
Ooh, I can write signum in three instructions on PowerPC:

    
    
        divw r4,r3,r3
        srawi r3,r3,31
        or r3,r3,r4
    

The equivalent in C:

    
    
        int signum(int x) { return (x / x) | (x >> 31); }
    

This uses the hilariously fun fact that integer division by zero silently
gives you zero on PPC.

~~~
unwind
Probably best to stick to assembly for that, since the C version has lots of
undefined behavior (division by zero is UB, as is bitwise shift of a negative
number) an optimizing compiler can do what it want with that code. :)

I'm not trying to be annoying, but I think there's value in not publishing
broken code without at least noting that it is broken. I realize that the code
still communicates _to humans_ how to compute signum(), but it's meaningless
if compiled which is kind of important sometimes too.

------
Ono-Sendai
It's possible to compute the fastest function for a finite domain (see my
proof here
[http://forwardscattering.org/post/13](http://forwardscattering.org/post/13) )

However for an infinite domain, I strongly suspect it's not possible, due to
the fact that it's impossible to tell in general (e.g. it's uncomputable) if
two programs compute the same function. This is due to Rice's theorem, and
hence ultimately to the Halting problem.

It's possible there may be some kind of constructive proof but I find it
unlikely.

So in summary, general superoptimisation is probably impossible.

~~~
fiatmoney
Domains are almost never truly infinite. You tend to be bound by at least the
space of memory even in the brute-force case, and the meaningful domain for
any meaningful function (especially one small enough to be optimized
intensively) is much less.

~~~
Ono-Sendai
Domains are not infinite in practice, but I think it's important to establish
the theoretical behaviour and possibilities for the infinite case. They can
for example give us a good indication of the 'asymptotic' behaviours of
algorithms.

------
damianknz
At first I thought the super-optimizer had made a mistake because the results
are not the same but I think the error is in the original where line 6 should
be

return -1;

Anyway, using assembler for CPU I made up in my head just now, how about

shl r0,1

and r0,1

~~~
david-given
I don't think that'll work. Consider the positive integer 0b01: the shift
yields 0b10, then the and yields 0.

I've actually tried to come up with a simpler example. It's tougher than you'd
think, and you pretty much have to do weird stuff with flags. The
superoptimiser trick is pretty neat.

------
amelius
I wonder if superoptimization can be performed more effciently by using
statistical techniques based on the input code sequence. And of course,
memoization could be used when compiling the same piece of code twice e.g.
during development (actually I would expect _that_ feature in today's
optimizing compilers).

~~~
TheLoneWolfling
There are superoptimizers that do the first part.

Effectively, build a Markov chain generator from known "fast" executables,
then explore that as the input to the superoptimizer.

------
PaulHoule
I think the web server could use some superoptimization.

------
lectrick
Link 404's

