Better than brute force is using a model of your instruction set (x86 start here [1]), then using something like Z3 [] to find solutions. Here's a paper doing the same [2]. With these approaches you can get vastly bigger pieces of code generated than brute force, and with significantly less hand heuristic tuning effort (which likely still loses out to current solvers).
If you are familiar with that paper, I'm a little confused by the results section. If I'm reading p01 from figure 7 correctly, the "before" is x&(x-1) and the "after" is a single instruction x&(x<<1). What am I doing wrong?
(The other two wins listed are "recognize the Hacker's Delight trick and replace it with the straight-forward implementation a human would write", which I like.)
Happy to see this! Superoptimizers are one of my all time favorite topics.
My favorite trick when cutting down the search space of possible programs is to generate DAGs directly instead of an instruction sequence. This way, it's easier to write rules that avoid nonsensical programs and to some extent avoid generating multiple equivalent programs.
For example to avoid programs that overwrite calculations you make sure that each node in the generated DAG has something that depends on it.
In a suitable environment the DAG can then be executed directly without lowering to a sequence of instructions.
This sounds a little bit like representing boolean functions as binary decision diagrams -- although for a superoptimizer presumably each node in the DAG would correspond to an instruction from the target instruction set, which might do something rather complicated, not just a primitive logical operation between a few bits.
(i first read about BDDs in Knuth's, "The Art of Computer Programming", Volume 4, Fascicle 1, then spent an enjoyable few weeks going down a rabbit hole of building a boolean function to decide if an input 2d grid of bits was a maze or not, then uniformly sampling mazes from it...)
The 1987 paper about superoptimizers is well worth the read, BTW. It's quite short. Even the first example on its own (writing the sign function without any conditionals) is a mind-bender but becomes straightforward once you see the trick used: https://courses.cs.washington.edu/courses/cse501/15sp/papers...
By 'those optimisations' I assume you mean the specific rewrites, not superoptimisation? There has been some interesting recent work in rewrite inference, e.g. ruler. (I recall one such project did a survey and found that many—possibly a majority—of the rewrites it had generated automatically were not present in the hand-written corpuses of gcc and clang. I'm not sure if there was subsequent work to supplement gcc and clang; this was a few years ago now; I think they were looking at gcc 7.)
Maybe one day computers will be fast enough for a `gcc -O4`
-O4
Extreme optimization. GCC uses AI and quantum
computation to warp space-time and invent
entirely new paradigms to make your programs 2% faster.
Warning: May destroy your current universe.
It's a shame it's a bit hard to actually build these days. Its target is people working on compilers, not end users. But there are some odd cases where I'd really like to use it myself, like trying to get speedup in llama.cpp.
One can now compile against cuBLAS and other GPU libraries to help take some of the load, but the core of the project is indeed to run on the CPU and do all of the work without dependencies.
Bound by: yes. Currently restricted by: possibly not. For example the performance difference between clang 11 and 15 compiled llama.cpp is ~10% (I tested a while ago, can't remember the exact number)
Jetbrains was hiring mathematicians working on homotopy type theory to do something similar if I recall correctly. The idea is that HTT can be used to generate equivalent programs to that which is in your buffer and do something like superoptimization. Maybe I am hallucinating.
Is superoptimization related to supercompilation, or are these unrelated compiler optimization techniques? From what I have read, it seems like Turchin [1] et al. are trying to prove equality using some sort of algebraic rewriting system (i.e., the "intensional" approach), whereas superoptimization uses an extensional technique to first try proving disequality by testing on a small set of inputs, then applies formal verification to the remaining candidates.
Massalin (1987) [2] calls the first phase "probabilistic execution" and claims that nearly all of the functions which pass the PE test also pass the more rigorous Boolean verification test. Can you give any insight into the benefits of TT over more "automated" optimizations? I am curious if MLTT/HoTT is more suitable for certain compiler optimizations or offers additional expressive power for proving equivalence, or is the benefit mostly ergonomics?
Supercompilation is "top down": beginning with the given program code and trying to optimise it until it can't find any more improvements. It basically runs an interpreter to evaluate the code, but unknown values (e.g. runtime data) must be treated symbolically. It's similar to aggressively inlining (essentially "calling" functions), and aggressively optimising static calculations.
Supercompilation is good at removing unused scaffolding and indirection, e.g. for code that's written defensively/flexibly, supporting a bunch of fallbacks, override hooks, etc. A common problem with supercompilation is increasing code size, since it replaces many calls to a single general-purpose function/method, with many inlined versions (specialised to various extents).
Superoptimisation is "bottom up": generating small snippets of code from scratch, stopping when it finds something that behaves the same as the original code.
There is (long abandoned tho) a GNU superoptimiser that supports multiple hardware architectures including HP PA-RISC, Motorola 68k and 88k, Intel i960 amongst others – https://www.gnu.org/software/superopt/
It is based on the exhaustive search with backtracking, kind of what Prolog is best for, and given a function that takes an input vector of integers, it attempts to find the most optimal instruction sequence that produces an output integer.
It would be interesting to modify this for optimizing BrainF programs.
It has a very simple instruction set and there are lots of example programs available.
To avoid generating useless things like "++--", you could have the optimizer generate instructions that are translated to BrainF operations.
So instead of ">>>--", the optimizer would generate "Move +3; Add -2".
there might not be that many opportunities to optimize brainfuck programs as the instruction set of the brainfuck virtual machine is so simple. arithmetic is unary, and staying within the brainfuck virtual machine doesn't provide a way to do any better than unary arithmetic.
on another hand, there are many opportunities for optimization when compiling brainfuck to an actual real world instruction set
e.g. suppose in your program you want to load the integer 36 into the current cell
a naive way to do this is to increment 36 times, e.g. `++++++++++++++++++++++++++++++++++++`.
if you can't guarantee that the initial value of the current cell is zero, then you could zero it first: `[-]++++++++++++++++++++++++++++++++++++`
if we're optimizing for code size, not instruction count, we could express this as a more complex, compact program: `[-]>[-]++++++[-<++++++>]<` -- assuming it is OK to use the cell one to the right of the pointer as working storage. this will run slower than the less compact naive program as it executes more instructions.
if compiling to some reasonable instruction set, there's likely a single non-BF instruction that lets you load the constant 36 into a register.
extra credit: implementing the optimising brainfuck-to-something compiler in brainfuck itself
> arithmetic is unary, and staying within the brainfuck virtual machine doesn't provide a way to do any better than unary arithmetic.
It won’t be efficient (but who cares about that in a true Turing machine?), but you can do binary/octal/decimal/whatever, using a series of cells each containing a number in the right range.
As an optimization, you can postpone any carries until you want to compare or print such numbers. Addition would ‘just’ loop over the cells in the numbers to be added, adding cells pairwise. For example, binary 11001 + 10101 would yield 21102.
Just as an exercise, you could use brainfuck with genetic algorithms and evolve the program that you want, then try to play with the fitness function to squeeze down whatever you don't want (code size, instruction set, etc...).
Then you could try to superoptimize yourself a solution and compare how close the evolved program was to the superoptimized program.
Could also be interesting with other esoteric languages like befunge or malbolge.
This is begging to use genetic programming instead. I used to do GP experiments all the time when I was younger, its such an easy algorithm and can produce some very interesting results.
Compared to deep learning, genetic algoritms in general feels like childs play!
I’m curious: Has there been any research on optimization in the opposite direction? Start with the most naively implemented correct function—a giant switch/nested if covering all possible inputs—and transform it piecewise in a correctness-preserving manner?
A similar problem comes up in digital design. Behaviour is often specified in terms of exhaustive truth tables, perhaps tables that were generated automatically, from a hardware description language. While such a table can be directly translated into a circuit, doing so in a naive fashion is potentially extremely non-optimal. Optimizing it is a hard problem. Everything has been tried, from exhaustive search (suitable only for small numbers of terms), to rule-based fuzzy backtracking searches, to in recent years, deep learning. But at least when I was in school a couple decades ago, the best known algorithm was still a person playing with the terms on graph paper, rearranging them until patterns emerged: https://en.wikipedia.org/wiki/Karnaugh_map
As I understand it, there are recent advances in SAT solvers giving practical algorithms for much larger numbers of terms, but I'm much behind the state of the art at this point.
I've done this. In my experience it ends up stuck in local minima from which there are no obvious correctness-preserving improvements. Add Monte Carlo to escape valleys and this mode of exploration is still too timid to find anything close to optimal.
When you find a truly optimal solution, you end up with the least satisfying proof of correctness: "Here's this this inscrutable formula; I verified that it produces the right output for all possible inputs."
Mostly yes. There is a lot of research on taking arbitrary programs and distilling constraints and invariants out of them or doing various forms of function simplification.
This reminds of an undergrad paper I once wrote in which I showed that binary functions in n variables need asymptotically not more (and seldom less) than 2^n/n boolean gates.
This is essentially calculating the Kolmogorov complexity of whatever the final state of memory wrt to the constrained assembly (and giving the associated program). Since the any program in the constrained assembly always halts, this is possible a la brute force.
It also doesn't seem particularly interesting because it doesn't allow the programs to get input. Obviously that makes things much more difficult wrt to proving program equivalence.
There are surely ways to further constraint the set of programs to test, ie anything that output something not in the expected outcome should probably be an early return. I'd love for more work to be poured into this concept. There might be low hanging fruits and who knows an LLVM pass somewhere not to far in the distant future.
Or the Dijkstra's algorithm, which is a form of dynamic programming.
To solve the superoptimization problem, imagine every possible memory state as a node in a graph, and each instruction as an edge between two states, the superoptimized program is the shortest path between the starting state and the end state.
You can even use A* which is very similar to Dijkstra's algorithm, but with a sense of direction, materialized by a heuristic function that estimates how close we are to the goal. Here, a simple heuristic could be the number of memory locations that are different to the goal divided by two, that because each instruction changes at most two memory locations.
Of course, dynamic programming is a space-time trade-off, so memory consumption is something to consider.
Why did you use the term dynamic programming programming when referring to the memorization part? The term of art, dynamic programming, is at play here but refers to the initial design of breaking up the problem into recursive sub-problems.
Oh! I'd always imagined it was the combination of the recursive breakdown and the memoization together, but you're right it seems the DP part is just the recursive bit. Thanks for pointing this out.
Seems like the superoptimizer search graph would be a very good subject for machine learning algorithms? Are where any good AI superoptimizers out there?
I've found GPT to be pretty good at cleaning up code. I've pasted in isolated blocks of code, asked it to tidy it up, and it often has nice suggestions.
There's a feature coming to Copilot where you can select code in-editor and prompt on it directly.
Superoptimization has absolutely nothing to do with what GPT is doing. It doesn't happen in source and produces actually optimal code sequences (against some cost model) rather than doing some refactoring or tidying.
I'd hate to suggest that the comment about cleaning up code with GPT was written by GPT and posted without that crucial editing step. But the number of people who are using that tool instead of their wetware is certainly growing.
GPT's knowledge isn't up to date enough to know about the feature I mentioned. No, it wasn't written by GPT. It wasn't clear to me from the post that superoptimisation pertains to optimisation of machine code only.
>Good evening, GPT. Can you tell me what superoptimization is, in relation to computer programming?
ChatGPT:
Good evening! Superoptimization, in the context of computer programming, refers to the process of finding the optimal or most efficient implementation of a piece of code for a given task or problem. It involves automatically searching through all possible combinations of instructions and their arrangements to produce the most efficient code in terms of speed, memory usage, or other specific criteria.
Superoptimization is typically performed by specialized tools or algorithms known as superoptimizers. These tools analyze the code and exhaustively explore the space of all possible code sequences, considering both the individual instructions and their order. The goal is to find the shortest or fastest sequence of instructions that achieves the desired output or behavior.
Superoptimization is a computationally expensive process since it involves searching a vast solution space. Due to the exponential nature of the problem, it is often only feasible for small code snippets or performance-critical sections of code. Additionally, superoptimization may not always be practical or necessary because modern compilers and optimization techniques can often generate highly efficient code automatically.
Superoptimization techniques have been primarily used in areas such as compiler design, code generation, and optimizing low-level code for specific architectures. They can be beneficial in situations where manual optimization is not feasible or when there is a need for extreme performance optimizations in constrained environments.
It doesn't pertain to the optimization of machine code only, but the distinguishing feature between optimization and superoptimization is that superoptimization produces an absolute shortest possible program given an instruction set. GPT doesn't do that. Copilot won't do that. Whatever cleanup feature you're describing is a non-sequitur here.
[1] https://github.com/zwegner/x86-sat
[2] https://people.cs.umass.edu/~aabhinav/Publications/Unbounded...