
Search-based compiler code generation - JoshTriplett
http://jamey.thesharps.us/2017/06/search-based-compiler-code-generation.html
======
CalChris
I'd compare this to Unison [1,2] where they combine register allocation and
instruction scheduling into a single phase using combinatorial optimization.
Usually you have to choose one before the other.

[1]
[http://llvm.org/devmtg/2017-03//assets/slides/register_alloc...](http://llvm.org/devmtg/2017-03//assets/slides/register_allocation_and_instruction_scheduling_in_unison.pdf)

[2]
[https://www.sics.se/~rcas/publications/CastanedaCarlssonEa_C...](https://www.sics.se/~rcas/publications/CastanedaCarlssonEa_CC_2016.pdf)

------
kodablah
> So a practical implementation needs good heuristics for value-ordering

If I understand correctly, this would require some correlation between
disparate forms of optimization. I would fear that the each optimization would
alter/rewrite insns significantly enough that there may not be enough
correlation to build those search shortcuts. Too many guess-and-check-cost
across all permutations might still be expensive.

Along the same lines of determining which combination of optimizations yields
the best overall, I wonder if one could train a NN to do it. Maybe there's
some work in the area, I don't keep up with research. But in general I would
imagine that with a large enough trained set, the NN could say a spill in case
X on arch Y is likely better than a re-eval. But then you get into non-
determinism which scares devs. Just a thought.

------
titzer
I am not sure there is an optimal solution to this problem, since there are
many conflicting optimization criteria, and optimizations sometimes work
against each other. In particular, common subexpression elimination reduces
code size and redundancy, but can increase register pressure and even make
worse schedules.

We use node splitting in the TurboFan compiler, but only do limited splitting
during the scheduling phase when it detects that the placement of a node that
has multiple uses would produce partially redundant code. Splitting these
nodes allows them to be scheduled independently (essentially, moved down the
schedule, avoiding placing them on paths where they would be redundant), but
of course duplicates code.

In TurboFan, register allocation happens after instruction selection, which
happens after scheduling. Instruction selection depends on the schedule, since
it generally does not want to combine instructions across basic block
boundaries because applying a selection tile across basic blocks might pull
computations (even if they are just micro-ops) into a loop.

Rematerialization is yet another optimization that works against CSE. Instead
of spilling, the register allocator recomputes a value. This undoes CSE much,
much later, only when the register assignment is partially known.

IMO closing that whole optimization loop is just prohibitively expensive (in
compile time).

On the other hand, measuring tiny differences in code quality seems to be no
longer possible on modern processors due to all the runtime nondeterminism
(turbo boost, cache state, reorder buffer state, interrupts, core migration),
so how would we validate optimizations like this with such a small impact?

~~~
CalChris
The point that the Unison people make is that it is useful for inspiring and
evaluating heuristics which certainly makes sense in an AOT context but then
decidely not in a JIT context.

How do you validate these optimizations? Dunno, I'd recommend looking at Agner
Fog's benchmark tools. But torturing the optimizer to make a test suite number
go down will _eventually_ end in tears.

