Hacker News new | comments | show | ask | jobs | submit login
Search-based compiler code generation (thesharps.us)
69 points by JoshTriplett 122 days ago | hide | past | web | 4 comments | favorite

I'd compare this to Unison [1,2] where they combine register allocation and instruction scheduling into a single phase using combinatorial optimization. Usually you have to choose one before the other.

[1] http://llvm.org/devmtg/2017-03//assets/slides/register_alloc...

[2] https://www.sics.se/~rcas/publications/CastanedaCarlssonEa_C...

> So a practical implementation needs good heuristics for value-ordering

If I understand correctly, this would require some correlation between disparate forms of optimization. I would fear that the each optimization would alter/rewrite insns significantly enough that there may not be enough correlation to build those search shortcuts. Too many guess-and-check-cost across all permutations might still be expensive.

Along the same lines of determining which combination of optimizations yields the best overall, I wonder if one could train a NN to do it. Maybe there's some work in the area, I don't keep up with research. But in general I would imagine that with a large enough trained set, the NN could say a spill in case X on arch Y is likely better than a re-eval. But then you get into non-determinism which scares devs. Just a thought.

I am not sure there is an optimal solution to this problem, since there are many conflicting optimization criteria, and optimizations sometimes work against each other. In particular, common subexpression elimination reduces code size and redundancy, but can increase register pressure and even make worse schedules.

We use node splitting in the TurboFan compiler, but only do limited splitting during the scheduling phase when it detects that the placement of a node that has multiple uses would produce partially redundant code. Splitting these nodes allows them to be scheduled independently (essentially, moved down the schedule, avoiding placing them on paths where they would be redundant), but of course duplicates code.

In TurboFan, register allocation happens after instruction selection, which happens after scheduling. Instruction selection depends on the schedule, since it generally does not want to combine instructions across basic block boundaries because applying a selection tile across basic blocks might pull computations (even if they are just micro-ops) into a loop.

Rematerialization is yet another optimization that works against CSE. Instead of spilling, the register allocator recomputes a value. This undoes CSE much, much later, only when the register assignment is partially known.

IMO closing that whole optimization loop is just prohibitively expensive (in compile time).

On the other hand, measuring tiny differences in code quality seems to be no longer possible on modern processors due to all the runtime nondeterminism (turbo boost, cache state, reorder buffer state, interrupts, core migration), so how would we validate optimizations like this with such a small impact?

The point that the Unison people make is that it is useful for inspiring and evaluating heuristics which certainly makes sense in an AOT context but then decidely not in a JIT context.

How do you validate these optimizations? Dunno, I'd recommend looking at Agner Fog's benchmark tools. But torturing the optimizer to make a test suite number go down will eventually end in tears.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact