Yeah, I'm happy to see more experimentation in the lightweight compiler space! M...

Yeah, I'm happy to see more experimentation in the lightweight compiler space! My one-pass pipeline does forward greedy regalloc integrated with the approach I outlined in my post.

The next sweet spot seems to be a two-pass pipeline so you get a full forward and backward dataflow pass: your forward pass does SSA generation integrated with the forward dataflow analyses and the backward pass does machine code generation integrated with backward regalloc and dead code elimination. SSA with backward codegen/regalloc gives you an optimal coloring since the backward order for structured code matches the reverse postorder of the corresponding reducible CFG, and the greedy backward approach gives you decent splitting/coalescing for free. This is basically LuaJIT's approach but it can be extended beyond traces to reducible control flow if you replace the linear trace ordering with the dominator tree ordering. With two passes you should also be able to do something like Cliff Click's style of global code motion (as used in the HotSpot server compiler) with the early-schedule/late-schedule steps integrated into the two passes.

The main limitation of a pure two-pass approach is that you can't do precise SSA generation for loops at the same time you're doing your forward dataflow pass, so you have to be conservative. That blocks most LICM opportunities, so an extra micro-pass per loop is easy to justify, either an SSA pre-pass per loop or something like LuaJIT's loop peeling where you fully peel off the first iteration while identifying loop invariant variables before processing the loop a second time. Then LICM happens automatically via CSE since the peeled iteration dominates the loop body.