I really enjoyed this. It's always surprising how focussing on a teeny tiny part of a topic can clarify so much around it. There's so much complexity in and around compilers, it's very intimidating
$ txr opt.txr data
movl %ecx, %eax
movl %ebx, %edi
movq %r12, %rdi
$ cat opt.txr
@(repeat)
@ (cases)@; eliminate wasteful reverse move
movl @x, @y
movl @y, @x
@ (output)
movl @x, @y
@ (end)
@ (or)@; fallback: output one line, advance by one line
@ line
@ (do (put-line line))
@ (end)
@(end)
The setup here is not general, because we want to iterate on the pattern matching; not just match a pattern, consume that part of the input, transform it, and move on.
The transformed output has to be pushed back into the input. (Iterating the tool on the entire file to a fixed point is possible, but inefficient.)
In the TXR Lisp compiler (which does this kind of optimizing on List structures, not text), I wrote a little macro called rewrite-case for this: this repeatedly applies multiple pattern matching to the start of a list, rewriting the matched part of the list. When a fixed point is reached, it then deposits one item to the output list, and marches down by one item.
> It would be better if we printed out just the first line after discovering that the first three lines do not satisfy any optimization patterns, then check lines two through four for any optimization patterns; if we did that, we could optimize the last three lines.
That's an absolute requirement; if no patterns match, the peephole must move by one instruction.
It's exactly like a lexical analyzer throwing away one character if it doesn't match any token. (Lex will do that by default and print the character to standard output.)
> Interestingly, we cannot actually say with certainty if moving %rdi into %r12 and back again is useless.
Well, yes; peephole optimization is generally done on basic blocks, not on entire programs.
You can peephole a program containing jump labels, but you have to make sure your patterns do not blindly match across them.
For instance, say we want to peephole out this wasteful register reverse move, but one instruction away:
Indeed, the xor-with-itself bit is so basic, it barely counts as an optimization (Clang, in fact, does this even at -O0). It should have been trivially possible to do at initial codegen.