What are the ways compilers recognize complex patterns?

pizlonator · 2024-07-22T02:59:46 1721617186

Canonical forms, smart stuff, a lot of hardcoding.

Canonical forms are the really important part. Simple example: there are multiple ways a program might say “X * 2”. You could shift left by 1. You could multiple by 2. You could add X to itself. The idea of canonical forms is that in one pass, the compiler will pattern match all the ways you can do this and reduce all of them to the same canonical form - say, left shift by 1. Then, subsequent passes that want to catch more complex uses of that construct only have to look for one version of it (left shift 1) and not all three.

Here’s a more complex case. Ternary expressions in C and if-else statements have the same representation in llvm IR generated by clang: basic blocks and branches. There are multiple ways of representing the data flow (could use allocas and stores/loads or SSA data flow) but both the sroa and mem2reg passes will canonicalize to SSA. And, last I checked llvm says that the preferred canonical form of a if-then-else is a select (I.e. conditional move) whenever the two are equivalent. So, no matter what you use to write the equivalent of std::min - macros, templates, whatever, coding style don’t matter - you will end up eventually with a select instruction whose predicate is a comparison. Then - if your CPU supports doing min in a single instruction, it’s trivial for the instruction selector to just look for that kind of select. This happens not because every way of writing min is hardcoded, but because multiple rounds of canonicalization (clang using basic blocks and branches for both if/else and ternaries, sroa and mem2reg preferring SSA, and if conversion preferring select) gets you there.

A lot of this is hardcoding, but it’s not the boring “hardcode everything” kind of approach, but rather, it’s about using multiple phases that each produce increasingly canonical code that makes subsequent pattern matching simpler.

Agentlien · 2024-07-22T12:02:21 1721649741

This makes so much sense to me!

I once ended up using a similar approach for a small project at work[0]. It's a tool for cleaning up automatically generated shader files to make them easier to read and optimize. It works entirely by running a series of regex replacements in succession. I very quickly found that picking the order of these passes allowed me to create a lot more opportunities for improvements because I could make more assumptions about the form and patterns of the code. It also really helped that the original generated code was very formulaic.

[0] https://github.com/Agentlien/ShaderCleanup

dartos · 2024-07-22T04:48:30 1721623710

So you hardcode canonical forms to simplify them down to a common form so that you only need to hardcode one kind of form for the next pass?

gumby · 2024-07-22T11:44:21 1721648661

Almost. Canonicalize constructs so that they will be easier to manage in later passes.

So for example in the x * 2 case, if you canonicalize it to a shift, the representation of that shift may have other properties (or tags or whatever) like “this operation has no side effects on memory”. A later pass might make sure the register is saved (or decide to discard the shift because its result is not used) without* that pass having to know specifically about shifts.

Later passes could have different canonicalizations, say coupling the shift and its store so that a single thing is, say, hoisted out of a loop (this is actually probably an unrealistic example, but reasonable for explanatory reasons)

bee_rider · 2024-07-22T12:44:49 1721652289

Something I don’t really understand about if-then and ternary expressions is that ternary expressions seem like a really intuitive way to express a masked SIMD operation, while if-then does not. I think this must be wrong because they are equivalent, so I guess… why can’t I get over it?

pizlonator · 2024-07-22T14:14:10 1721657650

Here’s the trouble with ternaries in C and similar languages: given p?a:b, a will only execute if p is true and b will only execute if p is false. So that’s a branch. It takes compiler analysis (checking that neither a nor b have effects, or finding a way to move/eliminate those effects) to turn that into a conditional move or some kind of mask.

So the issue is that ternaries are semantically defined in a way that makes them exactly like branches.

I think that might be an artifact of C being designed before conditional moves and masks were a thing. Maybe newer languages should have a ternary operator that mandates that both an and b execute, and then the compiler can treat that as a conditional move or mask or whatever from the start.

jcranmer · 2024-07-22T15:28:25 1721662105

For promoting a branch to a conditional move to be a speed advantage, then a and b need to be cheap to compute (and even then p should probably be somewhat unpredictable--you're balancing the cost of computing the unnecessary effect with the cost of the branch predictor giving the wrong result). If a and b are cheap to compute, and safe to speculatively execute, then hoisting them out of that branch into a conditional move instruction is already likely.

Note that this isn't the only common case of tiny basic blocks being created by expressions that can be eliminated: many instances of && and || can be converted to & and |, and it's generally even more beneficial to do so than ternary-to-conditional-move construction.

pizlonator · 2024-07-22T15:55:59 1721663759

> For promoting a branch to a conditional move to be a speed advantage, then a and b need to be cheap to compute (and even then p should probably be somewhat unpredictable--you're balancing the cost of computing the unnecessary effect with the cost of the branch predictor giving the wrong result). If an and b are cheap to compute, and safe to speculatively execute, then hoisting them out of that branch into a conditional move instruction is already likely.

Agreed, hopefully I wasnt implying anything different.

> Note that this isn't the only common case of tiny basic blocks being created by expressions that can be eliminated: many instances of && and || can be converted to & and |, and it's generally even more beneficial to do so than ternary-to-conditional-move construction.

That conversion is only beneficial as a canonical form. Definitely not beneficial for instruction selection, since it means grosser code on most CPUs (you have to do conditional moves or sets on the inputs to the &&/|| and then a logic op and then compare/branch, which has less ILP than just branching twice).

bruce343434 · 2024-07-22T18:50:04 1721674204

With the recent "EGG" (e-graphs-good) library, canonicalization just becomes "member of the same equivalence class" which makes the problem a lot easier and less error prone to make a rewriting system based optimizer.

mamcx · 2024-07-22T16:45:58 1721666758

This shows up as `sugar` and `desugar` in some materials. Like how `for i in x` is `sugar` for making an iterator and then a `while` loop.

pizlonator · 2024-07-22T17:11:13 1721668273

Not the same.

Desugaring usually means that the target IR lacks the construct that is the sugar.

Canonical form usually means that the target IR has multiple ways of saying the same thing (x*2, x+x, and x<<1 are all valid) but one of them is canonical, ie preferred by opt passes.

tester756 · 2024-07-22T22:27:22 1721687242

also as a "Lowering"

CalChris · 2024-07-22T15:43:23 1721663003

The C code is a low level implementation of population count. AggressiveInstCombine.cpp first matches and raises the IR for this into llvm.ctpop.i64.

https://godbolt.org/z/8ronKz3Eb

Later an x86 backend can re-lower this into a popcnt instruction or to CPOP on a RISC-V backend or to CNT on ARMv8 or ….

https://godbolt.org/z/4zvWs6rzr

It can also be re-lowered to roughly those instructions on machines lacking population count instructions.

https://godbolt.org/z/aYsc9dz7f

userbinator · 2024-07-22T03:02:19 1721617339

Imagine if x86 had POPCNT since the beginning, implemented in microcode at first, and optimised it over time to be faster and use more available hardware. There would be no need for this sort of "decompiler" in a compiler nor would software need recompilation for each CPU model.

lmm · 2024-07-22T04:20:34 1721622034

If they somehow magically knew the future of the whole industry decades in advance, and were able to design the perfect ISA based on that, sure. But that's not how things go in practice. If they'd put in all the opcodes that they thought would be important in the future back in the '80s and implemented them in microcode, I guarantee that we'd be a whole lot worse off now, with loads of pointless opcodes for 5GL and semantic networks and all the other dead ends of the era, and no space for things that turned out to be important like POPCNT.

adrian_b · 2024-07-22T06:02:50 1721628170

POPCNT was not from the future, it was from the past.

The Intel designers only assumed that this belongs to the operations that would not be frequently used in the applications expected for their processors.

This was caused in part because they were not personally familiar with such applications. Even if POPCNT actually has a very wide area of applicability, during the seventies of the 20th century the only people who were concerned with the speed of executing POPCNT were some who worked at cryptographic applications and at that time almost all such work was classified.

However, with such operations there is always a chicken and egg problem. When they are not implemented in hardware, the programmers and the compiler writers avoid expressing algorithms with them and use various workarounds to implement in a different way the algorithms that would benefit from them.

This leads to a low frequency of use of such operations, which is then used to justify that it is not necessary to implement them in hardware.

The correct analysis whether such operations would be worthwhile when implemented in hardware requires much more work in writing alternative versions of various algorithms, to be run in simulated hardware, and this is almost never done.

sumtechguy · 2024-07-22T17:11:13 1721668273

There are also a few different algorithms to handle it https://graphics.stanford.edu/~seander/bithacks.html#CountBi...

x86 started as basically a calculator chip. Not a lot of need in that particular space/usecase for it, and more importantly silicon space for it.

However, what is interesting is how long it took to add it to the x86 set of instructions. But at least we have it now.

gizmo686 · 2024-07-22T05:30:18 1721626218

C predates x86, and was written with the specific intent of being independent of the particular CPU architecture being compiled for.

At any point, the C standard could have introduced a standard POPCNT function that compilers could easily compile in whatever platform relevant way they want; but such has never happened.

derdi · 2024-07-22T13:33:50 1721655230

You got this backwards. If x86 had had POPCNT from the beginning but C did not have a standard popcount function, you would still need to express the popcount in source as in this question, so the compiler would still need to recognize the idiom. There would be no need for this sort of pattern matching if C had had a standard popcount function from the beginning, regardless of widespread hardware support.

adrian_b · 2024-07-22T05:46:46 1721627206

Which is not at all unrealistic, because POPCNT had been recognized as useful and it had already been implemented in some of the earliest computers with vacuum tubes (at the suggestion of Alan Turing). Those early computers, like the Ferranti Mark 1, were much simpler than Intel 8086.

The name "population count" had been introduced for this instruction by Cray 1, a couple of years before the launch of Intel 8086.

dzaima · 2024-07-22T18:35:29 1721673329

"implemented in microcode at first" has massive downsides if the microcode isn't particularly fast. e.g. AMD Zen 1 and Zen 2 support BMI2 and thus pdep & pext, but they're microcoded, so up to hundreds of times slower than on Zen 3 or Intel; so much so that they're a good amount slower than some 90-instruction manual bitwise implementation (potentially doing 4 at a time via SIMD, at least; so even an equivalent microcoded version might be slower as the pdep/pext instructions are scalar), and completely utterly brokenly slow if you want to use it for something like a byte compress.

Similarly, using a microcoded POPCNT on just, say, integers in the range 0..15, would be pretty inefficient, while it's by far the best option on a properly implemented one.

All to say, yeah, it would mean that you wouldn't strictly need recompilation for each CPU model, but it would still be plenty beneficial. And, honestly, "compile for the absolute minimum x86-64 while still tuning for recent CPUs" is just extremely horrible - yes, old hardware will technically run it, but that's the hardware that needs the extra perf tuning the most but ends up getting the least.

kragen · 2024-07-22T18:44:19 1721673859

probably 'a properly implemented one' would be using combinational logic that can calculate popcnt in a single cycle, or maybe a two-cycle pipelined implementation, which is not the same thing as a microcoded popcnt

here's a simple combinational 8-bit popcnt circuit i just simulated, just using ripple carry. unfortunately i think falstad's circuit.js doesn't have a way to simulate propagation delays; i think the point where you start wanting lookahead carry instead of ripple carry is when you're doing a 16-bit popcnt, because a 64-bit ripple-carry popcnt would have three more full adders of propagation delay over this one

https://www.falstad.com/circuit/circuitjs.html?ctz=CQAgjCAMB...

if you have a lot of popcnts to do, you can productively bitslice them so they're fast even without hardware support

dzaima · 2024-07-22T20:02:23 1721678543

I was indeed using 'a properly implemented one' for a non-microcoded implementation there.

kragen · 2024-07-22T21:13:21 1721682801

okay, cool :)

JoshTriplett · 2024-07-22T06:34:57 1721630097

That doesn't allow for the possibility of hardware and software co-evolving. In particular, that doesn't allow hardware to get better by providing ways to run common software operations faster.

akira2501 · 2024-07-22T05:33:16 1721626396

POPCNT really doesn't have a place on a 16bit processor. Particularly one which can split the 16bit register into two 8bit pairs.

adrian_b · 2024-07-22T05:53:59 1721627639

POPCNT might have not been very useful in the initial 8086 ISA of 1978, but it would have been at home in the NPX extension of the Intel ISA from 1980 (Numeric Procesor Extension, i.e. the instruction set of Intel 8087), which had operations with 64-bit integers and 64-bit significands of 80-bit floating-point numbers.

pizlonator · 2024-07-22T04:10:31 1721621431

Meh?

There will always be new CPU instructions that weren’t already part of whatever language you’re using, that corresponded to a pattern of code that people are already writing in that language. And the pattern matching isn’t rocket science. I wouldn’t characterize it as “decompilation”; that makes it seem more magical than it really is.

Popcnt may be a particularly amusing example but it’s far from the only one. A modern C compiler has countless patterns it recognizes, sometimes to match them to instructions, other times just to aid the compiler’s understanding of what’s going on. Usually the latter.

rurban · 2024-07-22T10:31:48 1721644308

In general: a tree matcher.

In this case: hardcoded search.