Here's a little gem that I don't see mentioned on there.
For bit unpacking in C, the natural inclination is to shift out the least significant bit with something like this:
u1 getbit() { u1 lsb = x & 1; x >>= 1; return lsb; }
However, most assembly languages have a better alternative if you instead shift out the most significant bit. Suppose that the bit buffer is in ECX. Then you would do ADD ECX, ECX to left shift ECX and (this is the important part) put the shifted-out bit into the carry flag. From there the carry flag can be shifted into the least significant bit of another register, say EAX, with ADC EAX, EAX, or you can branch based on the carry with JC/JNC
As a full example, here is the gamma decoder from apack/aplib, which despite its simplicity rewards careful study:
It would be a little adventure to try to figure out how it was conceived, as although it looks like something a superoptimiser would produce, it was around before superoptimisation, and I think it was generated by a human likely related to the demoscene.
This was a well known trick for 8080s and Z80s as far back as the 1970s. Back then there were a lot of assembly programmers, tricks like this were stock in trade.
Reminds me of optimizing microcode back in uni... From the reference implementation of 134 microinstructions down to 61, and it was far more cycle-efficient as well (progressive decoding and Huffman encoded micro representation given sample programs). Since it was a contest for two goals of extra credit, people were pissed that our team captured both. It took about 30 hours to implement.
Microinstructione are like a CPU's firmware which implements the external-facing macroinstruction set into a simpler set of microinstructions which coordinate various internal state.
The abs, labs, and llabs functions compute the absolute value of an integer j. If the result cannot be represented, the behavior is undefined. 304)
304) The absolute value of the most negative number cannot be represented in two’s complement.
Now almost every code to be run is cache-bound for performance, not cycle-bound. So this gems, while may still work, may not be the best means to achieve optimal performance. Optimizing cache access is what brings the biggest speedups today -- see for example NumExpr[1].
Cache is extremely important, but most of these operate strictly on values in registers. So, losing cycles due to cache misses, latency, contention, etc... isn't a bottleneck.
A lot of these are crucial enough that they've been codified in transistors now, and generally can execute in a cycle without consuming extra registers. There are still some very valid bit tricks though.
Well I read some and had the impression they were going for speed or low cycle count. Indeed, most surely still work (there doesn't seem to be anything that is processor-specific). Unfortunately my asm is rustier than I thought and I find it quite demanding to read and fully understand them.
I wouldn't go so far to call them secrets, as in my personal experience gurus tend not to keep secrets; they're just neat hacks that not many people know about nowadays.
For bit unpacking in C, the natural inclination is to shift out the least significant bit with something like this:
However, most assembly languages have a better alternative if you instead shift out the most significant bit. Suppose that the bit buffer is in ECX. Then you would do ADD ECX, ECX to left shift ECX and (this is the important part) put the shifted-out bit into the carry flag. From there the carry flag can be shifted into the least significant bit of another register, say EAX, with ADC EAX, EAX, or you can branch based on the carry with JC/JNCAs a full example, here is the gamma decoder from apack/aplib, which despite its simplicity rewards careful study: