
Assembly Gems - aburan28
http://dflund.se/~john_e/fr_gems.html
======
psykotic
Here's a little gem that I don't see mentioned on there.

For bit unpacking in C, the natural inclination is to shift out the least
significant bit with something like this:

    
    
        u1 getbit() { u1 lsb = x & 1; x >>= 1; return lsb; }
    

However, most assembly languages have a better alternative if you instead
shift out the most significant bit. Suppose that the bit buffer is in ECX.
Then you would do ADD ECX, ECX to left shift ECX and (this is the important
part) put the shifted-out bit into the carry flag. From there the carry flag
can be shifted into the least significant bit of another register, say EAX,
with ADC EAX, EAX, or you can branch based on the carry with JC/JNC

As a full example, here is the gamma decoder from apack/aplib, which despite
its simplicity rewards careful study:

    
    
        getbit:
            add    dl, dl
            jnz    .stillbitsleft
            mov    dl, [esi]
            inc    esi
            adc    dl, dl
          .stillbitsleft:
            ret
    
        getgamma:
            xor    ecx, ecx
        getgamma_no_ecx:
            inc    ecx
          .getgammaloop:
            call   getbit
            adc    ecx, ecx
            call   getbit
            jc     .getgammaloop
            ret

------
userbinator
I think this is one of the more interesting ones for its _extreme_ terseness
and opaqueness:

[http://dflund.se/~john_e/gems/gem003a.html](http://dflund.se/~john_e/gems/gem003a.html)

It would be a little adventure to try to figure out how it was conceived, as
although it looks like something a superoptimiser would produce, it was around
before superoptimisation, and I think it was generated by a human likely
related to the demoscene.

~~~
billforsternz
This was a well known trick for 8080s and Z80s as far back as the 1970s. Back
then there were a lot of assembly programmers, tricks like this were stock in
trade.

~~~
bro-stick
Reminds me of optimizing microcode back in uni... From the reference
implementation of 134 microinstructions down to 61, and it was far more cycle-
efficient as well (progressive decoding and Huffman encoded micro
representation given sample programs). Since it was a contest for two goals of
extra credit, people were pissed that our team captured both. It took about 30
hours to implement.

Microinstructione are like a CPU's firmware which implements the external-
facing macroinstruction set into a simpler set of microinstructions which
coordinate various internal state.

------
tasty_freeze
This one has a serious flaw:
[http://dflund.se/~john_e/gems/gem0032.html](http://dflund.se/~john_e/gems/gem0032.html)

The neg (two's complement) of 0x80000000 is 0x80000000. Thus his trick results
in an infinite loop for the most negative value.

~~~
mattst88
Indeed. The x86 equivalent gem
([http://dflund.se/~john_e/gems/gem0001.html](http://dflund.se/~john_e/gems/gem0001.html))
makes a point to note that. Odd that it's not mentioned on the m68k page.

For what it's worth, my copy of the C11 spec says

    
    
      The abs, labs, and llabs functions compute the absolute value of an integer j. If the result cannot be represented, the behavior is undefined. 304)
      
      304) The absolute value of the most negative number cannot be represented in two’s complement.

------
toolslive
If you're interested in this stuff, bit-twiddling hacks:
[https://graphics.stanford.edu/~seander/bithacks.html](https://graphics.stanford.edu/~seander/bithacks.html)

Some things are due to an algorithmical trick, while others are a translation
trick. Some are both.

------
uxcn
How many of these are still relevant? x86 has an instruction for _crc_ ,
_bswap_ , as well as _nop_ now.

~~~
dr_zoidberg
Now almost every code to be run is cache-bound for performance, not cycle-
bound. So this gems, while may still work, may not be the best means to
achieve optimal performance. Optimizing cache access is what brings the
biggest speedups today -- see for example NumExpr[1].

[1] [https://github.com/pydata/numexpr](https://github.com/pydata/numexpr)

~~~
uxcn
Cache is extremely important, but most of these operate strictly on values in
registers. So, losing cycles due to cache misses, latency, contention, etc...
isn't a bottleneck.

A lot of these are crucial enough that they've been codified in transistors
now, and generally can execute in a cycle without consuming extra registers.
There are still some very valid bit tricks though.

~~~
dr_zoidberg
Well I read some and had the impression they were going for speed or low cycle
count. Indeed, most surely still work (there doesn't seem to be anything that
is processor-specific). Unfortunately my asm is rustier than I thought and I
find it quite demanding to read and fully understand them.

------
Plishar
Secrets of Assembly Programming Gurus!

~~~
fao_
I wouldn't go so far to call them secrets, as in my personal experience gurus
tend not to keep secrets; they're just neat hacks that not many people know
about nowadays.

