Hacker News new | past | comments | ask | show | jobs | submit login
Assembly Gems (dflund.se)
78 points by aburan28 on Aug 29, 2015 | hide | past | favorite | 13 comments



Here's a little gem that I don't see mentioned on there.

For bit unpacking in C, the natural inclination is to shift out the least significant bit with something like this:

    u1 getbit() { u1 lsb = x & 1; x >>= 1; return lsb; }
However, most assembly languages have a better alternative if you instead shift out the most significant bit. Suppose that the bit buffer is in ECX. Then you would do ADD ECX, ECX to left shift ECX and (this is the important part) put the shifted-out bit into the carry flag. From there the carry flag can be shifted into the least significant bit of another register, say EAX, with ADC EAX, EAX, or you can branch based on the carry with JC/JNC

As a full example, here is the gamma decoder from apack/aplib, which despite its simplicity rewards careful study:

    getbit:
        add    dl, dl
        jnz    .stillbitsleft
        mov    dl, [esi]
        inc    esi
        adc    dl, dl
      .stillbitsleft:
        ret

    getgamma:
        xor    ecx, ecx
    getgamma_no_ecx:
        inc    ecx
      .getgammaloop:
        call   getbit
        adc    ecx, ecx
        call   getbit
        jc     .getgammaloop
        ret


I think this is one of the more interesting ones for its extreme terseness and opaqueness:

http://dflund.se/~john_e/gems/gem003a.html

It would be a little adventure to try to figure out how it was conceived, as although it looks like something a superoptimiser would produce, it was around before superoptimisation, and I think it was generated by a human likely related to the demoscene.


This was a well known trick for 8080s and Z80s as far back as the 1970s. Back then there were a lot of assembly programmers, tricks like this were stock in trade.


Reminds me of optimizing microcode back in uni... From the reference implementation of 134 microinstructions down to 61, and it was far more cycle-efficient as well (progressive decoding and Huffman encoded micro representation given sample programs). Since it was a contest for two goals of extra credit, people were pissed that our team captured both. It took about 30 hours to implement.

Microinstructione are like a CPU's firmware which implements the external-facing macroinstruction set into a simpler set of microinstructions which coordinate various internal state.


This one has a serious flaw: http://dflund.se/~john_e/gems/gem0032.html

The neg (two's complement) of 0x80000000 is 0x80000000. Thus his trick results in an infinite loop for the most negative value.


Indeed. The x86 equivalent gem (http://dflund.se/~john_e/gems/gem0001.html) makes a point to note that. Odd that it's not mentioned on the m68k page.

For what it's worth, my copy of the C11 spec says

  The abs, labs, and llabs functions compute the absolute value of an integer j. If the result cannot be represented, the behavior is undefined. 304)
  
  304) The absolute value of the most negative number cannot be represented in two’s complement.


If you're interested in this stuff, bit-twiddling hacks: https://graphics.stanford.edu/~seander/bithacks.html

Some things are due to an algorithmical trick, while others are a translation trick. Some are both.


How many of these are still relevant? x86 has an instruction for crc, bswap, as well as nop now.


Now almost every code to be run is cache-bound for performance, not cycle-bound. So this gems, while may still work, may not be the best means to achieve optimal performance. Optimizing cache access is what brings the biggest speedups today -- see for example NumExpr[1].

[1] https://github.com/pydata/numexpr


Cache is extremely important, but most of these operate strictly on values in registers. So, losing cycles due to cache misses, latency, contention, etc... isn't a bottleneck.

A lot of these are crucial enough that they've been codified in transistors now, and generally can execute in a cycle without consuming extra registers. There are still some very valid bit tricks though.


Well I read some and had the impression they were going for speed or low cycle count. Indeed, most surely still work (there doesn't seem to be anything that is processor-specific). Unfortunately my asm is rustier than I thought and I find it quite demanding to read and fully understand them.


Secrets of Assembly Programming Gurus!


I wouldn't go so far to call them secrets, as in my personal experience gurus tend not to keep secrets; they're just neat hacks that not many people know about nowadays.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: