
Questions about Superoptimization - nkurz
http://www.oilshell.org/blog/2016/12/30.html?
======
nightcracker
My comments here are not to the posted article, but the one referenced at the
top:
[http://www.oilshell.org/blog/2016/12/23.html](http://www.oilshell.org/blog/2016/12/23.html)

    
    
        def LookupKind(id):
             return KIND_TABLE[id]
        def LookupKind(id):
            return 175 & id & ((id ^ 173) + 11)
    

> The second implementation is faster because it doesn't require any memory
> accesses.

This is not necessarily true. If this function is in a hot loop, and the
entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only
taking ~1-2 cycles.

> Jeff Dean also posed this question in his talk: How long it does it take to
> quicksort 1 billion 4-byte integers? It surprised me that he estimated it by
> simply counting the number of branch mispredictions, which is described in
> this post.

This is no longer true. In state of the art implementations quicksort is
implemented in a branchless fashion. See
[https://github.com/orlp/pdqsort](https://github.com/orlp/pdqsort) for details
(I'm the author of pdqsort, old username on HN). The trick is to first replace
directly swapping misplaced elements in a loop by first finding a buffer of
elements on the left that should be on the right, and a buffer of elements
that are on the right but should be on the left. This can be done in a
branchless manner:

    
    
        buffer_num = 0; buffer_max_size = 64;
        for (int i = 0; i < buffer_max_size; ++i) {
            // With branch:
            if (elements[i] < pivot) { buffer[buffer_num] = i; buffer_num++; }
            // Without:
            buffer[buffer_num] = i; buffer_num += (elements[i] < pivot);
        }
    

These buffers can then also be swapped branchlessly.

~~~
Coding_Cat
>the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only
taking ~1-2 cycles.

You can do so, but on modern processors it's ~4 cycles to access the L1 cache
and

    
    
            return 175 & id & ((id ^ 173) + 11)
    

takes 3 cycles (cylce 1, do the first & and the ^; cycle two, do the +; cycle
3, do the second &).

~~~
pmalynin
Actually on Intel CPUs ALU operations take 1/3 of a cycle and the processor
can schedule them back to back.

~~~
Coding_Cat
AFAIK this is not quite right. They take only 1/4th of a cycle on average, but
that is because of pipelining. if you have a dependency on the result of an
ALU you will still have to wait the full latency (1 cycle) before you can
continue.

This makes sense as the clock is also somewhat of the 'driving force' for
pushing signals through the chip from one part to the other. (some
architectures have 'zero cost' operations I believe, but these are usually
baked into the pipeline and have to be turned on-or-off depending on need).

------
awirth
The nop and xchg in the GCC output are padding to make the next function
aligned. You can ignore everything after the ret.

The execution of movzx should be negligible. The big difference between the
code generated by GCC and the code generated by clang is that the clang code
does everything using 8-bit registers, and zero-extends at the end, whereas
the GCC code does everything with full 32-bit registers.

I actually had never seen the dil register before this (it's the low 8-bits of
the [er]?di register). It's pretty cool that clang uses this when it knows the
value in the first argument is byte-sized. I think it's also neat seeing how
the compilers use the commutativity of binary-and differently, and end up
anding in a different order. Overall I like the clang code better, as it seems
closer to the asm someone might write by hand.

~~~
chubot
OK thanks, do you know why it's 16-byte aligned? The Clang code is exactly 16
bytes but the GCC code is 20 bytes and it pads it out to 32. For i-cache
maybe? (I asked this on reddit too)

~~~
awirth
It's definitely an optimization thing, but I'm not actually sure off the top
of my head why. According to [https://gcc.gnu.org/onlinedocs/gcc/Optimize-
Options.html#Opt...](https://gcc.gnu.org/onlinedocs/gcc/Optimize-
Options.html#Optimize-Options) -falign-functions is enabled at -O2 and higher,
unless you use -Os. Presumably something about aligned instruction
fetch/decoding can be faster.

------
powera
If you're considering doing this, it's important to consider what you will do
when you inevitably add additional values to the enum. If you have persistence
or partial rollouts (and networked services _always_ have partial rollouts)
this is likely to break horribly. If the app never changes or it's used
entirely in memory on a single process, you're fine.

