

The 8-Byte Two-Step - rzezeski
http://zinascii.com/2014/the-8-byte-two-step.html

======
nkurz
Thanks for writing this. It's great to see more articles this in HN.

The assembly in the article looks like it was compiled without optimization,
which is going to change the exact numbers quite a bit. Contrary to what it
sounds like, "without optimization" really means something like "output some
strange dialect of antiquated assembly that's 10 times slower than it needs to
be". It's really only useful if you are debugging the compiler. In particular,
it's pushing all the variables onto the stack even though it doesn't need to,
so that they are there in case you need to check them manually. 99% of the
time you want to be compiling with -02, -03, or -0fast.

With gcc -O3 for x64, here's what they look like. The functions are shorter,
faster, and in some ways easier to understand:

    
    
      align_1:
            leal    7(%rdi), %eax
            andl    $-8, %eax
            ret
    

The first integer arg always comes in %rdi/%edi. The compiler has recognized
the idiom, and transformed it into (x + 7) & (0b11..11000). LEA is "load
effective address", and in this case is a slightly shorter way of writing
'ADD', but no faster.

    
    
      align_2:
                    cmpl    $8, %edi
                    movl    $8, %eax
                    jbe     .L6
            .L5:
                    addl    $8, %eax
                    cmpl    %eax, %edi
                    ja      .L5
                    rep ret
            .L6:
                    rep ret
    

"If x < 8 (jbe == jump before or equal), return 8. Otherwise keep adding 8
until it's greater (jump after), and return that". The return value is
whatever is in %eax on return. The 'rep ret' is a funny way of saying return,
apparently done to be slightly faster on ancient AMD processors. At this point
it's cruft. And there is no real reason to have two return statements ---
that's the kind of wacky thing compilers like to do even when optimizing.

    
    
      align_3:
                    movl    %edi, %edi
                    subq    $8, %rsp
                    cvtsi2sdq       %rdi, %xmm0
                    mulsd   .LC0(%rip), %xmm0
                    call    ceil
                    mulsd   .LC1(%rip), %xmm0
                    addq    $8, %rsp
                    cvttsd2siq      %xmm0, %rax
                    ret
            .LC0:
                    .long   0
                    .long   1069547520
            .LC1:
                    .long   0
                    .long   1075838976
    

The first 'movl' wipes out any bits greater than 32 if they are there. This is
because x was defined as an int, and not a long. In general, you might be
better off always using long unless there is a specific reason not to. Then we
convert the int to floating point. Floating point on modern systems uses the
128-bit XMM registers that are also used for vectors. We put it into %xmm0,
because that is register for the first floating point or vector arg. Keeping
the result in %xmm0, multiply by what is presumably a floating point constant
equal to ALIGN and then call ceil(). The result is also returned in %xmm0.
Then because multiplication is much faster than division, multiply by ALIGN
and then call ceil(). Then because multiplication is much faster than
division, multiply by another constant that is is approximately (and depending
on the value of the alignment, this might be a significant fact) equal to
1/align. Convert that back to a 32-bit "signed integer quadword" (siq) and
return it.

The trickiness with optimizing (which is probably why you used the mangled
unoptimized version) is that the compiler has a tendency to optimize them out
completely so you end up testing an empty loop. Your choices are either work
in assembly directly so what you see is what you get, or to be crafty and come
up with something that prevents the compiler from doing so: a checksum, or
perhaps a comparison that you know the result of but hopefully the compiler
doesn't.

Here are the timings in cycles on a Haswell system. I compiled with '-fno-
inline', which would be faster, but might skew the results more in this case,
even though function calls are fast on x64.

baseline: 5.3 cycles for a loop that returns x and compares against zero.

align_1: 5.0 cycles including the loop. Yes, in this case adding the two
instructions of the function actually took fewer cycles than baseline loop.
It's more than ILP voodoo, it's a whole way of life!

align_2: 62 cycles per iteration. Horrible, but not as bad as the uncompiled
mess. But it worth pointing out that while you don't want to use a loop here,
there are much faster loops. Perhaps while (x % align != 0) { x++ }? This gets
you down to 11 cycles, only 6 more than the baseline. Interestingly, it's this
fast because the processor apparently can perfectly predict the number of loop
iterations after the first 30, which is a little scary.

align_3: 19 cycles per call. Faster than the silly loop, but really not a way
you want to approach this problem unless you've really thought through all the
floating point details.

I put my code up at
[https://gist.github.com/nkurz/985470b01b999e67d04b](https://gist.github.com/nkurz/985470b01b999e67d04b).
At the bottom are more numbers provided by Likwid for things like number of
instructions, number of branch errors, etc.

~~~
duckingtest
>In general, you might be better off always using long unless there is a
specific reason not to

I don't agree. That's trading off useful cache for almost invisible micro
optimizations.

As a general rule, it's always better to use the smallest size possible.

~~~
nkurz
I phrase it as "might be" since I realize it's controversial, but I think that
rule is outdated. Yes, having a large static array would be a fine specific
reason to use the smallest size possible. But what would be the benefit when
you are dealing with the argument to a function in an example like this?

Most of the time the argument starts in a register, is passed in a register,
and returned in a register. Using a smaller size often just means that the
compiler adds some unneeded conversions as in this case. Usually this doesn't
matter, but when it does, the benefit is almost always in favor of the simpler
rule of always using 64-bit variables.

~~~
duckingtest
One argument? Sure, probably no difference. When you start using longs as
local variables and arguments, even if they're all in registers in one
function, if you call other functions inside they are going to be pushed on
the stack. It all adds up and suddenly you're getting L1 cache misses.

Anyway, unnecessary conversions mostly go away when you use link time
optimizations (fwhole-program or flto in gcc).

~~~
nkurz
A reasonable argument, although not one I seem to run up against, perhaps
because I'm rarely concerned about high performance when writing functions
with that many layers of subcalls.

By contrast, when trying to optimize inner loops, I frequently encounter cases
where the front-end limitation of 4 micro-ops per cycle is a limiting factor,
and getting rid of any extraneous instruction is a speedup. And rather than
worrying about a deep stack causing L1 data misses, I'm more concerned with
missing L1 instruction cache, or with the extra micro-ops causing me to miss
the ~1000 slot decoded micro-op cache.

These concerns are clearly at opposite ends of the performance spectrum, and
which should dominate probably depends on the problem at hand.

(I glanced at your comment history. Welcome to HN! You have good insights.
Please stick around.)

------
kabdib
The floating point version is dangerous and broken on platforms that have
64-bit ints. (Not sure what happens with ceil and negative numbers, it
_probably_ works, but I'd have to try it). And while not an issue here, since
this looks like it's user-space code, it's bad to use floats in a systems
programming environment (floats are often in registers that can't be modified
or maybe even _referenced_ in kernel contexts without some save/restore
shenanigans).

The loop version works poorly for large queue sizes, and if I saw this in
production code I'd replace it immediately.

I'd be okay with a division and re-multiplication (likely going to be
optimized by the compiler, through it's foolish to depend on this).

I don't see any problem with the mask/not/and solution. This has been a
programming idiom for decades, and should be no more mysterious than (say) a
doubly-linked list.

~~~
leif
I agree, the mask/not/and is standard within systems contexts. It need not be
cognitive overhead, not to be elitist but if you don't have a copy of hacker's
delight or the ability to understand constructs like this, you have no
business writing systems-level code.

------
Dylan16807
Maybe you shouldn't care about cycles, but using a loop for something that can
be done without branches is a definite cognitive overhead.

If we want to avoid bit twiddling, how about this:

(mqhp->mq_maxsz + MQ_ALIGNSIZE - 1) / MQ_ALIGNSIZE * MQ_ALIGNSIZE;

I would expect a compiler to emit the same instructions as the original, but
at worst you would have an add and two shifts rather than jumps or complex
float operations.

~~~
to3m
Provided everything is unsigned, this should result in something somewhat
similar. I think this form also makes it clearer what's going on.

Assume MQ_ALIGNSIZE is a power of two. Then if N is the power, MQ_ALIGNSIZE is
1<<N. Then you can replace divisions by right shifts, and multiplications by
left shifts, and turn the expression into (value+(1<<N)-1)>>N<<N.

Consider x>>N<<N. The (unsigned - i.e., guaranteed zero-filling) right shift
discards the bottom N bits, and then the left shift puts the remaining bits
back where they were. The net result is to mask off the bottom N bits of x.

Another way of putting that would be x&~((1<<N)-1) - hopefully that is clear.
In our example, 1<<N is MQ_ALIGNSIZE. So we have x&~(MQ_ALIGNSIZE-1), or,
overall, (x+MQ_ALIGNSIZE-1)&~(MQ_ALIGNSIZE-1).

I suppose it's possible a compiler could transform this, or the divide-and-
multiply version, into the add-then-and version, given enough things that are
compile-time constants. But personally if I knew the alignment would be a
power of two - which it pretty much always is - I'd just write the code I
want. Life is usually simpler that way.

------
greenyoda
I'd agree with his last sentence: "My guess, this was done more as an idiom of
systems programming than as an optimization."

~~~
syntheticnature
Doubly agreed -- similar patterns appear elsewhere to avoid off-by-one errors
and bitmasking is fairly intuitive to anyone regularly working at that level.
I was surprised at the author not recognizing this idiom, but you have to
learn it sometime!

I was further stunned by the seeming naivety of the author's align_2()
implementation, but then it does get the job done, eventually. My naive
approach would've been roughly align_3(), but using integer math and modulus.

On the other hand, younger team members were recently stunned by code I did
for translation of a binary protocol into more easily handled pieces using
what I think of as typical idioms, so this article might get sent around
Monday morning.

~~~
rzezeski
Hi, author here. My align_2 implementation is indeed very naive. Part of that
was to show just how much overhead a naive implementation could add, in this
case 3 orders of magnitude. I'm also new to systems programming. This article
should serve as a reminder to experienced hackers that idioms are often only
obvious _after_ they've been explained. My last 4 years were spent working on
a distributed database; I could bring up "obvious" idioms from that world that
might perplex a kernel hacker.

But you are right, this is indeed a very obvious line of code once you
understand it. Thank you for your comments.

~~~
syntheticnature
The key being "once you understand it." I look forward to sending this around
to make sure folks "get it" \-- that which I grasp intuitively is magic for
them[1], and kernel hacking is rarely the deep topic people imagine it to be.
I doubt I have the right perspective to explain this as clearly, and if I
could, I lack the time to do so as deeply.

[1]Their words, heard within the past month.

------
jgh
It's a fun trick I've used for years, I think originally I gleaned it from
[http://graphics.stanford.edu/~seander/bithacks.html](http://graphics.stanford.edu/~seander/bithacks.html)
(or maybe not, now that I look through that page - though it is full of
goodies)

------
suprjami
This is really neat, thanks for the thorough explanation.

I have banned myself from using printfs to figure out things like this.
Instead I would use a debugger and breakpoints to view the live variables in
their different data formats.

In GDB, that's p/t for binary and p/d for decimal.

------
thaumaturgy
Thank you for the careful writeup and submission here. I recently picked up a
copy of _Beautiful Code_ , but I was hoping it would have more stuff like
this.

------
bitwize
Holy shit, mind blown.

I've done this before, but usually take modulo 8 rather than bitwise-and
negative 7 as the final step.

~~~
syncsynchalt
Same, I'd write it out using modulo and would assume the compiler would figure
it out.

    
    
        void *p = x;
        p += 7;
        p -= (p%8);
    

Now that I've written it out I suppose it's not any clearer than the mask
method:

    
    
        void *p = x;
        p += 7;
        p &= ~7;

------
hardwaresofton
Another great article from that site:

[http://zinascii.com/2014/a-posix-queue-
implementation.html](http://zinascii.com/2014/a-posix-queue-
implementation.html)

------
toolslive
if you don't know history, you are doomed to reinvent it. The one thing you
should realize is that in the past, people were not more stupid. They just had
less, or different context. Regarding bit trickery, chess programming is full
of it.

Curiously, it's also the area where strong static typing breaks down, as it is
sometimes preferred to view the same set of bits as a value of a different
type to be able to manipulate it in another way.

------
Chromozon
TLDR: round a number N up to the nearest multiple of M

~~~
blt
... where M is a power of 2.

