
Reversing Bits in C - pvilchez
http://corner.squareup.com/2013/07/reversing-bits-on-arm.html
======
jandrewrogers
This article overlooks a major factor in bit-twiddling performance on modern
CPUs: saturation of the execution ports in a CPU core.

An Intel i7 core has six execution ports, three of which are ALUs of various
types. Depending on the specific instruction and the dependencies between
instructions, the CPU can execute up to 3 simple integer operations _every
clock cycle_ mixed with operations like loads and stores at the same time. For
most algorithms, particularly those that are not carefully designed, multiple
execution ports may be sitting idle for a given clock cycle. (Hyper-threads
work by opportunistically using these unused execution ports.)

Consequently, algorithms with a few extra operations but more operation
parallelism will frequently be faster than an equivalent algorithm where the
operations are necessarily serialized in the CPU.

Furthermore, the compiler and CPU may have a difficult time discerning when
instructions in some algorithms can be executed in parallel across execution
ports. Seemingly null changes to the implementation of such algorithms, such
as using splitting the algorithm across two accumulator variables and
combining them at the end when any normal programmer would just use one
variable to achieve the same thing can have a large impact on performance. I
once _doubled_ the performance of a bit-twiddling algorithm simply by taking
the algorithm and using three variables instead of one. The algorithm was
identical but the use of three registers exposed the available parallelism to
the CPU.

~~~
stephencanon
This is an excellent point, however, there are a few things to keep in mind:
first, compilers can (and do) perform this optimization for you (ignoring
details about re-associating floating-point since we’re talking about bit
twiddling).

Second, bit-reversal never exists in a vacuum. There are other operations
taking place around it, which will fill in unused execution resources, thanks
to out-of-order execution. (And as you note, hyper threading will take
advantage of them too).

Third, even though there are six ports (actually, 8 ports and 4 ALUs in
Haswell![1]), that i7 can still only _retire_ 4 fused uops per cycle, so in
practice one thread cannot saturate all of the execution ports, no matter how
cleverly it is optimized.

All of this combines to mean that the fastest bit-reversal in isolation may
not be the fastest bit-reversal _in situ_ , which is much more important.
Actually evaluating that is much more complex, but it does tend to tip things
away from chasing too much ILP slightly more than isolated timing does.

[1] [http://www.anandtech.com/show/6355/intels-haswell-
architectu...](http://www.anandtech.com/show/6355/intels-haswell-
architecture/8)

~~~
jandrewrogers
Both GCC and Clang are surprisingly mediocre at this kind of optimization. I
write a lot of extreme performance integer algorithms and those compilers only
seem to find "obvious" parallel instruction schedules about half the time even
in isolated contexts.

Fortunately, it is pretty simple to induce the desired optimization from the C
code without resorting to much cleverness. The compilers miss these
optimizations often enough that I frequently double check if I care. Still, it
requires fairly detailed knowledge of the microarchitecture.

I do not do microarchitecture optimization work very often. The last time I
did, it was to design a faster, better hash function to replace Google's
CityHash (and the result was faster and stronger). For most codes, memory
behaviors dominate with respect to performance.

~~~
caf
Have you published your hash function?

~~~
jandrewrogers
Not yet but will soon. It is not just one function but an entire family of
functions with some interesting aspects beyond just the algorithms.

The hash functions were algorithmically optimized around a scaffolding I
designed that guaranteed certain performance characteristics and easy
analyzability. It has literally produced many, many thousands of high quality
hash functions. Tens of thousands of CPU hours have been burned on the
optimization process, which is still running, and the ones that have been put
into use so far were early samples pulled out of that process but the hash
functions still being processed in the pipeline are statistically more robust
than the earlier versions. Much easier than trying to design them the old
fashioned way. At some point soon, since the optimization is converging, I
will evaluate the most promising parts of the phase space to select the
strongest and most aesthetic functions and publish those into the public
domain.

~~~
csom
how does it compare to, say, tabulation as in

Mihai Patrascu, Mikkel Thorup: The Power of Simple Tabulation Hashing. J. ACM
59(3): 14 (2012)
[http://doi.acm.org/10.1145/2220357.2220361](http://doi.acm.org/10.1145/2220357.2220361)
(free version:
[http://arxiv.org/abs/1011.5200](http://arxiv.org/abs/1011.5200))

or also

Mihai Patrascu, Mikkel Thorup: Twisted Tabulation Hashing. SODA 2013: 209-228
[http://knowledgecenter.siam.org/0236-000005/](http://knowledgecenter.siam.org/0236-000005/)

------
rainforest
The multiplication trick reminds me of this StackOverflow answer[1] where an
SMT solver (z3) is used to derive mask and multiplier to extract chosen bits
from a byte.

[1] : [http://stackoverflow.com/questions/14547087/extracting-
bits-...](http://stackoverflow.com/questions/14547087/extracting-bits-with-a-
single-multiplication/14551792#14551792)

~~~
nkurz
That's really interesting, and an approach to such problems that I'd never
considered. I was excited that a "Code generator for bit permutations"
([http://programming.sirrida.de/calcperm.php](http://programming.sirrida.de/calcperm.php))
exists, but using a theorem prover is really another level of possibility. Now
I need to figure out how to apply it to the problem I'm currently thinking
about: [http://stackoverflow.com/questions/17880178/how-do-i-sum-
the...](http://stackoverflow.com/questions/17880178/how-do-i-sum-the-
four-2-bit-bitfields-in-a-single-8-bit-byte/)

~~~
pbsd
We can use the exact same approach used in the bit reversal trick of the
article:

    
    
      ((x * 0x01010101) & 0xC0300C03) % 1023
    

This is probably not gonna be faster than the naive approach, though.

~~~
pbsd
Thinking a little further about this, I believe using PSHUFB is the way to go,
at least for when the count is large. This is because we can do 2 iterations
in essentially one go (haven't tested the code, it's mostly a sketch):

    
    
      vmovdqa xmm0, [0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6]
      vmovdqa xmm15, [0x0f, 0x0f, ..., 0x0f]
      vmovdqu xmm7, [rdi]
      _loop_body:
      vpand   xmm8, xmm15, [rdi]
      vpsrlw  xmm9, xmm7, 4
      vpand   xmm9, xmm9, xmm15
      vpshufb xmm8, xmm0, xmm8
      vpshufb xmm9, xmm0, xmm9
      vpaddb  xmm8, xmm8, xmm9
    
      vpshufb xmm7, xmm8, xmm8 ; since sum <= 12, we already have the next sum in the vector!
                               ; xmm7[0] = xmm8[xmm8[0]]
      vpaddb  xmm8, xmm8, xmm7 ; add it
    
      vpextrb eax, xmm8, 0
      vmovdqu xmm7, [rdi + rax]
      add rdi, rax
      sub esi, 2
      jnz _loop_body
    

This is likely extendable to 32-byte vectors with AVX2; have not thought much
about that case.

~~~
nkurz
This is exciting, but I think I may have tricked you on a couple of details.
The actual distance to the next 'key' is 'sum + 5': 00 represents a 1 byte
encoding, not zero, so the minimum offset to the next 'key' is 5. Thus the
maximum offset is actually 17, which means one can't depend on having two keys
within a single 16 byte vector. I'm trying to figure out if there is some way
to compensate for this without halving the performance.

~~~
pbsd
Oh, I missed that. That makes things trickier, but I think we can still get
away with something like

    
    
      vmovdqu xmm7, [rdi + rax + 5 - 1]
      vpinsrb xmm7, xmm7, [rdi + rax + 0], 0
    

without too much of a performance penalty. The adjustments to offsets then can
be put into the shuffle tables, so there should be no further significant
performance loss.

~~~
nkurz
Yes, I think that should work to guarantee two per vector. I hadn't previously
considered trying to do that, and appreciate the suggestion and the sketch. I
think I have a slightly faster (7 cycle) approach doing one at at time using a
64-bit register as a lookup for the sum of the middle two fields, but this has
good promise. Especially if we can get out one farther ahead, so instead of
having the vector reload on the critical path, the unused portion of the
current vector and a preload can be 'slid' into place. Do you know if there is
a good way to simulate a PALIGN but with a non-immediate operand? This might
get down to 9-10 cycles for two keys.

~~~
pbsd
I have no idea how to simulate a variable PALIGNR on Intel chips without
making the loop extremely slow.

On AMD (with XOP), it can be done using VPPERM, which can shuffle from 2
sources. We can do variable alignment like this:

    
    
      vpperm xmm0, xmm1, xmm2, [[0..31] + offset]
    

On second thought, we can possibly do something similar on Intel using 2
pshufb and a blend.

~~~
nkurz
_On second thought, we can possibly do something similar on Intel using 2
pshufb and a blend._

I tried for a bit, but haven't figured out how to make that work. PSHUFB needs
a different XMM operand for each 'rotate'. Loading this operand would take 6
cycles, and I haven't thought of a clever way of generating it in less.

I do greatly appreciate your help, though. Thanks!

------
mgraczyk
The author seems to misunderstand the idea of asymptotic complexity. All of
the reversal operations are O(1) because the number of bits being flipped is a
constant. If he were concerned with flipping the bits in an arbitrary
precision number, then his different solutions might deserve "Big-O"
classifications.

Second point: The reason that the original solution is slow is because a mod
operation by a number that is not a power of two involves a floating point
divide, or several multiply accumulates at extended precision. Either of those
two operations are slower than any of the other methods.

------
cnvogel
Interestingly, while x86-64 does not seem to have a single opcode for
reversing bits in a byte, it has a function to arbitrarily shuffle around the
16 bytes in a 128bit SSE register [PSHUFB]. It just blows my mind how much
data those SIMD instructions process or move around in relatively few clock-
cycles.

[http://stackoverflow.com/a/9040426](http://stackoverflow.com/a/9040426)

[http://www.intel.com/content/www/us/en/processors/architectu...](http://www.intel.com/content/www/us/en/processors/architectures-
software-developer-manuals.html) (it's on page 1256 of 3251).

~~~
stephencanon
It’s actually shocking how _long_ it took Intel to add PSHUFB to SSE. Altivec
(PPC) had the even-more-powerful vperm (arbitrary shuffle mapping 32B to 16B)
way back in 1999.

~~~
chacham15
The VAX (circa 1977) had polynomial evaluation as an instruction[1]. What is
your point?

[1] [http://en.wikipedia.org/wiki/VAX](http://en.wikipedia.org/wiki/VAX)

~~~
stephencanon
Like my sibling posted, the crazy CISCy instructions aren’t comparable because
in general they were no faster than an equivalent sequence of simpler
instructions. That’s not the case for permute; there are no “simpler”
instructions that let you build an efficient permute. It’s one the fundamental
building blocks for efficient vector code -- that’s why it’s shocking that it
was added to SSE so late.

------
daniel-cussen
In the GA144, lookup tables are pretty painful, so the way I implement reverse
there is:

reverse: a! 16 push . 2 dup . . begin +x 2* 2* unext +x 2* a . + nip ;

In Intel x86/64, the fastest way I know of is to use SIMD instructions, and
break the 64-bit word into 16 nibbles (4-bit pieces), and use PSHUFB to
perform a parallel lookup against another 128-bit xmm register. Then you
aggregate the nibbles in reverse order, using inclusive or and variants of the
shuffle instruction.

~~~
keenerd
This does an 18 bit word, right?

~~~
daniel-cussen
Yep. I thought this would be a huge issue when using this, but first, it's
really necessary for the instruction set, and second, a lot of hardware uses
18-bit, including FPGA's (often packed w/ 18x18 multipliers and 18bit SRAMs,
in order to support 8b/10b SERDES) and 72-bit DDR3.

------
twoodfin
The article makes the point that the lookup table version is fast because the
table fits in D$, and that if the table were evicted it would be slower. This
is true, but the more interesting point is that by loading this table into D$,
you're potentially slowing down other operations.

It's an important conundrum of optimization that if you had 20 similarly
complex functions in a critical path, implementing and benchmarking each
individually with a lookup table could show excellent performance while
globally performance is terrible. And worse, it's _uniformly_ terrible, with
no particular function seeming to be consuming an inordinate amount of the
runtime or, for that matter, D$ misses.

~~~
stephencanon
If you had 20 similar functions, the tables would occupy 5k in total, using
only 1/6th of the L1 D$ on a typical "big" CPU. In actuality, temporal
locality is such that you don't often stride through all table entries
uniformly, so the actual cache pressure is even less.

The point that you're going after is a good one, but its important to keep in
mind how enormous modern memory hierarchies are. It often is very reasonable
to trade memory and cache pressure for speed.

------
Scaevolus
I'm glad they noted that the lookup table's speed relies on it being in cache,
which most "benchmark magic bit-fiddling operations" posts ignore. (Although
it's temporal locality, not cache coherence, that's important for this.)

------
_ihaque
Along the same vein, Andrew Dalke wrote up an interesting series of blog posts
benchmarking different implementations of population count (counting the
number of set bits in a word):

[http://dalkescientific.com/writings/diary/archive/2008/07/03...](http://dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html)

[http://dalkescientific.com/writings/diary/archive/2008/07/05...](http://dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html)

[http://dalkescientific.com/writings/diary/archive/2011/11/02...](http://dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html)

The Stanford Bit Hacks page linked in the original article is also very
interesting reading for folks into this sort of stuff.

------
fjarlq
A great companion to this sort of thing is the book Hacker's Delight by Henry
S. Warren, Jr:

[http://www.hackersdelight.org/](http://www.hackersdelight.org/)

[http://www.amazon.com/Hackers-Delight-2nd-Edition-
ebook/dp/B...](http://www.amazon.com/Hackers-Delight-2nd-Edition-
ebook/dp/B009GMUMTM/)

------
kibwen
_" Intel x86/x64 processors don’t have this instruction, so this is definitely
not a portable solution."_

This stuck out to me. I know that RISC vs CISC is basically a meaningless
distinction nowadays, but I still naively expected that x86 would be more-or-
less a strict superset of ARM.

~~~
pbsd
Strictly speaking, AMD's XOP extensions do have an instruction that is close
enough: VPPERM. It allows to not only shuffle bytes, like the already
mentioned PSHUFB, but also reverse bits within each byte. Therefore, a single
VPPERM instruction can reverse up to 128 bits at a time.

------
Symmetry
Very interesting, though you shouldn't be surprised by small differences
between O(1) and O(N) algorithms when N is only 8.

~~~
stephencanon
If N is 8, then O(N) _is_ O(1). For that matter, so is O(f(N)), for any
function f.

~~~
MichaelBurge
Is that true? I would agree that the time is bounded by a constant, but Big O
only makes sense at all as the size of the input increases without bound.

~~~
drivers99
If you define N<=8 from the beginning, then there exists some constant that is
the maximum time the function will take. That makes it O(1).

~~~
millstone
So every terminating function is O(1) since your computer has only a finite
number of possible states!

The real question is whether the input is big enough that the cost is
dominated by the asymptotic behavior, and not the constant coefficient. The
O(N) "obvious" algorithm was faster than the O(1) "3 ops 64 bit algorithm," so
I think the answer is no, it is not big enough. N=8 sufficiently small that
the asymptotic complexity is irrelevant.

~~~
Dylan16807
I think the logical way to look at algorithmic efficiency starts by picking a
reasonable N. "Number of bits in a byte" is something that very rarely
changes, and never reaches high values, so it makes a bad N. The flip side of
this is that something like "number of bits in main memory" is very flexible
and reaches extremely large numbers, so it shouldn't be a constant.

If I wasn't making a point about different methods to flip bits, and I was
just naively classifying these byte flippers, I would probably call all of
these O(1). Or perhaps O(N) where N is the size of input in bytes.

------
KaeseEs
Great analysis, although I'm curious how the idea that doing a bunch of 64 bit
ops in order to accomplish byte arithmetic came about to begin with - was the
function in question not written by a firmware guy?

~~~
zwieback
I think this one goes back to PDP days and wasn't necessarily written to be
the fastest possible implementation. The PDP could do 36*36 multiply into 72
bits. Not sure how the modulo instruction performed but there was a DIV
instruction.

~~~
binarymax
Down the rabbit hole says this came from HAKMEM No. 239 in 1972!

[http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167](http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167)

------
applecore
Interesting. What's the purpose of reversing the bits in a byte?

~~~
RodgerTheGreat
A common approach for performing a Fast Fourier Transform involves reversing
the bits in time-domain samples.

~~~
stephencanon
Expanding slightly, there’s a permutation that needs to happen in order to
efficiently perform a DFT in-place (and the same approach is often used even
when the transform is out-of-place). For power of two sizes (one of the most
common cases), that permutation is precisely the same as a bit reversal of the
indices.

------
robomartin
If you've ever dealt with graphics file manipulation code chances are you've
suffered the pain of changing the endian-ness of an image file. I never
understood why some of these operations are not implemented as machine
instructions that can run in one instruction cycle flat. There's nothing to
them, I've done exactly that on FPGA's. Yes, they can be a little
resource/routing intensive but not that bad.

~~~
picomancer
> changing the endian-ness

x86 has had the BSWAP instruction since the 486.

gcc has a __builtin_bswap16, __builtin_bswap32, and __builtin_bswap64 which
will presumably take advantage of these built-in instructions on x86 and any
other gcc-supported architectures where similar instructions exist (and fall
back to a reasonably fast and well-tested multi-instruction implementation
where they don't).

You should really RTFM every couple years, just to know what your processor
[1] and compiler [2] can do.

[1]
[http://www.intel.com/content/www/us/en/processors/architectu...](http://www.intel.com/content/www/us/en/processors/architectures-
software-developer-manuals.html)

[2] [http://gcc.gnu.org/onlinedocs/gcc/Other-
Builtins.html](http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)

~~~
robomartin
Oh, I RTFM. Not always working on Intel platforms though. And still:

[http://hardwarebug.org/2010/01/14/beware-the-
builtins/](http://hardwarebug.org/2010/01/14/beware-the-builtins/)

------
mzs
Oh man this is one of my nits. I've written code like this. For example in
some bit counting code I have a block comment in front of all that with 57
lines that are not blank. I have a copy of Hacker's Delight on my bookshelf,
but will the person after me know what and how that code works? I really hope
that there was a comment before that pointing to one of the hack web pages at
least.

------
chacham15
The difference here between the obvious method and the best method is 55ns. Is
there a reason that this problem deserves this much attention for as little a
difference in time? (I realize that it is 6.5x more, but if it isnt at the
center of some core loop, the multiplicative factor doesnt really matter).
What use cases are there for this?

~~~
Renaud
I suppose the point was to show that it's pretty bad to resort to copy/pasting
clever bit hacks into libraries without taking care of how they work.

The fact that the code isn't necessarily obvious makes me think that whoever
used it was hoping for an optimisation of sorts.

Terseness can lead to obfuscation, and that's the wrong sort of optimisation.
So we can hope that the developer was going for speed instead, but the results
show that was a huge failure.

Maybe this won't affect performance in this particular library, maybe it's
called once or twice and it doesn't matter, but if this is part of the innards
of a game or a cryptographic function or some low-level network stack, it
could have very detrimental consequences on performance.

------
munificent
> That’s one mathematical operation, but a large number of CPU instructions.
> CPU instructions are what matter here, though, as we see, not as much as
> cache coherency.

I thought it was also a single CPU instruction, but multiple _clock cycles_.

~~~
wnissen
It's not a single hardware instruction on a 32-bit CPU, I believe is the
point.

------
barbs
Ack! Light grey on white background! My eyes!! Seriously, that's really
annoying.

------
duedl0r
Why on earth does this article have so many upvotes? Running time analysis is
completely wrong... O(n) vs O(1) and such...tss..don't get me started..

