

Bit Twiddling Hacks - ahalan
http://www.graphics.stanford.edu/~seander/bithacks.html

======
julian37
This was posted a number of times before:

<http://news.ycombinator.com/item?id=2570269>

<http://news.ycombinator.com/item?id=513935>

<http://news.ycombinator.com/item?id=86419>

And also:

<http://news.ycombinator.com/item?id=1811104>

------
rgarcia
These are cool, but how much is automatically done for you by the compiler?
For example, if I changed all of my calls to min(x,y) with

r = y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y)

my code would quickly become unreadable. I could define a macro, but those are
supposedly evil[1].

[1] [http://www.parashift.com/c++-faq-lite/inline-
functions.html#...](http://www.parashift.com/c++-faq-lite/inline-
functions.html#faq-9.5)

~~~
jandrewrogers
Many of the very simple algorithms are done by a good compiler, more complex
ones generally are not. It is pretty easy to turn these into clean,
generalized functions using templates in C++ and std::numeric_limits. I have a
substantial library of generalized algorithms constructed from bit-twiddling
primitives that does not use any macros.

The value of these examples is that if you understand the underlying
mechanisms behind them you can use those building blocks to construct even
more complex (and extremely fast) algorithms than demonstrated at the link.
Complex, high-level algorithms can be implemented using these bit hacks once
you wrap your head around what they are actually doing. The caveat is that the
code will be opaque for programmers that are not fluent in bit-hacking.

For most applications you won't see much performance benefit because these are
micro-optimizations. If you have a small kernel that is being executed a
hundred thousand or million times per second then algorithms constructed this
way can be a substantial performance optimization (easily integer factor
speedup, I frequently see 10x for components).

Algorithms built on bit-twiddling primitives tend to have two properties that
make them particularly efficient on modern processors. First, they make very
good use of superscalar CPUs, putting the parallel ALUs to work. Second, they
often take branching algorithms (e.g. small tight loops, ?: operator, if
statements) and convert them into branchless algorithms. If you are doing
performance and latency sensitive code, this method of building algorithms is
worth learning. For everyone else, it is a neat bit of computer science arcana
that really delves into the nature of integers.

------
AgentIcarus
If you enjoyed this, you'll also probably like Hacker's Delight:
<http://www.hackersdelight.org/>

------
Jach
I like Morton Numbers--they're a great example of the notion that
dimensionality is an interpretation, and you can compress higher dimensions
down to 1 while still retaining a 1-to-1 correspondence. However, good luck
with modifying the bit twiddling algorithms shown here when you want to deal
with Big Numbers that overflow C/C++'s built-in types. Here's a slow (I
haven't measured how slow) Python version that's insensitive to how big the
number is:

    
    
        def tomorton(x,y):
          x = bin(x)[2:]
          lx = len(x)
          y = bin(y)[2:]
          ly = len(y)
          L = max(lx, ly)
          m = 0
          for j in xrange(1, L+1):
            # note: ith bit of x requires x[lx - i] since our bin numbers are big endian
            xi = int(x[lx-j]) if j-1 < lx else 0
            yi = int(y[ly-j]) if j-1 < ly else 0
            m += 2**(2*j)*xi + 2**(2*j+1)*yi
          return m/4

~~~
jandrewrogers
There is a straightforward way to extend bit interleaving to bit strings of
arbitrary size using bit-twiddling. The output of the magic constants just
have to be shifted in a regular pattern (which varies as a function of
dimensionality) so that each word in the array has its bits correctly ordered.

The magic constants used in the bit-twiddling example also have a very regular
derivation. Designing an algorithm that computes the correct constants for
Morton Numbers of arbitrary dimensionality is pretty simple.

There is a neat generalization of this algorithm that extends it to irregular
bit interleaving patterns at the cost of requiring a few more magic constants
(also derivable). I once wrote a compact engine that used algorithmically
generated transform constants to produce arbitrary bit interleaving patterns
in an arbitrary number of dimensions via these bit-twiddling algorithms.

~~~
Jach
Hey, you finally made the leap to HN! Did you independently discover a lot of
these things yourself or did you come across parts in books/papers? For me the
canonical example of deriving magic numbers is with the fast inverse square
root trick (my favorite paper on it is
<http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf> ) but I haven't seen many
other instances... I haven't gone nearly far enough yet but I like getting
pointed suggestions.

There were some recent submissions here about related topics like space-
filling curves and secret sharing that I've looked at before. When I started
looking into abstract algebra and came across finite fields I sort of laughed
at the Wikipedia page mentioning right at the beginning "Finite fields are
important in number theory, algebraic geometry, Galois theory, cryptography,
coding theory and Quantum error correction" with each topic being linked. It's
a fun way to procrastinate.

If your engine isn't part of some secret sauce I think it would be neat to
study it. Unless you know of an already existing open source equivalent that's
not hidden away in the corners of something like GCC?

~~~
jandrewrogers
I found most of them in code and on the 'net but came up with a couple myself.
When I started looking at bit interleaving some years back, I found a few
examples for 2- and 3-dimensions but not much explanation. If you step through
the algorithm and label every bit to watch how they move, it starts to become
obvious what is going on. This was how I learned. The constants mask
individual bits so that the sum of the shifts applied to a bit equals the
total shift. If a bit needs to be shifted 5 bits to its final position, the
mask is not applied to that bit for the ">>4" and ">>1" step. Consequently,
constants and shifts for a particular transform do not have to be based on
powers of two, just numbers where the sum of some subset of shifts matches
every shift in the integer you are transforming.

Once you know that, you can generate any distribution you need. However,
Morton numbers take advantage of a property of their bit distributions where
intermediate steps never corrupt a bit that is not going to be masked off
anyway. This is not true for some other bit interleaving patterns. Extending
it to arbitrary patterns requires two masking operations with another magic
constant at each step which protects bits that would otherwise be destroyed in
the simple Morton algorithm.

Nothing like the transform engine I wrote is open source to the best of my
knowledge. There is no reason I could not open source it, I just haven't. It
is pretty efficient in that it can do a number of reductions and
simplifications to the minimum number of steps and constants required to
produce an n-dimensional interleaved result. For irregular patterns, you end
up with quite a few no-op steps that can be eliminated; for regular patterns
you can reuse steps, saving memory. I'll probably write it up and put it on
the web at some point.

------
kennywinker
Premature optimization heaven!!

But seriously, I'm going to crawl through this and see if there is anything
that can speed up some drawing code I have. Very nice!

------
hornd
These are pretty neat. I use a few of them relatively often at work, but if I
ever saw

c = ((((c - 0x3f800000) >> r) + 0x3f800000) >> 23) - 127;

in production code I would be sad.

~~~
jonhohle
If properly commented, why would that make you sad?

I think this story is relevant:
[http://www.folklore.org/StoryView.py?project=Macintosh&s...](http://www.folklore.org/StoryView.py?project=Macintosh&story=Saving_Lives.txt)

Think of all of the time software wastes in people's lives by being bloated
and inefficient.

In Java, for example, the default HashMap implementation [0] uses a while loop
to compute the capacity of the array it will use as storage. Using the next
highest power of two algorithm in the page listed is faster than this loop.
Imagine if every JVM was microseconds faster every time a HashMap was
allocated or resized. On one server, computer, or phone it's meaningless. On
hundreds of millions of phones phones, PCs, and servers that adds up to
meaningful amounts of time and energy.

[0] <http://www.docjar.com/html/api/java/util/HashMap.java.html>

------
fredsanford
Search for Ratko Tomic bits on google. He was doing these optimizations in the
late 80s and early 90s.

How I miss the RIME/RelayNet C language forums...

