
Ten Ways to Check if an Integer Is a Power Of Two in C - caustic
http://www.exploringbinary.com/ten-ways-to-check-if-an-integer-is-a-power-of-two-in-c/
======
imurray
For interest rather than practical use:

Creating a 2GiB lookup table is ~10% faster on my machine than method #10.
That is, faster on the benchmark in the blog post. A 2GiB lookup table is
horrible, has a little setup cost dominated by calloc-ing the array, and would
trash caches with real code.

A also made a solution using a union to split x into two shorts and a 64kiB
lookup table that was ~15% slower. For more expensive functions, lookup tables
are an annoying baseline to beat. (Although still often good to avoid because
of dealing with setup, cache problems, etc.)

~~~
lambda_cube
How were you accessing the lookup table? In a linear or random way? If you did
it in a linear way, locality and prefetching will help performance for your
lookup table. The great thing about #9 and #10 is that they are just as fast
when the sequence of numbers is random. I know you weren't serious about the
2GiB lookup table, but if a lookup table in general should be used as a
baseline, the benchmark should probably use random access. Do you agree?
(Special applications could use a linear access pattern, of course.)

(Also, you can shrink the size to 1/8 by just using 1 bit instead of one byte,
but that would need some more code of course.)

Edit: You can simulate random access by using a stride great enough to avoid
the cache. I guess that would be slightly worse than random, but close enough.

~~~
imurray
I completely agree with all of the above. I was doing a linear scan, and I
know that that is artificial. Your stride suggestion does slow things down
dramatically; thanks for the simple-to-implement idea.

(Shrinking the lookup table by 1/8, I don't know how to do that _fast_.)

The annoying thing about lookup tables is that they are hard to benchmark
properly. But superficially they often look like a good idea. (Here faster
than the fastest reported result, when tested naively.)

~~~
lambda_cube
> I was doing a linear scan, and I know that that is artificial.

 _May_ be artificial :). Don't forget that your application is the best
benchmark. If it uses linear access you can take advantage of that. Let's just
say that linear access is a special case and random access is a worst case
result. The behavior of the random access is the one to remember, IMO.

> Shrinking the lookup table by 1/8, I don't know how to do that fast.

Here is how I do bit vectors. I'm not suggesting that anyone should use a
lookup table for this problem since there are very fast solutions that uses
O(1) memory, but let's use this as an example since we're all familiar with
it.

Since you mention a 2GiB table I guess you use a byte vector. The article uses
an unsigned int as the type for the argument which would need 4GiB for all
values, I guess you used int instead. We want to use every bit of memory in an
array to store boolean flags. It's probably faster to use the native word size
of the machine than to use byte size elements, so let's use unsigned int. I
assume that we use a 32-bit machine, just like in the article. If you have a
64-bit machine it will be obvious what to change.

We need 2^31 bits (1 << 31 in C), unsigned int is 32 bits and 32 is 2^5 so we
need 2^31/2^5 = 2^26 elements.

When looking up things in a bit vector we need to find the right element in
the array and the right bit in the array element that holds the boolean flag.
To find the right element we divide the input by the number of bits in each
element. To find the right bit we do mod (%) by the number of bits in each
element and then shift a bit flag by that amount and AND (&) with the array
element.

    
    
      unsigned table[1 << 26];
      unsigned is_power_of_two(int x)
      {
          if (x > 0) {
              return table[x / 32] & (1 << (x % 32));
          } else {
              return 0;
          }
      }
    

Since the constant 32 is a power of two, a good C compiler will change the
divide and modulo operations to shift and AND. If you use some other language
and/or your compiler doesn't optimize that you may want to do that
optimization by hand. x / 32 == x >> 5, if x >= 0. x % 32 == x & 31, if x >=
0. NB: It's not that simple for negative numbers!

The lookup table must be populated before use of course, that is left as an
exercise for the reader ;-).

(imurray, if you think I explained things you already know, I did it for other
readers.)

> The annoying thing about lookup tables is that they are hard to benchmark
> properly.

If you want some general rule: If you can get the table small, lookup tables
are fast, sometimes the fastest solution. But maybe you're application (not an
artificial benchmark) doesn't use the function very often and the table won't
be in the cache. Then the computation have to slower than fetching memory with
a cache miss for the lookup table to be worth it. Also, memory bandwith has
been a bottleneck for a long time and it will get even worse. This diminishes
the value of lookup tables and trading memory for computation time. As always
when it comes to performance and optimizations: benchmark your application
with your data. General results may or may not apply in your context.

Edit: If you want the answer to always come out as 0 for false and 1 for true,
you can do this instead:

    
    
      return (table[x / 32] >> (x % 32)) & 1;

------
omaranto
It seems bizarre to call the first group of methods "decimal based" when the
methods don't ever look at the decimal digits of the number being tested. I
would call them "arithmetic" or something like that.

~~~
rcfox
It's decimal as in base 10, as opposed to binary.

~~~
eru
But they aren't in base 10. The constant mentioned might as well have been put
into the code in hex or octal and nothing would have changed. (I don't know
whether C support binary constants.)

~~~
rcfox
I didn't write the article; I'm just giving the rationale. You don't need to
take it out on me.

~~~
eru
No offense intended.

------
bad_user

         return ((x != 0) && !(x & (x - 1)));
    

This is beautiful.

~~~
gjm11
Unless x==0 is often true, it's probably better to put the conjuncts in the
other order:

    
    
      return (!(x & (x-1)) && (x != 0));
    

because that way you don't have to test x against 0 so often.

(On my machine it appears to be about 10% faster.)

[EDIT to clarify: 10% faster with the particular sample of _x_ values that I
tested, which happened to be all the integers from 0 up to 2^30-1 once each.
Of course if you only ever call it with x=0 then the original version will be
faster. Also, if this is really in your inner loop then you're probably doing
something wrong :-).]

~~~
vog
However, this way your code enters an undefined state in C, strictly speaken.
(as far as I unstand the standard)

So theoretically, an extremely aggressive optimizer would be allowed to
generate machine code that doesn't handle the case x==0 properly.

If (x != 0) is placed first, the optimizer wouldn't be allowed to do that, due
to short-circuit evaluation.

~~~
gjm11
It's OK if x is of an unsigned integer type. If x is of a signed type, though,
you're right: strictly, the value of x-1 is then undefined when x==0.

(In practice, of course, it's perfectly safe unless you're on a distinctly
exotic system, and if you are then you probably know you are.)

~~~
vog
I disagree.

Why should x-1 for x==0 be undefined for a signed type? The signed type can
happily represent -1. Rather, x-1 is undefined for unsigned types because
those have no representation of -1.

~~~
gjm11
Yow, what the hell was I thinking when I wrote that? Let me try again.

Preliminary note: I'm looking at a draft version of the C9X standard, because
that's what I have to hand.

When x is of an unsigned type, it's OK. (In particular, arithmetic overflow on
unsigned types is defined to work modulo (max value + 1).)

When x is of a signed type, the bit-patterns corresponding to negative numbers
are left somewhat up for grabs. It has to be either sign+magnitude, or
1's-complement, or 2's-complement. (I'm pretty sure this was not true in
earlier versions of the standard.)

However, the result of applying a bitwise logical operator to a negative value
is defined to be what you get by applying that logical operation to the bits.
In other words, it's undefined (actually, unspecified, which isn't quite the
same) what answer you get, but this isn't one of those cases where the
standard permits implementations to cause the Moon to explode or pornographic
email to be sent to your employer.

In particular, whatever value x&(x-1) has when x==0, it _has_ a value and the
whole expression (including the && (x!=0) bit) comes out false regardless.

So, I take back my earlier statement: according to my reading of (a draft of)
the (latest) C standard, my version is in fact guaranteed to do the right
thing in all implementations, even when x is of signed type.

I repeat that I haven't looked at earlier versions of the standard; I think
they were less specific about how signed integer types can be represented, and
I wouldn't be surprised if bitwise ops on negative numbers (at least) had
entirely undefined behaviour then.

------
aidenn0
The benchmarks are crap. The loop/function call overhead time dominates. They
should list the benchmark for when the implementation is "return x;" That
gives a baseline number. On my machine I have to run a few thousand runs
before I can distinguish between "return x;" and "return ((x != 0) && !(x & (x
- 1)));"

They both take about 10s for 2 __32 iterations.

------
dasmoth
Recent Intel/AMD CPUs have a POPCNT instruction, which seems like the logical
way to do this. Would be interested to see how that performs compared to these
implementations.

~~~
vilya
FWIW, in gcc/g++ there are compiler intrinsics which (should) map to that
instruction on CPUs where it's available: __builtin_popcnt,
__builtin_popcountl and __builtin_popcountll for unsigned ints, unsigned longs
and unsigned long longs respectively. Visual C++ provides equivalent functions
for Windows (but I can't remember what they're called).

It does seem odd that the article misses this approach out.

~~~
dasmoth
Unfortunately __builtin_popcnt isn't emitting a popcnt instruction with the
GCC I've got here, even using -msse4.2. I believe that very recent GCC does
get this right.

~~~
neckbeard
GCC 4.4 isn't very recent, but it generates a popcnt with -msse4.2.

GCC 4.5, using popcnt on my Core-i7 860 takes the trivial loop mentioned using
"Complement and Compare" from ~10.5s to ~7.5s

------
imurray
A _nasty_ solution, which assumes and abuses IEEE floating point format and
demonstrates a couple of things.

    
    
       int isPowerOfTwo (unsigned int x)
       {
           int exponent;
           union { unsigned int u; float f; } tmp;
           tmp.f = x;
           exponent = (tmp.u >> 23) - 127;
       
           return x == (1 << exponent);
       }
    

One can also cast to a double without losing precision, mask out the exponent
and then compare to 1.0. That solution is even nastier, and needs #define's to
deal with endianness.

Obviously the above solution is not a good idea! Amongst the several problems,
casting to floats is _really_ slow. Sometimes I store my integers in doubles
throughout my code because it saves conversions, and can be more convenient.
(Matlab users routinely store integers as doubles.)

What I found interesting/disconcerting was that the above function doesn't
compile reliably. When using 'gcc -Wall' I get isPowerOfTwo(0)==0, whereas
with 'gcc -Wall -O2' I get isPowerOfTwo(0)==1. clang has the same change in
behaviour with optimization levels.

~~~
oldcigarette
gcc should do the right thing if you add a special case for zero. exponent
will be negative in that case. For larger integers you might end up with some
false positives with floats though for say 2^30+1.

~~~
imurray
Thanks, I had confused myself. You're right: shifting with a negative value
(or far too big) gives _undefined behaviour_ in C. (The type punning stuff is
formally undefined too, a memcpy would be more portable, but gcc promises to
make the widely-used union trick work.)

Regarding the second point, the posted code works fine with 1073741825U (the
literal for 2^30+1). The algorithm doesn't need the float to keep full
precision, because only the exponent is consulted.

------
ChristianMarks
The decrement test for a power of two can be modified to count bits and runs
in log n time.

    
    
      int bits(unsigned n)
      {
         int i = 0;
         while (n > 0) { n &= n-1; i++ } 
         return i;
      }

~~~
axylone
Much better ways to count bits set:
[http://graphics.stanford.edu/~seander/bithacks.html#CountBit...](http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive)

------
Someone
I haven't tried, but I would expect doing the linear search in the reverse
direction will be faster than binary search. It, on average, inspects just two
values; binary search does about five.

------
gjm11
Here's another. It's strictly inferior to the x&(x-1) one, but the idea that
makes it work is so pretty it seems worth mentioning.

    
    
      y = x | (x>>1);
      y |= y>>2;
      y |= y>>4;
      y |= y>>8;
      y |= y>>16;
      y |= y>>32; // for 64-bit numbers;
      return y == (x<<1)-1;
    

So what's going on here? Well, it's easy enough to locate the _lowest_ set bit
in x -- it's x & -x, the same basic idea as the x&(x-1) trick -- but what the
code above does is to locate the _highest_ by "filling" rightwards from that
bit. After the first line, the 1-bits in y are the 1-bits in x, and the bits
one to the right of them. After the second, it's the 1-bits in x shifted right
by 0..3 places. After the next, 0..7 places. And so on. Eventually, that comes
to "all the 1-bits in x, and everything any distance to their right". So,
e.g., 000101000100 -> 000111100110 --> 000111111111 which then doesn't change.

The nice thing is that the number of steps this takes goes like log(word
width). It's in the same spirit as the 0x33333333 popcount trick.

(On some processors -- all x86 ones since 386, for instance -- there's an
instruction that does something similar directly, but on at least some x86
processors it's rather slow.)

------
zxw
Is there a reason this won't work? It's the most 'readable' way I could come
up with.

    
    
        #(python code)
        def is_power_of_two(n):
            import math
    
            if n <= 0:
                return False
    
            power = round(math.log(n, 2))
            return 2 ** power == n

~~~
lambda_cube
This will probably give the wrong answer for some integer. I have tried
something similar in Java. Since it was three years ago my memory is a little
hazy. I was working on a parallelizing compiler written in Java (but not for
Java) and I saw that the other programmers had used a method similar to yours,
it used log anyway. I knew about #9 and #10 and worried that their method was
potentially wrong (and also inefficient). To check if it was wrong I coded up
something that compared the log-floating point method against #10 for all non-
negative integers and the log-floating point method gave the wrong answer for
one value (out of 2 billion).

That was Java and your example is in Python, there could be some difference.
If you try and compare in Python, please tell us the result.

~~~
zxw
Here is the code from the first test.[0] It increments a variable and prints a
message if their is an inconsistency. I left it running till it reached
1,351,773,471 and didn't come up with any inconsistencies.

I then modified the test[1] to look for inconsistencies where they were most
likely to be found, ie ±1 of 2 __n. I reached n being 1024 before python
complained about a 'Result too large'.

[0] <http://paste.pound-python.org/show/10067/>

[1] <http://paste.pound-python.org/show/10068/>

Edit: just reread about the 1 in 2 billion chance, I'll leave the first test
running longer to make sure.

~~~
lambda_cube
Nice to see some experimentation. :)

I tested all 2^31 non-negative integers, which is 2147483648 values. If I
remember correctly, the value that was wrong was large, probably between 2^30
and 2^31. Java is pretty fast and I think this took tens of minutes. Python is
about 20 times slower so it may take hours for you.

~~~
zxw
Yeh it's fairly slow going. I'm at 3,706,382,752 and am going to call it a
day. Looks like the code works properly.

------
Lambent_Cactus
I hadn't considered #2, the Check All option. I love its refusal to be drawn
into over-engineering.

------
btilly
Most of the solutions explicitly assume 32 bit integers. These days most of us
have 64 bit integers available.

~~~
lambda_cube
Yes, but the two best solutions don't. The other solutions are mostly for
educational purposes, I guess. One of #9 or #10 is the one that should be in
some utility library.

------
rcfox
I think it would be better to list the worst- and average-case asymptotic
analyses of each method, rather than just the run-times. For instance, the
"Decimal-Based Approaches to Checking for Powers of Two" will take longer
depending on the size of your number.

------
mansr
One they missed: (x & -x) == x

~~~
Jabbles
Unfortunately this doesn't work for x=0.

~~~
rcfox
2^-infinity == 0

~~~
Jabbles
And 3 == 2^(log_2(3)).

Clearly we want to know if x is an _integer_ power of 2.

~~~
wnoise
2^(bitsize) == 0.

Edit: Yes, 1 << bitsize is undefined. But unsigned integers actually do have
well-defined semantics on overflow, and multiplying by 2 enough times really
does produce zero.

~~~
rcfox
Actually, it's undefined behaviour. <http://blog.regehr.org/archives/213>

------
sharth
Use popcount?

return _mm_popcnt_u64(x) < 2;

------
kqueue
if x is unsigned, then:

(x & (x - 1)) == 0

