Hacker News new | past | comments | ask | show | jobs | submit login
Ten Ways to Check if an Integer Is a Power Of Two in C (exploringbinary.com)
114 points by caustic on July 26, 2011 | hide | past | favorite | 73 comments

For interest rather than practical use:

Creating a 2GiB lookup table is ~10% faster on my machine than method #10. That is, faster on the benchmark in the blog post. A 2GiB lookup table is horrible, has a little setup cost dominated by calloc-ing the array, and would trash caches with real code.

A also made a solution using a union to split x into two shorts and a 64kiB lookup table that was ~15% slower. For more expensive functions, lookup tables are an annoying baseline to beat. (Although still often good to avoid because of dealing with setup, cache problems, etc.)

How were you accessing the lookup table? In a linear or random way? If you did it in a linear way, locality and prefetching will help performance for your lookup table. The great thing about #9 and #10 is that they are just as fast when the sequence of numbers is random. I know you weren't serious about the 2GiB lookup table, but if a lookup table in general should be used as a baseline, the benchmark should probably use random access. Do you agree? (Special applications could use a linear access pattern, of course.)

(Also, you can shrink the size to 1/8 by just using 1 bit instead of one byte, but that would need some more code of course.)

Edit: You can simulate random access by using a stride great enough to avoid the cache. I guess that would be slightly worse than random, but close enough.

I completely agree with all of the above. I was doing a linear scan, and I know that that is artificial. Your stride suggestion does slow things down dramatically; thanks for the simple-to-implement idea.

(Shrinking the lookup table by 1/8, I don't know how to do that fast.)

The annoying thing about lookup tables is that they are hard to benchmark properly. But superficially they often look like a good idea. (Here faster than the fastest reported result, when tested naively.)

> I was doing a linear scan, and I know that that is artificial.

May be artificial :). Don't forget that your application is the best benchmark. If it uses linear access you can take advantage of that. Let's just say that linear access is a special case and random access is a worst case result. The behavior of the random access is the one to remember, IMO.

> Shrinking the lookup table by 1/8, I don't know how to do that fast.

Here is how I do bit vectors. I'm not suggesting that anyone should use a lookup table for this problem since there are very fast solutions that uses O(1) memory, but let's use this as an example since we're all familiar with it.

Since you mention a 2GiB table I guess you use a byte vector. The article uses an unsigned int as the type for the argument which would need 4GiB for all values, I guess you used int instead. We want to use every bit of memory in an array to store boolean flags. It's probably faster to use the native word size of the machine than to use byte size elements, so let's use unsigned int. I assume that we use a 32-bit machine, just like in the article. If you have a 64-bit machine it will be obvious what to change.

We need 2^31 bits (1 << 31 in C), unsigned int is 32 bits and 32 is 2^5 so we need 2^31/2^5 = 2^26 elements.

When looking up things in a bit vector we need to find the right element in the array and the right bit in the array element that holds the boolean flag. To find the right element we divide the input by the number of bits in each element. To find the right bit we do mod (%) by the number of bits in each element and then shift a bit flag by that amount and AND (&) with the array element.

  unsigned table[1 << 26];
  unsigned is_power_of_two(int x)
      if (x > 0) {
          return table[x / 32] & (1 << (x % 32));
      } else {
          return 0;
Since the constant 32 is a power of two, a good C compiler will change the divide and modulo operations to shift and AND. If you use some other language and/or your compiler doesn't optimize that you may want to do that optimization by hand. x / 32 == x >> 5, if x >= 0. x % 32 == x & 31, if x >= 0. NB: It's not that simple for negative numbers!

The lookup table must be populated before use of course, that is left as an exercise for the reader ;-).

(imurray, if you think I explained things you already know, I did it for other readers.)

> The annoying thing about lookup tables is that they are hard to benchmark properly.

If you want some general rule: If you can get the table small, lookup tables are fast, sometimes the fastest solution. But maybe you're application (not an artificial benchmark) doesn't use the function very often and the table won't be in the cache. Then the computation have to slower than fetching memory with a cache miss for the lookup table to be worth it. Also, memory bandwith has been a bottleneck for a long time and it will get even worse. This diminishes the value of lookup tables and trading memory for computation time. As always when it comes to performance and optimizations: benchmark your application with your data. General results may or may not apply in your context.

Edit: If you want the answer to always come out as 0 for false and 1 for true, you can do this instead:

  return (table[x / 32] >> (x % 32)) & 1;

> You can simulate random access by using a stride great enough to avoid the cache.

I think a good pattern is to add a large number co-prime to 2^32 on every iteration. That guarantees you actually hit all cache entries, while picking a "round" number like 1024 underutilizes the cache severely, which is unfair if you were trying to simulate random performance.

edit: Actually, in this case it doesn't really matter since your values aren't expected to be in-cache anyway.

Thank you for that little trick. Even if it doesn't matter for this case I will remember it for later. I know I've read stuff like that before, but I haven't learned enough of that part of discrete math to really understand why (but it seems kind of intuitive).

As a coincidence I got "A Concrete Introduction to Higher Algebra" by Lindsay Childs in the mail today. The concrete part of the book is that he uses properties about integers to introduce and teach concepts about algebra (and then he goes on to polynomials and other stuff). So now I got no excuse to not know that stuff any longer :-).

Great job; that was definitely a point of comparison that was missing from the original article. I'm surprised this actually is faster, but if it is only by 10%, then I still think the memory-less algorithm is preferable.

On modern-day systems main memory access is much slower than the processor, and on multi-core systems (which are the norm these days, though not all cores are always in use, of course) the memory bus can easily become a bottleneck. Benchmarks tend to hide that fact because they are often run on a single core on an otherwise idle system, so they usually have all memory bandwidth and cache memory to themselves, which isn't really representative of real-world systems.

For that reason, I believe that algorithms that do not depend on fast memory access to work efficiently are preferable to those that make use of large memory caches, at least if the performance difference isn't too great.

It seems bizarre to call the first group of methods "decimal based" when the methods don't ever look at the decimal digits of the number being tested. I would call them "arithmetic" or something like that.

Or "representation independent", although the ones that rely on the number being less than INT_MAX obviously rely on some knowledge of the representation.

It's decimal as in base 10, as opposed to binary.

But they aren't in base 10. The constant mentioned might as well have been put into the code in hex or octal and nothing would have changed. (I don't know whether C support binary constants.)

I didn't write the article; I'm just giving the rationale. You don't need to take it out on me.

No offense intended.

     return ((x != 0) && !(x & (x - 1)));
This is beautiful.

Unless x==0 is often true, it's probably better to put the conjuncts in the other order:

  return (!(x & (x-1)) && (x != 0));
because that way you don't have to test x against 0 so often.

(On my machine it appears to be about 10% faster.)

[EDIT to clarify: 10% faster with the particular sample of x values that I tested, which happened to be all the integers from 0 up to 2^30-1 once each. Of course if you only ever call it with x=0 then the original version will be faster. Also, if this is really in your inner loop then you're probably doing something wrong :-).]

However, this way your code enters an undefined state in C, strictly speaken. (as far as I unstand the standard)

So theoretically, an extremely aggressive optimizer would be allowed to generate machine code that doesn't handle the case x==0 properly.

If (x != 0) is placed first, the optimizer wouldn't be allowed to do that, due to short-circuit evaluation.

It's OK if x is of an unsigned integer type. If x is of a signed type, though, you're right: strictly, the value of x-1 is then undefined when x==0.

(In practice, of course, it's perfectly safe unless you're on a distinctly exotic system, and if you are then you probably know you are.)

I disagree.

Why should x-1 for x==0 be undefined for a signed type? The signed type can happily represent -1. Rather, x-1 is undefined for unsigned types because those have no representation of -1.

Yow, what the hell was I thinking when I wrote that? Let me try again.

Preliminary note: I'm looking at a draft version of the C9X standard, because that's what I have to hand.

When x is of an unsigned type, it's OK. (In particular, arithmetic overflow on unsigned types is defined to work modulo (max value + 1).)

When x is of a signed type, the bit-patterns corresponding to negative numbers are left somewhat up for grabs. It has to be either sign+magnitude, or 1's-complement, or 2's-complement. (I'm pretty sure this was not true in earlier versions of the standard.)

However, the result of applying a bitwise logical operator to a negative value is defined to be what you get by applying that logical operation to the bits. In other words, it's undefined (actually, unspecified, which isn't quite the same) what answer you get, but this isn't one of those cases where the standard permits implementations to cause the Moon to explode or pornographic email to be sent to your employer.

In particular, whatever value x&(x-1) has when x==0, it has a value and the whole expression (including the && (x!=0) bit) comes out false regardless.

So, I take back my earlier statement: according to my reading of (a draft of) the (latest) C standard, my version is in fact guaranteed to do the right thing in all implementations, even when x is of signed type.

I repeat that I haven't looked at earlier versions of the standard; I think they were less specific about how signed integer types can be represented, and I wouldn't be surprised if bitwise ops on negative numbers (at least) had entirely undefined behaviour then.

Check out Bit Twiddling Hacks if you'd like to see more: http://graphics.stanford.edu/~seander/bithacks.html

for more equally beautiful expressions, check out "Hacker's Delight".

And fast; Mesa uses this.

  static INLINE boolean
  util_is_power_of_two( unsigned v )
     return (v & (v-1)) == 0;

Returns True when v==0, and yet 0 is not a power of 2.

Further, this technique - correctly applied - is essentially #9 in the list.

According to the C spec, this is expression is undefined for v == 0.

So it might return true, or return false, or start NetHack, or wipe your disk. In other words: Strictly speaken, that function is not allowed to be called for v == 0.

Mesa is a GL renderer. POT only matters in 3D for things like texture sizes. Textures aren't allowed to have zero-sized dimensions. The check for v == 0 occurs far higher up the stack than where this function's used.

We're not writing bad code, we promise. :3

The benchmarks are crap. The loop/function call overhead time dominates. They should list the benchmark for when the implementation is "return x;" That gives a baseline number. On my machine I have to run a few thousand runs before I can distinguish between "return x;" and "return ((x != 0) && !(x & (x - 1)));"

They both take about 10s for 232 iterations.

Recent Intel/AMD CPUs have a POPCNT instruction, which seems like the logical way to do this. Would be interested to see how that performs compared to these implementations.

FWIW, in gcc/g++ there are compiler intrinsics which (should) map to that instruction on CPUs where it's available: __builtin_popcnt, __builtin_popcountl and __builtin_popcountll for unsigned ints, unsigned longs and unsigned long longs respectively. Visual C++ provides equivalent functions for Windows (but I can't remember what they're called).

It does seem odd that the article misses this approach out.

Unfortunately __builtin_popcnt isn't emitting a popcnt instruction with the GCC I've got here, even using -msse4.2. I believe that very recent GCC does get this right.

GCC 4.4 isn't very recent, but it generates a popcnt with -msse4.2.

GCC 4.5, using popcnt on my Core-i7 860 takes the trivial loop mentioned using "Complement and Compare" from ~10.5s to ~7.5s

Also, is there any pure C code that an optimizing compiler could make into a POPCNT instruction? I would be quite impressed if a compiler could recognize some of the snippets in the article, and compile them into a POPCNT and a compare with 1.

Intel has included BSF and BSR (bit scan forward / reverse) since 386, which would presumably also give you the magnitude of the power of 2.

A nasty solution, which assumes and abuses IEEE floating point format and demonstrates a couple of things.

   int isPowerOfTwo (unsigned int x)
       int exponent;
       union { unsigned int u; float f; } tmp;
       tmp.f = x;
       exponent = (tmp.u >> 23) - 127;
       return x == (1 << exponent);
One can also cast to a double without losing precision, mask out the exponent and then compare to 1.0. That solution is even nastier, and needs #define's to deal with endianness.

Obviously the above solution is not a good idea! Amongst the several problems, casting to floats is really slow. Sometimes I store my integers in doubles throughout my code because it saves conversions, and can be more convenient. (Matlab users routinely store integers as doubles.)

What I found interesting/disconcerting was that the above function doesn't compile reliably. When using 'gcc -Wall' I get isPowerOfTwo(0)==0, whereas with 'gcc -Wall -O2' I get isPowerOfTwo(0)==1. clang has the same change in behaviour with optimization levels.

gcc should do the right thing if you add a special case for zero. exponent will be negative in that case. For larger integers you might end up with some false positives with floats though for say 2^30+1.

Thanks, I had confused myself. You're right: shifting with a negative value (or far too big) gives undefined behaviour in C. (The type punning stuff is formally undefined too, a memcpy would be more portable, but gcc promises to make the widely-used union trick work.)

Regarding the second point, the posted code works fine with 1073741825U (the literal for 2^30+1). The algorithm doesn't need the float to keep full precision, because only the exponent is consulted.

The decrement test for a power of two can be modified to count bits and runs in log n time.

  int bits(unsigned n)
     int i = 0;
     while (n > 0) { n &= n-1; i++ } 
     return i;

I haven't tried, but I would expect doing the linear search in the reverse direction will be faster than binary search. It, on average, inspects just two values; binary search does about five.

Here's another. It's strictly inferior to the x&(x-1) one, but the idea that makes it work is so pretty it seems worth mentioning.

  y = x | (x>>1);
  y |= y>>2;
  y |= y>>4;
  y |= y>>8;
  y |= y>>16;
  y |= y>>32; // for 64-bit numbers;
  return y == (x<<1)-1;
So what's going on here? Well, it's easy enough to locate the lowest set bit in x -- it's x & -x, the same basic idea as the x&(x-1) trick -- but what the code above does is to locate the highest by "filling" rightwards from that bit. After the first line, the 1-bits in y are the 1-bits in x, and the bits one to the right of them. After the second, it's the 1-bits in x shifted right by 0..3 places. After the next, 0..7 places. And so on. Eventually, that comes to "all the 1-bits in x, and everything any distance to their right". So, e.g., 000101000100 -> 000111100110 --> 000111111111 which then doesn't change.

The nice thing is that the number of steps this takes goes like log(word width). It's in the same spirit as the 0x33333333 popcount trick.

(On some processors -- all x86 ones since 386, for instance -- there's an instruction that does something similar directly, but on at least some x86 processors it's rather slow.)

Is there a reason this won't work? It's the most 'readable' way I could come up with.

    #(python code)
    def is_power_of_two(n):
        import math

        if n <= 0:
            return False

        power = round(math.log(n, 2))
        return 2 ** power == n

This will probably give the wrong answer for some integer. I have tried something similar in Java. Since it was three years ago my memory is a little hazy. I was working on a parallelizing compiler written in Java (but not for Java) and I saw that the other programmers had used a method similar to yours, it used log anyway. I knew about #9 and #10 and worried that their method was potentially wrong (and also inefficient). To check if it was wrong I coded up something that compared the log-floating point method against #10 for all non-negative integers and the log-floating point method gave the wrong answer for one value (out of 2 billion).

That was Java and your example is in Python, there could be some difference. If you try and compare in Python, please tell us the result.

Here is the code from the first test.[0] It increments a variable and prints a message if their is an inconsistency. I left it running till it reached 1,351,773,471 and didn't come up with any inconsistencies.

I then modified the test[1] to look for inconsistencies where they were most likely to be found, ie ±1 of 2n. I reached n being 1024 before python complained about a 'Result too large'.

[0] http://paste.pound-python.org/show/10067/

[1] http://paste.pound-python.org/show/10068/

Edit: just reread about the 1 in 2 billion chance, I'll leave the first test running longer to make sure.

Nice to see some experimentation. :)

I tested all 2^31 non-negative integers, which is 2147483648 values. If I remember correctly, the value that was wrong was large, probably between 2^30 and 2^31. Java is pretty fast and I think this took tens of minutes. Python is about 20 times slower so it may take hours for you.

Yeh it's fairly slow going. I'm at 3,706,382,752 and am going to call it a day. Looks like the code works properly.

math.log uses floating point arithmetic. That will lead to trouble on large numbers. The article addresses the issue.

I'm not doing power == int(power) though which is what he warns against. I haven't done a lot of testing however as far I can tell it's working.

    >>> is_power_of_two(2 ** 31 - 1)
    >>> is_power_of_two(2 ** 31)
    >>> is_power_of_two(2 ** 31 + 1)

    >>> is_power_of_two(2 ** 548 - 1)
    >>> is_power_of_two(2 ** 548)
    >>> is_power_of_two(2 ** 548 + 1)

Yes. It seems to work for much larger numbers than this on my version of Python, but going via doubles leaves a bad taste.

I hadn't considered #2, the Check All option. I love its refusal to be drawn into over-engineering.

Most of the solutions explicitly assume 32 bit integers. These days most of us have 64 bit integers available.

Yes, but the two best solutions don't. The other solutions are mostly for educational purposes, I guess. One of #9 or #10 is the one that should be in some utility library.

I think it would be better to list the worst- and average-case asymptotic analyses of each method, rather than just the run-times. For instance, the "Decimal-Based Approaches to Checking for Powers of Two" will take longer depending on the size of your number.

One they missed: (x & -x) == x

That was #9:

    int isPowerOfTwo (unsigned int x)
      return ((x != 0) && !(x & (x - 1)));

This works for negative values of x, though.

It should be: ((x>0) && ((x & -x) == x))

This is method 10 in the article, since -x is equivalent to ~x+1.

Unfortunately this doesn't work for x=0.

2^-infinity == 0

And 3 == 2^(log_2(3)).

Clearly we want to know if x is an integer power of 2.

2^(bitsize) == 0.

Edit: Yes, 1 << bitsize is undefined. But unsigned integers actually do have well-defined semantics on overflow, and multiplying by 2 enough times really does produce zero.

Actually, it's undefined behaviour. http://blog.regehr.org/archives/213

2^(bitsize) == 2^(bitsize)

If you meant 1 << (bitsize), that's undefined behaviour in C

So make it x && blah. Do I really have to state the obvious?

You're a computer programmer (probably). Yes you have to state the obvious...


All of the loops could be written recursively, but that'd be rather atypical for solving this problem in C.

Though with gcc and tail recursion it wouldn't make a difference to the generated code.

Use popcount?

return _mm_popcnt_u64(x) < 2;

if x is unsigned, then:

(x & (x - 1)) == 0


Not only did you not bother to read the site, your version gives the wrong answer for x=0.

It is method #9.


Because the methods clearly get more complicated. Starting simply and naively and then improving is a perfectly standard way of teaching.

Articles like these, which provide detailed explanation of the methods involved and results of benchmarks help greatly.

The benchmarks would probably be somewhat misleading in a lot of cases here. I'm fairly certain that most compilers will replace a /2 with >>1 for example, so that isn't really going to end up being reflected when you compile it.

That depends on the CPU used. Modern CPU have more arithmetic units than logical units and can perform more divisions than bit shifts. There was a good video presentation of that, but I can't find the link now.

EDIT: http://blogs.msdn.com/b/shawnhar/archive/2007/03/19/a-story-... gives other reasons why multiplications can be faster than bit shifts.

That's multiplies, not divides. Modern CPUs are still faster at shifts than divides, even the pentium4, which lacked a single-cycle shift unit.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact