/// GetNextWithEqualBitCount
/// How to get the next higher integer with the same number of bits set.
/// This function is supposed (claim by original author) to be able to
/// wrap around and continue properly, but testing indicates otherwise.
/// Do not call this with x = 0; as that causes a divide by 0.
/// Doesn't work for an input of all bits set.
/// Do not assign the result of this to a type of less size than the input argument.
/// This function has certain real-world applications.
/// This function has been called 'snoob' by some.
If the wrap around does not work anyway, you can do the following and have it also work for all zeros and all ones. At least I think so, this is a bit ad hoc without putting too much thought into it.
If you are willing to accept a conditional expression, this will make the wrap around work, I currently have no good idea how to make it work without a condition.
x = (mask == 0) ? x <<< popcount(x) : x ^ mask
The idea is as follows. To increase the number, you have to toggle some bit from zero to one, and to make the increase as small as possible, it should be as far to the right as possible. But to keep the number of ones the same, you also have to toggle one bit from one to zero. This one must not be to the left of the bit you toggle from zero to one or the number would decrease, otherwise it should be as far to the left as possible to make the number as small as possible. In essence this means that you have to find the rightmost occurrence of 01 and turn it into 10.
x ^ (x >> 1) finds all the 01 and 10 edges, edges = x & (x ^ (x >> 1)) finds only the 01 edges. edge = edges & -edges clears all but the rightmost set bit - the two's complement first inverts all bits, then adding one turns all the ones on the right into zeros until reaching the rightmost zero which gets turned into a one - leaving use with one set bit indicating the rightmost 01 edge. mask = edge | (edge << 1) duplicates that bit and this is then used to flip the corresponding two bits with x ^= mask.
Eventually all set bits will end up on the left and there will be no 01 edge leaving x stuck in this state.
zeroOneEdges = x & (x ^ (x >> 1)) // This must be a signed arithmetic shift.
rightmostZeroOneEdge = zeroOneEdges & -zeroOneEdges
toggleMask = rightmostZeroOneEdge | (rightmostZeroOneEdge << 1)
rightmostOne = x & -x
shiftMask = rightmostZeroOneEdge - 1;
onesToShift = x & shiftMask
shiftedOnes = onesToShift / rightmostOne // This must be an unsigned division.
x = ((x & ~shiftMask) ^ toggleMask) | shiftedOnes
Now it wraps around without a conditional expression but just as the one from the library will divide by zero if called with zero, so either do not do this or add a check for zero and return zero in that case. It works for all ones.
My first attempt was on the right track but also missing a step. Not only must the rightmost 01 edge be turned into 10 but also all the ones to the right of the 01 edge must be shifted back to the right, i.e. <rest>01<ones><zeros> must become <rest>10<zeros><ones>. This shift is done with the division and it is important that it is an unsigned division. It is also important that the 01 edge detection shift is a signed arithmetic shift.
The algorithm from the library and my solution are quite similar and they might actually be identical with the one from the library being more optimized and shorter. I would not be surprised if the one from the library actually also works in principle but the implementation got a tiny detail about the signedness of one of the operations wrong, because that will result in the described issues, i.e. not working properly once the sign bit gets involved, either for all ones or during the wrap around.
EDIT: I had a close look at the algorithm from the library and it is indeed broken - in order to work as intended, the shift in ones = (ones >> 2) / smallest would have to sometimes perform a logical shift and sometimes an arithmetical shift depending on the circumstances. This should be fixable.
EDIT: Combining the ideas from both, this is my current best solution that handles everything including the wrap around properly, besides zero.
rightmostOne = x & -x
zeroOneEdges = x & (x ^ (x >> 1)) // This must be a signed arithmetic shift.
rightmostZeroOneEdge = zeroOneEdges & -zeroOneEdges
shiftedOnes = (x & (rightmostZeroOneEdge - 1)) / rightmostOne // This must be an unsigned division.
x = (x + rightmostOne) | shiftedOnes
Super cool, I'll have to have closer look sometime. To handle zero, if you have ctz, you can replace the divide with two bitwise shifts. This may sometimes be faster than the divide even if you don't have ctz.
Two shifts? I think ctz plus a single shift should do.
shiftedOnes = (x & (rightmostZeroOneEdge - 1)) >> ctz(x) // This must be a logical shift.
With all the fancy bit level instruction, this could probably be simplified in general, but I usually avoid them because I mostly see this as an exercise in making it work if you only have access to the basic arithmetic and logic operations and shifts.
You need two shifts in C++ because shifting a uint32_t by 32 is UB. Every actual computer probably gives you 0 if you shift 0 right by a too-large amount though...
I have no idea if this would be worth doing. One could also use a conditional move like CMOV on x86 if available on the target architecture to avoid a jump. On the one hand I have the gut feeling that there should be a nice trick to do this, on the other hand this seems to either require a population count dependent shift or rotation or reversing the word, which also has no really simple trick as far as I know and is usually done with a logarithmic number of shifts.
That appears to be a direct C++ translation of item 175 from HAKMEM [1]. This does not require a division and can be done much faster, see section 1.24 of [2].
The division performs a right shift, the version given in the FXT book replaces it with a while loop and has the note that one could use a bit scan and a shift. I would assume bit scan and shift is the fastest option - if accessible from the language used - but I am not sure if the loop could beat the division. Also the version from the EA library claims to - but fails to - perform the wrap around, the version in the FXT book does not try to wrap around and I would guess the version in HAKMEM also does not wrap around properly.
It can be used to enumerate all subsets of size N from a set of size M. Start with 1 << N - 1, and then repeatedly call this function until the value is larger than 1 << M. Every subset of size N will be visited, in order.
Years ago, I had a similar need and I consulted TAOCP to find the “standard” algorithm to do so, which is done by a binomial tree. The tree T is defined recursively as T(0) = nil and T(n) = [Cons(0, T(0)), Cons(1, T(1)), ..., Cons(n-1, T(n-1))]. There's no need to construct the tree explicitly; it's just a conceptual formalism that helps formulate an algorithm to traverse the tree.
The text did mention using bitwise operations to do so; in fact it says it presents a “remarkable sequence of seven bitwise operations that will convert any binary number to the lexicographically next t-combination” but I haven't been able to find those seven operations.
Take a 64 bit number and treat it like a bit field where each bit represents inclusion/exclusion of an element in a set S of size 64. Thus every number with M 1s represents a (distinct) M-sized subset of S
Every bit in a bitfield represents a "thing". The bitfield represents a collection of distinct things. If bit "i" is 1, then the corresponding thing "i" is in the collection.
This can be used generate all bit sequences over a particular range that have the same Hamming weight. I have needed that before when testing error-correction code decoders to make sure they properly correct all of the possible error patterns that they should be able to.
Chess engine stuff, at the very least. With an integer mask representing an 8x8 board and 1s representing a single type of piece, you can find all subsets of pawn positions etc.
I've used this to enumerate grids with a certain number of open and closed cells (0 and 1 bits). It is simply the next permutation of a bit string. C++ offers this in its standard lib, but that version is slower by factor 1000 or so.
I've also tested this with arbitrary bit-width numbers (LLVM can do that) and it worked fine. I've not checked if the code is the exact same tho.
I'm going out on a limb and say "bloom filters" so it has applications in fast filtering, seen-count, DB stuff like parallelism, discrete-different buckets, partially ordered sets being applied in a stream of events...
In addition to the stated problems with wraparound and zero (claimed fixed here) the implementation in this header file seems to have some potential correctness issue if you pass in a signed type and end up performing right shifts on values with the sign bit set and your compiler decides this means you want sign extension.
Edit: translation back to C++ using other functions in this header and assuming that a bt_unsigned exists which is similar to the already included bt_signed:
template <typename T>
inline T GetNextWithEqualBitCount(T x_) {
bt_unsigned x, smallest, ripple, ones;
int tz, pc;
x = (bt_unsigned)x_;
tz = CountTrailing0Bits(x);
pc = CountBits(x);
if (tz + pc == sizeof(x)*8) {
if (pc == 0) {
return 0;
}
return (T)(x >> tz);
}
smallest = ((bt_unsigned)1) << tz;
ripple = x + smallest;
ones = ((x ^ ripple) >> 2) >> tz;
return (T)(ripple | ones);
}
The 2nd branch here only exists because shifting a uint32_t right by 32 is undefined. You can avoid it if you do 2 shifts instead. The first branch is tougher to get rid of.
You could keep incrementing from the current value until the population count matches again, but in the worst case - probably from 2^62 to 2^63 with a single set bit or wrapping around from 2^63 to 2^0 - that would still take pretty much 2^64 steps.
Seeing the project name EAStdC I thought for a moment this would be the C counterpart to EASTL (e.g. a modern replacement for the C standard library), but alas, it's all written in C++ and intended to be used from C++ (e.g. no C interfaces).
For anyone interested in this sort of stuff and isn't yet aware of the 'bit twiddling hacks' collection:
(disclaimer: I used to work on these EA* libraries)
The intention of EAStdC wasn't really to replace the platform libc but to augment it with portable implementations that had consistent behavior across the different platforms, as well as have some opt-in performance improvements (like a memcpy that used VMX), while also not polluting the global (C) namespace.
I wonder how many of these have actually been used vs someone thought of a cool bitwise trick for a problem that doesn't exist. Can anyone think of a use-case for TurnOffLowestContiguousBits?
I'm working on a game written in m68k Assembly. In cases where I'm trying to make every CPU cycle count, these kinds of novel bitwise operations really catch my interest. TurnOffLowestBit seems useful for an effect I've had in mind for a while: A graphic gets "erased" one pixel at a time from right to left, but with some random variance for each horizontal line. Reliably turning off the lowest bit might be a good solution. Maybe throwing in TurnOffLowestContiguousBits would help to introduce some variance in a visually interesting way.
Or maybe it wouldn't. But in order for me to consider these operations as potential solutions to a problem, I have to know they exist. Even if they beckon for me to invent a problem they can solve, I'm totally open to that kind of creative inspiration.
"Check out this neat bitwise thing you can do in only 12 CPU cycles!" That's definitely the jam of Assembly and/or demoscene coders, haha.
Obviously a very very gross generalization (might get called out on it too :P) - But I think the low hanging fruit for modern software is optimizing data layouts for fast processing.
Optimizing code requires a lot of work, and requires skills that people usually learn from experience - therefore a rare and expensive skillset. Data-Oriented Design can be taught, and requires (IMO of course) a far less technical skillset.
This is very true. The tower of template and OO class hierarchies typical in c++ codebases make the memory layout impossible. Polymorphism and vtables are so killer for performance. I think general programming can learn a lot of lessons from HPC and game engines.
This is interesting. It looks like it's returning the (biased corrected) value of mantissa.
I'm surprised that there isn't a function supplied by the compilers or in the processor that takes an int and returns the mantissa. After all, every int -> float conversion requires this step.
Modern CPUs have a count-leading-zeros instruction to calculate floor(log2(x)) efficiently. GCC and clang expose this with the __builtin_clz* family of intrinsics. MSVC exposes this with _BitScanReverse.
Take a look at the C header <ieee754.h>. It's not on all systems but it provides a union for type punning between a double and its decomposed parts. IEEE754_DOUBLE_BIAS is provided for bias correction. The same are available for floats.
(Unfortunately, the "not on all systems" part is why I don't use it, but instead copy the union and the constant verbatim into the project if I need it.)
I think it may be a GNU extension to explicitly allow it, but I'm no language lawyer. This header is from glibc; it may explain why it's not available on some platforms.
I think you're right. Most of the top google results seem to be from people who can't mentally separate c and c++. So when i tried to look it up around the time of the comment, i didn't get clear answers.
Are any of these still needed, though? I kinda assumed they would be like Duff's Device and be a nifty historical relic, but firmly in the "Let the compiler do this for you" category in modern C.
Many of those are not standard functions (or were not, until very recently). You have to implement it somehow, so having a ready-made collection that also performs well is very nice to have.
What code specifically do you write instead of the code in TFA, in order to have the optimizer generate the code in TFA? In general compiling with optimizations enabled has not produced particularly good code in my experience.
The general use-case is "high-performance and/or embedded systems." Games, and especially console games, happen to check both of those boxes. Another high-perf case is financial systems (eg: high-frequency trading). Other embedded systems would include things like robotics, automotive software, audio synthesizers and processing, etc.
Packing data into bits is a space optimization. On modern processors, accessing memory is hideously slow compared to doing logical operations, so if you can pack more info into fewer bits, you get higher information density, and need to access main RAM less often. Thus, it speeds things up, sometimes dramatically.
Sorry, I don't have reading material per-se.
But just as an anecdotal example: I was using some of this trickery to implement a densely-packed Quadtree using Morton Codes. Morton Codes interleave the bits of the X and Y coordinates for a cell on a grid. So, in order to break a M-Code apart and combine them back together, you need to "select the even bits" or "select the odd bits", which can be done with some tricks like these.
Thanks for the resources, these are great! As an avid chess fan I've always played around with the idea of trying making a basic chess engine.. This might just be the push!!
In embedded programming it can be useful to control an animation on an LED matrix since a lot of LED drivers and multiplexers enable pins based on the bits passed through their respective drivers. Example of how you might animate 8 LED connected to 8 pins:
You could turn that into an array to animate an LED going from left-to-right or you could just write a simple function that bit-shifts your current state to the right.
It's also useful in scrolling text: Imagine that 8x8 above contains a character like O. You could animate the O moving to the left by shifting the bits in each row to the left by one over and over again (according to whatever speed you've defined).
Anywhere that a byte/word/dword is used as a first class data structure, there is likely some operation on the data structure that can be accomplished with a bit trick. As others have said, this is most often the case where CPU cycles are tightly constrained.
You end up needing this stuff to write fast parsers of text formats, but you need a bunch of other stuff too, and there isn't really any introductory material online.
I just got a copy of Hacker's Delight, which seems to show the same sorts of tricks. I haven't read very much of it yet, but I wonder how many of these came from that book or a common source. It would be fun to read more about the history of these tricks.
Question from a frontend dev: Why can't the compiler apply these tricks automatically? Are there assumptions they're making that don't necessarily hold?
Clang and GCC can do some amazing optimizations. However, these types of tricks often come into play as a result of clever data representation, i.e., instead of representing data as a bunch of objects in an array, you’ve figured out how to represent it, or a subset, in some cleverly packed bits and/or ints (and then can reap the benefit of these tricks).
It’s trivial for a compiler to see (x / 2) and substitute (x >> 1) but it is much harder for a compiler to go from seeing that you have a bunch of Foo objects and realizing that they could be better represented as clever bit vectors.
Yes, this library is of uneven quality, and would benefit from a few hours of focused attention from a specialist. E.g. the various HighestBit functions below what you've called out also look inefficient relative to something using one of the builtin_clz intrinsics, even though there's an earlier use of builtin_clz...
I think a lot of times the way a header like this comes into being is that someone has a narrow need for an operation that is deemed reusable and abstract. The 64 bit version of a 32 bit bit utility doesn't have a need at the time. Then somebody comes in later trying to fill out more stuff.
On many platforms a shift by more than 31 bits often did nothing, despite it should do something. Intel, for example, masked the count at 5 bits, breaking a lot of 64 bit compiler hacks.
There is also the issue of cache and speed - expanding to full size would have different footprints, so perhaps they tested that too.
This is likely a result to work around those problems.