Hacker News new | past | comments | ask | show | jobs | submit login
EABitTricks.h (github.com/electronicarts)
283 points by mmastrac on Dec 19, 2022 | hide | past | favorite | 107 comments



My favorite so far

    /// GetNextWithEqualBitCount
    /// How to get the next higher integer with the same number of bits set.
    /// This function is supposed (claim by original author) to be able to 
    /// wrap around and continue properly, but testing indicates otherwise.
    /// Do not call this with x = 0; as that causes a divide by 0.
    /// Doesn't work for an input of all bits set.
    /// Do not assign the result of this to a type of less size than the input argument.
    /// This function has certain real-world applications.
    /// This function has been called 'snoob' by some.


If the wrap around does not work anyway, you can do the following and have it also work for all zeros and all ones. At least I think so, this is a bit ad hoc without putting too much thought into it.

  edges = x & (x ^ (x >> 1))
  edge  = edges & -edges
  mask  = edge | (edge << 1)
  
  x ^= mask
If you are willing to accept a conditional expression, this will make the wrap around work, I currently have no good idea how to make it work without a condition.

  x = (mask == 0) ? x <<< popcount(x) : x ^ mask
The idea is as follows. To increase the number, you have to toggle some bit from zero to one, and to make the increase as small as possible, it should be as far to the right as possible. But to keep the number of ones the same, you also have to toggle one bit from one to zero. This one must not be to the left of the bit you toggle from zero to one or the number would decrease, otherwise it should be as far to the left as possible to make the number as small as possible. In essence this means that you have to find the rightmost occurrence of 01 and turn it into 10.

x ^ (x >> 1) finds all the 01 and 10 edges, edges = x & (x ^ (x >> 1)) finds only the 01 edges. edge = edges & -edges clears all but the rightmost set bit - the two's complement first inverts all bits, then adding one turns all the ones on the right into zeros until reaching the rightmost zero which gets turned into a one - leaving use with one set bit indicating the rightmost 01 edge. mask = edge | (edge << 1) duplicates that bit and this is then used to flip the corresponding two bits with x ^= mask.

Eventually all set bits will end up on the left and there will be no 01 edge leaving x stuck in this state.


THE ALGORITHM IS WRONG.

But I can no longer edit it. 1110 should become 10011 but my »solution« yields 10110. Will have to rethink this.


And here is the fixed version, I hope.

  zeroOneEdges         = x & (x ^ (x >> 1)) // This must be a signed arithmetic shift.
  rightmostZeroOneEdge = zeroOneEdges & -zeroOneEdges
  toggleMask           = rightmostZeroOneEdge | (rightmostZeroOneEdge << 1)
   
  rightmostOne = x & -x
  shiftMask    = rightmostZeroOneEdge - 1;
  onesToShift  = x & shiftMask
  shiftedOnes  = onesToShift / rightmostOne // This must be an unsigned division.
  
  x = ((x & ~shiftMask) ^ toggleMask) | shiftedOnes
Now it wraps around without a conditional expression but just as the one from the library will divide by zero if called with zero, so either do not do this or add a check for zero and return zero in that case. It works for all ones.

My first attempt was on the right track but also missing a step. Not only must the rightmost 01 edge be turned into 10 but also all the ones to the right of the 01 edge must be shifted back to the right, i.e. <rest>01<ones><zeros> must become <rest>10<zeros><ones>. This shift is done with the division and it is important that it is an unsigned division. It is also important that the 01 edge detection shift is a signed arithmetic shift.

The algorithm from the library and my solution are quite similar and they might actually be identical with the one from the library being more optimized and shorter. I would not be surprised if the one from the library actually also works in principle but the implementation got a tiny detail about the signedness of one of the operations wrong, because that will result in the described issues, i.e. not working properly once the sign bit gets involved, either for all ones or during the wrap around.

EDIT: I had a close look at the algorithm from the library and it is indeed broken - in order to work as intended, the shift in ones = (ones >> 2) / smallest would have to sometimes perform a logical shift and sometimes an arithmetical shift depending on the circumstances. This should be fixable.

EDIT: Combining the ideas from both, this is my current best solution that handles everything including the wrap around properly, besides zero.

  rightmostOne         = x & -x
  zeroOneEdges         = x & (x ^ (x >> 1)) // This must be a signed arithmetic shift.
  rightmostZeroOneEdge = zeroOneEdges & -zeroOneEdges
  shiftedOnes          = (x & (rightmostZeroOneEdge - 1)) / rightmostOne // This must be an unsigned division.

  x = (x + rightmostOne) | shiftedOnes


Super cool, I'll have to have closer look sometime. To handle zero, if you have ctz, you can replace the divide with two bitwise shifts. This may sometimes be faster than the divide even if you don't have ctz.


Two shifts? I think ctz plus a single shift should do.

  shiftedOnes = (x & (rightmostZeroOneEdge - 1)) >> ctz(x) // This must be a logical shift.
With all the fancy bit level instruction, this could probably be simplified in general, but I usually avoid them because I mostly see this as an exercise in making it work if you only have access to the basic arithmetic and logic operations and shifts.


You need two shifts in C++ because shifting a uint32_t by 32 is UB. Every actual computer probably gives you 0 if you shift 0 right by a too-large amount though...


Fortunately my encounters with C and C++ have been rare and brief.


> I currently have no good idea how to make it work without a condition.

Would this work? Or not worth computing both values each time?

    x = ((mask == 0) * x <<< popcount(x)) & ((mask != 0) * x ^ mask)


I have no idea if this would be worth doing. One could also use a conditional move like CMOV on x86 if available on the target architecture to avoid a jump. On the one hand I have the gut feeling that there should be a nice trick to do this, on the other hand this seems to either require a population count dependent shift or rotation or reversing the word, which also has no really simple trick as far as I know and is usually done with a logarithmic number of shifts.


Full marks for honesty, IMO. It'd be tempting to delete this before making it publicly visisble.


That appears to be a direct C++ translation of item 175 from HAKMEM [1]. This does not require a division and can be done much faster, see section 1.24 of [2].

[1] http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item175

[2] https://jjj.de/fxt/fxtpage.html#fxtbook


> This does not require division...

From the link:

IDIVM D,C


"This" as in finding the next combination. Gosper's trick does.


The division performs a right shift, the version given in the FXT book replaces it with a while loop and has the note that one could use a bit scan and a shift. I would assume bit scan and shift is the fastest option - if accessible from the language used - but I am not sure if the loop could beat the division. Also the version from the EA library claims to - but fails to - perform the wrap around, the version in the FXT book does not try to wrap around and I would guess the version in HAKMEM also does not wrap around properly.


I wonder what the real-world applications for this are


It can be used to enumerate all subsets of size N from a set of size M. Start with 1 << N - 1, and then repeatedly call this function until the value is larger than 1 << M. Every subset of size N will be visited, in order.


That's certainly a very smart way of doing so.

Years ago, I had a similar need and I consulted TAOCP to find the “standard” algorithm to do so, which is done by a binomial tree. The tree T is defined recursively as T(0) = nil and T(n) = [Cons(0, T(0)), Cons(1, T(1)), ..., Cons(n-1, T(n-1))]. There's no need to construct the tree explicitly; it's just a conceptual formalism that helps formulate an algorithm to traverse the tree.

The text did mention using bitwise operations to do so; in fact it says it presents a “remarkable sequence of seven bitwise operations that will convert any binary number to the lexicographically next t-combination” but I haven't been able to find those seven operations.


Can you explain what you mean by "set" and "subset"? What do they contain? How they're represented? I can't make heads or tails of your answer as-is.


Take a 64 bit number and treat it like a bit field where each bit represents inclusion/exclusion of an element in a set S of size 64. Thus every number with M 1s represents a (distinct) M-sized subset of S


Every bit in a bitfield represents a "thing". The bitfield represents a collection of distinct things. If bit "i" is 1, then the corresponding thing "i" is in the collection.


Iterates through all possible N-combinations of a set that has M elements?


This can be used generate all bit sequences over a particular range that have the same Hamming weight. I have needed that before when testing error-correction code decoders to make sure they properly correct all of the possible error patterns that they should be able to.


Chess engine stuff, at the very least. With an integer mask representing an 8x8 board and 1s representing a single type of piece, you can find all subsets of pawn positions etc.


It's filtering by popcount effectively [0] and crops up quite a bit.

[0] https://vaibhavsagar.com/blog/2019/09/08/popcount/


I misunderstood the task


I've used this to enumerate grids with a certain number of open and closed cells (0 and 1 bits). It is simply the next permutation of a bit string. C++ offers this in its standard lib, but that version is slower by factor 1000 or so.

I've also tested this with arbitrary bit-width numbers (LLVM can do that) and it worked fine. I've not checked if the code is the exact same tho.


>slower by factor of 1000

It's also for generic iterables.


I'm going out on a limb and say "bloom filters" so it has applications in fast filtering, seen-count, DB stuff like parallelism, discrete-different buckets, partially ordered sets being applied in a stream of events...


snoob - I want to know why



It stands for "same number of one bits".


What is the correct implementation of this?


I think something like this would handle it

  const std = @import("std");
  
  fn snoob(x_: anytype) @TypeOf(x_) {
    const T = @TypeOf(x_);
    const U = comptime std.meta.Int(.unsigned, std.meta.bitCount(T));
    const x = @bitCast(U, x_);
    const tz = @ctz(x);
    const pc = @popCount(x);
    const L2U = comptime std.meta.Int(.unsigned, std.meta.bitCount(@TypeOf(tz))-1);
    if (tz + pc == std.meta.bitCount(T)) {
      if (pc == 0) {
        return 0;
      }
      return @bitCast(T, (x >> @intCast(L2U, tz)));
    }
    const smallest = @as(U, 1) << @intCast(L2U, tz);
    const ripple = x + smallest;
    const ones = ((x ^ ripple) >> 2) >> @intCast(L2U, tz);
    return ripple | ones;
  }
  
  pub fn main() void { std.debug.print("Hello, {d}!", .{snoob(@as(u32,5))}); }
In addition to the stated problems with wraparound and zero (claimed fixed here) the implementation in this header file seems to have some potential correctness issue if you pass in a signed type and end up performing right shifts on values with the sign bit set and your compiler decides this means you want sign extension.

Edit: translation back to C++ using other functions in this header and assuming that a bt_unsigned exists which is similar to the already included bt_signed:

  template <typename T>
  inline T GetNextWithEqualBitCount(T x_) {
    bt_unsigned x, smallest, ripple, ones;
    int tz, pc;
    x = (bt_unsigned)x_;
    tz = CountTrailing0Bits(x);
    pc = CountBits(x);
    if (tz + pc == sizeof(x)*8) {
      if (pc == 0) {
        return 0;
      }
      return (T)(x >> tz);
    }
    smallest = ((bt_unsigned)1) << tz;
    ripple = x + smallest;
    ones = ((x ^ ripple) >> 2) >> tz;
    return (T)(ripple | ones);
  }
runnable demo: https://www.sololearn.com/compiler-playground/c6iz3kHpZ3zP


The 2nd branch here only exists because shifting a uint32_t right by 32 is undefined. You can avoid it if you do 2 shifts instead. The first branch is tougher to get rid of.


Have a look at popcount[0]. It's even a single instruction on many platforms [1].

[0] https://en.cppreference.com/w/cpp/numeric/popcount

[1] https://vaibhavsagar.com/blog/2019/09/08/popcount/


This gives you the number of set bits, not the next larger number with the same number of set bits.


Filter larger, zip Popcount, filter with same popcount, min.

If you need the full implementation spelled out I can do that when I get in from my commute home.


Step 1 is computing a set of ~2^64 things?


You could keep incrementing from the current value until the population count matches again, but in the worst case - probably from 2^62 to 2^63 with a single set bit or wrapping around from 2^63 to 2^0 - that would still take pretty much 2^64 steps.


Lol just gotten in and looked at this at a desktop I thought we were filtering a list. My bad. Funny though.

Interesting task I guess really you want to find the first (from right) and move the first set bit left to fill it.


Seeing the project name EAStdC I thought for a moment this would be the C counterpart to EASTL (e.g. a modern replacement for the C standard library), but alas, it's all written in C++ and intended to be used from C++ (e.g. no C interfaces).

For anyone interested in this sort of stuff and isn't yet aware of the 'bit twiddling hacks' collection:

https://graphics.stanford.edu/~seander/bithacks.html

Also regarding the bit counting function:

    inline int CountBits(uint32_t x) // Branchless version
    { 
        #if defined(EASTDC_SSE_POPCNT)
            return _mm_popcnt_u32(x);
        #else
            x = x - ((x >> 1) & 0x55555555);
            x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
            x = (x + (x >> 4)) & 0x0F0F0F0F;
            return (int)((x * 0x01010101) >> 24);
        #endif
    }
Some compilers go as far as detecting the 'bit twiddling' version and replace that with a popcnt instruction:

https://www.godbolt.org/z/bExMvfzEx


(disclaimer: I used to work on these EA* libraries)

The intention of EAStdC wasn't really to replace the platform libc but to augment it with portable implementations that had consistent behavior across the different platforms, as well as have some opt-in performance improvements (like a memcpy that used VMX), while also not polluting the global (C) namespace.



Wow! That's quite surprising to me (the godbolt link). Very cool!


There's this gem in EAProcess.h

```/// ExecuteShellCommand("su root\nrm /* -r");```

https://github.com/electronicarts/EAStdC/blob/master/include...


No -rf ?


For those interested in this kind of stuff, there is the FXT library [1] and the accompanying book [2] by Jörg Arndt.

[1] https://www.jjj.de/fxt/

[2] https://www.jjj.de/fxt/fxtbook.pdf


I wonder how many of these have actually been used vs someone thought of a cool bitwise trick for a problem that doesn't exist. Can anyone think of a use-case for TurnOffLowestContiguousBits?


I'm working on a game written in m68k Assembly. In cases where I'm trying to make every CPU cycle count, these kinds of novel bitwise operations really catch my interest. TurnOffLowestBit seems useful for an effect I've had in mind for a while: A graphic gets "erased" one pixel at a time from right to left, but with some random variance for each horizontal line. Reliably turning off the lowest bit might be a good solution. Maybe throwing in TurnOffLowestContiguousBits would help to introduce some variance in a visually interesting way.

Or maybe it wouldn't. But in order for me to consider these operations as potential solutions to a problem, I have to know they exist. Even if they beckon for me to invent a problem they can solve, I'm totally open to that kind of creative inspiration.

"Check out this neat bitwise thing you can do in only 12 CPU cycles!" That's definitely the jam of Assembly and/or demoscene coders, haha.


Obviously a very very gross generalization (might get called out on it too :P) - But I think the low hanging fruit for modern software is optimizing data layouts for fast processing.

Optimizing code requires a lot of work, and requires skills that people usually learn from experience - therefore a rare and expensive skillset. Data-Oriented Design can be taught, and requires (IMO of course) a far less technical skillset.


This is difficult because lots of tools don't let developers choose their data layouts. For example you cannot declare an array of structs in Java.


This is very true. The tower of template and OO class hierarchies typical in c++ codebases make the memory layout impossible. Polymorphism and vtables are so killer for performance. I think general programming can learn a lot of lessons from HPC and game engines.


    inline uint32_t Log2(uint32_t x)
    {
     const union { float f; uint32_t i; } converter = { (float)x };
     return (converter.i >> 23) - 127;
    }
This is interesting. It looks like it's returning the (biased corrected) value of mantissa.

I'm surprised that there isn't a function supplied by the compilers or in the processor that takes an int and returns the mantissa. After all, every int -> float conversion requires this step.


Modern CPUs have a count-leading-zeros instruction to calculate floor(log2(x)) efficiently. GCC and clang expose this with the __builtin_clz* family of intrinsics. MSVC exposes this with _BitScanReverse.

  static inline uint64_t floor_log2(uint64_t x) {
      return 63 - __builtin_clzll(x);
  }


And C++ (as of C++20) exposes this and more in the header <bit>.

https://en.cppreference.com/w/cpp/header/bit


Take a look at the C header <ieee754.h>. It's not on all systems but it provides a union for type punning between a double and its decomposed parts. IEEE754_DOUBLE_BIAS is provided for bias correction. The same are available for floats.

(Unfortunately, the "not on all systems" part is why I don't use it, but instead copy the union and the constant verbatim into the project if I need it.)


I thought that type punning via union is against the standard.

I personally do not think it should be ... It's a very handy tool.


I think it may be a GNU extension to explicitly allow it, but I'm no language lawyer. This header is from glibc; it may explain why it's not available on some platforms.


It’s legal in C.


I think you're right. Most of the top google results seem to be from people who can't mentally separate c and c++. So when i tried to look it up around the time of the comment, i didn't get clear answers.


> It looks like it's returning the (biased corrected) value of mantissa.

I think you mean exponent?


Yes, oops.



I like this topic, so here are some more "bit hacks" links, from my collection:

  * http://aggregate.org/MAGIC/
  * http://graphics.stanford.edu/~seander/bithacks.html
  * https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer/109025#109025
  * https://github.com/bloomberg/bde/blob/main/groups/bdl/bdlb/bdlb_bitutil.h


Modified BSD License (3-Clause BSD license), so it is permissive. Not apparent from header file only.


Are any of these still needed, though? I kinda assumed they would be like Duff's Device and be a nifty historical relic, but firmly in the "Let the compiler do this for you" category in modern C.


Many of those are not standard functions (or were not, until very recently). You have to implement it somehow, so having a ready-made collection that also performs well is very nice to have.


When you're working on Frostbite you definitely need these.

Optimizers are smart, but aren't smart enough yet


How do you let the compiler do these things for you?


By having someone write the optimization passes to detect these :)


Compile with optimizations enabled.


What code specifically do you write instead of the code in TFA, in order to have the optimizer generate the code in TFA? In general compiling with optimizations enabled has not produced particularly good code in my experience.


Does anyone have any reading material on introduction to where bit-wise programming becomes important knowledge? Is it mostly just used in game-dev?

They've always peaked my interest, but never personally had to work with it, and just don't know the use cases where these become important!


The general use-case is "high-performance and/or embedded systems." Games, and especially console games, happen to check both of those boxes. Another high-perf case is financial systems (eg: high-frequency trading). Other embedded systems would include things like robotics, automotive software, audio synthesizers and processing, etc.

Packing data into bits is a space optimization. On modern processors, accessing memory is hideously slow compared to doing logical operations, so if you can pack more info into fewer bits, you get higher information density, and need to access main RAM less often. Thus, it speeds things up, sometimes dramatically.

Sorry, I don't have reading material per-se.

But just as an anecdotal example: I was using some of this trickery to implement a densely-packed Quadtree using Morton Codes. Morton Codes interleave the bits of the X and Y coordinates for a cell on a grid. So, in order to break a M-Code apart and combine them back together, you need to "select the even bits" or "select the odd bits", which can be done with some tricks like these.

(aside: it's "piqued" :)


Thank you for the thorough break down. All the replies have immediately framed my "understanding" much more.

I'm not exactly sure why, but I'm very intrigued nonetheless.

Aside: TIL, Thanks!


I don't know of any formal introductions to the topic but you can see them in action in these places too

* some of Daniel Lemire's work in high performance data structures

* the blog posts related to solving board games on https://nullprogram.com

* the articles about board representations on the chess programming wiki https://www.chessprogramming.org/Main_Page

* Any explanation of Quake III's fast inverse square root


Thanks for the resources, these are great! As an avid chess fan I've always played around with the idea of trying making a basic chess engine.. This might just be the push!!


Most obvious use case I can think of is embedded programming, e.g. when you're setting control registers.


In embedded programming it can be useful to control an animation on an LED matrix since a lot of LED drivers and multiplexers enable pins based on the bits passed through their respective drivers. Example of how you might animate 8 LED connected to 8 pins:

    10000000
    01000000
    00100000
    00010000
    00001000
    00000100
    00000010
    00000001
You could turn that into an array to animate an LED going from left-to-right or you could just write a simple function that bit-shifts your current state to the right.

It's also useful in scrolling text: Imagine that 8x8 above contains a character like O. You could animate the O moving to the left by shifting the bits in each row to the left by one over and over again (according to whatever speed you've defined).


Thanks man, this was a very clear explanation!


Anywhere that a byte/word/dword is used as a first class data structure, there is likely some operation on the data structure that can be accomplished with a bit trick. As others have said, this is most often the case where CPU cycles are tightly constrained.


Not an answer to your exact question, but there's:

https://graphics.stanford.edu/~seander/bithacks.html

And if you like that, there's an entire book full of them, which is delightful:

https://www.amazon.ca/Hackers-Delight-2nd-Henry-Warren/dp/03...


Interacting with hardware registers, such as in a device driver, is another common use case.


or implementing file formats / network protocols


Maybe what danbruc commented [0] may be of interest for you.

[0]: https://news.ycombinator.com/item?id=34055831


Thanks!


You end up needing this stuff to write fast parsers of text formats, but you need a bunch of other stuff too, and there isn't really any introductory material online.


If you're writing a standard library from scratch, they can be important in data structure or algorithms that are expected to be in hot loops.


When dealing with networking it's extremely common.


NETWORKS. All use cases apply.


I just got a copy of Hacker's Delight, which seems to show the same sorts of tricks. I haven't read very much of it yet, but I wonder how many of these came from that book or a common source. It would be fun to read more about the history of these tricks.

https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0...


https://en.wikipedia.org/wiki/HAKMEM contains some early history but probably not complete. These type of things get re-invented a lot.


Question from a frontend dev: Why can't the compiler apply these tricks automatically? Are there assumptions they're making that don't necessarily hold?


Clang and GCC can do some amazing optimizations. However, these types of tricks often come into play as a result of clever data representation, i.e., instead of representing data as a bunch of objects in an array, you’ve figured out how to represent it, or a subset, in some cleverly packed bits and/or ints (and then can reap the benefit of these tricks).

It’s trivial for a compiler to see (x / 2) and substitute (x >> 1) but it is much harder for a compiler to go from seeing that you have a bunch of Foo objects and realizing that they could be better represented as clever bit vectors.


You can't really trust the compiler to emit sensible code ever


The compiler cannot always detect all the uses.


Another handy one: Bitwise Table Lookup

  /* Bits: selected, hovered */
  const colors = [
    grey,  // 00
    green, // 01
    blueA, // 10
    blueB  // 11
  ]
  const color = colors[selected << 1 | hovered]
https://blog.uidrafter.com/bitwise-table-lookup


Just imagine performance of modern software if developers took time to optimize it, like they did it in the old days.


Bit hacks are but a tangent to performance optimization, of course.


Funny they didn't just extend the 32 bit reverse algorithm to 64 bits: https://github.com/electronicarts/EAStdC/blob/master/include...


Yes, this library is of uneven quality, and would benefit from a few hours of focused attention from a specialist. E.g. the various HighestBit functions below what you've called out also look inefficient relative to something using one of the builtin_clz intrinsics, even though there's an earlier use of builtin_clz...


I think a lot of times the way a header like this comes into being is that someone has a narrow need for an operation that is deemed reusable and abstract. The 64 bit version of a 32 bit bit utility doesn't have a need at the time. Then somebody comes in later trying to fill out more stuff.


I haven't tested this, but on Arm32 platforms, this isn't a bad way to do it n some 32-bit machines.

If the compiler recognizes the 32-bit approach and rbit is available, the 64-bit variant is usually 2 rbits with the registers swapped.


If the compiler can recognize the 32-bit version, it really should be able to recognize the 64-bit version too, no?


Ideally yes. Only last year did Clang 13 learn the 32-bit one. GCC doesn't understand either. I haven't checked clang for the 64-bit one.


On many platforms a shift by more than 31 bits often did nothing, despite it should do something. Intel, for example, masked the count at 5 bits, breaking a lot of 64 bit compiler hacks.

There is also the issue of cache and speed - expanding to full size would have different footprints, so perhaps they tested that too.

This is likely a result to work around those problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: