Bit Twiddling Hacks

sjrd · on Dec 4, 2020

This kind of hacks is a major part of how we made Scala.js' 64-bit integers fast (the fastest known implementation of 64-bit integers on JavaScript). See [1] for the implementation, and [2] at section 4.4 for the long explanation with benchmarks and the other tricks we use.

[1] https://github.com/scala-js/scala-js/blob/v1.3.1/linker-priv...

[2] https://lampwww.epfl.ch/~doeraene/thesis/doeraene-thesis-201...

jsheard · on Dec 4, 2020

My favorite thing about these is how they come full circle - programmers find convoluted ways to outsmart the compiler in some primitive operation, but then CPU manufacturers add dedicated instructions for those operations, and eventually the compilers outsmart the programmer by deleting their smartass method and emitting a single instruction instead

https://godbolt.org/z/nPdrcz

ncmncm · on Dec 4, 2020

This of course refers to the POPCNT instruction, that is still quite tricky to get compilers to generate reliably, and which RISC-V specifications and all implementations to date mysteriously lack.

Compilers failing to generate POPCNT cause sometimes 10x slowdowns in key operations.

jsheard · on Dec 4, 2020

Indeed it's better to use intrinsics in new code rather than rely on compiler magic

https://godbolt.org/z/n3z8nz

Portability is an issue though, that works on GCC/Clang but you need a different intrinsic for MSVC

csense · on Dec 4, 2020

Any idea why this hasn't been addressed in the C standard? Like C21 could say "math.h shall have a popcnt() function with these semantics", and then the compiler treats that like the intrinsic?

ncmncm · on Dec 4, 2020

The difficult part is not expressing it. Rather, it is persuading the compiler that the target implements the instruction. gcc and clang have to be convinced they are generating code for an amd64 chip manufactured since 2002 (for which "-mpopcnt" usually works). Common Linux distributions don't turn on any such flags. MSVC offers no way to persuade it, but it will issue the instruction anyway if you use its intrinsic, _popcntN(x), where N is 16, 32, or 64. Shipping Microsoft code doesn't; it is all supposed to work on the oldest amd64 chips ever made.

C++98 gave us std::bitset<N>(x).count(). With C++20 we have std::popcount(x). But getting the compiler to produce the instruction reliably and portably is no easier than before.

jsheard · on Dec 4, 2020

I don't know why C doesn't have it yet, but TIL they did already standardize it in C++20

https://en.cppreference.com/w/cpp/numeric/popcount

mmastrac · on Dec 4, 2020

Odd that GCC is emitting an xor eax, eax before popcount (in both the intrinsic and pattern cases!):

  popcnt(unsigned int):
          xor     eax, eax
          popcnt  eax, edi
          ret

wahern · on Dec 4, 2020

It's a workaround for a bug in Intel CPUs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

mmastrac · on Dec 4, 2020

Interesting, thanks for the pointer.

klmr · on Dec 4, 2020

At my last job interview I had to implement popcount and explain how to optimise it (amongst other things).

I was able to jot down a naïve implementation and the obvious optimisation based on a lookup tables. I only vaguely remembered the Bit Twiddling treatment of the subject but, with a bit of nudging from the interviewer, I managed to implement and explain the variant that runs in O(set bits) (“Brian Kernighan's way”). I got the job.

Now, it’s fashionable to deride this this kind of code interview as unrealistic and unhelpful. But in my first week on the job, by sheer coincidence, I had to use the function. Obviously there are existing, efficient implementations, including intrinsics. But knowing how to derive an efficient implementation certainly didn’t harm. My job has since evolved into different responsibilities but low-level algorithmic knowledge is still important. I’m not sure testing for it in job interviews is generally a good idea, and designing good job interviews is certainly a big topic. But in my particular case it happened to be a relevant, fair test of my abilities.

vsareto · on Dec 4, 2020

I mean, it wouldn't hurt to know how to do all of this, plus all the data structures & algorithms stuff, plus all major and minor details about your language and environment of choice, plus anything else that someone deems important, but realistically most of us need a job before you can learn all of that. Even if you do learn it all, people tend to lose knowledge they don't use, so it may have an expiration date depending on how good your brain is. New technologies and advancements can also introduce more knowledge that becomes required to know for interviews. The pool of knowledge for interview questions might continue to grow until you'd have to study for a year or more to have a high chance of passing it.

klmr · on Dec 4, 2020

> but realistically most of us need a job before you can learn all of that

The interview was for a senior position.

> Even if you do learn it all, people tend to lose knowledge they don't use

Let me emphasise that my interview was not a knowledge test (and at any rate I don’t study for interviews). I wasn’t expected to know by heart how to implement popcount. The interviewer was trying to see me work. Successfully, I might add. — Another question I got concerned something I had no knowledge of, and I had to derive a solution myself. In fact, I failed to do so, but that didn’t prevent me from getting the job since the interviewer was satisfied with what they observed about my thought process.

nayuki · on Dec 4, 2020

DSA = data structures & algorithms

wiredfool · on Dec 4, 2020

I’ve had that question twice in google phone screen interviews, and iirc, the best answer has changed from the hash table thing to just use the hardware instruction.

praptak · on Dec 4, 2020

I don't believe companies want to hire for knowing that popcnt exists. The interviewer should follow up with the hypothetical situation when either the language or the CPU doesn't have the instruction.

Nobody should ask this anyway, this has become the fizzbuzz type of question.

ericbarrett · on Dec 4, 2020

Long before leetcode, I used to ask an "algorithm" question in interviews. I don't remember it exactly, but IIRC it was about determining quickly if a number was 2^n-1 for any n. (Don't shoot me, this was a decade ago.)

There were three broad ways candidates answered it:

1) Hack out a for-loop-style bit count. This was good, because even though it wouldn't be optimized, it demonstrated they could understand the problem and at least conceptualize a solution.

2) Give the "leetcode" best-answer (it was some bitwise math trick). I'd then ask if they'd seen the problem before, and the answer was always yes. This was a mark in their favor—but no better than (1)—and also a signal I needed to ask them a follow-up question that actually made them think.

3) Code a bit, but not quite arrive at a solution. I'd then probe them about their thought process. Not an automatic fail, sometimes our brains just don't walk down the corridors we want them to at a given time, especially during an interview. (One candidate couldn't hold the marker steady because his hands were shaking too much! Poor guy.)

4) Give up and say they had no idea. Obviously the worst case for the interviewee.

The purpose was not to check a candidate's recall and test-taking abilities, but rather to watch them reason through a novel problem, appropriately limited in scope for the interview timeframe. Case (2) served as a great short-circuit to block test-preppers lacking actual experience.

wiredfool · on Dec 4, 2020

Yeah, it's a stupid question, which is why I was surprised that Google asked it to me twice, separated by a couple of years.

What's funny is that 15 years later, I actually had a use for popcnt, put it together with something that seemed expensive at the time, and wound up with a C program that exhaustively searched a problem space in .12 sec. On a single core, in a vm, on a laptop. So much for me trying GPU programming on that one. (as in , feckit, there aren't that many possibilities, we'll just count them all)

lou1306 · on Dec 4, 2020

These are typically seen as tricks for embedded programming. But they have (at least) another useful application.

Years ago I was messing around with SAT-based bounded model checking of C programs that I was crafting. Many of these tricks allow you to remove branches/loops from your program, which is great because you end up with a smaller SAT formula, which (usually) can be solved faster.

ncmncm · on Dec 4, 2020

This list is super-great, except: all modern targets[0] (and many non-modern, including Alpha, Cray, CDC, Stretch) have the POPCOUNT instruction, which does a key operation underlying a large portion of the target operations much faster than presented. It provides an order-of-magnitude speedup in numerous algorithms. The list should have a comprehensive treatment.

At the moment, the only ISO Standard way to say "popcount" is in C++, with e.g.

  std::bitset<64>(x).count()

But even this is insufficient on MSVC, which targets pre-2002 arm64, which lacked it; and on gcc and clang on amd64, similarly, absent a -fpopcnt or -march=native or related option (which there are numerous other reasons to use).

Given -march=native or =core2 or various other means, optimizers will happily rewrite the Kernighan loop into a straight-up POPCNT instruction. Anyway, all compilers provide it as a non-standard, therefore variously-spelled, intrinsic, but (except on MSVC) only actually produce it if the "-march=" or related commad-line option enables that.

Historically, it is common for instruction sets to start out lacking the instruction, and then getting it in subsequent releases because of customer demand. Also historically, a key such customer has frequently been the US NSA. Thus, initial and64, alpha, POWER, and SPARC lacked it, but soon got it. x86 ("ia32") is perhaps the principal laggard. When a feature is added, at great expense, to a subsequent ISA revision, that says a lot for its importance.

[0] All except RISC-V, to date. Some would say this makes RISC-V non-modern. It appears in the (still) unratified B extension, which (therefore) nobody implements.

stfwn · on Dec 4, 2020

I once had to count the number of set bits (the Hamming weight) for an assignment. It had to be done in under 40 bitwise ops, without loops and with constants of up to 8 bits. The naive approach took way too many ops! I was stumped until I found the linked page, in particular 'Counting bits set, in parallel'[1]. Parallelism at the bit level?! That is when I realized low level programming was completely badass.

[1]: https://graphics.stanford.edu/~seander/bithacks.html#CountBi...

ncmncm · on Dec 4, 2020

Again, the POPCNT instruction reveals its importance.

gorjusborg · on Dec 4, 2020

If you're into this sort of thing and like print books, "Hacker's Delight" by Henry S. Warren, Jr. is great.

SoSoRoCoCo · on Dec 4, 2020

Came here to say this. "Hacker's Delight" is essential reading if you are an embedded programmer working in small spaces.

And by small, I mean writing code to fit on a credit card, transit ticket, or SIM card. (Yes <2KB ROM budgets still exist.)

ape4 · on Dec 4, 2020

I recalled seeing a nicer sign() or sgn() function. Here it is:

    template <typename T> int sgn(T val) {
        return (T(0) < val) - (val < T(0));
    }

Not mine, from here https://stackoverflow.com/questions/1903954/is-there-a-stand...

jrib · on Dec 4, 2020

These slides on Bit Hacks are interesting too if you're looking for a more verbose presentation on some of these:

https://ocw.mit.edu/courses/electrical-engineering-and-compu...

Also, curiously enough, I was on the page in the submission after a co-worker asked yesterday if there was a cleaner way to do:

    return direction === 'asc' ? value : -value;

Was fun to work through how https://graphics.stanford.edu/~seander/bithacks.html#Conditi... works. (no, we didn't change our javascript code to use the bit hack)

rectang · on Dec 4, 2020

Having written my fair share of bit-twiddling hacks in low-level code, my takeaway is that because the hack code doesn't resemble what it actually does, it's all too easy to create bugs. Unit testing the living daylights out of any bit-twiddling code path is a must.

Also helpful: being able to represent integer types with binary. Since I'm not a genius and I write code to be maintained by other not-geniuses, `mask = 0b_1010_1010` is clearer than `mask = 0xAA`.

flohofwoe · on Dec 4, 2020

Another simple bit-twiddling-hack (it has a proper name which I unfortunately forgot) is SIMD without SIMD instructions by masking out the top-level bits (which are essentially the carry-flags), e.g. doing eight 7-bit additions at once:

    uint64_t a = ((a0 & 0x7F) | ((a1 & 0x7F)<<8) ...;
    uint64_t b = ((b0 & 0x7F) | ((b1 & 0x7F)<<8) ...;
    uint64_t c = (a + b) & 0x7F7F7F7F7F7F7F7F;

hairtuq · on Dec 4, 2020

The name SWAR is sometimes used (https://en.wikipedia.org/wiki/SWAR).

Indeed, even 8-bit fields can be added in parallel, using the fact that ^ is like a + that does not produce a carry:

    uint64_t signmask = 0x8080808080808080;
    uint64_t sum_without_sign_bits = ((x & ~signmask) + (y & ~signmask));
    uint64_t sum_of_sign_bits = (x ^ y) & signmask;
    return sum_without_sign_bits ^ sum_of_sign_bits;

bitcharmer · on Dec 4, 2020

We use bit twiddling a lot to reduce conditional jumps. This website is a great starting point if you need to learn it. Highly recommend.

oytis · on Dec 4, 2020

A trick I used to get the mask of least significant set bit in a bit mask: `flag = bitmask & (-bitmask)`. Works with two's complement numbers only of course. (and modern architectures have a dedicated command for this)

mschuetz · on Dec 4, 2020

The interleaving/morton code bit hacks are great. They result in serious performance improvements if you need to compute morton codes.

almog · on Dec 4, 2020

TIL these two lines (to get an integer absolute value without branching) are _patented_:

  int const mask = v >> sizeof(int) * CHAR_BIT - 1;
  r = (v ^ mask) - mask;