
Popcount CPU instruction - mmastrac
https://vaibhavsagar.com/blog/2019/09/08/popcount/
======
jongalloway2
.NET Core 3 (in preview, due out Sep 23) includes support for popcount as part
of the BitOperations class: [https://docs.microsoft.com/en-
us/dotnet/api/system.numerics....](https://docs.microsoft.com/en-
us/dotnet/api/system.numerics.bitoperations). There's a recent post describing
hardware intrinsics in .NET Core:
[https://devblogs.microsoft.com/dotnet/hardware-intrinsics-
in...](https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in-net-core/)

There are a lot of potential use cases, but one of the early discussions was
around quickly scanning for known HTTP headers. You can see that in use here:
[https://github.com/aspnet/AspNetCore/blob/caa910ceeba5f2b2c0...](https://github.com/aspnet/AspNetCore/blob/caa910ceeba5f2b2c02c47a23ead0ca31caea6f0/src/Servers/Kestrel/shared/KnownHeaders.cs#L646)

If you really need help falling asleep, here's the discussion going back to
2015:
[https://github.com/dotnet/corefx/issues/2209](https://github.com/dotnet/corefx/issues/2209)

Disclaimers: Microsoft employee, Nazgûl

~~~
vaibhavsagar
GHC also includes a `popCount` function that is implemented in terms of the
built-in instruction: [https://haskell-works.github.io/posts/2018-08-22-pdep-
and-pe...](https://haskell-works.github.io/posts/2018-08-22-pdep-and-pext-bit-
manipulation-functions.html#availability-on-ghc)

------
pacaro
I used to use "implement popcount" as an interview question (for c
programmers), I'm not super convinced that it was a good question, although it
definitely tells you quite quickly how comfortable a candidate is with bit
manipulation.

There are three or four reasonable ways to implement it in c, including one
weird trick, but which is most efficient tends to be very CPU dependent (at
the time I worked on a project that targeted four or five esoteric CPUs so I
had the luxury of being able to verify this)

~~~
castratikron
I remember in one interview I was asked to write a function that returns true
iff x is a power of two, so I wrote return 1 == __builtin_popcount(x). They
liked that.

~~~
nwallin
I did the same thing!

But they didn't like it. So I wrote the naivest possible popcount and returned
1 == naive_bitcount(n).

They didn't like that. I explained to them that the compiler will optimize or
the loop so this will be fast and that they should check it on godbolt. (did I
mention that I was coding on the white board?) There's a loop in the source
code, but there isn't a loop in the machine code, and it's like three
instructions.

They didn't like that either, so I worked out the n & (n-1) method from first
principals. I didn't do it optimally, there was some redundancy in my code,
which the interviewer smugly explained to me. I can't remember if I told him
"that's nice, but it's still an extra instruction compared to the trivial,
obvious, easy to maintain method I showed you ten minutes ago." I'm pretty
sure I didn't, too afraid to make a bad impression.

I didn't get the job. In hindsight, I'm glad I didn't. Those guys are dicks.

~~~
dooglius
popcnt is generally going to take multiple cycles though, I'd expect the
n&(n-1) method to be more performant. It's also vectorizeable and more
portable.

~~~
dalke
builtin popcount when compiled for an architecture with hardware popcount
should be 4 bytes per 1 cycle.

The n&(n-1) method to test for a power of 2 cannot be faster than that.

~~~
jcranmer
POPCNT has a latency of 3 cycles on most x86 hardware.

~~~
dalke
You are right. I was thinking of the number of operations.

Is the popcnt test slower than the n&(n-1) test? (Edit: Ooops! I see you
already addressed that at
[https://news.ycombinator.com/item?id=20918136](https://news.ycombinator.com/item?id=20918136)
.)

~~~
jcranmer
A popcnt/test/jmp cycle should have roughly 4 cycles of latency: a test+jmp
can be fused into one uop, and popcnt would be 3 cycles of latency. n&(n-1) ==
0 would compile down into a DEC, TEST, JMP, which is 2 cycles of latency
(again, test+jmp are fused into one uop), as dec is just 1 cycle of latency.

So n&(n-1) is faster for checking if it's a power of two.

~~~
dalke
In the following I confirm that n&(n-1) is faster than popcount (compiled with
"cc -march=haswell check_pow2.c -O3"), at 699 vs. 1016 microseconds,
respectively:

    
    
      #include <stdio.h>
      #include <stdlib.h>
      #include <sys/time.h>
      
      const int N = 2*1000*1000;
      
      int count_popcnt(unsigned int *data) {
        int sum = 0;
        int i;
        for (i=0; i<N; i+=8) {
          sum += (
                  (__builtin_popcount((data+i)[0]) == 1) +
                  (__builtin_popcount((data+i)[1]) == 1) +
                  (__builtin_popcount((data+i)[2]) == 1) +
                  (__builtin_popcount((data+i)[3]) == 1) +
                  (__builtin_popcount((data+i)[4]) == 1) +
                  (__builtin_popcount((data+i)[5]) == 1) +
                  (__builtin_popcount((data+i)[6]) == 1) +
                  (__builtin_popcount((data+i)[7]) == 1)
                  );
        }
        return sum;
      }
      
      int count_wegner(unsigned int *data) {
        int sum = 0;
        int i;
        for (i=0; i<N; i+=8) {
          sum += (
                  ((((data+i)[0]) & ((data+i)[0]-1)) == 0) +
                  ((((data+i)[1]) & ((data+i)[1]-1)) == 0) +
                  ((((data+i)[2]) & ((data+i)[2]-1)) == 0) +
                  ((((data+i)[3]) & ((data+i)[3]-1)) == 0) +
                  ((((data+i)[4]) & ((data+i)[4]-1)) == 0) +
                  ((((data+i)[5]) & ((data+i)[5]-1)) == 0) +
                  ((((data+i)[6]) & ((data+i)[6]-1)) == 0) +
                  ((((data+i)[7]) & ((data+i)[7]-1)) == 0)
                  );
        }
        return sum;
      }
      
      
      int main(void) {
        unsigned int data[N];
        FILE *f = fopen("/dev/urandom", "rb");
        if (!f) {
          perror("Cannot open /dev/urandom");
          exit(1);
        }
        if (fread(data, sizeof(unsigned int), N, f) != N) {
          perror("Cannot read enough items");
          exit(1);
        }
        /* Use only a few bits */
        for (int i=0; i<N; i++) {
          data[i] = data[i] & 31;
          /* Don't worry about handling 0 */
          if (data[i] == 0) {
            data[i] = i + 1;
          }
        }
      
        struct timeval time1, time2;
        int n;
        
        gettimeofday(&time1, NULL);
        n = count_popcnt(data);
        gettimeofday(&time2, NULL);
        
        printf(" popcnt: %d in %8ld us\n",
               n,
               (time2.tv_sec-time1.tv_sec) * 1000*1000 +
               (time2.tv_usec-time1.tv_usec));
      
        gettimeofday(&time1, NULL);
        n = count_wegner(data);
        gettimeofday(&time2, NULL);
      
        printf("n&(n-1): %d in %8ld us\n",
               n,
               (time2.tv_sec-time1.tv_sec) * 1000*1000 +
               (time2.tv_usec-time1.tv_usec));
               
        return 0;
      }
    

Thank you for teaching me something new!

~~~
nkurz
Besides just timing it (excellent!) did you take a look at what it compiles
to?

[https://godbolt.org/z/C0ZeE5](https://godbolt.org/z/C0ZeE5)

It's faster because both GCC and Clang now optimize loops with n&(n-1) to use
AVX2 SIMD! I haven't looked closely to confirm, but I think they may in fact
even do Harley-Seal for "naive popcount" loops.

C is no longer portable assembly. If you want to test whether a particular
algorithm is faster than another, you probably need to write assembly --- or
at least confirm that the compiler did what you thought it did.

~~~
dalke
I think I'll stick with paying people like you to deal with these sorts of
issues in my code. ;)

------
monday_
One way popcount is useful not mentioned in the OP is in the graph analysis.
For example, you can use the following

std::bitset<N>* graph = new std::bitset<N> [N];

to store the adjacency matrix for a graph with up to N vertices in N^2 bits of
memory. The nice thing is that std::bitset::count uses popcount to compute the
number of bits set to one. This makes some graph operations extremely fast
even for a pretty large N. For example, graph[i].count() will produce a degree
of a vertex and (graph[i] & graph[j]).count() will produce a number of
vertices adjacent to both i and j.

~~~
snovv_crash
Careful, std::bitset.count() doesn't use popcnt on msvc, only gcc and clang.

~~~
ncmncm
Popcount was not supported on the original AMD64 ISA. You have to tell your
compiler to generate code for a recent chip before it will generate the
instruction.

~~~
snovv_crash
That still doesn't work. Here is an example with codegen for AVX2, and an
intrinsic call to popcnt, that still doesn't use popcnt for
std::bitset.count() with a bitset size of 32 bits. Compare it to the GCC
generated code.

[https://godbolt.org/z/27TmQY](https://godbolt.org/z/27TmQY)

~~~
ncmncm
I guess you're talking about MSVC. I find it hard to care much what that does.
Code where performance matters much is not built with it.

But you're right, their std lib not using their own intrinsic is pathetic.

~~~
snovv_crash
Sometimes we don't have a choice of toolchain used due to distribution targets
or other dependencies :-( I'd prefer to be using GCC / Clang for everything
too...

~~~
ncmncm
I wrote to a maintainer of the MSVC lib. He says their lib has to work on all
amd64s, but some (specifically, AMD before K10, and Intel before SSSE3) have
no popcount. He says their intrinsics are defined to emit exactly the
instruction named, unlike Gcc's, so they can't use that in their library.

No explanation why they use the loop form, except that the code hasn't been
touched in a long, long time.

~~~
snovv_crash
Surely they can check if AVX is enabled and use the intrinsic if so?

~~~
ncmncm
That would involve changing code not touched since before AVX or even SSSE3
existed. Probably not even since before amd64 existed.

But it's hard to switch on use of a single instruction. Checking at the use
site consumes a branch predictor slot. Switching in a function pointer
interferes with inlining. Self-modifying code would have been the old way. The
modern way might be rewriting in the linker or loader, or JIT compiling.

I have discovered that compilers are extremely bad at recognizing hand-coded
byte-order swapping and dropping in movbe or bswap instructions. That Gcc and
Clang recognized ham-handed pop counting loops seems miraculous now.

------
ncmncm
The Bitmanip extension definition document for RISC-V explains a wide variety
of what were, for me, unfamiliar but terribly useful bitwise operations. Well
worth reading even if you have no particular interest in RISC-V. Many of the
instructions detailed have analogs in AVX, NEON, and recent POWER instruction
sets. Often these instructions are poorly explained in the official manuals.

------
zzo38computer
MMIX has a "SADD" instruction, which does popcount(x&~y). (You could use this
to count trailing zeros with two instructions, each taking only one clock
cycle.)

I thought JavaScript ought to have a built-in popcount function for the new
integer type, but last I checked, it doesn't have. (It would be faster than
writing JavaScript code to emulate it.) (I can think of uses for popcount on
arbitrary length integers. Of course this won't work with negative numbers (I
don't care what it does if the input is negative, although they should define
what it does in that case), but for nonnegative numbers would be good to
have.)

I have also used the __builtin_popcount function in GNU C. I did not know that
it can detect an implementation of popcount and replace it; I have just used
the built-in function.

I also did not know of all of the uses mentioned in the linked document (but I
did know of some other uses).

~~~
zamadatix
WASM has ctz and it's honestly the better place for such a low level
instruction. BigInt is getting some of the bitwise ops
([https://github.com/tc39/proposal-
bigint/blob/master/ADVANCED...](https://github.com/tc39/proposal-
bigint/blob/master/ADVANCED.md)) but just the higher level ones at this time.

JavaScript does have clz32 from back before WASM was the new target for
compile-system-language-to-browser but not ctz32 and I doubt it'd be added as
that's no longer the focus.

------
geoffhill
I like popcount for converting a 2^N-bit uniformly-distributed random number
into a N-bit binomially-distributed one. Each bit of the input simulates a
random coin flip.

~~~
eru
You are a wasting a lot of random bits this way, don't you?

~~~
adtac
Not if you already have 2^n bits at hand. In fact, if you have 2^n bits of
entropy, popcount is probably more efficient than generating n more bits
randomly.

------
kardos
The fact that GCC and Clang can identify a manual implementation of count-
bits-set and substitute it with an invocation of popcount is fascinating. What
other idioms of similar or higher complexity can these compilers recognize and
convert to instructions?

~~~
drfuchs
I about fell out of my chair when I saw gcc and clang both turn
(u>>s)|(u<<(32-s)) into a single "rorl" (rotate right logical) instruction.

~~~
ncmncm
Even pcc (on version 7 pdp-11 unix) knew that one.

------
majke
There is more! You can be _faster_ than __builtin_popcount()!

Wojciech Muła did an excellent SSSE3 implementation:

[http://0x80.pl/articles/sse-popcount.html](http://0x80.pl/articles/sse-
popcount.html)

My understanding is that the set-up time is slower than the POPCNT cpu
instruction, but throughput is higher. Useful if you need to count bits not on
a register, but on a whole array.

~~~
dalke
It looks the conclusion is that the POPCNT instruction ("cpu") is faster than
any of the SSE implementations. Only AVX2 outperforms POPCNT, and only for
large enough bitstrings.

~~~
majke
On i7:

> AVX2 code is faster than the dedicated instruction for input size 512 bytes
> and larger

The difference is indeed tiny. Still - it's very cool the generic AVX2 code
can beat the instruction burned in the silicon!

------
hjorthjort
It's also an instruction in WebAssembly, which has a pretty sparse instruction
set. There are only 29 integer operations, and this is one of them. There are
also "clz" and "ctz" which count leading and trailing zeroes. Other than that
all the instructions are run-of-the-mill arithmetic, comparison, bitmasking
and shifting.

------
tom_mellior
For whatever it's worth, this article is a good example of why HN's (only
theoretical, by now) policy of keeping the original headline intact makes
sense. I wouldn't have clicked on an article with the title "You Won’t Believe
This One Weird CPU Instruction!", and I would not feel regret.

~~~
MaxBarraclough
> For whatever it's worth, this article is a good example of why HN's (only
> theoretical, by now) policy of keeping the original headline intact makes
> sense.

Did you mean _doesn 't_ make sense?

------
dboreham
Yikes I remember reading that thread on usenet. John Nagle, Rob Warnock and
Roger Shepherd. Good times. Roger oddly didn't mention the story about guys in
black suits and mirrored sunglasses showing up as the reason the T414 lacked
popcount but the T800 added it.

------
vchak1
Popcount is also very valuable in search (bitmap) indexes, for example roaring
bitmap by Lemire uses it quite a bit.

~~~
eesmith
Though, interestingly, if AVX2 is available then Lemire's CRoaring package
uses a Harley-Seal popcount implementation, at
[https://github.com/RoaringBitmap/CRoaring/blob/master/includ...](https://github.com/RoaringBitmap/CRoaring/blob/master/include/roaring/bitset_util.h#L286)
, rather than the POPCNT64 instruction.

Because it's faster -
[https://arxiv.org/abs/1611.07612](https://arxiv.org/abs/1611.07612) .

~~~
robocat
From the arxiv pdf: "We disabled Turbo Boost and set the processor to run at
its highest clock speed" for their comparison.

Doesn't their setup bias towards the AVX2 solution?

~~~
dalke
For what it's worth, I've tested my AVX2 copy of their code on a couple of
different machines, and found it faster than my POPCNT-based implementation.
(See my comments elsewhere here.)

Here's my laptop numbers for my benchmark:

    
    
      2048 Tanimoto (RDKit Morgan radius=2) 1000 queries
      chemfp search using threshold Tanimoto arena, index, single threaded (popcnt_128_128)
        threshold T=0.4 popcnt 19.33 ms/query  (T2: 19.81 ms check: 189655.09 run-time: 39.2 s) (allow specialized)
      chemfp search using threshold Tanimoto arena, index, single threaded (avx2_256)
        threshold T=0.4 avx2 12.63 ms/query  (T2: 14.04 ms check: 189655.09 run-time: 26.7 s) (allow specialized)
    

That's 19.33 ms/query on 2048-bit (256 byte) bitstrings using POPCNT unrolled
to two loops of 128 bytes each, and 12.63 ms/query using AVX2 specialized for
256 bytes.

My testing showed there was no advantage for a fully unrolled POPCNT
implementation. One thing to know is that there is only one execution port for
POPCNT on my Intel processor. An AMD processor with 4 execution ports (Ryzen,
I think?) may be faster. I don't have that processor, and my limited
understanding of the gcc-generated assembly suggests it isn't optimized for
that case.

~~~
robocat
> An AMD processor with 4 execution ports (Ryzen, I think?) may be faster.

FYI m0zg agreed: "AMD can retire 4 (!) popcounts per cycle per core, if your
code is able to feed it":
[https://news.ycombinator.com/item?id=20916023](https://news.ycombinator.com/item?id=20916023)

------
m0zg
Be aware though that Intel POPCNT instruction throughput is pretty low by
today's standards. When you need to popcount a lot of bits, on Intel it might
make sense to do it without the use of POPCNT. This oldie but goodie still
holds, on Intel: [https://danluu.com/assembly-
intrinsics/](https://danluu.com/assembly-intrinsics/)

In contrast, AMD can retire 4 (!) popcounts per cycle per core, if your code
is able to feed it.

There's vectorized popcount in some of the optional AVX512 instruction sets,
but for those to make it into consumer CPUs AMD would really have to hurt
Intel pretty bad, marketshare-wise. :-)

------
dclowd9901
We used a “popcount” style operation + bitwise AND to determine how similar
two people were by interest. We had a bitmap where each bit represented an
interest and 1 meant they had that interest and 0 meant they did not. Bitwise
AND filters out the non overlapping interests and popcount to measure the
“intensity” of their similarity (to tease out people who had almost every or
almost no interests selected matching others like that — bell curve kind of
thing).

~~~
gpderetta
generally, popcnt is useful to implement the Jaccard Distance [1]:

    
    
        J(X,Y) = |X∩Y| / |X∪Y|

So, implementing the sets X and Y as bitsets it becomes:

[1]:
[https://en.wikipedia.org/wiki/Jaccard_index](https://en.wikipedia.org/wiki/Jaccard_index)

    
    
        J(X,Y) = popcnt(X&Y) / popcnt(X|Y)

~~~
dalke
If anyone needs that to go fast, for fixed-length (up to ~4k bits) try out my
code, chemfp at [http://chemfp.com/](http://chemfp.com/) . It's designed for
cheminformatics, but can work with any data set which can be described by a
fixed-length byte string and an identifier.

------
unhammer
[https://haskell-works.github.io/posts/2018-08-22-pdep-and-
pe...](https://haskell-works.github.io/posts/2018-08-22-pdep-and-pext-bit-
manipulation-functions.html) from the series [https://haskell-
works.github.io/](https://haskell-works.github.io/) shows how to use popcnt
(and rank-select) in a Haskell CSV parser

------
twoodfin
The same computer architect who first told me about popcount's desirability to
the NSA also believed that the massive simultaneous multithreading first
introduced to the HPC world commercially by Tera—now owner of the "Cray"
name—was also a big win for the spooks.

------
nabla9
Revisiting POPCOUNT Operations in CPUs/GPUs
[https://homes.cs.washington.edu/~cdel/papers/spost106s1-pape...](https://homes.cs.washington.edu/~cdel/papers/spost106s1-paper.pdf)

------
winrid
The coolest part of this for me is that compilers can optimize this stuff out
and replace it with the machine instructions.

I know that optimizing compilers do some amazing things, but detecting this
particular case is pretty cool.

~~~
snailmailman
I highly recommend Matt Godbolt's talk "What has my compiler done for me
lately?"[0] and its follow-up[1]. Both are great presentations going into
details on some of the crazy optimizations compilers will do. As someone who
knows very little about the inner workings of compilers my mind was blown when
I saw some them.

[0]
[https://www.youtube.com/watch?v=bSkpMdDe4g4](https://www.youtube.com/watch?v=bSkpMdDe4g4)

[1]
[https://www.youtube.com/watch?v=nAbCKa0FzjQ](https://www.youtube.com/watch?v=nAbCKa0FzjQ)

~~~
winrid
Fantastic. Adding to my watch list.

------
oconnor663
If you're building a binary tree incrementally, comparing the length of your
stack of partially completed subtrees to the popcnt of the number of leaves so
far tells you how many subtrees you need to merge :)

------
kristianp
The Popcount instruction seems to be a meme on hacker news. You're cool if
know about it and cooler if you've used it. That aside, this article is a good
one for applications of it.

~~~
ncmncm
Knowing when to use bitmap representation, ops, and popcount is a kind of
superpower. When it is the right thing, it makes your program 10x faster.

~~~
gmueckl
Thus can be said about almost all CPU instructions. x64 has tons of SIMD
intrinics. You can do absolutely crazy stuff with them if you are willing to
invest the time.

~~~
ncmncm
True enough, but bitmaps are very easy to use portably, with just &, |, ~, ^,
and ++ operations. The only exotic needed very frequently is popcount, but you
get that portably from std::bitset (except, evidently, in MSVC).

The compilers are perfectly happy to keep an unsigned and a bitset variable in
the same register, so that assigning from one to the other is a no-op.

------
ncmncm
It's kind of pathetic that the RISC-V core instruction set lacks popcount.

Besides all its immediate uses, it is essential to a fast integer logarithm.

~~~
eleson
It is part of the Bit Manipulation extension. So is count leading zeros.
Unfortunately it is still in draft.

The math extensions are not in draft.

~~~
ncmncm
Being in an extension means you can't count on it being there.

~~~
dralley
That's a feature (for hardware designers) not a bug.

RISC V is not really designed to be a good high performance oriented ISA.
They're aimed square at the embedded market.

~~~
ncmncm
It's a feature for hardware designers working with 250 nm transistors.

Modern designers working with modern processes have the problem of discovering
what they can throw a million transistors at that will actually make real
programs faster. Hint: bitmanips!

------
thom
Popcount comes up in chess engine programming, where boards are regularly
represented as a series of bits for each type of piece. The Chess Programming
Wiki has a lot of information about both usage and implementation:

[https://www.chessprogramming.org/Population_Count](https://www.chessprogramming.org/Population_Count)

------
markh1967
I think the most probable reason for this instruction is for calculating
parity bits. This would need to be done fast so it makes sense that there
would be a CPU instruction to do most of the work.

~~~
kps
Parity is much easier than counting. It's so easy that you get it for free in
the x86 flags register. (… and because the 8008 was designed to run a
terminal.)

~~~
bogomipz
>"(… and because the 8008 was designed to run a terminal.)"

Could you elaborate on this? How does the 8008 being designed to run a
terminal relate to the parity and the flags register?

~~~
kps
Each iteration from the 8080 through x64 have a parity bit in the flags
register for backwards compatibility with the previous generation. The 8008
was a microprocessor implementation of the Datapoint 2200 architecture.

~~~
bogomipz
Excellent, thanks for the insights. Cheers.

------
MauranKilom
I've used _BitScanReverse (popcount-ish MSVC intrinsic) with a custom octree
indexing scheme, where higher level voxels have more unset bits from the left.
Neat stuff.

------
collyw
MySQl includes a native popcount instruction, Postgres didn't when I checked a
few years back. I actually needed it at the time.

~~~
jonatron
It's surprisingly easy to implement a popcount user defined function in C for
postgres. It would be a problem for hosted postgres with extension
restrictions though.

------
classichasclass
Power ISA has it too. I'm hard pressed to think of a modern architecture that
_doesn 't_ have it.

~~~
saagarjha
I don’t think ARM does, but there might be some NEON thing that you can use to
do something equivalent.

~~~
vaibhavsagar
ARM NEON includes it as VCNT:
[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489h/CIHCFDBJ.html)

~~~
saagarjha
Huh, it looks like that only works on 1-byte values? That’s an interesting
choice.

~~~
m0zg
Worse, it's a fertile ground for "interesting" bugs, because VADDV (which sum-
reduces the result) reduces into an 8 bit uint. So if you e.g. accumulate two
or more quadword VCNTs into a uint8x16_t and then VADDV it, you could end up
with something other than the actual overall bit count (because 2 quadwords
can have _256_ bits set). Same with accumulating 8 or more VADDVs, except now
individual bytes could wrap around if you don't widen in between.

------
itodd
I've used this in an effort to calculate hamming distances of encoded equal
length strings of dna.

------
Vizarddesky
Popcount is not at all unusual in ordinary bit-twiddling.

------
guytv
popcount reminds me of my first Map-Reduce program. It seems that I "reduced"
by popcounting.

------
amelius
Peculiar name. I would have named that "bitsum".

~~~
dalke
The original name was "sideways addition". There are many other names,
including Hamming weight. The Wikipedia entry for Hamming weight says the
HP-16C developers agree with you:

> Some programmable scientific pocket calculators feature special commands to
> calculate the number of set bits, e.g. #B on the HP-16C[3][17] and WP
> 43S,[18][19] #BITS[20][21] or BITSUM[22][23] on HP-16C emulators, and nBITS
> on the WP 34S.[24][25]

