Rust's integer intrinsics are impressive - miqkt
======
mrkgnao
> conspiracy theories around popcount and the NSA

Whoa. Some links (with sometimes not-very-well-thought-out allegations):

[https://groups.google.com/forum/#!topic/comp.arch/UXEi7G6WHu...](https://groups.google.com/forum/#!topic/comp.arch/UXEi7G6WHuU)

[https://www.schneier.com/blog/archives/2014/05/the_nsa_is_no...](https://www.schneier.com/blog/archives/2014/05/the_nsa_is_not_.html)

------
morecoffee
I like that it compiles to a single instruction, but lots of languages already
have this; it's pretty common.

Even boring old Java has had Integer.bitCount for many years.

~~~
joewalker
I don't know if
[http://grepcode.com/file/repository.grepcode.com/java/root/j...](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/lang/Integer.java?av=f#1444)
is representative of all the implementations of bitCount in Java, but it's
probably obvious that it's not going to use the popcnt opcode instruction.

I think the point is that Rust makes it much easier to use that opcode
instruction. It's possible but hard with GCC using __builtin_popcount(), but,
I'd guess totally impossible in Java due to lack of a JVM instruction for the
same.

~~~
the8472
That's just the placeholder implementation that will work on any platform and
even with the interpreter.

If you look at the openjdk9 sources you will notice that it is annotated as
intrinsic candidate[0]. But earlier versions also have intrinsics for that[1],
it's just not annotated as such.

[0]
[http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/23721aa1d87f/s...](http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/23721aa1d87f/src/java.base/share/classes/java/lang/Integer.java#l1676)
[1] [https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc#file-
ho...](https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc#file-hotspot-jvm-
intrinsics-L38)

------
bobbyi_settv
I did some experimentation with popcnt in C++ six years ago and was impressed
to find that the intrinsic in gcc was faster than the best inline assembly I
could come up with using the popcnt instruction:

[https://github.com/bobbyi/Fast-Bit-Counting](https://github.com/bobbyi/Fast-
Bit-Counting)

------
Animats
Now that most CPUs have population count, it may as well be exposed at the
language level.

Doing it by table lookup results in questions such as "am I wasting too much
cache space on this?" and "is a 64K table causing cache misses".

~~~
tjalfi
Fortran 2008 added a number of bit twiddling intrinsics[0].

Here are a few of the scalar intrinsics for bit counting.

    
    
      popcnt() - population count
      leadz() - leading zero
      trailz() - trailing zero
      poppar() - parity
    

[0] ftp://ftp.nag.co.uk/sc22wg5/n1701-n1750/n1729.pdf

Edited to fix formatting.

------
Flow
Exactly what purpose does the surrounding instructions serve in this and
similar simple cases? Is it compiler dogma or a missed uncommon optimization?

    
    
            push    rbp
            mov     rbp, rsp
            popcnt  eax, edi
            pop     rbp
            ret

~~~
thegeomaster
GCC, for example, allows a "-fomit-frame-pointer" [1] optimization option
which would get rid of this. I'm not sure why this isn't done by default in
optimized builds. Maybe it has something to do with stack unwinding: if the
functions panics for some reason, or triggers a CPU fault, there's no way to
get the correct backtrace if you don't have the frame pointers on the stack.

[1]: [https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-
Option...](https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html)

~~~
jannic
rustc will omit the frame pointer if you add "-C debuginfo=0".

    
    
      example::count:
            popcnt  eax, edi
            ret

------
dnautics
It's exposed as an intrinsic in llvm. C(99? I think) surfaces these in math.h,
although the standard is silent on some edge cases (all zeros), and numbers
are autopromoted to 32 bit integer. Julia surfaces these as well.

In case you're wondering what this could be useful for besides super secret
NSA stuff, and Bitcoin mining, here are a few suggestions:

1) hyperloglog. (Similar to Bitcoin). Keep an estimated count of items
streaming by by hashing them and store the highest lzcount of the hashes for
each category you're tracking. This will be ~ log2(category count)

2) converting from fixed point to floating point. The number of zeros in front
of your value represents the exponent of your value (or ones in the case of a
2's complement negative fixed point), which is critical to deducting the float
representation.

Along those lines, one of the things I've done is implemented floating point-
like datatypes, which extensively uses lzcount and locount for tracking values
and also will use tzcount to measure if the values are exact or not.

[https://github.com/Etaphase/FastSigmoids.jl/blob/master/READ...](https://github.com/Etaphase/FastSigmoids.jl/blob/master/README.md)

------
pklausler
I've been using bit population count to traverse packed sparse arrays since
the CDC 6600, so it's handy for more things than Hamming distance. Always nice
to have in hardware, but pretty cheap to synthesize when it isn't.

------
Someone
Reading [http://0x80.pl/articles/sse-
popcount.html](http://0x80.pl/articles/sse-popcount.html), I would think the
hardware instruction is slower than the best software implementation. Or has
that changed on newer hardware?

~~~
jblow
This would not apply to the Rust method shown in the OP, since that one is
operating on only 32 bits at a time.

It's unclear how well the benchmarks in this linked article generalize to
other applications. If you are just popcounting in a tight loop, probably
pretty well, but who does that? In reality you have other things going on, so
if this method is occupying too many execution units or polluting your cache,
you would see the effect of that on the rest of the program. But it's program-
dependent, thus unclear.

~~~
the8472
> since that one is operating on only 32 bits at a time.

[https://doc.rust-
lang.org/std/primitive.u64.html#method.coun...](https://doc.rust-
lang.org/std/primitive.u64.html#method.count_ones)

[https://doc.rust-
lang.org/std/primitive.u128.html#method.cou...](https://doc.rust-
lang.org/std/primitive.u128.html#method.count_ones)

Of course to benefit from the SSE optimizations you would still have to call
it in a loop and the optimizer would have to recognize that and replace it
with a vectorized approach.

~~~
jackmott
you can do the simd intrinsics explicitly as well

------
dang
This submission originally linked to a blog post whose author (not the HN
submitter) asked us to delete it. We don't want to kill the thread, so we
removed the URL above.

------
__s
> Whereas GCC’s __builtin_popcount() only works on unsigned ints Rust’s
> count_ones() works on signed, too!

C's weak typing should handle signed

------
jblow
_I was once asked to come up with a table lookup method for popcount on the
spot and could not come up with a solution._

Oh, Hacker News.

If someone can't solve a problem like this off the top of their head, does it
not act as a strong signal that they are a beginner and you should probably
look elsewhere for quality information?

~~~
aisofteng
Generally, yes. In this case, though, the post is about a feature of a tool
that exists regardless of who notices that it does.

