Hacker News new | past | comments | ask | show | jobs | submit login
Rust's integer intrinsics are impressive
89 points by miqkt on June 10, 2017 | hide | past | web | favorite | 36 comments



> conspiracy theories around popcount and the NSA

Whoa. Some links (with sometimes not-very-well-thought-out allegations):

https://groups.google.com/forum/#!topic/comp.arch/UXEi7G6WHu...

https://www.schneier.com/blog/archives/2014/05/the_nsa_is_no...


I like that it compiles to a single instruction, but lots of languages already have this; it's pretty common.

Even boring old Java has had Integer.bitCount for many years.


I don't know if http://grepcode.com/file/repository.grepcode.com/java/root/j... is representative of all the implementations of bitCount in Java, but it's probably obvious that it's not going to use the popcnt opcode instruction.

I think the point is that Rust makes it much easier to use that opcode instruction. It's possible but hard with GCC using __builtin_popcount(), but, I'd guess totally impossible in Java due to lack of a JVM instruction for the same.


That's just the placeholder implementation that will work on any platform and even with the interpreter.

If you look at the openjdk9 sources you will notice that it is annotated as intrinsic candidate[0]. But earlier versions also have intrinsics for that[1], it's just not annotated as such.

[0] http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/23721aa1d87f/s... [1] https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc#file-ho...


> obvious that it's not going to use the popcnt opcode instruction ... I'd guess totally impossible in Java

I don't understand why you would guess at what Java can do when you can find out for certain?

    public class Test {
      
      private static int bitCount(int a) {
        return Integer.bitCount(a);
      }
      
      public static void main(String[] args) {
        while (true) {
          bitCount(14);
        }
      }
      
    }
Compiles bitCount to

    0x000000011e637980: sub    rsp,0x18
    0x000000011e637987: mov    QWORD PTR [rsp+0x10],rbp  ;*synchronization entry
                                                  ; - Test::bitCount@-1 (line 4)

    0x000000011e63798c: popcnt eax,esi            ;*invokestatic bitCount
                                                  ; - Test::bitCount@1 (line 4)

    0x000000011e637990: add    rsp,0x10
    0x000000011e637994: pop    rbp
    0x000000011e637995: test   DWORD PTR [rip+0xfffffffff10ed665],eax        # 0x000000010f725000
                                                  ;   {poll_return}
    0x000000011e63799b: ret 
There's your popcnt instruction. No need to guess.


There are methods in Java which have special vm implementations to take advantage of instructions like this. They methods have an ordinary Java code version for the interpreter and on jit compilation are replaced with the other version. For a complete list of these types of functions see http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee275...


JIT-compiler might recognize this call and replace it with popcnt instruction. I'm not sure if it does that, though.


Yea, this is really nothing special—I'd imagine anything that compiles to native code would need all the


I did some experimentation with popcnt in C++ six years ago and was impressed to find that the intrinsic in gcc was faster than the best inline assembly I could come up with using the popcnt instruction:

https://github.com/bobbyi/Fast-Bit-Counting


Now that most CPUs have population count, it may as well be exposed at the language level.

Doing it by table lookup results in questions such as "am I wasting too much cache space on this?" and "is a 64K table causing cache misses".


Fortran 2008 added a number of bit twiddling intrinsics[0].

Here are a few of the scalar intrinsics for bit counting.

  popcnt() - population count
  leadz() - leading zero
  trailz() - trailing zero
  poppar() - parity
[0] ftp://ftp.nag.co.uk/sc22wg5/n1701-n1750/n1729.pdf

Edited to fix formatting.


> be exposed at the library level.

TFTFY :-)


Exactly what purpose does the surrounding instructions serve in this and similar simple cases? Is it compiler dogma or a missed uncommon optimization?

        push    rbp
        mov     rbp, rsp
        popcnt  eax, edi
        pop     rbp
        ret


It's a function invocation following x86 C calling convention in assembly. push ebp pushes the base stack pointer of the caller of count() onto the stack, and mov ebp, esp sets the base stack pointer for the callee count(). pop ebp pops the base stack pointer and ret pops the return address into the program counter. count_ones() is replaced with its assembly counterpart, which stores the result in return register eax. This is as optimized as count() as a user-created function can be in assembly given the calling convention. https://en.m.wikipedia.org/wiki/X86_calling_conventions


Correction: rsp and rbp are 64-bit registers storing 64-bit stack memory addresses, while popcount eax, edi is taking the 32-bit parameter as input and returning the 32-bit return value. The calling convention is x64 C.


GCC, for example, allows a "-fomit-frame-pointer" [1] optimization option which would get rid of this. I'm not sure why this isn't done by default in optimized builds. Maybe it has something to do with stack unwinding: if the functions panics for some reason, or triggers a CPU fault, there's no way to get the correct backtrace if you don't have the frame pointers on the stack.

[1]: https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Option...


rustc will omit the frame pointer if you add "-C debuginfo=0".

  example::count:
        popcnt  eax, edi
        ret


IIRC it's needed on x86 but not on x86_64


A function is being called, so it gets a stack frame at the start of the call and that stack frame goes away when it returns.

rbp points to the base of the current stack frame (register base pointer) and rsp points to the top of the current stack frame (register stack pointer).

What the function actually does is just the single popcnt instruction.


Yes, I get the reserve space for local variables and so on.

But this is a leaf-function and one that has no register trashing or temporary variables. Could this be inlined later on via link-time optimizations? Will the linker then remove that function prologue and epilogue?


Yeah, it can certainly be inlined or optimized away with various compiler options. However, as you might surmise, debugging without stack frames becomes rather difficult as functions corresponding to source code disappear into a goo of assembly.

The example with slightly different options (as mentioned in another comment) gives what you're looking for: https://godbolt.org/g/GlkEQK


gcc has the -fomit-leaf-frame-pointer option at least.


Those are the function prologue and function epilogue. https://en.wikipedia.org/wiki/Function_prologue


It's exposed as an intrinsic in llvm. C(99? I think) surfaces these in math.h, although the standard is silent on some edge cases (all zeros), and numbers are autopromoted to 32 bit integer. Julia surfaces these as well.

In case you're wondering what this could be useful for besides super secret NSA stuff, and Bitcoin mining, here are a few suggestions:

1) hyperloglog. (Similar to Bitcoin). Keep an estimated count of items streaming by by hashing them and store the highest lzcount of the hashes for each category you're tracking. This will be ~ log2(category count)

2) converting from fixed point to floating point. The number of zeros in front of your value represents the exponent of your value (or ones in the case of a 2's complement negative fixed point), which is critical to deducting the float representation.

Along those lines, one of the things I've done is implemented floating point-like datatypes, which extensively uses lzcount and locount for tracking values and also will use tzcount to measure if the values are exact or not.

https://github.com/Etaphase/FastSigmoids.jl/blob/master/READ...


I've been using bit population count to traverse packed sparse arrays since the CDC 6600, so it's handy for more things than Hamming distance. Always nice to have in hardware, but pretty cheap to synthesize when it isn't.


Reading http://0x80.pl/articles/sse-popcount.html, I would think the hardware instruction is slower than the best software implementation. Or has that changed on newer hardware?


This would not apply to the Rust method shown in the OP, since that one is operating on only 32 bits at a time.

It's unclear how well the benchmarks in this linked article generalize to other applications. If you are just popcounting in a tight loop, probably pretty well, but who does that? In reality you have other things going on, so if this method is occupying too many execution units or polluting your cache, you would see the effect of that on the rest of the program. But it's program-dependent, thus unclear.


> since that one is operating on only 32 bits at a time.

https://doc.rust-lang.org/std/primitive.u64.html#method.coun...

https://doc.rust-lang.org/std/primitive.u128.html#method.cou...

Of course to benefit from the SSE optimizations you would still have to call it in a loop and the optimizer would have to recognize that and replace it with a vectorized approach.


you can do the simd intrinsics explicitly as well


This submission originally linked to a blog post whose author (not the HN submitter) asked us to delete it. We don't want to kill the thread, so we removed the URL above.


> Whereas GCC’s __builtin_popcount() only works on unsigned ints Rust’s count_ones() works on signed, too!

C's weak typing should handle signed


I was once asked to come up with a table lookup method for popcount on the spot and could not come up with a solution.

Oh, Hacker News.

If someone can't solve a problem like this off the top of their head, does it not act as a strong signal that they are a beginner and you should probably look elsewhere for quality information?


Generally, yes. In this case, though, the post is about a feature of a tool that exists regardless of who notices that it does.


[dead]


Being asked to solve it using lookup tables is not a tricky problem. It's a straight-forward application of bitwise operations common in the field of software engineering.


Indeed, it's a trivial lookup table problem, it's hard to come up with an easier one, so I would guess the confusion comes from lack of familiarity with bitwise operations, in which case why is the author presuming to write an article judging a certain implementation of bitwise operations impressive?

The intent of my comment is not to be mean, it's just ... there is a lot of noise on places like Hacker News and this article is part of that noise because, look, just about all compilers have had intrinsics for decades now at least, popcount is a very common one, so it's not surprising to see it turn up. It's not impressive as the title suggests, it's extremely common. And it's nothing specific about Rust because most production-quality languages do it. So both major elements of the article title are pretty much incorrect.

And it's fine not to know that when you're a beginner, I am not knocking that at all. But there's something about writing articles that then get broadcast, that give the wrong impression to other new people who are trying to learn. It's useful information that there is a popcount intrinsic in the Rust compiler, but this would be much more educational coming from someone who understands the context of all this stuff and can explain the real situation. Which may be the author of this article someday, maybe even someday very soon -- I don't wish to be inappropriately negative -- but it's not today.

I never liked going to school, and I think higher education is going to go through an existential crisis pretty soon, if it's not happening already. But one good thing about the old system is that at least there was this idea that you should work hard, and really learn the material, before you go presuming to teach people. And I think that's a very good idea. If you're inexperienced and there's a shortage of teachers and teaching needs to happen, then go for it -- but otherwise I think it is very important to keep in mind what one does and does not understand, and who understands it better, and to not presume to teach until one is in a good position to do so.

I know this goes a little bit against the current philosophy of "programming is great! Anyone can do it! Rah rah," but actually I think on closer inspection it doesn't. There's nothing wrong with participation, and community, and everyone contributing, etc. But it's important to keep an understanding of the difference between beginner contributions and advanced contributions, otherwise it seems possible to suffer a severe degradation of skill in the field over time, because how do people know what to shoot for if people of all expertise levels are teaching them and they can't tell the difference because they themselves are beginners?


FWIW, I think the story hit the front page simply because it was a positive story about Rust and a lot of people in this community reflexively upvote that sort of thing. I took your comment as well deserved criticism against the community.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: