Hacker News new | past | comments | ask | show | jobs | submit login

I have been doing some exploration of how well Rust optimizes Iterators and have been quite impressed.

Writing a iterator to provide the individual bits supplied by an iterator of bytes means you can count them with

    fn count_bits<I : Iterator<Item=bool>>(it : I) -> i32{
        let mut a=0;
        for i in it {
            if i {a+=1};
        }
        return a;
    } 
Counting bits in an array of bytes would need something like this

    let p:[u8;6] = [1,2,54,2,3,6];
    let result = count_bits(bits(p.iter().cloned()));
Checking what that generates in asm https://godbolt.org/g/iTyfap

The core of the code is

    .LBB0_4:
        mov     esi, edx        ;edx has the current mask of the bit we are looking at
        and     esi, ecx        ;ecx is the byte we are examining
        cmp     esi, 1          ;check the bit to see if it is set (note using carry not zero flag)
        sbb     eax, -1         ;fun way to conditionally add 1
    .LBB0_1:
        shr     edx             ;shift mask to the next bit
        jne     .LBB0_4         ;if mask still has a bit in it, go do the next bit otherwise continue to get the next byte 
        cmp     rbx, r12        ;r12 has the memory location of where we should stop.   Are we there yet?
        je      .LBB0_5         ; if we are there, jump out. we're all done
        movzx   ecx, byte ptr [rbx]  ;get the next byte
        inc     rbx             ; advance the pointer
        mov     edx, 128        ; set a new mask starting at the top bit
        jmp     .LBB0_4         ; go get the next bit
    .LBB0_5:
Apart from magical bit counting instructions this is close to what I would have written in asm mysef. That really impressed me. I'm still a little wary of hitting a performance cliff. I worry that I can easily add something that will mean the optimiser bails on the whole chain, but so far I'm trusting Rust more than I have trusted any other Optimiser.

If this produces simiarly nice code (I haven't checked yet) I'll be very happy

   for (dest,source) in self.buffer.iter_mut().zip(data) { *dest=source }



How much of this is the Rust compiler and how much is generic LLVM optimization passes?


It's the combination of LLVM, the Rust compiler providing enough information to LLVM, and very optimized library code. As well as plenty of testing to make sure this kind of stuff doesn't regress :)


The compiler itself currently does only some basic optimizations, it's all LLVM.

However, the compiler does have perfect aliasing info, and it does provide a subset of this info to LLVM (I don't think LLVM IR currently supports more fine grained aliasing info being provided from the compiler). This does help certain optimizations; though I'm unsure if this one is one of them.


Okay, wait. Every modern CPU has a popcount instruction, so any hand-coded implementation would use that, meaning the compiler output is actually pretty bad in an absolute sense.

But if you find popcount too "magical", the commonly-known fast way to count bits is via masking, shifts and adds, so that you do it in log(n) steps. Which also would perform much better than this solution.

So what you're really saying is "the compiler managed to make a pretty efficient representation of the naive solution" which is fine but it does not mean your code is fast.


> Okay, wait. Every modern CPU has a popcount instruction

What do you consider a "modern CPU"? Atom chips sold less than a decade ago didn't support popcnt. AMD shipped some C-series chips without support for it as recently as 2012. The low-end chips were likely to be sold later without refreshes to newer features in some cases, and those are also likely the ones to be repurposed for small and cheap x86 devices later. I wouldn't want the default compilation settings without specifying CPU extensions to use an extension that might not exist on my target platform.


This is why you have a fallback to a generic version.


As I stated in another reply. I don't actually want to count bits at all. Counting is just the simplest thing I could do with the bits in the test case. Turning a stream of bytes into a stream of bits can be quite useful.


But if you are reading a stream of bits, you don't want to read one bit at a time either, because that is pathologically slow. You want to read n buts at a time (where n varies each call, probably) at which point you're doing a very standard mask-and-shift with no magic...


That's really impressive, but I'm a little surprised LLVM didn't optimize that inner loop to a popcnt instruction.


You may have to specify a flag like "-C target-cpu=native" -- not all CPUs support popcnt, so LLVM can't generate it without knowing details about the target. I can't get the compiler on godbolt.org to generate it either way, though.


You need "-C opt-level=3 -C target-cpu=native -C target-feature=+ssse3"

I actually wrote about it last week: http://tmccrmck.github.io//post/rust-optimization-partii/


You can drop `-C target-feature=+ssse3` since `target-cpu=native` will cover that. The `-C opt-level=3` isn't necessary either, but if you drop that, it looks like there's a function that doesn't get inlined. However, even without opt-level=3, the popcnt instruction is still emitted.


There's a little typo in your C implementation of popcount: you reference `x` when you probably meant `n`


Thanks. I fixed it.


Does Rust support FMV (function multi-versioning)?


> If this produces simiarly nice code (I haven't checked yet) I'll be very happy

You didn't specify what the type of `buffer` was, so I picked a `u8`. This code[1]:

    pub struct Thing {
        buffer: Vec<u8>,
    }
    
    impl Thing {
        pub fn copy(&mut self, data: &[u8]) {
            for (dest, &source) in self.buffer.iter_mut().zip(data) {
                *dest = source;
            }
        }
    }
Produces this assembly:

    _ZN10playground5Thing4copy17hf523bcb10e2298f3E:
    	.cfi_startproc
    	pushq	%rax
    .Ltmp0:
    	.cfi_def_cfa_offset 16
    	movq	16(%rdi), %rax
    	cmpq	%rdx, %rax
    	cmovbeq	%rax, %rdx
    	testq	%rdx, %rdx
    	je	.LBB0_2
    	movq	(%rdi), %rdi
    	callq	memcpy@PLT
    .LBB0_2:
    	popq	%rax
    	retq
The call to `memcpy` is what makes me happy.

----

Your `count_bits` already exists as a combination of iterator adapters (`filter`[2] and `count`[3]):

    a_bit_iterator.filter(|bit| bit).count()
Although if you have a numeric value, I'd suggest using `count_ones`[4], which can use the `popcnt` intrinsic.

If you wanted to count all the bits in an array, I'd suggest `map`[5] and `sum`[6]

    let p = [1u8, 2, 54, 2, 3, 6];
    let result: u32 = p.iter().map(|b| b.count_ones()).sum();
If you wanted to keep the bit iterator, you could also use `flat_map`[7].

[1]: https://play.integer32.com/?gist=03f8ffbe3ade6ced4d315c8e020...

[2]: https://doc.rust-lang.org/std/iter/trait.Iterator.html#metho...

[3]: https://doc.rust-lang.org/std/iter/trait.Iterator.html#metho...

[4]: https://doc.rust-lang.org/std/primitive.u8.html#method.count...

[5]: https://doc.rust-lang.org/std/iter/trait.Iterator.html#metho...

[6]: https://doc.rust-lang.org/std/iter/trait.Iterator.html#metho...

[7]: https://doc.rust-lang.org/std/iter/trait.Iterator.html#metho...


>The call to `memcpy` is what makes me happy.

That's the sort of thing I was hoping to see.

>Your `count_bits` already exists as a combination of iterator adapters (`filter`[2] and `count`[3]):

That's the problem with simple examples. I don't actually want to count bits. It was just the minimum workload I could think of to generate a result from the conversion.

Seeing how the for (a,b) in ai.zip(bi) works well I'll probably be writing a bitmap glyph renderer that is basically if b {*a=color}




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: