
Hand-Coded Assembly Beats Intrinsics in Speed and Simplicity - ingve
http://danluu.com/assembly-intrinsics/
======
haberman
Previous discussion:
[https://news.ycombinator.com/item?id=8508923](https://news.ycombinator.com/item?id=8508923)

------
Aardwolf
Even more ideally you wouldn't need intrinsics. You'd just say what you want,
with all constraits. E.g. I have two integers, this is their signedness, this
is the range of expected inputs, this is the probability of the inputs, and
this is what I want with them, nothing more... now you compiler figure it out.

Now the compiler too often generates suboptimal code because it has to take
edge cases that it doesn't know don't matter here into account.

And then you have only those limited not actually well specified operators of
C (something as simple as "+" is not fully specified on signed integers), so
you can tell even less well what you actually want to the compiler... You
can't even do a simple overflow check without bordering on undefined behaviour
that allows compilers to do whatever they want.

If you could tell better what you want, the compiler could better choose the
perfect CPU instructions for it.

So imho, a programmer shouldn't choose the CPU instruction as that doesn't
allow portability, but the programmer should have the ability to specify
things better than C now allows :)

~~~
ploxiln
The optimal compiler with perfect knowledge theory never really comes true,
and it's been around for a long long time. Many years ago some people said
that java would be faster than C, because it can optimize based on information
learned at run-time, and there was even a microbenchmark or two to prove it
(against naive c of course). They also said that Itanium's crazy VLIW
architecture would allow compilers to make more efficient use of processor
resources than hardware out-of-order instruction schedulers could. It's been a
decade since the height of java and Itanium, and the results speak for
themselves.

This popcount thing is a great example. All your layers of abstraction didn't
save you from a bug in the lowest layer! (Granted, the only result is some
inefficiency, and we sacrifice that all the time.) Keep it simple, you can use
abstractions but keep them transparent and debugable. Or not. But don't tell
me I should use a high level language and trust the compiler and runtime.
Never trust...

~~~
Ace17
"you can use abstractions but keep them transparent"

Wouldn't it defeat the purpose of abstraction?

~~~
ploxiln
Not all the purposes. Abstractions can still make the complexity more
manageable. Linux kernel developers write the vast majority of their code in C
but very often inspect the assembly instructions generated by the compiler.
Web developers have many templating languages with macro capabilities (or sass
for css), but still want the generated html to be basically readable, so they
can verify the templating is doing what they want, and they can debug things.

Some people really can and do ignore things below a particular abstraction,
but then there are others who must service that layer transition. And not just
the original authors of it. If you have an effective web company of more than,
perhaps, 30 developers, at least one of them is dealing with server connection
state statistics, occasionally problematic internet routes, occasionally
problematic switch configuration or hardware (even on an opaque cloud
platform). Someone has to deal with it, or else stuff is kinda flaky, and no
one seems to know why...

------
corysama
danluu knows much more about the subject matter than me. Meanwhile...

If I take the main.cpp from
[http://www.strchr.com/media/crc32_popcnt.zip](http://www.strchr.com/media/crc32_popcnt.zip)
that he comparing and paste it into a new, default, VC2010 project. Then
compile it as a Release build, then the mixed disassembly for the inner loop
of POPCNT_HardwareSubbuN() looks like

    
    
    		    cnt += __popcnt(*(DWORD*)buf) +
    			       __popcnt(*(DWORD*)(buf + sizeof(DWORD) )) +
    			       __popcnt(*(DWORD*)(buf + sizeof(DWORD) * 2 )) +
    			       __popcnt(*(DWORD*)(buf + sizeof(DWORD) * 3 ));
        00E71BE0  popcnt      ebx,dword ptr [edx+8]  
        00E71BE5  popcnt      esi,dword ptr [edx+0Ch]  
        00E71BEA  add         esi,ebx  
        00E71BEC  popcnt      ebx,dword ptr [edx+4]  
        00E71BF1  add         esi,ebx  
        00E71BF3  popcnt      ebx,dword ptr [edx]  
        00E71BF7  add         eax,ebx  
        00E71BF9  add         eax,esi  
    		    buf += sizeof(DWORD) * 4;
        00E71BFB  add         edx,10h  
        00E71BFE  dec         ecx  
        00E71BFF  jne         POPCNT_HardwareUnrolled+50h (0E71BE0h)  
    

which looks to me to be as good or better than his inline asm

    
    
        00000001000013c3        popcntq %r10, %r10
        00000001000013c8        addq    %r10, %rcx
        00000001000013cb        popcntq %r11, %r11
        00000001000013d0        addq    %r11, %r9
        00000001000013d3        popcntq %r14, %r14
        00000001000013d8        addq    %r14, %r8
        00000001000013db        popcntq %rbx, %rbx
    

And does not have the odd

    
    
        000000010000133e        movl    %ecx, %r10d
        0000000100001341        movl    %edx, %r11d
        0000000100001344        movl    %eax, %r14d
        0000000100001347        movl    %r8d, %ebx
    

He was complaining about.

Maybe there's some compiler issue about arrays that makes using "int cnt[4];"
as the accumulator induce the seemingly extraneous movls?

~~~
corysama
Double-checked to make sure it wasn't a 64-bit issue:

    
    
    		    cnt += __popcnt(*(DWORD*)buf) +
    			       __popcnt(*(DWORD*)(buf + sizeof(DWORD) )) +
    			       __popcnt(*(DWORD*)(buf + sizeof(DWORD) * 2 )) +
    			       __popcnt(*(DWORD*)(buf + sizeof(DWORD) * 3 ));
        000000013F231DF0  popcnt      eax,dword ptr [r8+8]  
        000000013F231DF6  popcnt      ecx,dword ptr [r8+0Ch]  
    		    buf += sizeof(DWORD) * 4;
        000000013F231DFC  add         r8,10h  
        000000013F231E00  add         ecx,eax  
        000000013F231E02  popcnt      eax,dword ptr [r8-0Ch]  
        000000013F231E08  add         ecx,eax  
        000000013F231E0A  popcnt      eax,dword ptr [r8-10h]  
        000000013F231E10  add         r9d,eax  
        000000013F231E13  add         r9d,ecx  
        000000013F231E16  dec         rdx  
        000000013F231E19  jne         POPCNT_HardwareUnrolled+50h (13F231DF0h)

~~~
minimax
This also has the false dependency problem (using eax as the popcnt
destination register three times), and neither one of your VC examples is
using the 8 byte popcnt instruction.

~~~
corysama
Switching to __popcnt64 uses 8 byte popcnt. But, I can't get rid of the false
dependency problem.

    
    
        RES64 cnt0=0, cnt1=0, cnt2=0, cnt3=0;
    	SIZE_T ndqwords = len / (sizeof(RES64) * 4);
    	for(; ndqwords; ndqwords--) {
    		cnt0 += __popcnt64(*(RES64*)buf);
    		cnt1 += __popcnt64(*(RES64*)(buf + sizeof(RES64) ));
    		cnt2 += __popcnt64(*(RES64*)(buf + sizeof(RES64) * 2 ));
    		cnt3 += __popcnt64(*(RES64*)(buf + sizeof(RES64) * 3 ));
    		buf += sizeof(RES64) * 4;
    	}
        cnt = cnt0 + cnt1 + cnt2 + cnt3;
    
    
        000000013FD61FF3  popcnt      rax,qword ptr [rdx]  
        000000013FD61FF8  add         rdx,20h  
        000000013FD61FFC  add         r9,rax  
        000000013FD61FFF  popcnt      rax,qword ptr [rdx-18h]  
        000000013FD62005  add         r10,rax  
        000000013FD62008  popcnt      rax,qword ptr [rdx-10h]  
        000000013FD6200E  add         r8,rax  
        000000013FD62011  popcnt      rax,qword ptr [rdx-8]  
        000000013FD62017  add         r11,rax  
        000000013FD6201A  dec         rcx  
        000000013FD6201D  jne         main+0C3h (13FD61FF3h)

~~~
corysama
This gets rid of the false dependency problem. Not a shining example of
simplicity or reliability. But, interesting that it works.

    
    
        struct A { RES64 cnt0; RES64 cnt1; RES64 cnt2; RES64 cnt3; } a;
        RES64 cnt0=0, cnt1=0, cnt2=0, cnt3=0;
        SIZE_T ndqwords = len / (sizeof(RES64) * 4);
        for(; ndqwords; ndqwords--) {
            a.cnt0 = __popcnt64(*(RES64*)buf);
            a.cnt1 = __popcnt64(*(RES64*)(buf + sizeof(RES64) ));
            a.cnt2 = __popcnt64(*(RES64*)(buf + sizeof(RES64) * 2 ));
            a.cnt3 = __popcnt64(*(RES64*)(buf + sizeof(RES64) * 3 ));
            cnt0 += a.cnt0;
            cnt1 += a.cnt1;
            cnt2 += a.cnt2;
            cnt3 += a.cnt3;
            buf += sizeof(RES64) * 4;
        }
        cnt = cnt0 + cnt1 + cnt2 + cnt3;
    
        000000013F2320C0  popcnt      rax,qword ptr [r8]  
        000000013F2320C5  popcnt      rcx,qword ptr [r8+8]  
        000000013F2320CB  popcnt      rdx,qword ptr [r8+10h]  
        000000013F2320D1  popcnt      rbx,qword ptr [r8+18h]  
        000000013F2320D7  add         r11,rax  
        000000013F2320DA  add         rdi,rcx  
        000000013F2320DD  add         rsi,rdx  
        000000013F2320E0  add         r8,20h  
        000000013F2320E4  add         rbp,rbx  
        000000013F2320E7  dec         r10  
        000000013F2320EA  jne         main+0D0h (13F2320C0h)

------
mschuster91
One thing that has always interested me, how is backwards compatibility done
for older CPU instruction set?

Like the new VFMADD* instructions. So if I wanted to write a binary which
supports post-2013 CPUs as well as previous ones, my way of doing this would
be:

1) have a huge array of function pointers for every function that could use
said instructions

2) in main() check if the CPU supports the instructions, if yes: populate
array with fast functions, if not, populate with backwards-compatible
functions.

Naturally this comes with a performance hit at every call as at least one (or
two, if you fill the arrays at compiletime, and in main just switch the array
pointer) indirections. Is this really how stuff gets done?

~~~
nteon
If you're using shared libraries in your program, you already have this extra
indirection - at compilation/link time the compiler/linker doesn't know where
in memory the shared library will be when the program is run, so every call
into a library goes through the procedure linkage table (PLT), which works by
looking up the 'real' function pointer in the global offset table (GOT).

So yea, populating this at runtime depending on the CPU features supported by
your CPU is possible, and the linux dynamic linker does this:
[http://man7.org/linux/man-pages/man8/ld.so.8.html](http://man7.org/linux/man-
pages/man8/ld.so.8.html) (check out the section on hardware capabilities).

~~~
pjc50
We should bring this point out next time there's a wave of comments saying
that static linking is the one true way.

~~~
qu4z-2
If you need to redirect all your function calls through a giant array, you can
_technically_ do that with static linking too. Just sayin'...

------
CyberDildonics
Before dropping down to assembly or even intrinsics, check out ISPC. It's a
little tricky to learn how to use varying and uniform and the compiler crashes
here and there, but my god if it doesn't produce crazy fast programs.

