
Exploring calling conventions with x86 assembly - klixon
http://apoorvaj.io/exploring-calling-conventions.html
======
hDeraj
When calling a simple function like this within a large loop, would it make a
noticeable difference in speed to inline the computation vs. having a function
call? If so, what's the best practice for inlining a computation like this? I
imagine a macro would be the simplest solution but I'm interested to hear any
other techniques that are used

~~~
unsignedqword
It can sometimes make a difference, but usually the compiler's optimizer does
a good job of deciding whether or not a function should be inlined.

If you want you can nudge the compiler in the direction you want via the
"inline" keyword, although the compiler won't always take this suggestion to
heart. MSVC has "__forceinline" but it too will not always comply.

Before the "inline" keyword, macros were the standard way to do this, IIRC

~~~
colejohnson66
There's something funny about a compiler being able to ignore something called
"force inline"

~~~
gruez
Sometimes it's not possible to inline functions, for example recursive calls

------
userbinator
If you have any experience with writing Asm you'll see how rigid calling
conventions are almost entirely an artificial construct of HLLs (and possibly
an artifact of early compilers), and it's possible to do so much more and so
much better --- which is partly the reason why Asm can be so fun. ;-)

When calling functions written in Asm from Asm, you get to decide exactly how
to do it: Pass arguments in registers in any order, on the stack, a
combination of both, _directly following the call instruction itself_ [1],
etc.; the limit is practically your imagination. You can choose the best way
to pass arguments for each function instead of being forced into one
suboptimal one for every function. Ditto for return values --- you can easily
return multiple values, in different registers, and also make use of HLL-
inaccessible "registers" like the flags (carry bit in particular is quite
useful).

I think the PC BIOS / DOS API is a pretty nice calling convention, clearly
designed for and by Asm programmers; all arguments are passed in registers and
CF is used to indicate success/error. These compiler-imposed calling
conventions like cdecl/stdcall/fastcall are just awfully inefficient in
comparison because of how much memory access they require, especially when
"fastcall" can only pass _two_ arguments in registers.

Incidentally, these 3 examples are also great at showing how compilers can be
so _very_ stupid at code generation. Observe that in all 3 cases, the return
value in eax after calling foo is written to memory --- then immediately read
from memory again, _into the exact same register_. This is not something that
should ever appear in human-written Asm, and I've actually made use of this
fact in marking a course assignment: asked to manually "compile" a short
function, some students cheated and used the compiler (with no optimisations,
i.e. the defaults), and it was dead easy to recognise.

It's funny to see the _entirely-register-based_ fastcall somehow still
managing to generate 5 totally useless memory accesses. If I really wanted to
write a fastcall min() function instead of just inlining it as I probably
should, it'd be 4 lines:

    
    
        mov eax, ecx
        cmp edx, ecx
        cmovl eax, edx
        ret
    

Likewise, cdecl and stdcall (only differing in one instruction):

    
    
        mov eax, [esp+4]
        cmp eax, [esp+8]
        cmovl eax, [esp+12]
        ret    ; ret 8 for stdcall
    

... and seeing WINAPI in GAS/AT&T syntax just feels very _very_ weird.

[1] Like this:

    
    
        call puts
        db "Hello world!", 0
        ; execution continues here
    

I believe it's not the fastest on modern CPUs, but it does save space and was
a very common technique on 8-bit CPUs like 8080/8085/Z80. It's also a good way
of confusing automatic disassemblers.

~~~
bjourne
You certainly can do it differently but I really doubt you can do it _better_.
:) A common wisdom learned from the "calling convention wars" (there are many
more than just cdecl, stdcall and fastcall) were that it just doesn't matter
all that much. The same amount of work has to be done and the only thing that
changes is if the caller or callee is the one doing it.

For example, if your convention mandates that the callee _must preserve_ RAX-
RDX, then it must push/pop those registers if it wants to use them. Which
leads to redundant push/pops if the registers aren't in use by the caller. But
if it is _free to clobber_ them, then the caller must push/pop them even if
the callee doesn't use them, leading to the exact same number of redundant
push/pops!

~~~
Narishma
By "doing it better", I don't believe parent is saying to create another
"better" calling convention, but instead to use no convention at all.

~~~
userbinator
Exactly. What I see from the "calling convention wars" is not that "it just
doesn't matter all that much", it's that there is no single optimal convention
in all cases. Some functions need to use more registers than others; some
arguments are used very early in the function and their values are not needed
after that (prefer these in a register), while others may be used later after
a bunch of computation that needs many registers (these might be better
staying on the stack.) Some instructions like multiply/divide require certain
registers (does your function start with a multiply or divide and is one of
the arguments the multiplicand/dividend? Use AX, EAX, or EDX:EAX for that
one.)

The short examples in the article are illustrative of "used early and not
needed afterwards" \--- in cdecl/stdcall the caller writes the arguments into
memory, only to have the callee immediately read them back again. Ignoring the
extra memory accesses, even fastcall isn't optimal in this case --- it uses
ECX and EDX when what's really needed is for one of the arguments to be in EAX
since it may become the return value. In my "optimised" fastcall above, you
can see I had to spend an extra mov instruction just to get the return value
in the right place. It would be two instructions (cmp eax, ecx | cmovge eax,
ecx) otherwise. All this useless data movement, for what? Just to conform to
some arbitrary convention. These may be small things, but they can add up.

------
GoToRO
so why does it allocate extra space? "A value of 16 is subtracted from esp."
What's the purpose of that?

~~~
jmgao
There are some SSE instructions that crash if they're used with arguments that
aren't 16 byte aligned. All of the ones I can think of have a version that
supports unaligned access at a performance penalty, so it's basically just a
choice by the ABI to require the stack to be 16 byte aligned at function call
boundaries, so that functions don't have to verify that their stack frame has
the proper alignment.

------
ninjabeans
How did he create that diff graphic?

~~~
Karliss
From the colors it looks like he simply run Meld on text files with assembler.

