> How do things know where to look? The compiler generates code that uses the co...

makmanalp · on Nov 28, 2016

OK, so it's not a magical convention - it works it out bottom-up. First decide on registers for each parameter when you generate the code for the function, then based on that generate specific code for the instances where you call that function. Cool, thank you.

Also that example, heh. I tried to go back gcc versions to see if there was a case where it didn't do TCO - nope. Also, I like how returning 0 is "xor eax, eax".

pertymcpert · on Nov 28, 2016

It is a convention, it's called a procedure call standard. Compilers which conform to the PCS can call functions compiled by other compilers (that's how you can use libraries for example). If there's a bug in the compiler that results in non-PCS compliance, well that's a "fun" bug to track down.

wolf550e · on Nov 28, 2016

I think "procedure call standard" is an ARM term. Linux on commodity hardware uses the "System V AMD64 ABI calling convention" (https://en.wikipedia.org/wiki/X86_calling_conventions#System...).

vram22 · on Nov 28, 2016

>It is a convention, it's called a procedure call standard.

Is this the same thing as, or related to, the calling conventions that used to be used in Microsoft DOS and Windows native language / C apps some years ago - things like "long far pascal" as function declaration /definition qualifiers, and things like the fact that C pushes function arguments onto the stack (before the call) from left to right, and Pascal does it from right to left (or the other way around)? (Did some of that stuff, but it's been a while).

I did read the surrounding comments to this one, and saw that some of the topic was about registers, not the stack.

Sharlin · on Nov 28, 2016

Yes, "calling convention" is, I think, a more commonly used term for the same thing.

MichaelBurge · on Nov 28, 2016

Suppose you're right and the registers are arbitrary. Then how would foreign function calls work? If you're compiling Rust code that calls into a C library, how does it know what registers to use?

So the choice of registers cannot be arbitrary, unless the compiler knows the function is only used within an object file.

The registers are predetermined by a convention unless you use the 'static' keyword to signal that the function is only used internally to a module, so the compiler has complete freedom to choose registers.

userbinator · on Nov 28, 2016

Then how would foreign function calls work? If you're compiling Rust code that calls into a C library, how does it know what registers to use?

By using information kept with the function, or perhaps even encoded into the function name itself (as already happens when distinguishing between different calling conventions, or in the case of C++ name mangling)?

Coming from an Asm background, where there basically is no one "calling convention", and programmers would document which registers (almost always registers, rarely the stack --- and that can make for some great efficiency gains) are for what, I've always wondered why that idea didn't seem to go far.

masklinn · on Nov 28, 2016

> By using information kept with the function

How would you do that with dynamically linked code, inspect functions you're calling at runtime before laying out your arguments?

> perhaps even encoded into the function name itself

That would mean name mangling in C and assembly.

> Coming from an Asm background, where there basically is no one "calling convention"

Right, because you can lay out memory however you want since you're at the assembly level. Higher-level code (C up) can't do that, so instead you've got standard calling conventions for inter-library call (inside a compilation unit, the compiler is free to use alternate calling conventions since it has complete control over both sides of the callsite, that's also how it can inline calls entirely).

> programmers would document which registers (almost always registers, rarely the stack --- and that can make for some great efficiency gains)

Some standard CC (though not the old CDECL) also use registers, so far as they can, depending on the arch. The SystemV AMD64 ABI uses 6 x86_64 registers for integer/pointer arguments and 8 SSE registers for FP arguments, with the rest on the stack.

pmjordan · on Nov 28, 2016

Win32 has various different calling conventions, and each function is annotated accordingly in the header files. It's all a bit of a mess, which is presumably why they drastically simplified it in the x64 transition.

(And realistically, for all but the most trivial functions, having one convention is probably a highly reasonable default. Trivial functions should probably be made available to the compiler for inlining anyway. Note also that GCC allows you to override the number of arguments passed on the stack via an annotation on 32-bit x86, if you insist.)

makmanalp · on Nov 28, 2016

> If you're compiling Rust code that calls into a C library, how does it know what registers to use?

Ah, interesting, I figured that the generated object files would just store some metadata on that basically.

MichaelBurge · on Nov 28, 2016

That can work for statically-linked object files, but what about dynamically linked? You can load one with a function call, get back a function pointer, and invoke it like any other function. Trying to use some metadata would slow down the function call significantly, even if you tried to cache it somewhere.

pjmlp · on Nov 28, 2016

That is the role of operating system ABI, which definines the calling conventions between the programming languages on the OS.

voltagex_ · on Nov 28, 2016

> Also, I like how returning 0 is "xor eax, eax".

Why is it so different with different optimisation levels? The default emits quite a bit of code, -O1 is `mov eax, 0`

3chelon · on Nov 28, 2016

I used to be a native asm programmer in Z80 and 680x0, and one reason for using XOR rather than MOV is to do with condition codes: the XOR operation will most likely update the condition codes (notably, the Zero flag), whereas MOV will probably not.

Often you would not want the flags updated when simply clearing a register - you're hardly likely to test the Zero flag having just set something to zero, because it's obvious, and more importantly you may want to set something to zero while preserving the flags from a previous operation.

But often you don't care about the flags so you can use the slightly shorter and/or faster XOR operation. It used to generally be shorter and faster because the MOV instruction had the initial step of an immediate load of the zero from memory.

And that's why it changes with different optimisation levels - the compiler knows when the flags need to be preserved, and if they don't it can get away with using XOR.

Steve44 · on Nov 28, 2016

It's been a while since I programmed low level but I think on the 68k series they started to introduce cache and multi stage instruction pipelines. By alternating instructions working on different things you could get a decent performance gain. If every instruction had to wait for the result of the previous instruction to complete then it wouldn't be running at its best. With careful planning you could insert 'free' instructions but you would have to watch how flags were altered. We used to spend quite a bit of time optimising code to this level, eeking every bit of performance out of the hardware. Great fun.

3chelon · on Nov 28, 2016

Sure, things have moved on a lot since those days. I think in modern RISC architectures you can even specify whether the instruction should set the condition flags.

wolfgke · on Nov 28, 2016

> > Also, I like how returning 0 is "xor eax, eax".

> -O1 is `mov eax, 0`

Simply because it is shorter: On x86-64 (and x86-32)

  xor eax,eax

encodes as

   31h C0h or 33h C0h

(depending on the assembler; typically the first one is used) - 2 bytes, while

  mov eax,0x0

encodes as

   B8h 00h 00h 00h 00h

- 5 bytes.

Having privately analyzed some 256b demos I cannot even imagine how one could even come to the idea to use `mov r32, imm32` for zeroing a register (except for the reason that people don't want to understand how the assembly code is internally encoded) - the canonical way to use is `xor` (`sub` also works in principle, but `xor` is the way that is recommended by Intel).

EDIT: Here is an article about that topic: https://randomascii.wordpress.com/2012/12/29/the-surprising-...

3chelon · on Nov 28, 2016

It's not just shorter, it's also faster. But see my answer also: there are condition flag implications of using XOR and sometimes MOV will be preferable. The optimiser will always know best :)

wolfgke · on Nov 28, 2016

> there are condition flag implications of using XOR and sometimes MOV will be preferable

If the condition flags have to be preserved, you are right. But otherwise, read the linked article (https://randomascii.wordpress.com/2012/12/29/the-surprising-...):

"On Sandybridge this gets even better. The register renamer detects certain instructions (xor reg, reg and sub reg, reg and various others) that always zero a register. In addition to realizing that these instructions do not really have data dependencies, the register renamer also knows how to execute these instructions – it can zero the registers itself. It doesn’t even bother sending the instructions to the execution engine, meaning that these instructions use zero execution resources, and have zero latency! See section 2.1.3.1 of Intel’s optimization manual where it talks about dependency breaking idioms. It turns out that the only thing faster than executing an instruction is not executing it."

Sharlin · on Nov 28, 2016

It's fascinating how far down the rabbit hole goes these days. One might think machine code as emitted by compilers would be pretty close to where the buck stops, but no. Named registers are just an abstraction on top of a larger register pool, opcodes get JIT compiled and optimized to microcode instructions, execution order is mostly just a hint for the processor to ignore if it can get things done faster by reordering or parallelizing... And memory access is probably the greatest illusion of all.

wolfgke · on Nov 28, 2016

What I also find rather interesting is the concept of macro-op fusion that Intel introduced with the Core 2 processors: This means for example that a cmp ... (or test ...) followed by a conditional jump can/will be fused together to a single micro-op. In other words: Suddenly a sequence of two instruction maps to one internal micro-op. If you are interested in the details, read section 8.5 in http://www.agner.org/optimize/microarchitecture.pdf

erubin · on Nov 28, 2016

the lower levels of optimizations are supposed to be more straightforward translations of the high-level language code. you can imagine this might be useful if you are debugging assembly.

userbinator · on Nov 28, 2016

On the other hand, I find O0 is significantly worse than what even a novice human Asm programmer would do if asked to manually compile code, and O1 would be around the same as a novice human.

3chelon · on Nov 28, 2016

Yes, I used to find that too. It's because, pre-optimization, on older architectures, the compiler outputs chunks of asm as if from a recipe book. Loads of unnecessary memory access, pointless moving data between registers, etc.

A proficient human coder, on the other hand, writes assembler that is partly optimized by default.

But few humans could write code like a seriously optimizing compiler, esp. on modern pipelined architectures - that stuff is unintelligible. Which is as it should be, because modern processors are not designed to be programmed directly by humans.