x64 calling convention has been to use `RAX` for the return value and `RCX, RDX, R8 & R9` for the first four arguments. Only if you have more than 4 arguments will you use the stack by default.
That has been the case for a long time. For that matter `__fastcall` calling convention on x86 was the same (for two args) with `ECX, EDX` for args and `EAX` for the return call.
Isn't this just fixing a Go specific limitation?
Richard Jones [not me! a famous GC researcher with the same name] maintains an excellent list of resources which the golang authors might like to read: https://www.cs.kent.ac.uk/people/staff/rej/gcbib/gcbibA.html
I find it funny to talk to much younger developers, they get excited about some "new" technology and are surprised I do not feel the same. Probably they think I am old and grumpy and can't see the value in it.
The truth is that a lot of it is just same verbatim or rehashed concepts that have already been known under a lot of guises for a long time. Sometimes discarded by new generation only to be re-discovered after couple of years when their software starts to suck for no obvious reason.
I suggest that any person trying to do hardcore performance optimization today should invest some time learning tricks of developers that wrote software in 80s. There is wealth of useful knowledge hiding in plain sight.
I think I read somewhere that, at some point, it was all the rage to have self-modifying code that could patch conditional branch instructions into unconditional ones (I think it was used on game consoles to save space). These days profile guided JIT compilers (or even AOT compiler) will de-virtualize calls to increase performance, which to me looks like a really similar technique.
I've always been interested in the subject, but articles that cover it competently are usually few and far between.
But I think it should still be possible to find old source code for things like games. Back when I was learning that stuff it was very difficult to get access to real game code, but today there is a lot of titles that have released its source code for everybody to pore over.
int64_t func(int64_t a, double b, int64_t c, double d, int64_t e, double f, int64_t g, double h)
For the same function with the System V x86_64 calling convention, (e.g., Linux, BSD, macOS), you can put everything in registers rdi, rsi, rdx, rcx, r8, r9, xmm0-7, with some to spare. rdi, rsi, rdx, rcx are used for a, c, e, g. xmm0, xmm1, xmm2, xmm3 are used for b, d, f, h. xmm4-7 are still unused, but are available if more parameters are added to the function.
I haven't read the literature on this BTW, so I'm not terribly confident, but I'm guessing that on System V there are still enough scratch registers available for "general" code. High performance code where you're actually using all the lanes of your 128 bit xmm registers and so on, they should be fine on System V because you're doubling up or quadrupling up with those lanes, and often accumulating into a small subset of the argument registers anyway. So you just save yourself function call overhead.
Getting further off-topic, but if anyone is interested in how to apply "high performance" thinking to the common, mundane problems we have to solve daily, I highly recommend https://media.handmade-seattle.com/context-is-everything/. In particular, it's a window into practical SIMD, but it also makes some worthwhile "software philosophy" points as well.
The parent is correct, just incomplete :)
Those machines had an address space consisting of 256k 36-bit words. The registers shared that address space with core memory. Addresses 0-15 of the 18-bit address space referred to the registers, and addresses above that referred to core. The core memory had memory for words 0-16 but it was never used because the registers overlaid it.
In fact if you wanted to save a little money your could order your PDP-6 or your KA-10 PDP-10 without registers and then register accesses would go the words 0-15 of core.
This worked both ways. If you put an address in 0-15 in an 18-bit memory address field it would access registers. In particular if you loaded an address in 0-15 into the program counter it would execute code out of the registers.
MACLISP and others would copy some short time-critical loops to the registers and run them there for speed.
I (not being aware of this technique), thought it was a very nice article, I learned quite a few things, it was well written.
It's also possible that xchg %ax,%ax is has been decoded from a multi-byte nop (to align memory) before the function call.
Typically, you use xchg %ax,%ax because you can replace it with a jump if you want.
Now if the author can provide some more details, we'll see if the call is aligned on some interesting boundary.
I always wonder how much aligning code helps in real code. They seem waste a lot of bytes in the code segment. Aligning data, yes, that changes a lot, but the return position of a call in code?
Is a good post in regards to code alignment effects.
This ABI is unstable and will change between Go versions.
In that world, calling conventions is not something considered public, so modifying it does not break any compatibility.
Is there an option to remove that runtime environment if you won’t be using coroutines and need a smaller disk footprint?
The 16% reduction in runtime only occurs because the runtime of the program is completely dominated by the call-overhead of the function (which would usually be prevented by inlining).
Polymorphism, on the other hand, makes inlining harder, so it could be relevant for programs with interfaces that have one-liner implementations. Though Go could even overcome that, as all implementations of an interface are "seen" by the compiler.
In any case, this feature could be used to reduce the aggressiveness of inlining moderately small functions, giving us (marginally) smaller binary sizes and (maybe?) reduced compile times...
And because AMD64 doubles the number of GPRs and requires SSE (as well as doubles the number of SSER), all AMD64 calling conventions are register-based (the official AMD64 ABI uses 6 GPRs and 8 SSER, MS has a much more limited 4 GPR/SSER mix).