Hacker News new | past | comments | ask | show | jobs | submit login
Faster software through register based calling (menno.io)
87 points by menn0 on Nov 25, 2021 | hide | past | favorite | 47 comments



I'm sorry, but why is this new?

x64 calling convention has been to use `RAX` for the return value and `RCX, RDX, R8 & R9` for the first four arguments. Only if you have more than 4 arguments will you use the stack by default.

That has been the case for a long time. For that matter `__fastcall` calling convention on x86 was the same (for two args) with `ECX, EDX` for args and `EAX` for the return call.

Isn't this just fixing a Go specific limitation?


> Since its initial release, Go has used a stack-based calling convention based on the Plan 9 ABI, in which arguments and result values are passed via memory on the stack. This has significant simplicity benefits: the rules of the calling convention are simple and build on existing struct layout rules; all platforms can use essentially the same conventions, leading to shared, portable compiler and runtime code; and call frames have an obvious first-class representation, which simplifies the implementation of the go and defer statements and reflection calls. Furthermore, the current Go ABI has no callee-save registers, meaning that no register contents live across a function call (any live state in a function must be flushed to the stack before a call). This simplifies stack tracing for garbage collection and stack growth and stack unwinding during panic recovery.

https://go.googlesource.com/proposal/+/refs/changes/78/24817...


Lots of languages use registers for call and return and have aggressive garbage collection. The techniques for handling this efficiently are well known.

Richard Jones [not me! a famous GC researcher with the same name] maintains an excellent list of resources which the golang authors might like to read: https://www.cs.kent.ac.uk/people/staff/rej/gcbib/gcbibA.html


All true, but Go had previously chosen not to use all that knowledge in favor of simplicity in its calling convention and therefore implementation. But like many systems that did so, real world performance implications end up justifying additional complexity.


“Those who cannot remember the past are condemned to repeat it.”

I find it funny to talk to much younger developers, they get excited about some "new" technology and are surprised I do not feel the same. Probably they think I am old and grumpy and can't see the value in it.

The truth is that a lot of it is just same verbatim or rehashed concepts that have already been known under a lot of guises for a long time. Sometimes discarded by new generation only to be re-discovered after couple of years when their software starts to suck for no obvious reason.

I suggest that any person trying to do hardcore performance optimization today should invest some time learning tricks of developers that wrote software in 80s. There is wealth of useful knowledge hiding in plain sight.


Although I can't claim a long tenure in the software development field. One of my favorite tricks/rediscovered performance enhancer can be found in a similarity between very old-school self-modifying code, and modern day profile guided JIT optimizer.

I think I read somewhere that, at some point, it was all the rage to have self-modifying code that could patch conditional branch instructions into unconditional ones (I think it was used on game consoles to save space). These days profile guided JIT compilers (or even AOT compiler) will de-virtualize calls to increase performance, which to me looks like a really similar technique.


The value of JIT is that you do not need to pay any attention to it and it takes no effort to set up. It is still good to know it is there -- a lot of developers I work with did not understand why a Java application is slow for a bit before it becomes faster. They just take it as a fact of life. (Of course JIT is just a part of this, you also have classes loaded, various mechanisms implementing lazy initialization getting initialized the first time it is used, etc.)


Do you have any good links for starting points?

I've always been interested in the subject, but articles that cover it competently are usually few and far between.


I don't. I have either learned these a long time ago or as a side effect of paying attention to this topic for almost quarter of century.

But I think it should still be possible to find old source code for things like games. Back when I was learning that stuff it was very difficult to get access to real game code, but today there is a lot of titles that have released its source code for everybody to pore over.


That's for Windows x64 (cdecl specifically), which has pretty much the worst calling conventions. You can use rcx/xmm0, rdx/xmm1, r8/xmm2, r9/xmm3, so for a C function signature

    int64_t func(int64_t a, double b, int64_t c, double d, int64_t e, double f, int64_t g, double h)
Even though 4 general purpose registers and 4 floating point registers are available, you end up using 2 general purpose registers (rcx and r8 for a and c) and 2 floating point registers (xmm1 and xmm3 for b and d). rdx and r9, and xmm0 and xmm2 are unused, and the last 4 arguments must use the stack -- note that clustering the ints or doubles doesn't change that.

For the same function with the System V x86_64 calling convention, (e.g., Linux, BSD, macOS), you can put everything in registers rdi, rsi, rdx, rcx, r8, r9, xmm0-7, with some to spare. rdi, rsi, rdx, rcx are used for a, c, e, g. xmm0, xmm1, xmm2, xmm3 are used for b, d, f, h. xmm4-7 are still unused, but are available if more parameters are added to the function.

I haven't read the literature on this BTW, so I'm not terribly confident, but I'm guessing that on System V there are still enough scratch registers available for "general" code. High performance code where you're actually using all the lanes of your 128 bit xmm registers and so on, they should be fine on System V because you're doubling up or quadrupling up with those lanes, and often accumulating into a small subset of the argument registers anyway. So you just save yourself function call overhead.

Getting further off-topic, but if anyone is interested in how to apply "high performance" thinking to the common, mundane problems we have to solve daily, I highly recommend https://media.handmade-seattle.com/context-is-everything/. In particular, it's a window into practical SIMD, but it also makes some worthwhile "software philosophy" points as well.


The calling convention described in the parent is the Windows 64-bit calling convention. In the 64-bit calling convention used on all other platforms, the first six arguments are passed in registers (RDI, RSI, RDX, RCX, R8, R9).

The parent is correct, just incomplete :)


Actually this was new in ... Watcom C++ for DOS? Or even earlier.


It was the calling convention of MACLISP in the 60s and nobody seemed to describe it as exotic.


MACLISP had another register optimization that was possible on the PDP-6 and the earlier models of the PDP-10 (the ones with the KA-10 CPU).

Those machines had an address space consisting of 256k 36-bit words. The registers shared that address space with core memory. Addresses 0-15 of the 18-bit address space referred to the registers, and addresses above that referred to core. The core memory had memory for words 0-16 but it was never used because the registers overlaid it.

In fact if you wanted to save a little money your could order your PDP-6 or your KA-10 PDP-10 without registers and then register accesses would go the words 0-15 of core.

This worked both ways. If you put an address in 0-15 in an 18-bit memory address field it would access registers. In particular if you loaded an address in 0-15 into the program counter it would execute code out of the registers.

MACLISP and others would copy some short time-critical loops to the registers and run them there for speed.


I appreciate the comments pointing out this is an old technique and the criticism towards go looks valid, but to me personally, it is overlooked that this is a very nice article

I (not being aware of this technique), thought it was a very nice article, I learned quite a few things, it was well written.


I think it's pretty common in most bachelors to learn the basics of this, hence the dismissing response? The article explains it pretty well, but I was curious how they handle structs and spilling for when parameters exceed available registers.


Seems like a lot of performance to forego for the last 9+ years since the technique has been commonplace since... computers.


Go's origin in plan 9 tooling carried with it a certain preference for the portable and simple over the optimised. A good way to start a language, and they've been tuning it in various ways since.


Most x86 calling conventions used the stack. That's why you'd have "fastcall".


That's true if you interpret x86 narrowly to mean only 32-bit legacy environments. Unless I missed something big, the AMD64 System V ABI (x86_64) calling convention has been register based from its origin, likewise with the Windows equivalent.


Does that really matter ? I had 64-bit hardware and software since literally 2003. Running an x86 OS is closer to MS-DOS 5 than to present day.


How many much such low hanging fruits are present in the Go compiler?


Probably a lot, as Go is trying to keep things simple. Probably it would also benefit from a generational garbage collection.


The team thought about that obviously and the answer is probably not: https://go.dev/blog/ismmkeynote


Wait a second, how does go allocate memory? Is there no nursery?


the go theory is that if you allocate on the stack most of the time, and do static analysis to manually free things based on escape analysis, you end up not using GC for most things that Java would put in the nursery in the first place.


xchg %ax,%ax is the nop instruction. I would have expected to see <nop> written instead but the opcode (0x90) is shared by both.

It's also possible that xchg %ax,%ax is has been decoded from a multi-byte nop (to align memory) before the function call.


Writing nop for xchg %ax,%ax loses information. The xchg %ax,%ax is two bytes. If you write "nop", I'd assume that it's the one-byte version.

Typically, you use xchg %ax,%ax because you can replace it with a jump if you want.


Note that it uses ax (16 bits) instead of rax (64 bits). I'd assume it has opcode 0x66 0x90, a 2 byte nop.

Now if the author can provide some more details, we'll see if the call is aligned on some interesting boundary.

I always wonder how much aligning code helps in real code. They seem waste a lot of bytes in the code segment. Aligning data, yes, that changes a lot, but the return position of a call in code?


https://devblogs.microsoft.com/dotnet/loop-alignment-in-net-...

Is a good post in regards to code alignment effects.


That's an article worthy of its own HN post. Thanks


Versions don't seem to mean anything anymore. Changing the calling convention with a minor release? Even if this change doesn't affect foreign functions (I certainly hope so), it is all but guaranteed to break someone's code. What's so difficult nowadays with organizing a major release for such changes? Is it because of agile (TM)?


I don't think Go has ever specified that a specific calling convention must be used, and it doesn't expose it very easily to the end user. I would argue that if you are writing code that depends on assuming the calling convention when it is explicitly not specified, that is a faulty assumption on you.

https://go.googlesource.com/go/+/refs/heads/dev.regabi/src/c...

  This ABI is unstable and will change between Go versions.


Now you are blaming the victim. Surely there is alot of ffi that will break.


FFI already goes through calling convention translation, since Go and C runtime aren't that compatible, even more so outside of Go's native environment (Plan9)


Are you sure? This would only affect C code calling Go functions, right? Apart from that being very rare (the other way around is more common using CGo), from the register proposal it looks like they've discussed and had ways to address the CGo issue. I haven't tried this though.


I am not too sure really. Trying to understand it I realized I know very little about Go internals.


In golang, everything is recompiled from source into a single standalone binary. There is no concept of shareable precompiled library -- a library is to be reused from its source code.

In that world, calling conventions is not something considered public, so modifying it does not break any compatibility.


There used to be one, but they took it out.

https://github.com/golang/go/issues/28152


The other replies give the answers: the ABI has always been an implementation detail, and Go binaries are static built from source, so this change won't matter. But just to add to that: I've been using Go for 4-5 years, and have never needed to know or care about its calling convention. It doesn't even affect people that write functions for Go in assembly ... because as the proposal says, "This will remain backwards compatible with existing assembly code that assumes Go’s current stack-based calling convention through Go’s multiple ABI mechanism." So the Go developers have put a fair bit of thought and effort into making this just work, even bending over backwards to make assembly libraries keep working.


I see way too many breaking changes in minor versions; the js ecosystem is rife with this. But it is pretty common and mostly why I try to stick with .NET and Java. It is just incredibly weird installing a minor update with a security fix AND a breaking feature in a minor version. So much needless stress...


The article mentioned that Go binaries have a fairly large minimum size due to the Go runtime environment.

Is there an option to remove that runtime environment if you won’t be using coroutines and need a smaller disk footprint?


In practice, because you only need the one binary, and not all of the usual dynamically loaded libraries, it can appear large and yet be small overall.


It illustrates that often as programmers we put code into functions purely for readability or organizing stuff. How often have you written a function that is only called from one place (or in a loop in one place)? You could really just inline that function, but that make your code a bit less readable. Recursive functions are different of course, and do need the stack (unless tail call recursion??). What is a function anyway, right? They don't really "exist" in machine code. They are a higher level language abstraction, and a compiler is free-ish to translate that into machine code however it sees fit (except calling between pre compiled objects).


While I do have a soft spot for this kind of low-level optimizations, I wonder how relevant it will end up in the real world.

The 16% reduction in runtime only occurs because the runtime of the program is completely dominated by the call-overhead of the function (which would usually be prevented by inlining). Polymorphism, on the other hand, makes inlining harder, so it could be relevant for programs with interfaces that have one-liner implementations. Though Go could even overcome that, as all implementations of an interface are "seen" by the compiler.

In any case, this feature could be used to reduce the aggressiveness of inlining moderately small functions, giving us (marginally) smaller binary sizes and (maybe?) reduced compile times...


It's important to note that pretty much all other software already passes the first N function args (and even small structs up to 16 bytes) through registers, simply because the standard ABI calling conventions define it that way. It's hard to say how big the performance regression would be if everything is passed through the stack (I'm very surprised that Go passes everything through the stack TBH).


Indeed the last time the standard calling convention was stack-based was 32b x86 environments (“cdecl”) due to the dearth of registers, and even then there were various register-based compiler-specific or internal “fastcall” conventions.

And because AMD64 doubles the number of GPRs and requires SSE (as well as doubles the number of SSER), all AMD64 calling conventions are register-based (the official AMD64 ABI uses 6 GPRs and 8 SSER, MS has a much more limited 4 GPR/SSER mix).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: