%macro stackpush 0
push rdi
push rsi
push rdx
push r10
push r8
push r9
push rbx
push rcx
%endmacro
%macro stackpop 0
pop rcx
pop rbx
pop r9
pop r8
pop r10
pop rdx
pop rsi
pop rdi
%endmacro
I guess that's one way of saving registers. Not particularly efficient, I guess, but it works…
You can thank AMD for mysteriously removing PUSHA/POPA from the opcode map in 64-bit mode (not even replaced by any new useful instructions, just made invalid.)
I don't think they've given reasons for it, but it's likely because they just completely destroy any kind of out of order execution. They'd end up requiring a fence before either instruction to ensure that you only push or pop the correct values and you couldn't speculate at all past them. Combine that with more than double the size of needed space (double number of registers, and double the size of them) it seems pretty wasteful. And then there's the fact that modern compilers are likely already avoiding those instructions because of the timing of saving extra registers that you aren't using in a given function, it probably just doesn't make much sense to keep them anymore.
but it's likely because they just completely destroy any kind of out of order execution.
The same goes for the longer sequence of individual pushes or pops --- they all depend on the stack pointer. In fact, the single instruction needs to only adjust it once by the total number of registers pushed/popped (since it is a constant).
In other words, PUSHA/POPA already decode internally to a bunch of moves and one ALU op for the stack pointer which can be scheduled OoO. All they needed to do for 64-bit mode was adjust the constant (by multiplying it by two) and emit more uops for the additional registers, but they didn't for some otherwise inexplicable reason. All the machinery to do it was existing.
And then there's the fact that modern compilers are likely already avoiding those instructions because of the timing of saving extra registers that you aren't using in a given function, it probably just doesn't make much sense to keep them anymore.
Compilers won't ever use PUSHA/POPA but lots of other code will --- BIOS, executable packers, OS state-saving code (the perfect example of what these instructions were for?), etc.
See the story of SAHF/LAHF for a similar and even more astoundingly bad decision.
ARMv7 (not AArch64) has the STM instruction that lets you push a register set, selected by bitmask. eg:
STMFD sp!, {r3-r7,lr}
(I believe the "FD" suffix is to do with the stack growing down - "full descending")
Of course this is just implemented with microcode so it's not really any more efficient than a series of PUSHes, except there's a bit less I-cache pressure.
In the original ARM design the multiple register transfers were much more efficient than the equivalent single register transfers because of the simple architecture which had an inherent load delay (it couldn't fetch an instruction and a data word in a single cycle). When they switched to the Harvard model they lost their advantage.
Talking of the ARM reminds me of older heroic feats of assembly-programmed internet software in the Acorn days, such as Ben Dooks' all-asm TCP/IP stack and Jon Ribbens' web browser. Probably not been done that often...
Itanium does this automatically. You declare what registers you're using in the the function prologue and it handles popping/pushing/renaming. No register window exceptions too.
Sadly it's dead now. In a cruel twist of fate, Intel killed it off shortly before we found out about all the speculative execution attacks and found that everything else is horrifically vulnerable.
It's inherently not vulnerable to the Spectre/Meltdown family of attacks. They rely on speculative, out-of-order execution on modern CPUs, but Itanium is an in-order core with very limited (and software-controlled) speculation.
It's actually not vulnerable to a bunch of other attacks as well (e.g. a buffer overflow cannot overwrite the return address on Itanium).
ARM has stmdb / ldmia which appear in practically every function prolog/epilog. It also has "banking" systems to swap to a different set of registers on interrupts, which saves time and stack space.
Even if there was one, you'd still be saving registers unnecessarily. If you only clobber one register in a procedure there's no need to save and restore all of them.
Those undocumented instructions don't exist on x86_64, and they could be used to do bizarre things on the processors which had them:
> As the two LOADALL instructions were never documented and do not exist on later processors, the opcodes were reused in the AMD64 architecture.[8] The opcode for the 286 LOADALL instruction, 0F05, became the AMD64 instruction SYSCALL; the 386 LOADALL instruction, 0F07, became the SYSRET instruction. These definitions were cemented even on Intel CPUs with the introduction of the Intel 64 implementation of AMD64.[9]
[snip]
> Because LOADALL did not perform any checks on the validity of the data loaded into processor registers, it was possible to load a processor state that could not be normally entered, such as using real mode (PE=0) together with paging (PG=1) on 386-class CPUs.[7]
BeagleBone’s processor the AM335x PRU processors by TI has two “PRU” coprocessors. They have “xin” and “xout” instructions that can copy a register bank in 1 cycle. Pretty handy for quick data gathering but tricky to use in C.