%macro stackpush 0 push rdi push rsi push rdx push r10 push r8 push r9 push rbx ...

userbinator · on July 14, 2018

You can thank AMD for mysteriously removing PUSHA/POPA from the opcode map in 64-bit mode (not even replaced by any new useful instructions, just made invalid.)

simcop2387 · on July 14, 2018

I don't think they've given reasons for it, but it's likely because they just completely destroy any kind of out of order execution. They'd end up requiring a fence before either instruction to ensure that you only push or pop the correct values and you couldn't speculate at all past them. Combine that with more than double the size of needed space (double number of registers, and double the size of them) it seems pretty wasteful. And then there's the fact that modern compilers are likely already avoiding those instructions because of the timing of saving extra registers that you aren't using in a given function, it probably just doesn't make much sense to keep them anymore.

userbinator · on July 14, 2018

but it's likely because they just completely destroy any kind of out of order execution.

The same goes for the longer sequence of individual pushes or pops --- they all depend on the stack pointer. In fact, the single instruction needs to only adjust it once by the total number of registers pushed/popped (since it is a constant).

In other words, PUSHA/POPA already decode internally to a bunch of moves and one ALU op for the stack pointer which can be scheduled OoO. All they needed to do for 64-bit mode was adjust the constant (by multiplying it by two) and emit more uops for the additional registers, but they didn't for some otherwise inexplicable reason. All the machinery to do it was existing.

And then there's the fact that modern compilers are likely already avoiding those instructions because of the timing of saving extra registers that you aren't using in a given function, it probably just doesn't make much sense to keep them anymore.

Compilers won't ever use PUSHA/POPA but lots of other code will --- BIOS, executable packers, OS state-saving code (the perfect example of what these instructions were for?), etc.

See the story of SAHF/LAHF for a similar and even more astoundingly bad decision.

agumonkey · on July 13, 2018

I wonder if there are cpus with 1-instr state save

ps: thank you all for the answers

rwmj · on July 13, 2018

ARMv7 (not AArch64) has the STM instruction that lets you push a register set, selected by bitmask. eg:

    STMFD sp!, {r3-r7,lr}

(I believe the "FD" suffix is to do with the stack growing down - "full descending")

Of course this is just implemented with microcode so it's not really any more efficient than a series of PUSHes, except there's a bit less I-cache pressure.

jlarcombe · on July 14, 2018

In the original ARM design the multiple register transfers were much more efficient than the equivalent single register transfers because of the simple architecture which had an inherent load delay (it couldn't fetch an instruction and a data word in a single cycle). When they switched to the Harvard model they lost their advantage.

Talking of the ARM reminds me of older heroic feats of assembly-programmed internet software in the Acorn days, such as Ben Dooks' all-asm TCP/IP stack and Jon Ribbens' web browser. Probably not been done that often...

PeCaN · on July 13, 2018

Itanium does this automatically. You declare what registers you're using in the the function prologue and it handles popping/pushing/renaming. No register window exceptions too.

emteycz · on July 14, 2018

Isn't Itanium obsolete? I'd love to learn otherwise.

PeCaN · on July 14, 2018

Sadly it's dead now. In a cruel twist of fate, Intel killed it off shortly before we found out about all the speculative execution attacks and found that everything else is horrifically vulnerable.

emteycz · on July 14, 2018

Do you mean that Itanium would have been safe?

PeCaN · on July 14, 2018

It's inherently not vulnerable to the Spectre/Meltdown family of attacks. They rely on speculative, out-of-order execution on modern CPUs, but Itanium is an in-order core with very limited (and software-controlled) speculation.

It's actually not vulnerable to a bunch of other attacks as well (e.g. a buffer overflow cannot overwrite the return address on Itanium).

pjc50 · on July 13, 2018

ARM has stmdb / ldmia which appear in practically every function prolog/epilog. It also has "banking" systems to swap to a different set of registers on interrupts, which saves time and stack space.

pwg · on July 14, 2018

Another one: the old Z-80 CPU had two sets of registers, and a single instruction to swap between the main and the alternate register set.

http://landley.net/history/mirror/cpm/z80.html (search for "alternate registers").

nineteen999 · on July 16, 2018

Sadly it appears that none of the Z80 C compilers (at least the ones I've used, Hi-Tech C and sdcc) are smart enough to use it.

saagarjha · on July 13, 2018

Even if there was one, you'd still be saving registers unnecessarily. If you only clobber one register in a procedure there's no need to save and restore all of them.

kijiki · on July 13, 2018

x86 has LOADALL and SAVEALL.

https://en.wikipedia.org/wiki/LOADALL

msla · on July 13, 2018

Those undocumented instructions don't exist on x86_64, and they could be used to do bizarre things on the processors which had them:

> As the two LOADALL instructions were never documented and do not exist on later processors, the opcodes were reused in the AMD64 architecture.[8] The opcode for the 286 LOADALL instruction, 0F05, became the AMD64 instruction SYSCALL; the 386 LOADALL instruction, 0F07, became the SYSRET instruction. These definitions were cemented even on Intel CPUs with the introduction of the Intel 64 implementation of AMD64.[9]

[snip]

> Because LOADALL did not perform any checks on the validity of the data loaded into processor registers, it was possible to load a processor state that could not be normally entered, such as using real mode (PE=0) together with paging (PG=1) on 386-class CPUs.[7]

scandinavian · on July 13, 2018

What about PUSHAD and POPAD for x86, those are not undocumented right?

jamieiles · on July 13, 2018

There is no pusha in long mode though, you need to push individual regs.

elcritch · on July 13, 2018

BeagleBone’s processor the AM335x PRU processors by TI has two “PRU” coprocessors. They have “xin” and “xout” instructions that can copy a register bank in 1 cycle. Pretty handy for quick data gathering but tricky to use in C.

blattimwind · on July 13, 2018

Sure. It's called windowed registers and you have one instruction each to move the window right/left.