Hacker News new | past | comments | ask | show | jobs | submit login

    %macro stackpush 0
        push rdi
        push rsi
        push rdx
        push r10
        push r8
        push r9
        push rbx
        push rcx
    %endmacro
    
    %macro stackpop 0
        pop rcx
        pop rbx
        pop r9
        pop r8
        pop r10
        pop rdx
        pop rsi
        pop rdi
    %endmacro
I guess that's one way of saving registers. Not particularly efficient, I guess, but it works…



You can thank AMD for mysteriously removing PUSHA/POPA from the opcode map in 64-bit mode (not even replaced by any new useful instructions, just made invalid.)


I don't think they've given reasons for it, but it's likely because they just completely destroy any kind of out of order execution. They'd end up requiring a fence before either instruction to ensure that you only push or pop the correct values and you couldn't speculate at all past them. Combine that with more than double the size of needed space (double number of registers, and double the size of them) it seems pretty wasteful. And then there's the fact that modern compilers are likely already avoiding those instructions because of the timing of saving extra registers that you aren't using in a given function, it probably just doesn't make much sense to keep them anymore.


but it's likely because they just completely destroy any kind of out of order execution.

The same goes for the longer sequence of individual pushes or pops --- they all depend on the stack pointer. In fact, the single instruction needs to only adjust it once by the total number of registers pushed/popped (since it is a constant).

In other words, PUSHA/POPA already decode internally to a bunch of moves and one ALU op for the stack pointer which can be scheduled OoO. All they needed to do for 64-bit mode was adjust the constant (by multiplying it by two) and emit more uops for the additional registers, but they didn't for some otherwise inexplicable reason. All the machinery to do it was existing.

And then there's the fact that modern compilers are likely already avoiding those instructions because of the timing of saving extra registers that you aren't using in a given function, it probably just doesn't make much sense to keep them anymore.

Compilers won't ever use PUSHA/POPA but lots of other code will --- BIOS, executable packers, OS state-saving code (the perfect example of what these instructions were for?), etc.

See the story of SAHF/LAHF for a similar and even more astoundingly bad decision.


I wonder if there are cpus with 1-instr state save

ps: thank you all for the answers


ARMv7 (not AArch64) has the STM instruction that lets you push a register set, selected by bitmask. eg:

    STMFD sp!, {r3-r7,lr}
(I believe the "FD" suffix is to do with the stack growing down - "full descending")

Of course this is just implemented with microcode so it's not really any more efficient than a series of PUSHes, except there's a bit less I-cache pressure.


In the original ARM design the multiple register transfers were much more efficient than the equivalent single register transfers because of the simple architecture which had an inherent load delay (it couldn't fetch an instruction and a data word in a single cycle). When they switched to the Harvard model they lost their advantage.

Talking of the ARM reminds me of older heroic feats of assembly-programmed internet software in the Acorn days, such as Ben Dooks' all-asm TCP/IP stack and Jon Ribbens' web browser. Probably not been done that often...


Itanium does this automatically. You declare what registers you're using in the the function prologue and it handles popping/pushing/renaming. No register window exceptions too.


Isn't Itanium obsolete? I'd love to learn otherwise.


Sadly it's dead now. In a cruel twist of fate, Intel killed it off shortly before we found out about all the speculative execution attacks and found that everything else is horrifically vulnerable.


Do you mean that Itanium would have been safe?


It's inherently not vulnerable to the Spectre/Meltdown family of attacks. They rely on speculative, out-of-order execution on modern CPUs, but Itanium is an in-order core with very limited (and software-controlled) speculation.

It's actually not vulnerable to a bunch of other attacks as well (e.g. a buffer overflow cannot overwrite the return address on Itanium).


ARM has stmdb / ldmia which appear in practically every function prolog/epilog. It also has "banking" systems to swap to a different set of registers on interrupts, which saves time and stack space.


Another one: the old Z-80 CPU had two sets of registers, and a single instruction to swap between the main and the alternate register set.

http://landley.net/history/mirror/cpm/z80.html (search for "alternate registers").


Sadly it appears that none of the Z80 C compilers (at least the ones I've used, Hi-Tech C and sdcc) are smart enough to use it.


Even if there was one, you'd still be saving registers unnecessarily. If you only clobber one register in a procedure there's no need to save and restore all of them.


x86 has LOADALL and SAVEALL.

https://en.wikipedia.org/wiki/LOADALL


Those undocumented instructions don't exist on x86_64, and they could be used to do bizarre things on the processors which had them:

> As the two LOADALL instructions were never documented and do not exist on later processors, the opcodes were reused in the AMD64 architecture.[8] The opcode for the 286 LOADALL instruction, 0F05, became the AMD64 instruction SYSCALL; the 386 LOADALL instruction, 0F07, became the SYSRET instruction. These definitions were cemented even on Intel CPUs with the introduction of the Intel 64 implementation of AMD64.[9]

[snip]

> Because LOADALL did not perform any checks on the validity of the data loaded into processor registers, it was possible to load a processor state that could not be normally entered, such as using real mode (PE=0) together with paging (PG=1) on 386-class CPUs.[7]


What about PUSHAD and POPAD for x86, those are not undocumented right?


There is no pusha in long mode though, you need to push individual regs.


BeagleBone’s processor the AM335x PRU processors by TI has two “PRU” coprocessors. They have “xin” and “xout” instructions that can copy a register bank in 1 cycle. Pretty handy for quick data gathering but tricky to use in C.


Sure. It's called windowed registers and you have one instruction each to move the window right/left.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: