> The mechanism works only under certain conditions. It must use general purpose registers, and the operand size must be 32 or 64 bits. The memory operand must use a pointer and optionally an index. It does not work with absolute or rip-relative addresses.
> It seems that the CPU makes assumptions about whether memory operands have the same address before the addresses have been calculated. This may cause problems in case of pointer aliasing.
Or from the PDF:
•The instructions must use general purpose registers.
•The memory operands must have the same address.
•The operand size must be 32 or 64 bits.
•You may have a32 bit read after a 64 bit write to the same address, but not vice versa.
•The memory address must have a base pointer, no absolute address, and no rip-relative address. The memory address may have an index register,a scale factor, and an offset no bigger than 8 bits.
•The memory operand must be specified in exactly the same way with the same unmodified pointer and index registers in all the instructions involved.
•The memory address cannot cross a cache line boundary.
•The instructions can be simple MOV instructions, read-modify instructions,or read-modify-write instructions.It also works with PUSH and POP instructions.
•Complex instructions with multiple μops cannot be used.
For 64-bit platform, you would need some algorithm that needs more than 15 registers (so you need to spill to stack) that need to be execute faster than cache access latency. There might be some, but I doubt many would fall into this category.
Taking pending writes from the store buffer before they have retired is something else and has (obviously) been done since we first had OOO execution.
This is referring back to the operand's original value in a register file because that can be done with less latency than searching around in the store buffer, which I think isn't much different to L1.
Thanks, I had missed that distinction.
(Potentially related, assuming it's of benefit are modern compilers smart enough to repurpose %rsp (is this even allowed?) if I use a block of memory as a stack inside a hot loop?)