Hacker News new | past | comments | ask | show | jobs | submit login

This isn't directly related to Zen 2 (sorry), but it's something I've been wondering about:

How do processors that split ops into uops implement precise interrupts? I sort of understand how the ROB is used to implement precise interrupts even with pipelining and OOO, but I don't quite see how processors map uops back to the original instruction sequence.




First, most uops can't throw exceptions, so fusing a shift and an add instruction together doesn't require any complex tracking here.

If a uop throws an exception (let's say a fused add+ld), each uop can have a tag that helps you backtrace to its PC (instruction address) of let's say the start of the sequence, so you know what to inform the Privileged Architecture as to what "instruction" excepted. For many reasons, you need to store a list of PCs of the inflight instructions somewhere (although it is heavily compressed), so having a small ID tag to help reconstruct a given uop's PC isn't too onerous.

Ideally, multiple instructions may map to a single uop, but either none (or up to one) can throw an exception. The hard one here is something like load-pair uops; since each load can throw an exception. Some machines, if a fault is encountered, will refetch the pair and re-execute as independent/unfused loads. Other designs will just pay the pain of tracking which of the pair excepted and do some simple arithmetic off of that.


Instructions such as shift and add that have memory operands can throw exceptions on x86/amd64. (This was some of the motivations of RISC, separating loads/stores from ALU ops made exception handling cleaner).

Heh, random note, I just looked up shift instructions on x86, there are 6 different ones, not RISC. But today there's a lot over 1000 instructions so a few shift variants are peanuts.


Correcting myself, the issue was uops and not instructions. But this turns out to be still similar: at least intel nowadays keeps the memory addressing part in uops (most cases) and doesn't split instructions into load/store uops + alu ops.


I thought there was two different uop ISAs on big x86 cores with two different purposes these days. One is pretty close to the original instructions, just decoded and fixed width (on AMD at least, this is what's in the uCode ROM). Then those are cracked to another ISA that the ROB knows about because instructions will cross functional unit boundaries.

So in say 'rol mem_addr, shift', your inner ISA would be cracked to something like.

    ld  reg_temp0, mem_addr
    rol reg_temp0, shift
    st  reg_temp0, mem_addr
This is all hearsay though; I could have certainly misheard/misremembered.


My understanding on ISAs that require it is that the uops are marked with the PC of the instruction, and are in the ROB in program order so you can reverse the ROB. Intermediate results aren't fully committed until all of the uops have completed, so there's always the possibility of rollback.


I'm not an expert but I believe what accomplishes this task is the reorder buffer. This allows the instruction execution and its side effects to be separated.


The parent's ROB is the reorder buffer. AIUI it causes the instructions to be retired in order (with exceptions stored until retirement, then exposed). The original question, though is how a particular u-op is mapped back to the original macro-instruction, so we know what macro-instruction excepted.

And I don't know. I guess is if each u-op is tagged with the instruction address within the process, that would do, but that's carrying around at least 32 bits, which is quite a large tag.

Alternatively tag indirectly, which is more likely (you can have maybe 256 instructions 'hot' at any time so an 8-bit tag on each u-op pointing to a 32 or 64-bit table entry (edit: holding the actual address of the macro-op). And the window for the ROB and the other thing that does instruction issue, is ~200 instructions, so that sounds more plausible).

All speculation on my part though!


Your alternative is sort of close. What happens is that every x86 instruction is assigned a ROB entry. Every uop that has results (stores are handled separately) is assigned a clean register out of the physical register file, and the address of this register (or multiple registers in case of multi-uop instructions) is stored in the corresponding ROB entry. The ROB acts like an in-order circular list -- the retire phase drains it in order from the oldest first, retiring the oldest instruction if and only if all the corresponding PRF registers have been written to. This is the point where any and all side-effects are made visible.


I realise you're describing a dataflow engine. Suddenly it's starting to fall into place. Tomasulo's algorithm (which this is about?) is starting to fall into place.

Which is a) amazing and b) OMG the frigging complexity of something that has to run at sub-ns speeds. It's like sausages, the closer you look the less there is to enjoy.

Thanks!


Thanks. The bit im not sure about is how to prevent ending up Ina state in the middle of a single instruction if one instruction gets split into multiple uops. Like if one instruction gets split into two uops and the first one completes but the second one raises an exception.


OK, let me try (Tuna-Fish, put me right at any point).

> Like if one instruction gets split into two uops and the first one completes but the second one raises an exception.

That's not a problem. It's just one of n exception types that instruction can raise. Suppose a macro (say x64 instruction, if something like this exists) division instruction where one operand could be fetched from memory, you could have

  r2 <- r3 / ^r4
where ^r4 fetches the contents of memory at address held in r4.

suppose it's split up into u-ops

  tr6 <- ^r4  ;; tr6 is temporary register 6, invisible to programmer
  r2 <- r3 / tr6
you could have a division by zero at u-op 2, or an invalid address exception for u-op 1. Either of those are valid exceptions for the original single macro-op.

Extrapolating from what Tuna-Fish said, the ROB is list of macro instructions, each instruction I assume will be tagged with its actual macro-op address, and each u-op must link back to the originating macro-op so macro-op retirement can take place, so we have a small (8 bit? Because ROB queue is small) pointer from each u-op back into the macro-op in the ROB.

Follow the 8-bit u-op ptr to the ROB, get the originating macro-op address, raise exception at that address.

Assuming I'm right, and assuming I understood you question correctly. I'll have to read his answer more carefully again.

edit: swapped ^ for asterisk as deref operation, as stars interpreted as formatting. Edit 2: slightly clearer.


Perhaps there are two copies of the program pointer, one at the "top of the pipe" updated by instruction decode and branch prediction, and one at the "bottom of the pipe" updated by the ROB. Then uops only need to carry the amount by which the program counter is advanced, and certain events can cause a pipeline flush and copy the bottom of pipe version of the counter to the top of the pipe.

But that's also all speculation :)


> But that's also all speculation :)

How appropriate!

But, your idea I think will not work in general where there's branching.


I see what they mean then. I read the question a bit too fast.

I was mostly assuming that the issue of the operation reserved a spot in the outstanding buffer which would ensure sequencing of the write or commit after execution which would walk through the buffer in sequence.

But you're right that there are still more questions to how some of that data is tracked through the pipeline.


I'd assume the simplest thing to do would be to flush the pipeline as if you had a branch mispredict, the interrupt can be delivered instead of an alternate branch target.


The simplest solution is a dummy uop “check_int”.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: