
Where do interrupts happen? - matt_d
https://travisdowns.github.io/blog/2019/08/20/interrupts.html
======
pjc50
This is fantastic for reminding us of how the "programmer's model" of a CPU
differs from what's actually happening, which can be considerably more
complicated. We humans basically cannot handle anything other than an in-order
model where the processor fixes everything up to maintain that appearance.

It seems like you could also model this mentally (although I'm not sure you
could instrument it) as: "On receiving an interrupt, do not give any more
instructions to the execution units; wait for execution to drain, then
transfer control flow to the interrupt handler, and mark the instruction after
the last drained instruction as the interrupt return point."

~~~
BeeOnRope
Well one of the key points is that interrupts (apparently) do not drain
currently executing instructions: almost all outstanding isntructions are
thrown away, including already executed but unretired instructions. Only the
next to retire instruction is apparently allowed to complete (possibly because
interrupts are only checked every time an instruction retires).

------
rrss
Thanks for writing this, BeeOnRope!

I'm not all the way though it yet, but I'm having trouble understanding the
"eligible" column. There are a few nops with a nonzero entries, and I haven't
been able to figure out the pattern yet. I'd appreciate a clarification if you
have a moment to spare.

~~~
BeeOnRope
Sure. Here's the section from the post for context:

<quote>

Here, the eligible column indicates when the instruction is eligible for
retirement in the sense that it has finished executing. nop instructions have
no dependencies hence basically execute instantly (in fact, no not execute at
all) and hence are eligible on cycle 0. The other instructions come eligible
as defined by their position in the mov -> add -> ... dependency chain. The
actual column shows when the instruction actually retires. You can build it by
basically scanning down the eligible column and retiring up to 4 uops if their
eligible cycle is less than or equal to the current cycle.

</quote>

Basically instructions [1] need to execute before they can retire.
Instructions execute out-of-order when their inputs become ready and when some
other constraints are met (e.g., their execution port is available). This post
[2] is a good introduction.

So here "eligible" is the cycle at which an instruction would be expected to
complete execution (starting arbitrarily at cycle 0), if you have a chain of
mov instructions they will be eligible at 0, 4, 8, 12, ... since this chain
has latency 4. If you have a bunch of independent instructions with throughput
4/cycle, they would be become eligible at 0,0,0,0,1,1,1,1,2,2,2,2 since 4 can
start every cycle and they take 1 cycle each.

"Eligible" is relevant here because it defines when the instruction can
retire. Unlike execution, retirement happens in order, so you can imagine the
retirement unit maintaining a pointer to the oldest (i.e., "higher up" in
source order) instruction, and asking "is it eligible yet?" every cycle. If
yes, that instruction will be retired _in addition to up to the 3 subsequent
in-order instructions if they were also eligible_.

So from eligible cycle analysis we can reconstruct the retirement pattern
(sometimes more than one pattern is possible, I didn't touch on this yet in
this post).

Let me know if it makes some sense?

\---

[1] Actually uops are what execute, but they are equivalent for this example.

[2] [https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-
introdu...](https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-
to-dataflow-graphs/)

~~~
rrss
Thanks.

Sorry if I'm being thick, but I'm still having trouble seeing where the
pattern for nops is coming from, in this example from the post:

    
    
                           /- eligible
                           |  /- actual
        mov  rax, [rax]  ; 0  0                 
        nop              ; 0  0
        nop              ; 0  0  
        add  rax, 0      ; 1  1
        nop              ; 0  1  
        nop              ; 1  1  <------ ?
        nop              ; 0  1  
        mov  rax, [rax]  ; 6  6 
    

I would think that the fourth nop is eligible only at cycle 1 because only
four uops can start each cycle, but then the fifth nop is also eligible at
cycle 4. What's different about the fourth nop from the third and fifth?

~~~
BeeOnRope
You are not being thick. It was a typo on my part, I intended for all the nops
to have 0 for their eligible column.

As you point out, putting them all at 0 isn't even really accurate because
only 4 can allocate (rename) a cycle, so the true eligible pattern should be
0-0-0-0-1-1-1-1-2-2-2-2 or something like that (it's more complicated than
than because the mov/add also needs to allocate, so it's rp). I'm basically
ignoring that assuming the front-end (including allocation) can keep up, or
basically that the scheduler is infinitely large and always totally full.

I am going to fix this to be more accurate but I'll have to unroll the loop a
bit more for the difference between retire and eligible to make sense.

BTW, this is just a few simple version of what llvm-mca -timeline view does:

[https://godbolt.org/z/ZrGXrH](https://godbolt.org/z/ZrGXrH)

Timeline view simulates a simple CPU pipeline, and with -dispatch=4
-mcpu=Skylake it's pretty close to Skylake. Unfortunately, llcm-mca models
"unlimited retire bandwidth" so if 100 instructions all become eligible in the
same cycle, it will retire all 100. So it's not too helpful here in any cases
where it would retire more than 4 per cycle (but it works for this example
because that doesn't happen due to the dep chains).

I didn't find any argument you can pass to limit retire bandwidth. In practice
this is not a problem for most performance-oriented investigation because the
retire limit isn't really a bottleneck, so reported performance from the tool
is likely to be the same even though it simulates unlimited bandwidth.

~~~
BeeOnRope
I fixed the typo, but also tried to improve the way cycles are accounted,
adding two columns in support of that. If you have a moment can you check out:

[https://travisdowns.github.io/blog/2019/08/20/interrupts.htm...](https://travisdowns.github.io/blog/2019/08/20/interrupts.html#lookhere)

and see if makes sense to you? I'll upgrade the other examples to that format
if it does.

I've added you to the Thanks section at the bottom.

~~~
rrss
Yeah, that makes a lot of sense, thanks.

Thanks again for writing these articles, I very much appreciate them.

------
ggm
As a degenerate case, can anyone design a simple 'meta machine' or constraint-
set in the instructions which imposes specific costs on the interrupt decision
logic? Can one code to this? Is this useful or just entertaining?

It might (for instance) be analogous to designing solutions which fit in L1
cache, for the outcome in time to execute.

Or it might affect the side channel consequences of some forms of crypto.

(or am I missing the obvious "this is in fact, how it works" answer. I am sure
I recall in simpler times reading that writing interrupt _handlers_ demanded
some consideration of what was fixed or mutable in the throw)

~~~
ebeip90
I’m not sure that this addresses your question directly, but the x86 MMU is
turing complete.

[http://beza1e1.tuxen.de/articles/accidentally_turing_complet...](http://beza1e1.tuxen.de/articles/accidentally_turing_complete.html)

~~~
ggm
its turtles all the way down.

------
aidenn0
One thing I have noticed with statistical profilers is that conditional jumps
tend to be highlighted a lot.

On the other hand, if I can massage the compiler into using CMOV instead of
e.g. JNE, I often end up with much faster code, so the clustering around that
instruction ends up being useful sometimes anyways.

~~~
hamsham
Out of curiosity, how have you managed to do that? I've noticed even with GCC
or Clang's -O3 option I'm still getting jumps rather than conditional moves.

~~~
fsfod
Clang\LLVM has some heuristics where it will convert cmovs generated by
earlier passes back in to jumps
[https://reviews.llvm.org/D34769](https://reviews.llvm.org/D34769). Also the
compilers can't always prove its safe to evaluate both sides of a ternary
operator which would be needed to use a cmov. Here's a compiler explorer
sample for clang and GCC that shows all these problems
[https://godbolt.org/z/IpZfZv](https://godbolt.org/z/IpZfZv), you change the
-x86-cmov-converter option to true re-enable clangs cmov converter which will
disable cmov in the last function.

~~~
BeeOnRope
Yeah I have spent a long time fighting to get cmov into code that wants it.

The cmov1 case in your example ends up getting a cmov1 because the compiler
was able to apply the cmov to the pointer rather than to the read value
(essentially transforming it to `return (cond ? v1 : v2)->i`).

It's fragile though, if you change it a bit it stops working:

[https://godbolt.org/z/hs71Jb](https://godbolt.org/z/hs71Jb)

It isn't clear if it stops because the heuristics aren't in favor any more
(you'd need 2 cmov-y things to handle the +1), or because the compiler no
longer sees the transformation is possible at all.

gcc even fails if you add +1 to _both_ sides which should be almost as simple
as the original:

[https://godbolt.org/z/y7PIcU](https://godbolt.org/z/y7PIcU)

I have tried to use the "unconditionally deference" trick many times to
convince a cmov, but it seems it doesn't always work because the compiler
might re-optimize it back to only doing the read inside the
branch/conditional, so later phases don't see the read as safe.

