I don't know for sure at all, but I don't think it's impossible that speculative...

jhgb · on June 28, 2021

I don't see how this has anything to do with speculation? In most cases where you care about this you don't have to speculate if all the loop iterations are needed. For example in matrix multiplication all of those iterations will be needed.

tsimionescu · on June 28, 2021

What I'm thinking is that the processor has an instruction stream that looks like this:

  loop: 
    instr_1
    instr_2
    ...
    instr_n 
    jcond loop

Now, assuming the loop is not unrolled, it would need to speculate that `jcond loop` will jump to be able to execute 2 copies of instr_1 in parallel - I'm saying that it may be able to do that, though I am by no means sure.

jhgb · on June 28, 2021

Oh, I see what you mean -- I was talking (and thinking) about the unrolled version so it didn't make sense how speculation could help there. But I imagine that typically the kind of long chains that you might want to do in parallel in a single basic block are perhaps something that wouldn't get executed that far after a branch, if the only purpose is to not waste time after a branch misprediction. Plus from what I understand you'd still be wasting execution units here, just not by idling them but rather by speculating the "I'm done" branch repeatedly.

EDIT: I just found that the idea that I had in my head actually exists and is called "modulo scheduling".