> me but cannot tell the boundary of instructions until they are all decoded Not...

inkyoto · 2024-05-01T00:41:35

> Not fully decoded though, since it's enough to look at the lower bits to determine instruction size.

It is not about decoding, which happens later, it is about 32-bit instructions crossing the L1 cache line boundary in the L1-i cache which happens first.

Instructions are fetched from the L1-i cache in bundles (i.e. cache lines), and the size of the bundle is fixed for a specific CPU model. In all RISC CPU's, the size of a cache line is a multiply of the instruction size (mostly 32 bits). The RISC-V C extension breaks the alignment, which incurs a performance penalty for high performance CPU implementations, but is less significant for smaller, low power implementations where performance is not a concern.

If a 32-bit instruction cross the cache line boundary, another cache line must be fetched from the L1-i cache before an instruction can be decoded. The performance penalty in such a scenario is prohibitive for a very fast CPU core.

P.S. Even worse if the instruction crosses a page boundary, and the page is not resident in memory.

dzaima · 2024-05-01T01:07:29

I don't think crossing cache lines is particularly much of a concern? You'll necessarily be fetching the next cache line in the next cycle anyway to decode further instructions (not even an unconditional branch could stop this I'd think), at which point you can just "prepend" the chopped tail of the preceding bundle (and you'd want some inter-bundle communication for fusion regardless).

This does of course delay decoding this one instruction by a cycle, but you already have that for instructions which are fully in the next line anyways (and aligning branch targets at compile time improves both, even if just to a fixed 4 or 8 bytes).

inkyoto · 2024-05-01T13:45:10

> I don't think crossing cache lines is particularly much of a concern?

It is a concern if a branch prediction has failed, and the current cache line has to be discarded or has been invalidated. If the instruction crosses the cache line boundary, both lines have to be discarded. For a high-performance CPU core, it is a significant and, most importantly, unnecessary performance penalty. It is not a concern for a microcontroller or a low power design, though.

dzaima · 2024-05-01T15:42:41

Why does an instruction crossing cache lines have anything to do with invalidation/discarding? RISC-V doesn't require instruction cache coherency so the core doesn't have much restriction on behavior if the line was modified, so all restrictions go to just explicit synchronization instructions. And if you have multiple instructions in the pipeline, you'll likely already have instructions from multiple cache lines anyways. I don't understand what "current cache line" even entails in the context of a misprediction, where the entire nature of the problem is that you did not have any idea where to run code from, and thus shouldn't know of any related cache lines.

dzaima · 2024-05-01T00:40:26

Mispredict penalties == latency of pipeline. Needing to delay decoding/expansion to after figuring out where instructions actually start will necessarily add a delay of some number of gates (whether or not this ends up in mispredict penalty increasing by any cycles of course depend on many things).

That said, the alternative of instruction fission (i.e. that which RISC-V avoids requiring) would add some delay too (I have no clue how these compare though, I'm not a hardware engineer; and RISC-V does benefit from instruction fusion which can similarly add latency, and whose requirement other architectures could decide to try to avoid (though it'd be harder to keep avoiding it as hardware potential improves while old compiled binary blobs stay unchanged), so it's complicated)

camel-cdr · 2024-05-01T06:59:17

Ah, that makes sense, thanks. I think on the end it all boils down to both the arm and the rv approach to be fine approaches, with slightly different tradeoffs.