>> Aligning the code means compiler will insert NOPs before the code you want to align. That increases binary size and might cost you performance if you insert a lot of nops in the hot path. In the end executing nops doesn’t come for absolutely free. You need to fetch and decode it.
This statement makes me wonder if a variant of NOP would be useful - igNOP - ignore ops until next alignment. It would tell the cpu to treat every remaining instruction in the current block as a no op. I somewhat doubt this would help, as I think it is not so much that currently extra noops need to be fetched and decoded but some other bottle neck. A “nice” consequence would be that you could pack extra data after the igNOP and before the next alignment, to be unpacked elsewhere. It would probably cause headaches for debugging and security concerns... Can anyone comment on this?
This functionality essentially already exists in the form of multi-byte NOP's: https://stackoverflow.com/questions/25545470/long-multi-byte.... Because of the way the decoder fetches instructions, any approach that requires the decoder to act conditionally upon anything other than individual instruction length is likely impossible.
While in theory NOP decoding could be a bottleneck, I think it would be a really rare occurrence. Usually a hot loop is going to be fed from the LSD or DSB caches, so the NOP's will already be removed. It would be interesting to see a benchmark that illustrates a case where excessive alignment actually causes a slowdown.
A whole other approach to alignment would be to strategically lengthen earlier instructions so that the designed alignment is achieved. This avoids adding any executed nops.
It's not as hard as it sounds: there is lots of redundancy in the x86 encoding, so you can often add REX prefixes, make offsets longer or add an offset where it doesn't exist, etc, etc.
This is just a relative jump instruction and on intel can be just 2 bytes, although an ignop might fit into one.
It wouldn't solve the icache problem though. Also, writing to data inline with code would be super problematic performance-wise with modifying icache lines that currently are being executed from.
This statement makes me wonder if a variant of NOP would be useful - igNOP - ignore ops until next alignment. It would tell the cpu to treat every remaining instruction in the current block as a no op. I somewhat doubt this would help, as I think it is not so much that currently extra noops need to be fetched and decoded but some other bottle neck. A “nice” consequence would be that you could pack extra data after the igNOP and before the next alignment, to be unpacked elsewhere. It would probably cause headaches for debugging and security concerns... Can anyone comment on this?