This functionality essentially already exists in the form of multi-byte NOP's: https://stackoverflow.com/questions/25545470/long-multi-byte.... Because of the way the decoder fetches instructions, any approach that requires the decoder to act conditionally upon anything other than individual instruction length is likely impossible.
While in theory NOP decoding could be a bottleneck, I think it would be a really rare occurrence. Usually a hot loop is going to be fed from the LSD or DSB caches, so the NOP's will already be removed. It would be interesting to see a benchmark that illustrates a case where excessive alignment actually causes a slowdown.
While in theory NOP decoding could be a bottleneck, I think it would be a really rare occurrence. Usually a hot loop is going to be fed from the LSD or DSB caches, so the NOP's will already be removed. It would be interesting to see a benchmark that illustrates a case where excessive alignment actually causes a slowdown.