Thank you! That's very interesting.

jasonwatkinspdx · on Dec 22, 2020

Another key feature of these architectures is they had a vector length register. This allowed you to write strip mined loops that would move through arbitrary size vectors in units of the hardware vector lane width, without knowing that width until runtime. This means unlike MMX/SSE, the same binary works on machines with different numbers of lanes.

This idea has been resurrected recently with RISC-V and ARM's scalable vector instructions. There the general idea is an instruction that assigns the minimum of an argument value and the hardware vector register length to a register, and sets the masking appropriately if the argument is smaller. This makes for a very straightforward strip mined loop without a branch to check for and handle the remainder in the last iteration.