Do they? An arch with bultin lane predication (like AVX512) could easily implement wide SIMD on top of narrower ALU and then skip the masked out lanes. Actual runtime would depend on the number of non masked lanes.
I'm not up to date on GPU architectures, bit I wouldn't be surprised of they do this sort of stuff.
I'm not up to date on GPU architectures, bit I wouldn't be surprised of they do this sort of stuff.