Traditionally, autovectorization works by doing some prefix of the array vectori...

Traditionally, autovectorization works by doing some prefix of the array vectorized, and the remaining elements with a scalar loop, e.g. 108 items may be processed by 6 iterations of a vectorized loop that processes 16 elements at a time, and then the last 12 (108-6×16) are done with 12 iterations of a scalar loop.

This change, as far as I understand, makes handling those 12 be done via a single masked iteration. This is especially important if, instead of 108 elements, you had just 12, where what previously was 12 loop iterations, is now a single bit of code.