GCC Lands AVX-512 Fully-Masked Vectorization

gumby · 2023-06-19T17:38:04

Note an interesting implication of the mask: on the most recent C++ committee meeting, the library working group working on simd<T> discussed whether == should return a boolean, as it typically does, or a mask.

I don't know if they decided, and if so how, based on this trip report: https://www.think-cell.com/en/career/devblog/trip-report-sum...

nequo · 2023-06-19T14:31:49

Does someone have an ELI5 on the difference between fully masked and not fully masked vectorization?

adrian_b · 2023-06-19T14:57:11

Fully masked means using AVX-512 in the way it is supposed to be used.

Whenever a loop processes data whose size is not a multiple of the AVX-512 vector size and/or which is not suitably aligned, it may be necessary to have a loop prologue and/or a loop epilogue.

For AVX-512 it is always possible to implement the prologue and/or epilogue simply by generating different SIMD lane masks for the first and/or the last iterations (when a lane is masked the operation is not executed for that lane, so a non-aligned vector with any number of elements smaller than the maximum number can be processed).

For legacy reasons the current behavior of gcc is far from optimal. Instead of computing the required masks it inserts additional code as prologue and/or epilogue, consisting of AVX or AVX2 instructions.

This work corrects this behavior, resulting in improved performance for the loops with few iterations, where the execution time of the prologue or epilogue code becomes non negligible.

slashdev · 2023-06-19T15:02:14

While this is true, the masking is more important for control flow in the loops. You can sometimes replace if/else statements with the masked registers allowing loops with simple conditionals to be vectorized.

adrian_b · 2023-06-19T15:08:48

What you say is right, but it is not related to this gcc announcement, which refers strictly to replacing the AVX2 prologue and epilogue codes, which were still being generated even when the loop was vectorized with AVX-512 instead of AVX2.

I do not know if gcc is already smart enough to use masking for control flow, but there exists the also free ispc compiler which does it, allowing one to write programs in CUDA style for CPUs with AVX-512, like AMD Zen 4.

alecco · 2023-06-19T15:13:49

Actual ELI5:

SIMD, or Single Instruction, Multiple Data, allows the computer to do the same operation (like adding numbers) on many pieces of data all at once. This is just like if you had multiple pairs of numbers and you could add all pairs together at the same time.

So instead of doing:

All separately, the computer can do in one go:

    (1,2,3,4) + (5,6,7,8) = (6,8,10,12)

But what if we don't want to add all pairs of numbers? What if we wanted to keep one of the numbers from the first group the same, and not add anything to it? That's where "masking" comes in.

Using the mask (true/yes, true/yes, false/no, true/yes), we tell the computer to only add the pairs where the mask says 'yes', and skip where it says 'no'.

So, instead of getting (6,8,10,12), we get (6,8,3,12) because we told the computer to skip adding anything to the third number in the first group.

    sum_masked(mask=(true,true,false,true), a=(1,2,3,4), b=(5,6,7,8)) = (6,8,10,12)

This saves a lot of work. And the problem gets worse as the vectors get larger. AVX-512 vectors are very large so this is significant.

(with rephrasing from https://chat.openai.com/?model=gpt-4)

nequo · 2023-06-19T18:16:51

  >   sum_masked(mask=(true,true,false,true), a=(1,2,3,4), b=(5,6,7,8)) = (6,8,10,12)

Your explanation reads very clearly but should the result here be (6,8,3,12) instead of (6,8,10,12)?

alecco · 2023-06-19T20:52:15

Indeed! thanks.

dzaima · 2023-06-19T15:02:16

Traditionally, autovectorization works by doing some prefix of the array vectorized, and the remaining elements with a scalar loop, e.g. 108 items may be processed by 6 iterations of a vectorized loop that processes 16 elements at a time, and then the last 12 (108-6×16) are done with 12 iterations of a scalar loop.

This change, as far as I understand, makes handling those 12 be done via a single masked iteration. This is especially important if, instead of 108 elements, you had just 12, where what previously was 12 loop iterations, is now a single bit of code.

fulafel · 2023-06-19T14:58:18

Instruction set supported masking makes it easier to do per lane control flow, like "if else", familiar from the shader programming model (and Intel's ISPC) where you write a function that operates on one datum, and it gets compiled into code where individual SIMD lanes may by masked depending on which side of a data dependent branch they are executing.

(I don't know what the GCC change is though)

gpderetta · 2023-06-19T14:51:00

AVX-512 is the first intel ISA extension that support masking (i.e. each lane can be enabled or disabled independently), which allows for somewhat generalized vectorization. Without masking, vectorization need to be performed ad hoc and masking possibly emulated (which is hard to do for loads and stores).