The other extensions do not trigger it, not even AVX256.
With AVX512 is not always a win, and you don't even know until you try in particular hardware.
Agner’s microarchitecture.pdf says about Ryzen “There is no penalty for mixing AVX and non-AVX vector instructions on this processor.”
Not sure if it applies to Zen 2 but I’ve been using one for a year for my work, AVX 1 & 2 included, I think I would have noticed.
Just because they are split doesn’t mean they run sequentially. Zen 1 can handle up to 4 floating point microops/cycle, and there’re 4 floating-point execution units, 128-bit wide / each (that’s excluding load/store, these 4 EUs only compute).
Native 256 bit are even faster due to less micro-ops and potentially more in-flight instructions, but I’m pretty sure even on Zen 1 AVX is faster than SSE.
However, when instruction decode/retire is the bottleneck, AVX can be faster. I remembered this can be the case on Intel Sandy Bridge (first-gen AVX, double pumped, retire 3 instructions/cycle), where AVX can sometimes be faster (usually it's not that different)
With recent CPUs from both Intel/AMD able to at decode/retire at least 4 instructions per cycle this really cease to be the case.
Yes. Another possible reason for that is instructions without SSE equivalents. I remember working on some software where AVX2 broadcast load instruction helped substantially.
For general guidelines on when to use AVX-512, this (older post) remains best guide I've seen: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...
Among other things, it depends on the workload and the exact processor. You can find plenty of cases where AVX512 makes things faster. You can also find cases where the entire system slows down because it is running sections of AVX512 code here and there—apparently, for certain Intel processors, the processor needs to turn on top 256 bits of the register files and interconnects, and to get full speed for AVX512 it will alter the processor’s voltage and clock speed. This reduces the speed of other instructions and even other cores on the same die (which may be surprising).
While the specifics may be new, the generalities seem familiar—it has long been true that a well-intentioned improvement to a small section of your code base can improve performance locally while degrading overall system performance. The code that you’re working on occupies a smaller and smaller slice of your performance metrics, and meanwhile, the whole system is slowing down. There are so many reasons that this can happen, dynamic frequency scaling with AVX512 is just one more reason.
Not certain Intel processors: all of them. The CPU will reduce its clocks, reducing its overall performance.
Using AVX256 or AVX512 is easily a net negative on performance, depending on the software, input data and other processes running in the system.
Unless you are certain you have enough data to offset the downclock, don't use them.
When that’s the case, especially if the numbers being crunched are 32-bit floats, there’s not much point of doing it on CPU at all, GPGPUs are way more efficient for such tasks.
However, imagine sparse matrix * dense vector multiplication. If you rarely have more than 4 consecutive non-zero elements in rows of the input matrix, and large gaps between non-zero elements, moving from SSE to AVX or AVX512 will decrease the performance, you’ll be just wasting electricity multiplying by zeros.
This is also ignoring the fact that none of these penalties come into play if you use the AVX512 instructions with 256-bit or 128-bit vectors. (This still has significant benefits due to the much nicer set of shuffles, dedicated mask registers, etc.)