Hacker News new | past | comments | ask | show | jobs | submit login

I am not talking about ease of use, but about the downclock.

The other extensions do not trigger it, not even AVX256.

With AVX512 is not always a win, and you don't even know until you try in particular hardware.

The 256-bit vector instructions do trigger a downclock, but not as severe as the AVX512 downclock.

You are 100% right, it is AVX1 I was thinking about (which nowadays I am not sure if that has downclock or not either).

I don’t think that applies to modern AMD processors, though.

Agner’s microarchitecture.pdf says about Ryzen “There is no penalty for mixing AVX and non-AVX vector instructions on this processor.”

Not sure if it applies to Zen 2 but I’ve been using one for a year for my work, AVX 1 & 2 included, I think I would have noticed.

AMD processors used to implement AVX instructions by double-pumping them, using only 128-bit vector ALUs. This means there's no clock penalty, but there's also no speedup over an SSE instruction by doing so. I don't know if this is still the case with the newest µarchs though.

> but there's also no speedup over an SSE instruction by doing so

Just because they are split doesn’t mean they run sequentially. Zen 1 can handle up to 4 floating point microops/cycle, and there’re 4 floating-point execution units, 128-bit wide / each (that’s excluding load/store, these 4 EUs only compute).

Native 256 bit are even faster due to less micro-ops and potentially more in-flight instructions, but I’m pretty sure even on Zen 1 AVX is faster than SSE.

It depends. If EUs are actually the bottleneck then doing SSE or AVX wouldn't have any different in speed in such case.

However, when instruction decode/retire is the bottleneck, AVX can be faster. I remembered this can be the case on Intel Sandy Bridge (first-gen AVX, double pumped, retire 3 instructions/cycle), where AVX can sometimes be faster (usually it's not that different)

With recent CPUs from both Intel/AMD able to at decode/retire at least 4 instructions per cycle this really cease to be the case.

> AVX can be faster

Yes. Another possible reason for that is instructions without SSE equivalents. I remember working on some software where AVX2 broadcast load instruction helped substantially.

Why link to a 5 year old thread? There has to be more recent work.

There is more recent work. This blog post by Travis Downs is the most detailed analysis of transition behavior I've seen: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

For general guidelines on when to use AVX-512, this (older post) remains best guide I've seen: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

So, are programs that are compiled with those instructions faster or slower? In my experience they have been faster.

Short answer: Yes, faster. Long answer: It depends, and you may be measuring the wrong thing.

Among other things, it depends on the workload and the exact processor. You can find plenty of cases where AVX512 makes things faster. You can also find cases where the entire system slows down because it is running sections of AVX512 code here and there—apparently, for certain Intel processors, the processor needs to turn on top 256 bits of the register files and interconnects, and to get full speed for AVX512 it will alter the processor’s voltage and clock speed. This reduces the speed of other instructions and even other cores on the same die (which may be surprising).

While the specifics may be new, the generalities seem familiar—it has long been true that a well-intentioned improvement to a small section of your code base can improve performance locally while degrading overall system performance. The code that you’re working on occupies a smaller and smaller slice of your performance metrics, and meanwhile, the whole system is slowing down. There are so many reasons that this can happen, dynamic frequency scaling with AVX512 is just one more reason.

> apparently, for certain Intel processors

Not certain Intel processors: all of them. The CPU will reduce its clocks, reducing its overall performance.

Using AVX256 or AVX512 is easily a net negative on performance, depending on the software, input data and other processes running in the system.

Unless you are certain you have enough data to offset the downclock, don't use them.

Different causes, similar consequences.

Porting SSE to AVX code (with equivalent instruction and proper vzeropper) will increase performance in most case (the only case where it can be slower, on top of my head, is on Sandy Bridge). The same is not true for AVX to AVX512.

It will increase performance if you have sufficient amount of dense data on input.

When that’s the case, especially if the numbers being crunched are 32-bit floats, there’s not much point of doing it on CPU at all, GPGPUs are way more efficient for such tasks.

However, imagine sparse matrix * dense vector multiplication. If you rarely have more than 4 consecutive non-zero elements in rows of the input matrix, and large gaps between non-zero elements, moving from SSE to AVX or AVX512 will decrease the performance, you’ll be just wasting electricity multiplying by zeros.

So in some sense very similar to SKX behavior? The first iteration of the instruction implementation requires judicious use of instructions, while later implementations (this is something to be upset about...those "later implementations" should have been available quite some time ago).

This is also ignoring the fact that none of these penalties come into play if you use the AVX512 instructions with 256-bit or 128-bit vectors. (This still has significant benefits due to the much nicer set of shuffles, dedicated mask registers, etc.)

AVX to AVX512 will "increase performance in most case"s. https://www.researchgate.net/figure/Speedup-from-AVX-512-ove...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact