(Intel Hotchips slide: https://images.anandtech.com/doci/15984/202008171757161.jpg)
I don't think anyone ever said that.
What people are really saying, AVX-512; in its current form, limited by Clockspeed, Thermals, and most importantly market segmentation.
And given all these three things are only slightly improved in the current 3 years roadmap render the instruction useless. ( But latest HotChip information seems to suggest they are well aware of it. So Roadmap can change )
And even more importantly that is in reference to current stack of Intel issues.
Yeah, Torvalds said pretty much exactly that, in his typically boisterous form.
That it is "only useful in special cases", "only exists to make Intel look better on benchmarks", and that "the transistors would be better spent elsewhere".
Are you disagreeing with the parent comment or not? Are you maybe saying that the instruction is useless in chips that are in computers right now but will be useful in chips that will be released in future?
I mostly agree, but for me the frequency scaling is almost irrelevant. Here’s a link, expand “Other settings” and you’ll see: https://store.steampowered.com/hwsurvey
If you implement AVX2, the code will run on 76% of PCs. When you want good performance, such optimization is a good use of limited resources.
AVX512 is only at 0.42%. That’s why it’s useless to implement AVX512 except for servers where you have full control over hardware. And with current state of Intel, not necessarily a good idea even for servers, it’s quite possible in a couple of years you’ll want to switch to AMD.
From what? Automatic vectorization is very limited, it’s only good for pure vertical algorithms.
> and dynamically choose depending on CPU capabilities
Only Intel compiler does that, the rest of them don’t.
Sometimes I ship multiple versions of binaries. Other times I do runtime dispatch myself, it’s only couple lines of code, __cpuid() then cache some function pointers, or abstract class pointers. It’s even possible to compile these implementations from the same C++ source, using templates, macros, and/or something else.
It would have been nice to see the vcore and thermal values graphed as part of the benchmark. Do they increase faster for AVX-512 vs the other instruction sets?
I've had problems in the past with Sandybridge, where AVX hit thermal throttles before SSE. I ended up having to disable them in my build because of it. Presumably, the same behaviour would be seen here now that the vector unit is wider and there are more densely packed transistors flipping.
You you use more power (and get hotter temps: these are exactly proportional, so you can mostly just talk about them as one) with wider vectors because you are doing more work. When you look at it on a per-element basis, you use less power per element with wider vectors. E.g., you might use 1 pJ per element for 256-bit FMA but only 0.8 for 512-bit FMA.
Of course, since you can do 2x as many total elements in 512-bits on a 2 FMA machine, you can be both more efficient but use more total power, so you can get TDP or thermal limits with 512-bit code that you wouldn't on 256-bit, but it should still per faster and more efficient per element.
All of this assumes you can usefully use the 2x more work with the larger vectors. Sometimes the scaling is worse: e.g., for lots of short arrays, when a lookup table is involved, when additional shuffling or transposition is required with larger vectors, etc. In that case you could end up less efficient with larger vectors.
If the only issue with AVX-512 is thermal downclocking because you end up using more power, it's almost definitely because you are getting more work done per time. A few AVX-512 instructions in a mostly scalar workload is not going to significantly increase power dissipation and therefore should not induce thermal downclocking, while a heavily utilized AVX-512 kernel will burn power, but should also be doing work twice as fast per instruction.
If I had a little background deamon that used 512 because they were cool or in the hotchips presentation, and that bonked my overall system performance that would be annoying.
It's also annoying because intel benchmarks with no mitigations. So what can happen is you think you should be seeing X performance, and then with mitigations applied and some 512 instructions hitting you are Y performance.
And it looks like they've reduced how often a single instruction will cause a lockup as the core shifts to a different power level. But until they've eliminated that issue, it's still scary to toss in a few AVX-512 instructions.
> There is a transition period (the rightmost of the two shaded regions, in orange14) of ~11 μs15 where the CPU is halted: no samples occur during this period16. For fun, I’ll call this a frequency transition.
It stops executing for 35 thousand cycles. I call that a "lockup" "as it shifts".
From later in the same post:
> Here, we have the worst case scenario of transitions packed as closely as possible, but we lose only ~20 μs (for 2 transitions) out of 760 μs, less than a 3% impact. The impact of running at the lower frequency is much higher: 2.8 vs 3.2 GHz: a 12.5% impact in the case that the lowered frequency was not useful (i.e., because the wide SIMD payload represents a vanishingly small part of the total work).
Interestingly enough, this is another feature that is supposed to have been improved on server Icelake. The frequency transition halt time is now pretty much negligible. The "core frequency transition block time" goes from ~12 us on CLX (similar to the number quoted above) to ~0 us on ICX.
(Slide with frequency transition info: https://images.anandtech.com/doci/15984/202008171754441.jpg)
Yes, if you insert a single 512-bit FMA that runs every so often in your code you will get a 15% performance hit from the lower frequency, but that's much less likely than the old case where people who were trying to use AVX-512 for memcpy and the like would slow down scalar code.
Now that the older and bigger case is fixed, this case remains the last sticking point. Because you still can't trust the CPU to do the right thing when there are a small number of heavy instructions. Even if they cut the halting time to 0, it's still bad for a single instruction to cause a prolonged downclock.
Some of that is from overclocking. Some is from old and/or defective power supplies. Some is from motherboard VRMs. And some, like the original Ryzen 1700X, is from bad SOC-internal power management.
At any rate, I have read forum posts reporting system failures caused by AVX. Either overcurrent or clockrate changes crashing it.
So it seems to have support. I wouldn't call that "misinformation."
There are plenty of people with both Intel and Ryzen systems that are straight-up broken and don't power on at all.
Defects happen. Improperly designed systems happen. Misconfigurations happen. While those situations are unfortunate for the small percentage of people experiencing them, they shouldn't be used to judged the platform's capabilities as a whole.
> So it seems to have support. I wouldn't call that "misinformation."
Using anonymous, largely unverifiable anecdotes posted on web forums as evidence for a population-wide problem is a textbook case of selection bias.
And if you're OK with that, let me throw in my own anecdote.
All of my systems, which include:
* Skylake, Coffee Lake, Haswell/Broadwell, Sandy Bridge/Ivy Bridge, and Westmere/Nehalem Intel processors,
* Zen 2, K10, and K8 AMD processors,
...have been able to reliably execute supported vector instructions (SSE, AVX, etc.) for extended periods of time without any problems.
It's just a term probably used by the designers to define the particular "regimes" a chip can operate in: if it is doing lightweight stuff it can run at a lower voltage, "closer to the edge" if you will, because the worst case dI/dt event is relatively small so maximum voltage drop is limited. This saves power and keeps the chip cooler.
If the chip wants to suddenly start running some heavy-duty 512-bit wide floating point multiplies, it needs to inform the power delivery subsystem of it's intent and then wait for permission: this permission arrives in the form of a license, like "ok, you are now licensed to run L1 operations (heavy 256-bit or light 512-bit)".
Sometimes this license change can be satisfied by just bumping up the voltage without changing the frequency. Other times there might not be enough headroom to increase the voltage enough and still stay within the operating envelope so a combination of voltage and lowered frequency is used, which is how you get license-based downclocking.
This earlier article  explains in some detail the nature of these license transitions, including voltage levels and halted periods.
Excerpted from the Stack Overflow answer:
There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo, originally associated with AVX and AVX2 instructions. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".
In my experience, the limiting factor tends to be power delivery. I had a 6950x that I overclocked and I could get it to consistently hard crash just by starting an AVX task. My filesystems did not appreciate that! (I eventually spent a ton of time debugging why that happened, and it was just that the power supply couldn't keep the 12V rail at 12V. Turns out that Intel knows what sort of equipment is out on the market, and designed their chips accordingly. I did upgrade the power supply (1200W Corsair -> 850W Seasonic Prime) and got stable AVX without downclocking. But the whole experience killed overclocking for me. It just isn't worth the certainty that your computer will crash for no good reason at random times.)
The term "license" comes from Intel terminology, I didn't invent it.
It has nothing to do with legal/software licensing, but rather the chip having permission (a "license") from the power management unit and voltage regulator to run certain types of instructions that might place a large amount of stress on the power delivery components.
Doing computation produces heat, doing more computation at once produces more heat. If you build a cpu doing 512bits of math at a time, and get it to max clock, you'll be able to do math 32bits at a time, at a higher clock, because its a bit cooler. It is just physics.
I doubt Intel has validated the hardware outside its “license”. This appears to be a much earlier design constraint than anything they could reasonably expect to bin.
* Intel Upgrade Service
I'd like to be able to recommend the right CPU to customers, but there just isn't any information out in public about this...
Even if I were to, say, test with some cloud VMs, even then there are confounding issues. The different VM size categories aren't just different in the CPU type only, there are other differences that'll make the benchmark difficult to interpret. Memory type, throttling, HT on/off, etc...
Why is it so difficult for Microsoft to simply say "AVX-512 supported" somewhere in their documentation?
This is like every TV vendor saying "HDMI" instead of "HDMI 2.1" or whatever. Just because the port looks the same doesn't mean that they're identical! Versions matter.
They used to publish reference designs with specific servers and components.
Well, if we're optimizing a terrible idea anyway...
The FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE performance counter _seems_ to be the hardware performance counter tracking AVX512-32-bit instructions.
You wouldn't even need a debugger. Just a reset: turn on hardware performance counters in the BIOS and run a profiler that can read the hardware performance counters. I don't know if that's the performance counter for sure that we're looking for. There's a few other 512B counters there.
Or if SQL server is open source, just go to their github/gitlab/whatever repo, and enter the name of a commonly-used avx512 intrinsic on the search field ?
Also, SQL server is definitely not open source lmao.
Pls correct me if wrong.
These days, we have a full-fat EPYC or Threadripper to use, and even then its only 256-bit vector units. CUDA is also way better and NVidia has advanced dramatically: proving that CUDA is easier to code than people once thought. (Back in 2015, it was still "common knowledge" that CUDA was too hard for normal programmers).
Intel's Xeon Phi was a normal CPU processor. It could run normal Linux, and scale just like a GPU (Each PCIe x16 lane added another 60 Xeon Phi cores to your box).
It was a commercial failure, but I wouldn't say it was worthless. NVidia just ended up making a superior product, by making CUDA easier-and-easier to use.
It was definitely a really cool hardware architecture, but the software ecosystem just wasn't there.
Turns out, performance-critical code is hard to write, whether or not you have Linux. And I'm not entirely sure how Linux made things easier at all. I guess its cool that you ran GDB, had filesystems, and all that stuff, but was that really needed?
CUDA shows that you can just run bare-metal code, and have the host-manage a huge amount of the issues (even cudaMalloc is globally synchronized and "dumb" as a doornail: probably host managed if I was to guess).
Or alternatively, an AMD EPYC (64-cores / 128x PCIe lanes).
Yeah, NVidia CUDA makes a better coprocessor for deep learning and matrix multiplication. But a CPU-based coprocessor for adding extra cores to a system seems like it'd be better for some class of problems.
SIMD compute is great and all, but I kind of prefer to see different solutions in the computer world. I guess that the classic 8-way socket with Xeon 8180 is more straightforward (though expensive).
A Xeon Phi on its own motherboard is just competing with regular ol' Xeons. Granted, at a cheaper price... but its too similar to normal CPUs.
Xeon Phi was probably trying to do too many unique things. It used HMC memory instead of GDDR5x or HBM (or DDR4). It was a CPU in a GPU form factor. It was a GPU (ish) running its own OS. It was just really weird. I keep looking at the thing in theory, and wondering what problem it'd be best at solving. All sorts of weird decisions, nothing else was ever really built like it.
If you can make your code so parallelizable it runs well on a Phi, it'll run extremely well on future CPUs because clocks won't get much higher, but core counts will.