Hacker News new | past | comments | ask | show | jobs | submit login
Ice Lake AVX-512 Downclocking (travisdowns.github.io)
126 points by ingve 3 months ago | hide | past | favorite | 82 comments

Hopefully this behavior change will help improve AVX-512 uptake and end the somewhat ridiculous conception people have that the instructions are entirely useless. Intel's HotChips presentation on Icelake-SP also indicated that the behavior will be significantly better on server chips, but the behavior is more instruction dependent with 3 license levels and only 512-bit instructions that utilize the FMA unit being subject to downclocking (by ~15-17% as opposed to ~27-29% on SKX derived chips).

(Intel Hotchips slide: https://images.anandtech.com/doci/15984/202008171757161.jpg)

> end the somewhat ridiculous conception people have that the instructions are entirely useless.

I don't think anyone ever said that.

What people are really saying, AVX-512; in its current form, limited by Clockspeed, Thermals, and most importantly market segmentation.

And given all these three things are only slightly improved in the current 3 years roadmap render the instruction useless. ( But latest HotChip information seems to suggest they are well aware of it. So Roadmap can change )

And even more importantly that is in reference to current stack of Intel issues.

> I don't think anyone ever said that.

Yeah, Torvalds said pretty much exactly that, in his typically boisterous form.

That it is "only useful in special cases", "only exists to make Intel look better on benchmarks", and that "the transistors would be better spent elsewhere".


I think some subtlety has been lost on me. The parent comment said that people had the "ridiculous conception ... that the instructions are entirely useless". Then you said that no one thinks that, followed by some stuff I'll admit I didn't really understand, leading to the conclusion "... render the instruction useless".

Are you disagreeing with the parent comment or not? Are you maybe saying that the instruction is useless in chips that are in computers right now but will be useful in chips that will be released in future?

> the instructions are entirely useless

I mostly agree, but for me the frequency scaling is almost irrelevant. Here’s a link, expand “Other settings” and you’ll see: https://store.steampowered.com/hwsurvey

If you implement AVX2, the code will run on 76% of PCs. When you want good performance, such optimization is a good use of limited resources.

AVX512 is only at 0.42%. That’s why it’s useless to implement AVX512 except for servers where you have full control over hardware. And with current state of Intel, not necessarily a good idea even for servers, it’s quite possible in a couple of years you’ll want to switch to AMD.

Eh? Compilers can generate two versions of code -- AVX and non-AVX - and dynamically choose depending on CPU capabilities. It was this way for ages.

> Compilers can generate two versions of code

From what? Automatic vectorization is very limited, it’s only good for pure vertical algorithms.

> and dynamically choose depending on CPU capabilities

Only Intel compiler does that, the rest of them don’t.

Sometimes I ship multiple versions of binaries. Other times I do runtime dispatch myself, it’s only couple lines of code, __cpuid() then cache some function pointers, or abstract class pointers. It’s even possible to compile these implementations from the same C++ source, using templates, macros, and/or something else.

Makes sense - the main difference between ICL and ICL-X would seem to be 2 FMA units.

I don't think the book is closed. Thermal and TDP downclocking are still present.

It would have been nice to see the vcore and thermal values graphed as part of the benchmark. Do they increase faster for AVX-512 vs the other instruction sets?

I've had problems in the past with Sandybridge, where AVX hit thermal throttles before SSE. I ended up having to disable them in my build because of it. Presumably, the same behaviour would be seen here now that the vector unit is wider and there are more densely packed transistors flipping.

The book is definitely not closed, but the other limits are somehow less problematic than the license-based downclocking.

You you use more power (and get hotter temps: these are exactly proportional, so you can mostly just talk about them as one) with wider vectors because you are doing more work. When you look at it on a per-element basis, you use less power per element with wider vectors. E.g., you might use 1 pJ per element for 256-bit FMA but only 0.8 for 512-bit FMA.

Of course, since you can do 2x as many total elements in 512-bits on a 2 FMA machine, you can be both more efficient but use more total power, so you can get TDP or thermal limits with 512-bit code that you wouldn't on 256-bit, but it should still per faster and more efficient per element.

All of this assumes you can usefully use the 2x more work with the larger vectors. Sometimes the scaling is worse: e.g., for lots of short arrays, when a lookup table is involved, when additional shuffling or transposition is required with larger vectors, etc. In that case you could end up less efficient with larger vectors.

To reiterate, the problem with the existing license based downclocking is that a few AVX-512 operations can drop your frequency for subsequent scalar loads on the same core so you need to carefully analyze the overall application to make sure that you have enough AVX-512 work over which you can amortize the loss in performance in the rest of your code that is affected by the frequency drop.

If the only issue with AVX-512 is thermal downclocking because you end up using more power, it's almost definitely because you are getting more work done per time. A few AVX-512 instructions in a mostly scalar workload is not going to significantly increase power dissipation and therefore should not induce thermal downclocking, while a heavily utilized AVX-512 kernel will burn power, but should also be doing work twice as fast per instruction.

Doing work faster is almost always going to consume more power and if you're already at the power or temperature limit (which is how most CPUs/GPUs operate now) then the frequency will have to reduce. This isn't automatically bad; ultimately what matters is performance. Did you really see lower performance with AVX than with SSE?

The problem is that the downclocking affects other cores. So a performance improvement in this one task can hurt performance on other threads, which is what happened to me.

Yeah, I think Intel is trying to fix this with Speed Select but it's complex enough that no one will probably use it.

At least on most new cores, the frequency is per-core. This isn't true on, for example, some Skylake client cores - but these don't have much SIMD related downclocking either.

The license-based downclocking is per-core on most new chips, but according to the linked blog post that doesn't really exist on Ice Lake anyway. Pretty sure thermal and TDP-based downclocking are across the entire package (all cores and the GPU) which in some sense might make them more objectionable than license-based downclocking.

it does bot affect other cores on all cpus, most newer ones it doesn’t

Isn't the problem if you have just a small % of your workload or some random binaries on your system doing a view 512 instructions they bonk the rest of your performance?

If I had a little background deamon that used 512 because they were cool or in the hotchips presentation, and that bonked my overall system performance that would be annoying.

It's also annoying because intel benchmarks with no mitigations. So what can happen is you think you should be seeing X performance, and then with mitigations applied and some 512 instructions hitting you are Y performance.

Yes, that has been the problem – but the TLDR from this post is that it is no longer a meaningful problem at least on the Ice Lake client CPUs: since AVX-512 no longer causes almost any downclocking.

compilers are not at a point where they do a great job of leveraging SIMD in general, and definitely not where they leverage AVX-512, but hand written intrinsics with AVX-512 can attain amazing performance.

It definitely helps.

And it looks like they've reduced how often a single instruction will cause a lockup as the core shifts to a different power level. But until they've eliminated that issue, it's still scary to toss in a few AVX-512 instructions.

There is no "lockup". Quit spreading misinformation.


> There is a transition period (the rightmost of the two shaded regions, in orange14) of ~11 μs15 where the CPU is halted: no samples occur during this period16. For fun, I’ll call this a frequency transition.

It stops executing for 35 thousand cycles. I call that a "lockup" "as it shifts".

This isn't the real problem though...the problem with the existing AVX-512 implementations is the "relaxation time" causes subsequent scalar code to be slow.

From later in the same post:

> Here, we have the worst case scenario of transitions packed as closely as possible, but we lose only ~20 μs (for 2 transitions) out of 760 μs, less than a 3% impact. The impact of running at the lower frequency is much higher: 2.8 vs 3.2 GHz: a 12.5% impact in the case that the lowered frequency was not useful (i.e., because the wide SIMD payload represents a vanishingly small part of the total work).

Interestingly enough, this is another feature that is supposed to have been improved on server Icelake. The frequency transition halt time is now pretty much negligible. The "core frequency transition block time" goes from ~12 us on CLX (similar to the number quoted above) to ~0 us on ICX.

(Slide with frequency transition info: https://images.anandtech.com/doci/15984/202008171754441.jpg)

Either way, it's a big problem that certain single instructions can cause this transition. When the transition is based on a usage threshold of heavy instructions, it's not so bad. And with this revision, more of the transitions are based on threshold. But there are still some instructions that cause an immediate frequency change, if I'm reading the articles right.

No...the whole point is that the single instruction induced halt for downclocking isn't a real issue. Even in the pathological case where you insert a single instruction spaced 760 us apart in order to induce the maximum number of clock shifts, the total performance degradation due to the clock halts is only 3% (the frequency drop that is induced by the use of these instructions has a much larger impact.) Furthermore, on Icelake-SP, the halted time due to frequency transitions is supposed to go to 0, which makes this aspect of the problem entirely irrelevant.

Yes, if you insert a single 512-bit FMA that runs every so often in your code you will get a 15% performance hit from the lower frequency, but that's much less likely than the old case where people who were trying to use AVX-512 for memcpy and the like would slow down scalar code.

But they fixed the old case, by having a minimum number of heavy instructions before changing clocks. If you have some instructions there just for the occasional memcpy, it will be a little slow during the memcpy but it won't downclock and the overall impact will be very small.

Now that the older and bigger case is fixed, this case remains the last sticking point. Because you still can't trust the CPU to do the right thing when there are a small number of heavy instructions. Even if they cut the halting time to 0, it's still bad for a single instruction to cause a prolonged downclock.

It is definitely a problem of defective hardware. But there are plenty of people out there with both Intel and Ryzen systems who disable every C-State and every Turbo setting because any change in power levels causes their system to crash. Lockup, bluescreen, kernel panic or just computation errors.

Some of that is from overclocking. Some is from old and/or defective power supplies. Some is from motherboard VRMs. And some, like the original Ryzen 1700X, is from bad SOC-internal power management.

At any rate, I have read forum posts reporting system failures caused by AVX. Either overcurrent or clockrate changes crashing it.

So it seems to have support. I wouldn't call that "misinformation."

> But there are plenty of people out there with both Intel and Ryzen systems who disable every C-State and every Turbo setting because any change in power levels causes their system to crash.

There are plenty of people with both Intel and Ryzen systems that are straight-up broken and don't power on at all.

Defects happen. Improperly designed systems happen. Misconfigurations happen. While those situations are unfortunate for the small percentage of people experiencing them, they shouldn't be used to judged the platform's capabilities as a whole.

> So it seems to have support. I wouldn't call that "misinformation."

Using anonymous, largely unverifiable anecdotes posted on web forums as evidence for a population-wide problem is a textbook case of selection bias.

And if you're OK with that, let me throw in my own anecdote.

All of my systems, which include:

* Skylake, Coffee Lake, Haswell/Broadwell, Sandy Bridge/Ivy Bridge, and Westmere/Nehalem Intel processors,

* Zen 2, K10, and K8 AMD processors,

...have been able to reliably execute supported vector instructions (SSE, AVX, etc.) for extended periods of time without any problems.

Does the word "license" actually mean "even though the hardware is capable, you didn't pay enough to not downclock" or something else here?

No, nothing like that.

It's just a term probably used by the designers to define the particular "regimes" a chip can operate in: if it is doing lightweight stuff it can run at a lower voltage, "closer to the edge" if you will, because the worst case dI/dt event is relatively small so maximum voltage drop is limited. This saves power and keeps the chip cooler.

If the chip wants to suddenly start running some heavy-duty 512-bit wide floating point multiplies, it needs to inform the power delivery subsystem of it's intent and then wait for permission: this permission arrives in the form of a license, like "ok, you are now licensed to run L1 operations (heavy 256-bit or light 512-bit)".

Sometimes this license change can be satisfied by just bumping up the voltage without changing the frequency. Other times there might not be enough headroom to increase the voltage enough and still stay within the operating envelope so a combination of voltage and lowered frequency is used, which is how you get license-based downclocking.

This earlier article [1] explains in some detail the nature of these license transitions, including voltage levels and halted periods.


[1] https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

The three links in the third paragraph of the article describe the meaning of license in this context:




Excerpted from the Stack Overflow answer:

There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo, originally associated with AVX and AVX2 instructions. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".

Clear as mud. If it means "we can't make these instructions work at nominal clock" rather than "we won't let you", why can server chips do it?

The server chips tend to be sold with a lower base clock than the consumer equivalents. If you underclock your consumer CPU to whatever the server equivalent currently is, you aren't going to see much throttling under any workloads. But you're also going to be the guy that has a 2.2GHz processor when all your friends claim to have a 5GHz processor.

In my experience, the limiting factor tends to be power delivery. I had a 6950x that I overclocked and I could get it to consistently hard crash just by starting an AVX task. My filesystems did not appreciate that! (I eventually spent a ton of time debugging why that happened, and it was just that the power supply couldn't keep the 12V rail at 12V. Turns out that Intel knows what sort of equipment is out on the market, and designed their chips accordingly. I did upgrade the power supply (1200W Corsair -> 850W Seasonic Prime) and got stable AVX without downclocking. But the whole experience killed overclocking for me. It just isn't worth the certainty that your computer will crash for no good reason at random times.)

Server chips have the same license-based limitations. In fact, the limitations first appeared on those chips and in some generations only on those chips.

The term "license" comes from Intel terminology, I didn't invent it.

It has nothing to do with legal/software licensing, but rather the chip having permission (a "license") from the power management unit and voltage regulator to run certain types of instructions that might place a large amount of stress on the power delivery components.

Note that HEDT Skylake-X all let you choose the clock speed for each license in the bios. It is definitely not a "we won't let you".

Server chips are sometimes already down clocked for stability or are binned for ability to handle more heat/voltage/clock, and so may have more headroom in some cases.

Doing computation produces heat, doing more computation at once produces more heat. If you build a cpu doing 512bits of math at a time, and get it to max clock, you'll be able to do math 32bits at a time, at a higher clock, because its a bit cooler. It is just physics.

They can't.

A Google search doesn’t do much to define the term, but from context I believe it means a frequency allowance that’s baked into the CPU rather than one that’s enforced dynamically depending on power/thermals.

I doubt Intel has validated the hardware outside its “license”. This appears to be a much earlier design constraint than anything they could reasonably expect to bin.

The license is an open-loop, predetermined safe operating area for AVX. The other limits, like RAPL and temperature, are closed-loop control systems. I think we can imagine that the static SOA exists for all these parts, but in the newer part it's so large that it doesn't tread much on the territory of the closed-loop systems. Eventually the SOA will be large enough to be irrelevant in practice.

This is an excellent description. May I use, with credit, part of this text, perhaps edited or paraphrased, as a clarifier in the blog post?

Sure, feel free. I certainly don't need to be credited for the idea that semiconductors have limits.

Well yes, but you gave a good bit more detail than "that semiconductors have limits" ;).

That's what I assumed it means too. And the article said nothing to clarify, but now I learn from comments here it means something quite different...

If it does that's terrible

Sun or SGI used to do approximately that. They sold downclocked/locked core chips that could later be unlocked for a fee (from memory, time-limited fee).

Intel did that on low-end desktop CPUs for once, but soon discontinued it after criticism.

* Intel Upgrade Service


At Hot Chips 32 this week, Intel mentioned that Tiger Lake Xeon with Sunny Cove core would only downclock if AVX-512 usage hit TDP limits.

TigerLake client uses Willow Cove cores. I would expect IceLake Xeon to use the Sunny Cove cores (same as IceLake client) and TigerLake Xeon to use Willow Cove, but I haven't seen evidence that is the case.

Well, other than precedent. We have never seen Intel suddenly surprise the market by using a different core type in a chip family's Xeon SKUs than was used in the mobile and desktop SKUs. And part of that (I conjecture, having been in the rodeo with them since 486 days) is that they like to use the consumer versions of the chips to suss out errata to be fixed before the Xeons are taped out and released. It is true that some features do differ, such as support for ECC RAM and perhaps activation of TSX or extra FMA unit, but these are really just variations on the same core family.

Given how late Ice Lake Xeons are, Tiger Lake Xeons are irrelevant.

Or more likely the 10nm process yield issues are the reason Ice Lake Xeon's are late. One potential is that Ice Lake Xeons are less relevant because it is on market short period before replaced by Sapphire Rapids. Since they got the Tiger Lake Xeons running in lab in June, and it should take them about year to get it to servers. And I doubt that they would have manufacturing issues with Tiger Lake Xeons when they can manufacture Ice Lake Xeons. As for Intel being irrelevant in servers, I doubt that they would go from over 90% market share with on paper inferior products to irrelevant in a year.

That is MUCH better it seems? Because then you don't randomly throw away lots of performance because some 512 items hit so there is less risk in using 512.

I wonder how many clock cycles that'd take ;-)

They made it sound like some instructions were more power hungry than others. The impression I got is that the unit can run some kinds of streams without reduction Of clock.

One thing I really want to know is whether SQL Server's new vector-accelerated "Batch Mode" uses AVX2 only or if it also has AVX-512 code paths?

I'd like to be able to recommend the right CPU to customers, but there just isn't any information out in public about this...

There are a lot of different factors that can affect performance. I'd advise you to always benchmark.

I can't benchmark with a CPU I don't have, and I can't advise a customer to go off and buy a $50K server just for a "quick benchmark".

Even if I were to, say, test with some cloud VMs, even then there are confounding issues. The different VM size categories aren't just different in the CPU type only, there are other differences that'll make the benchmark difficult to interpret. Memory type, throttling, HT on/off, etc...

Why is it so difficult for Microsoft to simply say "AVX-512 supported" somewhere in their documentation?

This is like every TV vendor saying "HDMI" instead of "HDMI 2.1" or whatever. Just because the port looks the same doesn't mean that they're identical! Versions matter.

"AVX-512 supported" doesn't say much either.

They used to publish reference designs with specific servers and components.

you could rent an avx-512 vm in the cloud for a few hours?

This is a terrible suggestion, but you could always attach a debugger to find out

> This is a terrible suggestion

Well, if we're optimizing a terrible idea anyway...


The FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE performance counter _seems_ to be the hardware performance counter tracking AVX512-32-bit instructions.

You wouldn't even need a debugger. Just a reset: turn on hardware performance counters in the BIOS and run a profiler that can read the hardware performance counters. I don't know if that's the performance counter for sure that we're looking for. There's a few other 512B counters there.

Or just disassemble the binary and grep for avx-512 or avx2 instructions ?

Or if SQL server is open source, just go to their github/gitlab/whatever repo, and enter the name of a commonly-used avx512 intrinsic on the search field ?

I figured it would be easier to find the specific area by debugging than disassembling, since it's likely easily tens to hundreds of megs.

Also, SQL server is definitely not open source lmao.

Given the process improvements in Tiger Lake - I wonder if this improves further, or at least all levels become somewhat faster?

I don't know this area, but to clarify (or mess things up, depending on my understanding), the downclocking is for FP-heavy work. AFAIU it doesn't occur for any integer work (maybe an exception for div?), and 512-wide simd integer instructions could be very, very useful.

Pls correct me if wrong.

Did the Xeon Phi also downclock when using AVX-512?

I haven't seen any numbers on that but there's literally zero reason to run a Xeon Phi without using AVX-512, so I'd assume no design considerations were taken to optimize the clock frequency for a non-AVX-512 use case.

There's almost zero reason to run a phi in general.

The Phi was an interesting computer. AVX512 on 60 cores back in 2015 was pretty nuts. CUDA wasn't quite as good as it is today (there have been HUGE advancements in CUDA recently).

These days, we have a full-fat EPYC or Threadripper to use, and even then its only 256-bit vector units. CUDA is also way better and NVidia has advanced dramatically: proving that CUDA is easier to code than people once thought. (Back in 2015, it was still "common knowledge" that CUDA was too hard for normal programmers).

Intel's Xeon Phi was a normal CPU processor. It could run normal Linux, and scale just like a GPU (Each PCIe x16 lane added another 60 Xeon Phi cores to your box).

It was a commercial failure, but I wouldn't say it was worthless. NVidia just ended up making a superior product, by making CUDA easier-and-easier to use.

I was using CUDA heavily in 2015, and I also looked at the first/second gen of the Xeon Phi at the time. I thought it was much harder to program for than cuda was at the time (and certainly that gap has widened). I recall things like a weird ring topology between cores that you may have had to pay attention to, the memory hierarchies (you kind of do this with CUDA, but I remember it being NUMA-like), as well as the transfers to and from the host CPU were harder/synchronous compared to CUDA.

It was definitely a really cool hardware architecture, but the software ecosystem just wasn't there.

Xeon Phi was supposed to be easy to program for, because it ran Linux (albeit an embedded version, but it was straight up Linux).

Turns out, performance-critical code is hard to write, whether or not you have Linux. And I'm not entirely sure how Linux made things easier at all. I guess its cool that you ran GDB, had filesystems, and all that stuff, but was that really needed?


CUDA shows that you can just run bare-metal code, and have the host-manage a huge amount of the issues (even cudaMalloc is globally synchronized and "dumb" as a doornail: probably host managed if I was to guess).

That's right -- I always wished they made a Phi with PCIe connections out to other peripherals. Imagine a Phi host that could connect to a GPU to offload things it was better at.

That looks like 28 cores, and I think Phi went to 72 cores (or 144 with HT). Of course, the Phi was clocked much lower. The AMD is definitely more comparable.

Well... they did. That's basically called a Xeon 8180. :-)

Or alternatively, an AMD EPYC (64-cores / 128x PCIe lanes).

Now I'm remembering... They had the phi as a coprocessor in a PCI slot, effectively making it just as issue as a GPU. But the second gen (knights landing) made the phi the host processor, but removed almost all ability for external devices. It had potential I think, but it was a weird transition from v1 to v2.

I actually found the Coprocessor more interesting.

Yeah, NVidia CUDA makes a better coprocessor for deep learning and matrix multiplication. But a CPU-based coprocessor for adding extra cores to a system seems like it'd be better for some class of problems.

SIMD compute is great and all, but I kind of prefer to see different solutions in the computer world. I guess that the classic 8-way socket with Xeon 8180 is more straightforward (though expensive).


A Xeon Phi on its own motherboard is just competing with regular ol' Xeons. Granted, at a cheaper price... but its too similar to normal CPUs.

Xeon Phi was probably trying to do too many unique things. It used HMC memory instead of GDDR5x or HBM (or DDR4). It was a CPU in a GPU form factor. It was a GPU (ish) running its own OS. It was just really weird. I keep looking at the thing in theory, and wondering what problem it'd be best at solving. All sorts of weird decisions, nothing else was ever really built like it.

Agreed! That's why I was bummed when the second-gen was a host system. Didn't fit well to my use case.

I always suggested making software run well on a Phi would be valid research for making it run well on future Xeon Scalable and Core i9.

If you can make your code so parallelizable it runs well on a Phi, it'll run extremely well on future CPUs because clocks won't get much higher, but core counts will.

Good point, but from what I recall the phi had two different memory types, and you had to specify which you were targeting. This didn't necessarily translate to the server CPUs.

Yes. 200 MHz down from base clock.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact