Hacker News new | past | comments | ask | show | jobs | submit login
AMD Threadripper 3970X under heavy AVX2 load: Defective by design? (level1techs.com)
225 points by franzb 35 days ago | hide | past | web | favorite | 194 comments



“Unable to perform AVX2 instructions correctly under heavy load” is also a common “WTF Intel!?”–inducing phenomenon. I’m certain SREs who work at companies with more than 1 million servers have a bunch of hair pulling stories.

Most (all?) Intel server CPUs in fact decrease clock speed when executing AVX2 (and some other) instructions to keep things a bit more sane. Vlad from Cloudflare wrote about this, more specific to AVX-512 back in 2017: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

Then there is PROCHOT signal. Which is supposed to protect the CPU from getting too hot but keeps getting raised in lopsided AVX2 loads not because CPU is too hot but voltage regulation gets whacked.

You may wonder: what is an example of AVX2 heavy load. RSA multiplication is a good candidate. AES constructions or modes (CBC with SHA, GCM) are implemented in AVX2-BMI2 as well.


I ran into this issue on one of my builds. Aida64 has a benchmark (floating point photo or something?) that uses AVX instructions. Pressing the "run benchmark" button would instantly black-screen crash my machine with 100% certainty.

I debugged this problem over a number of years... I replaced the RAM, I replaced the motherboard, I eventually replaced the CPU... still happened no matter what I did. Even if I underclocked the machine and kept the voltages the same, instant crash. It was maddening.

Exasperated, I eventually busted out an oscilloscope and looked at the waveform on the 12V supply to the CPU. When starting the AVX benchmark, there was a huge brownout. That basically explained everything; my power supply essentially turned off when the CPU started drawing a ton of power. I replaced the power supply and got lucky -- it handled it fine and I could run the benchmark. I even got some overclock out of it.

After this whole experience, I've never looked at computers the same way again. You can buy high-spec brand-name components, and it's all just a crapshoot. Maybe your computer won't crash in the middle of an important task. Maybe it will. There isn't much you can do but cross your fingers.


If there is one thing in a computer you should not cheap out on, it is the power supply. Buy from Seasonic and buy more power supply than you currently need.


My bad power supply was a Corsair 1300W model (noted because others in the thread are offering them glowing praise, whereas I think they're just a nice logo on whatever is cheap this week in Shenzhen.) The working one is indeed a Seasonic ;)

I overbuy power supplies because I think, but have not tested, that it will convince them to run the fan at lower speed. This strategy has resulted in quiet, for sure, but I'm not sure that it gets you reliability.


> I overbuy power supplies because I think, but have not tested, that it will convince them to run the fan at lower speed. This strategy has resulted in quiet, for sure, but I'm not sure that it gets you reliability.

This theory is sound. High-end power supplies tend to not turn on the fan until reaching 30-50% of their rated maximum output. Fanless power supplies are usually internally using the same components as regular high-end models with 80-100% higher output ratings, but with larger heatsinks. A high-quality oversized power supply with a fan will be silent most of the time, but keep itself cool when necessary.

( I have a semi-passive 750W power supply that has spun up its fan exactly once, when I accidentally shorted two of its power rails together and melted some wires while burning out the PCIe slot I was measuring the power draw through. The motherboard and wires had to be replaced, but the CPU, PSU and PCIe card were all fine.)


This. I've had great luck with Corsair (who used to source from Seasonic) and after some of my experiences with less quality sources (Thermaltake....eww) I'll never go back.

Spending days trying to chase down an intermittent reboot just to find that your 750W PSU is actually shitting the bed around ~350W is a terribly unsatisfying journey.


this is not universally true. i bought a new 1000W seasonic prime ultra titanium for use in a hashcat rig not too long ago, and the PSU would immediately switch off when all the cards drew power to 100% (all the cards drew no more than 800W or so at load.) there is a newegg reviewer that experienced similar issues in a system with dual RTX cards. ended up buying an EVGA supply instead and that has been solid, even at 24x7 operation.


I always spring for an 80+ Gold or better efficiency rating. My theory is that higher efficiencies require tighter manufacturing tolerances, which means more care has to go into manufacturing and therefore you get a better quality as a result. That theory has yet to fail me.


Is Corsair typically a reliable brand? I usually use them for desktop builds but simply out of habit.


Corsair used to (and still might, but I haven't needed to replace my power supply in many many years) use Seasonic as their OEM. So for a long time, Corsair was just rebranded Seasonic.

These days I don't know if that's entirely true, so I would at least look at reviews to make sure.


I'm not sure if any of Corsair's current high-end PSUs are Seasonic; they've definitely done some that aren't. But whether they're using Seasonic or not, Corsair's fairly involved in the design process and set their own specs rather than just taking an ODM's existing design and adding their own logo. They also have a serious test lab where they do their own design validation, and I believe they also do failure analysis on returned products (since that lab's only a few rooms away from where RMAs land).


I'm sure these days they're far more involved in the creation of their power supplies than they were a decade ago when I last had a need to really research any of this.

Anecdotally, Corsair has always been a top-notch company to me. Even when their products go bad, their support is awesome. I had a set of memory sticks appear to go bad, but I couldn't tell if it was my northbridge or the RAM itself. I was a cash-strapped kid, so I didn't have another machine to test with. Corsair sent out a set of known-good RAM sticks for me to test with, at completely no charge to me. When the test revealed that it was indeed the RAM that was going bad, they offered to let me keep the test sticks they sent me until my replacement came in the mail.


https://www.orionpsudb.com/corsair

The OEM for most of their PSUs is CWT, but they are "Corsair custom" designs instead of off-the-shelf. Seasonic is only used for relatively few of Corsair's models.


Ah, my information is a bit out of date because thanks to the quality of Corsair power supplies, I haven't actually needed to buy a new one in a long time.


Corsair is good, I have been using them for high end desktops and low end servers for quite some time.

Only one out of ~ 10 I bought ever failed on me, and that was after 2 years of no usage.


Corsair's higher end stuff is good, yes.


You say not to cheap out, and then recommend Seasonic. I'd say avoid a brand that was notoriously the cheapest option in recent history.


Some Seasonic are probably among the best consumer power supplies you can find (e.g. https://www.anandtech.com/show/11252/the-seasonic-prime-tita...)


At the very least, Anandtech seems to rate Seasonic very highly: https://www.anandtech.com/show/11252/the-seasonic-prime-tita...


It's not just Anandtech. All reviewers have nothing but praise for Seasonic. They have the best reputation in the industry at the moment and all their PSUs have a 10 year warranty to back it up.

Jonnyguru's last Seasonic review: https://www.jonnyguru.com/blog/2018/07/03/seasonic-prime-ult...

Literally nothing in the bad & mediocre summary sections, and with nothing but recommendations for the entire series:

> Our look at the PRIME Ultra Titanium series has now at last come to a conclusion. We can now definitively say there’s not a bad performer in the bunch, and while all do have some very minor drawbacks I can’t see any reason not to recommend any of them.


They actually have 12 year warranty at least for the PRIME TX ones.


You'd be surprised to hear that part from their own brand (which is excellent) many other top tier power supplies from the best brands are actually made by Seasonic.


A lot of brands, Seasonic among them, vary from model to model. The last time I felt like I could just blindly stick with a brand was PC Power & Cooling, and they've since been absorbed by OCZ, so... ?


I had weird issues a few times, and it was always the power supply. Before I would never think of it as the responsible for issue, but now I always do. Each time a friend has a weird unexplainable problem with a gaming PC, I suggest to try another power supply and it always fixed the issue.


For me it's the same, but with one additional step.

Test the RAM in a known stable machine. If it passes an overnight memtest, then replace the flaky machine's PSU.


RAM has always been my "go to" "I bet this is it" problem, but I've debugged my own machines and other people's machines and I haven't seen bad RAM in ages.

I avoid XMP, however, and typically manually specify the voltage that is listed in the RAM's spec sheet and run at the speed that the CPU manufacturer recommends. (So the RAM might be sold as 3200 but Intel says 2400 is the max, so I run it at 2400.)


I get your desire to be conservative here, but this is not good practice with modern CPUs. For example AMD desktop CPUs are designed with DDR4 speeds well north of 3000MHz in mind[1]. Problem is that JDEC profiles simply don't go that hight for some reason, so you you have to run with XMP profiles if you want the chip to perform as the manufacturer intended.

[1] https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...


I look at Intel ARK and use what they say the CPU is rated for. I am a couple generations behind, so not at the bleeding edge. (Haven't used an AMD CPU, but I would find their equivalent if I did.)


Keep this in mind for when you do: The first boot will be in a "failsafe" 2133MT configuration, but the fabric clock is intended to ramp all the way up to 1800MHz.

You need 3600MT to match that while keeping 1:1 sync, which means XMP unfortunately.


You can do a little bit to help by ensuring you shop for the right kind of high-spec: workstation rather than gaming, so that the extra money goes toward useful engineering and QA rather than RGB LEDs. But even then, it often ends up that the best you can hope for is a long warranty with a quick and easy replacement process.


I've had enough failed server and workstation components to believe this is snake oil. It seems like a lot of "Enterprise" hardware (particularly hard drives) is mostly the same chipset/components with a longer warranty and a more boring PCB color. Or more annoyingly, differentiated by feature lockout in some firmware.

Gaming hardware has larger economies of scale which leads to more people running the hardware and submitting RMAs. Gamers are also more likely to overvolt their hardware, leading to most consumer components being overspec'ed for heat and power.

As such, the second revision of a gaming board imo is more refined than a workstation/server board that almost never gets a new revision.

On a more practical level, consumer CPUs usually have higher clockrates and newer architectures and gaming NVidia GPUs are so much cheaper than the enterprise product lines that you can buy multiple. My workflow is faster with the cheap stuff, sometimes twice as fast.


To some extent, this depends on what kind of component you're talking about. I agree with what you're saying about consumer vs workstation GPUs because the price disparity is so large, but consumer GPUs don't usually get the full RGB treatment anyways.

I think you're also overestimating how big the gaming desktop components market is, and underestimating how fragmented it is. For motherboards, the economies of scale apply to the OEM custom motherboards used by Dell, HP, etc., which have far more volume than any one consumer retail motherboard model. You're probably also vastly overestimating how many gaming-oriented motherboards make it past revision 1.0. I just checked Gigabyte's top gaming product line, and out of 75 models listed, all but 8 were revision 1.0. Browsing through a similar number of their most recent mainstream consumer motherboards didn't turn up any that went beyond rev 1.0.

Some gamer-oriented motherboards are over-specced for power delivery and cooling. It's definitely common to find VRM heatsinks that are oversized gaudy sculptures, but that doesn't help product reliability. There are also tons of examples of gamer-oriented boards that advertise a large number of VRM phases, but have obvious shortcomings in their power delivery design once the heatsinks are removed and an EE takes a close look at things. And then there's the motherboard at issue here. When gamer boards are overbuilt, it's not thorough, it's usually just the most visible aspects that get beefed up (and the price).

RAM is a really obvious area where paying a bit extra for workstation-class ECC DIMMs gets you a real reliability advantage, but paying extra for high-speed consumer modules with RGB LEDs and big heatsinks does not.

RGB LEDs are still thankfully pretty uncommon on SSDs, but the handful of examples show that the impact varies from compromised performance to catastrophic instability. Most gamer-oriented SSDs with no LEDs but with flashy heatsinks also fail to outperform or outlive the best pedestrian-looking models.

Enterprise and NAS hard drives are definitely diverging from mainstream consumer hard drives in more ways than simple firmware feature gating. Some of the technologies they're using to drive density beyond consumer HDD needs have a questionable impact on reliability, but filling them with helium does help reliability, and you don't find that on a WD Black or Seagate Barracuda.


Oh, WD Purple surveillance HDD brochures are just marketing bullshit™:

"Using AllFrame™ technology, WD Purple™ drives improve video capturing and help to reduce errors, pixelation, and video interruptions that can occur in a video recorder system."

"WD Purple 8TB, 10TB, 12TB & 14TB capacities feature AllFrame AI technology that enables not only recording up to 64 cameras, but also supports an additional 32 streams for Deep Learning analytics within the system."

"WD Purple drives with AllFrame AI technology feature a workload rating up to 360TB/year to support the Deep Learning analytics that is featured in AI capable NVRs."

https://documents.westerndigital.com/content/dam/doc-library...


Good call. After learning the hard way in my youth, now when I build a machine I don't try to save money on the power supply. Instead I do a bunch of research, stick to the premium models, and I haven't regretted it.


I would say this is why burn-in has been ingrained into me as extremely important, especially target workload specific burn-in. I see way too many cots things cobbled together and then cursed for failing in some scenario that should have already been tested for (and specifically built for).


This is a deep dive on frequency scaling and IPC throttling related to AVX512 instructions. The consequences are quite large, surprisingly complicated, and persist for an eternity, which is why you really have to coax the compiler if you want it to issue these instructions. https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html



Prime95 has been a bad test for years now. Literally since the introduction of AVX2, it has been known that overclocking and running Prime95 will cause rapid electromigration and processor degradation on Haswell-E if you attempt to overclock. It's fine for like a few minutes but you should not attempt to do 24 hour Prime95 runs like people used to do, that was in the days before AVX was a thing.

https://rog.asus.com/articles/usa/rog-overclocking-guide-cor...

It also does not test all parts of the processor equally, it really just is slamming the cache and the AVX units. You can be "Prime95 stable" and still crash in other things. It's just not a good test anymore in multiple respects.

Same thing for Furmark on GPUs. And nowadays GPUs are actually smart enough to realize they're running a power virus and the power management pulls back the power. So again, it doesn't demonstrate anything. Zen2 is supposed to work the same way, if it realizes you're redlining the chip then it should be clocking down somewhat.


No sequence of instructions should cause the machine to reboot.


That's an excusable outcome if you're overclocking or otherwise tweaking critical operating parameters beyond their safe default ranges.


Except the 'reboot' command? I'm always curious how it works.


Normally you'd ask the chipset so that it can reset all the peripherals at the same time.

On Intel you can also intentionally "triple fault" the CPU and that'll cause it to reset. The way that works is you tell the CPU that it doesn't have interrupt handlers anymore, then cause a software interrupt. Then the CPU will first try to call your interrupt handler (but that'll fail because you told it there are no handlers), then errors during an interrupt cause a "double fault" interrupt that's supposed to handle those errors (but once again, there's no handlers), so now you're in a "triple fault" condition and the only way to recover is to reset, so that's what the CPU does.


>electromigration and processor degradation on Haswell-E

So what you are really saying is Intel made another defective product.


No, I'm just using that as an officially documented example of Prime95 being problematic.

AMD is subject to electromigration as well. I've seen a couple guys on Reddit already degrade their Zen2 chips from pushing 1.4V through them just during normal usage, not even Prime95. Literally weeks of usage.

Anti intel circlejerk ahoy.


"Power virus" is a propaganda term invented by hardware manufacturers as an excuse to avoid fixing their defective hardware.


Designing chips that can be safely run at redline 24/7 will inevitably involve a significant reduction in voltage (and thus clocks) for the 99.99999% of users who aren't running them at redline 24/7.

Would you buy a GPU that you could safely run Furmark on 24/7 if it was limited to, half or 3/4 of the clocks? That's basically what you're proposing. I seriously doubt that you would actually lay money down for that shit. The market as a whole certainly wouldn't, they would buy the card that's faster under actual real-world workloads and/or didn't require manual overclocking to achieve its potential in real-world workloads.

The good news is that you can have it both ways. Modern power management is smart enough to figure out when you're running a power virus and it will downclock to those safe levels, while us normies can get competitive clocks when running our real-world code. It works pretty well most of the time.


> Would you buy a GPU that you could safely run Furmark on 24/7 if it was limited to, half or 3/4 of the clocks?

Given that modern GPUs have complex frequency scaling algorithms that take temperature and power usage into account when setting clock speed, there's a good chance that a modern GPU will do exactly that.

> Modern power management is smart enough to figure out when you're running a power virus and it will downclock to those safe levels, while us normies can get competitive clocks when running our real-world code.

Which is perfectly fine, as long as the clock speed behavior is disclosed (which it is).

What isn't fine is a product running at such a high speed from the factory that running certain workloads damages it.


If you look back at the OP, he specified "while overclocking".

I believe that's where the issue lays. Overclocking usually also means increasing the voltage.


That sounds like it’s a great test actually.


What part of damaging your processor due to current draw, while not actually demonstrating any particular level of processor stability, sounds like a great test?

Nice contrarianism.

Yes, you can run it, and it should run successfully. Yes, sustained testing will rapidly damage your processor and cause notable performance degradation (reduced maximum overclocks/increase in required voltages) within a matter of days, and yes, you can still crash in Aida64 or other stress tests because it's not testing the whole processor very well.

It's far outside the normal current draw, even compared to other AVX2 loads like video encoding. It's an AVX load that fits entirely inside instruction cache and will hammer the AVX2 units nonstop at the maximum possible rate with absolutely no other instructions going to any of the other units. That's not how any other workload behaves, even video encoding. CPUs are not designed around that, they're designed for some reasonably normal mix of instructions.

Yes, it's a good idea to stress your processor a bit more than normal to prove it's stable, but if you make your 'proof charge' too large you will still damage the firearm.


I don't understand, when I buy a processor I expect it to work 100% load for all of the instruction set. Anything less is a defective product, and should be returned.

Furthermore, it should be able to do this 24/7 for years at a time.

This test, assuming it actually does kill the processor or stops executing, clearly indicates it should be returned.


It's fine at normal clocks. Not good for the chip, but it won't kill it within the normal operating lifetime (say, 3-5 years). That's why chips ship with a little more voltage than they need, so that they can suffer some electromigration and still function.

Overclocking? Yeah, it'll kill it quick, and that's your problem, because you're operating it outside normal specifications (more voltage than designed). If that's your standard, that you should be able to overclock and then redline the chip 24/7 for an unlimited period of time and it should never kill the chip, then you're going to be disappointed.

The fact that Threadripper is having problems is definitely a fault but it's something that you really shouldn't be doing anyway (and it sounds like the problem is the motherboard not the processor).

More generally: this is like arguing that you should be able to buy a camry and run redlined 24/7 and get the inevitable engine failures covered by warranty. If the manufacturer finds out you're doing that, they're not going to cover it, because it's not normal usage. Somehow people think it's a reasonable expectation for processors.


I agree Prime95 is overkill, but a processor should be able to handle it.

This is like the cooling argument all over again - "your cooling system is good enough for normal workloads". No, it must be good enough for permanent full load! Like a CPU under Prime95. If your heatsink lets it overheat, it's defective. You saved a few pennies on the copper and now it's a defective product.

Same for Prime95, the CPU must be able to run it for days at end and last ~5 years. Doesn't seem like too much to ask for, imo. Intel says their chips will run at 98 degrees for years, so why would a full load like Prime95 kill them in days?

One other thing, overclocking these days is actually undervolting combined with raising clocks. The old school "just raise the voltage" is a bad way to do it on any Intel chips post-Sandy Bridge...


If you overclock CPU, you lose warranty. Even if it's K-series which is made for overclocking. And stock settings probably are safe.


Yeah I did mean at stock settings.

If you overclock, you're responsible. Losing warranty sounds acceptable to me. Better than manufacturers doing everything to make overclocking impossible.

To me, it's a great way to get more performance out of an old chip, the warranty would be expired at that point.

It is hard to prove a CPU was overclocked though, so I can see some people taking advantage of that.


Did you notice that in the original article he is running the stock clock speed? If a CPU can't handle any instructions it supports at stock clock and voltage for any period of time it is defective, full stop.


Change the word "processor" for "car" and the absurdity of your argument should be fairly obvious. Most ordinary cars will provide reliable transport for many years, but will literally catch fire after a few hot laps on a racetrack. It makes no sense to design a passenger car with brakes, suspension, cooling and tyres that are massively over-engineered for normal use just in case the owner wants to go endurance racing.


3970x is a race car in your analogy, and thus should not catch fire


Damaging the processor? A CPU being unstable while overclocked doesn't mean it is damaging the CPU.


The newest AMD CPUs decrease clock speed based on parameters like heat, which AVX2 under heavy load will cause. So, AMD also decreases clock speed when executing AVX2, indirectly. Though in a very different fashion, continuously, rather than a distinct mode.


Intel's TurboBoost is such a marketing mess. It used to be more sane when they stated base clocks - the guaranteed frequency you pay for. Anything above was just extra performance when your CPU is not running hot/at power limits.

Nowadays it's just "up to XX GHz", and people expect it to run at those clocks.

AVX really pushes the CPU, an Intel Core/Xeon will hit the power draw limit pretty fast. They don't decrease clock speed as much as fall back to base clocks. The ones you paid for. Anything above that is just a bonus.

That said, you should never trigger PROCHOT even under full stress load with AVX! If you do, you need better cooling.

It's a last resort throttling feature for when your processor is hot enough to boil water :/


I do prefer Intel's solution of slowing down over AMD's solution of crashing. I guess the good news is if this does turn out to be a power delivery related bug, it's fixable with firmware.


Maybe it wasn't very clear in my comment but "correctness issues" more-or-less literally mean the instruction's result is incorrect. Like you ask 2*2 and it gives you 0xfff1309f. So depending on how you rely on that result, you may crash. Extending on my examples, bad RSA multiplication means failed authentication. Bad AES construction means encrypting things to garbage, impossible to decrypted by the recipient. So, let's say if you rely on a result from such instruction to do memory references, you're definitely going to crash.


Slowing down and producing the correct answers is clearly superior to crashing and/or incorrect answers.

I still don’t understand what it is about avx2 that results in these kinds of issues - is it really just a matter of increased number of execution units running at once causing weird power and heat issues?


> is it really just a matter of increased number of execution units running at once causing weird power and heat issues?

That's been the root cause of all the previous AVX weirdness I'm aware of. It's probably the case for these Threadrippers, too. Modern CPUs have very high power density and run at pretty low voltage, which results in insane current delivery requirements.


Not increased number of execution units, but increased number of transistors, yeah.

A normal multiply does 1 number at a time (32 bit for simplicity). AVX2 can do up to 8 multiplications at a time. That's a huge amount more transistors firing all at once and that causes the voltage to start to droop.

AVX-512 takes that even further and now it's 16 multiplications per unit, oh and Intel moved from 1x256-bit unit per core on Haswell/Broadwell to 2x512-bit units on Skylake-X, so it's potentially 32 multiplications at a time - 4x as much as AVX2.

Basically to prep for that much power being drawn all at once, the chip has to switch to a higher-voltage mode to account for the voltage droop caused by all those transistors switching at once in one place. It takes time for the regulator (on-chip Fully Integrated Voltage Regulator or motherboard VRM) to wind up the voltage far enough.

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

https://travisdowns.github.io/assets/avxfreq1/fig-volts256-1...

At this level behavior is intensively analog and thermals/voltage both significantly affect transistor current draw and switching time, which feeds back into thermals and power consumption/voltage droop.

This gets even more problematic on 7nm/10nm type nodes and especially in GPUs where you are doing a huge amount of vector arithmetic all the time. Essentially it is no longer possible to design processors that are 100% stable under all potential execution conditions, or even under normal operating conditions, so you have to have power watchdog circuitry that realizes when it's getting close to brownout/missing its timing conditions and slows itself down to stay stable. That's why AMD introduced clock stretching in a big way with Zen2 (despite the fact that it's nominally been around since Steamroller). NVIDIA piloted this with Pascal, AMD piloted it with Vega and brought it to CPU with Zen2. You simply cannot design the processor to be 100% stable at competitive clocks anymore. You have to have power management that's smart enough to withstand small transient power conditioning faults.

https://semiengineering.com/managing-voltage-drop-at-107nm/

https://semiengineering.com/power-delivery-affecting-perform...

https://www.realworldtech.com/steamroller-clocking/


I’m curious if this behavior defined by something in hardware, microcode, boot-time BIOS flags, or higher level kernel/hypervisor/application code.


on unlocked intel cpus you can change the avx multiplier in the bios.


Oh neat (I haven’t messed with over clocking in years) - is it just avx that you can tailor? (Beyond the old school bus multipliers)


You know I think that cloudflare blog post is a great example of the engineering approach behind boringssl. It's optimized for actual workloads, where you decrypt or authenticate a short message and then move on to different activities, and engaging AVX512 doesn't actually pay off in reality. OpenSSL is optimized to produce the biggest number from `openssl speed` so of course in that light it makes perfect sense to enable AVX512. But if you're trying to use these libraries in realistic workloads you will begin to appreciate the boringssl approach.


Most performance motherboards with Intel unlocked K models will downclock the maximum boost when using AVX instructions. The reason is very high power draw and temperatures. For example my i5 9600k runs at 5ghz turbo boost on all cores but 4.7 when using AVX. If I disable that option it crashed with prolonged usage like benchmarks.

Edit: To be clear, the i5 9600k is sold as 3.7ghz with boost up to 4.6 on a single core. So there is a difference with the AMD case in that this doesn't happen on the setting Intel sell it at.


AVX offset is a configurable parameter with unlocked (K- or X- series) processors. You can run with 0 offset at all but the highest overclocks.

At some point it does stop being worth it though, because the power/voltage implications of 5 GHz AVX are so severe/potentially damaging to the chip. It is a lot of current and current kills chips. SiliconLottery does all their validation with a 200 MHz offset.


there's no real reason to try and hit 0 offset anyways. if you get it stable, it usually implies you could just increase the multiplier and offset by some n>0 for greater overall performance.


The problem is that real-world code (including games) often includes at least some AVX instructions, so the AVX number is often more representative of "real" performance. The way AMD does it where it's a smooth transition based on current/thermals is definitely better than Intel's "whoops, AVX instruction, pump the brakes!".

But yes, if you can get higher clocks at least some of the time then you might as well.

The other other downside is that flipping between power states can cause problems/crashes too. It shouldn't but it can.


agreed, I was only speaking from an overclocker's perspective.


My 3970X on the ASRock Taichi (with default settings, generally) does not seem to reproduce this issue at this time - the system remains operational despite the FMA3 path being used (I'm assuming this is behind the AVX2 flag? disabling FMA3 leads to a plain AVX path) while running an all-core test with 16K FFTs in Prime95.

Either a slight background workload (Windows seems to be trying to use half a core for an OS update) resolves this, or this board does not have a broken power design?


Thanks for sharing your experience. What version of Prime95 are you using? Make sure to use the latest one (v29.8 build 6). The only CPU options I see in the torture test settings are: (1) Disable AVX-512 (grayed out since unsupported on this CPU), (2) Disable AVX2, (3) Disable AVX. There's nothing about FMA3.


Yeah - I explicitly updated to the latest one; the FMA3 setting is one that existed in prior builds in local.txt so I toggled it off there just to be sure I was hitting an AVX2 code path (in case it didn't mean the UI saying FMA3 in each worker window), but it seems to interpret AVX2 as being synonymous to FMA3 I guess.


same here on 1950x, I built from source, when toggled avx2 it shows "using type-1" I dont know what it is:

https://pastebin.com/tPuYzYC0


Consider trying a couple different versions/builds of prime95

If you look at just the list of issues fixed in v29.x series there is more than one that is close to the code path that is problematic here:

https://www.mersenneforum.org/showpost.php?p=508842&postcoun...


I'm curious too because looking at pictures of the VRM it looks like it has less output filtering. My guess is that the board firmware tells the CPU to be less aggressive until it can deliver the power because it has a less overbuilt VRM (still 90A power stages though...). That probably means more of the phases are always active which in this case is probably an advantage as it means there is no long ramp to turn on a phase. I'm going to guess the fact that the phases are doubled also means the board is less likely to turn off phases as it would lose two not one.


I'm having the same results. If I disable AVX (AVX1) and FMA3 then I just get no usage of AVX or anything. If I just disable FMA3 then it just uses AVX.

I'm on a 3900X.


He's not alone; I've had similar problems with my 3960X.

It seems to be a power delivery issue, and fortunately fixable if you disable all spread-spectrum and VRM power-saving options, but the Zen series seems a tricky beast.

I've had machine crashes triggered by using the "wrong" CPU scheduler under Linux. It's amazing, in a horrible way.


"A tricky beast" definitely seems fair. These things have to scale per-core power consumption from ~13W down to ~3W on the fly to stay within their 280W limit.

To my knowledge, AMD hasn't implemented any proactive measures that are as severe as Intel's AVX512 strategy, where the instructions get split and handled by the narrower vector units for a surprisingly long time while powering up the full-width vector units (and dropping CPU clocks). This AMD instability is only using AVX2: 256-bit SIMD rather than 512-bit. But spread across so many cores, that's still a lot of FPUs to be lighting up.


So basically Intel underclocks itself to do the AVX stuff while AMD underclocks itself ... less?


Core clock speeds are just some of the many variables that need to be adjusted on the fly. Everyone wants to provide as much performance as possible within the platform power limits. But a sudden change in the instruction stream can cause a big swing in a CPU core's power draw even if the core's clock speed and target voltage remain constant. And new instructions can show up a lot faster than the voltage regulators can respond. So handling transients is tricky, but it also looks like this case may be less about transients and more about a particular motherboard not being able to supply stable power for a workload that's tuned to push the limits as much as possible. Gigabyte may have missed their mark when considering what the worst-case power draw situation is that they need to design and test for.


Intel CPUs have two different clock "ladders" so to speak, which define the range of frequencies the dynamic frequency selection (boost) can use. Specifically, as soon as an AVX2 instruction is encountered, the entire thing switches to the AVX2 clock ladder, which drops base and maximum frequencies by a GHz or so. Ladder selection is quite coarse in both cores affected; all cores for smaller CPUs; and time (I believe the y can only change like every 100 ms or so).


Why are the adjustments so coarse here? What's the limit to making them finer, which I assume would make sense?


You have to switch to a higher-voltage state to account for the voltage droop that you will cause when the AVX units fire off.

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

(specifically: https://travisdowns.github.io/assets/avxfreq1/fig-volts256-1... )

> So one theory is that this type of transition represents the period between when the CPU has requested a higher voltage (because wider 256-bit instructions imply a larger worst-case current delta event, hence a worst-case voltage drop) and when the higher voltage is delivered. While the core waits for the change to take effect, throttling is in effect in order to reduce the worst-case drop: without throttling33 there is no guarantee that a burst of wide SIMD instructions won’t drop the voltage below the minimum voltage required for safe operation at this frequency.


I assume this is due to AVX2 being rare in general-purpose software, so the design "one core is running AVX2 -> switch large part of CPU to AVX2 mode" doesn't hinder use cases where you are hammering all cores with AVX2 code.


I thought that @Beeonarope had shown that avx2 instructions do not clock down at all. It was only AVX-256 and avx-512 that did.


I believe it depends on which CPU generation you have.


Yeah, I'm referring to the latest generation, of which this AMD should fit that description.


That would suggest the instability might be resolved by adding threads one at a time, but does prime95 have an option like that?


That's something you can do outside of prime95; the schedutil scheduler does that by default, which I believe is why I had hard crashes with ondemand and not schedutil.

(Ok, it actually ramps up frequencies slower rather than lock out cores, but the effect is the same.)


you can probably bodge it by running a bunch of copies with limited workers, then run your main copy, then kill the starters. or, you could fiddle with the CPU affinity.


That's an interesting thought!


I remember when Buildzoid of AHOC did the mobo breakdowns of the TR4 boards and thinking that while some of the super high end boards were probably good for this sort of beating, the mid and low range might struggle hard with the 64/128 part if it ever came into being (3990X was just a rumor at that time). But it looks like they need more caps to handle the transient response time, and probably also some firmware fixes to slow ramp because I don't think all the SMD caps in the world are going to handle that sort of ramp. It's just not possible to get them close enough to the actual CPU without literally putting them under the IHS.


Here's BuildZoid review of the motherboard I'm using (GIGABYTE TRX40 Aorus Xtreme):

https://www.youtube.com/watch?v=HMUWzDSAS9c

A very interesting watch if you're interested in electronics in general and in power delivery in particular (the whole YouTube channel is awesome to be honest).


That was one of the few that I thought could handle the 64/128 part. However look at the output filtering: It's roughly the same if not a cap or two larger as what you'd find on x299... which is a higher voltage and thus has lower amperage requirements. I have yet to find a back of board shot but unless it has a ton of SMD AL-poly caps back there that board would still struggle with the ramp described in the thread. Even then I'm not sure the socket resistance wouldn't cause enough V-droop to cause a crash anyway.


Tangent: is there a reason that CPUs’ instruction sets aren’t designed with explicit “hint”-ops in them to let a compiler assert “I’m going to execute some instructions that’re going to draw a lot of power about 1000 cycles from now, so start ramping up for it now”? That’d basically eliminate what Intel calls “license-switching” costs.

Do they just not believe in compiler authors to be able to emit these kinds of hints? Is there too much legacy code that would come without the hints for it to be worthwhile? Would it just not be worth it given how often OS context switches could drop the CPU directly from regular code in one process to AVX2 code in another process?


There is no explicit "warmup" instruction but you can perform a dummy AVX instruction wherever you want to start warming it up.

> I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.

https://www.agner.org/optimize/blog/read.php?i=628

Agner also notes that previous generations would just flatly stall all instructions (including non-vector) once they saw 256-bit instructions, which is probably why there is no "warmup" instruction. If there is an explicit stall of all instructions then there are very few reasons to incur that until the last possible second.


I think this would be seen as an overly-leaky abstraction, for several reasons: (1) You don't want the chip's correctness to be at the mercy of the particular instruction stream it's given; (2) the problem is potentially solvable at the microcode level, which makes (1) remain viable.

But, if such "hint" ops were seen as the only viable way to get correct, high-performance behavior in these cases, and there was a commercial case to be made for it, I could see processor vendors adding them to the IA.


Probably because the better approach would be to stall the processor (or at least stall on power intensive instructions) until the requisite power has arrived. 1000 instructions is probably too far out (in terms of instructions) for compilers to accurately add the requisite instructions. Also, it requires code change, unlike the stalling approach. The only disadvantage to the stalling approach would be if the workload is constantly switching between low power and high power, but that can be solved by making the switching interval longer/shorter in the firmware.


Why is it “better” if the stall results in lower performance-per-watt? That’s what people buy these high-end CPU SKUs for, after all. They aren’t just concerned with CapEx (the cost of the CPU) but also OpEx (the aggregate cost of electricity for the CPUs, PSUs and cooling in their studio/data center.) You could also “fix” the problem by just allowing the customer to lock the CPU into AVX2 mode and thereby never switching out at all—but of course that’d be dumb; they’d be drawing more power, and also not going as fast as they could, when executing non-AVX2 code.

Another approach I haven’t seen from either Intel or AMD yet, is to copy ARM’s big.LITTLE architecture: to set up separate “AVX2 cores” that can execute AVX instructions and most basic ALU ops but not e.g. branches, put those on the other side of the die with in-wafer thermal insulation between them and the regular cores; and then throw workloads between regular and AVX2 cores in a way where the AVX2 cores heating up doesn’t mean that the non-AVX2 cores are heating up, and the CPU can go back to full Turbo Boost as soon as the workload is thrown back to the regular cores, because the respective regular core is actually quite cool.

(The performance effects of this are possible to loosely estimate, I think, by writing code that synchronously sets up a GPGPU pass on some data, executes it, retrieves the result, and then returns to executing CPU instructions for a while, in a loop; and then executing this code on an Intel CPU using its on-die IGPU as the GPU. The CPU and IGPU form somewhat-separate thermal domains already, though there’s no explicit insulation.)


Because it doesn't rely on both the user and developers to do some yet to be determined 'right thing' for the CPU to behave in a consistent manner. While you are correct that some people want absolute max performance per watt, I would argue that no one wants a CPU that behaves unpredictably or inconsistently. An option to disable one or more power saving features that result in this issue (assuming that's what the issue is) for those who just need maximum performance could be a desired solution for some. While others, possibly most, would be willing to take a short term performance hit from a stall to maintain the best overall performance per watt characteristics.


Stalling is the way to go.

I have my doubts about this being a power issue, because localized power analysis is pretty advanced in IC design and they'd totally have tested for this.

Assuming it is a power issue though, nearly all power is used in dynamic switching, so simply inserting stall cycles will resolve it.

Power rails on silicon chips have quite a lot of capacitance, at least enough for many clock cycles, so you don't even need to make every cycle in every bit of hardware a possible stall cycle - it would be enough to simply gate instruction issue I expect and let functional units drain.


> Probably because the better approach would be to stall the processor (or at least stall on power intensive instructions) until the requisite power has arrived.

Sorry to be pedantic, but... physics requires a degree of pedantry.

1. This is probably an ENERGY issue, not a power issue. Energy is the integral of power (or power is the derivative of energy: power is the change of energy over time).

2. Electronics can only really measure voltage easily. Everything else is converted to voltage to be measured. For example, to measure current, you create a resistor and then measure the voltage across the resistor. To measure power, you need to measure voltage, and multiply it with current. To measure energy, you take the integral of your power measurement. (No joke: that's how it typically works).

3. With a capacitor, you can measure your energy level. Capacitors are relatively inaccurate however (+/- 10% tolerances at the electronics level, and probably worse tolerances at the nanometer chipscale level). The voltage from a capacitor does roughly correlate to its energy level.

-----------

Issue #1: Capacitors suck at energy storage, but they're all that engineers have at these speed and sizes.

#2: Voltage is the only thing that can be measured reliably, but it takes a long time (more than a few nanoseconds, aka 10s or 100s of clock cycles) to measure voltage. Measuring other variables, such as "Energy" (voltage level of a capacitor), or "Power" (change of energy over time, or voltage * current), takes even longer to measure.

#3: Computers are fast. 4GHz computer means that you have 0.25 nanoseconds to make a decision per clock tick. Waiting on the results of any calculation on #2 will naturally take dozens or hundreds of clock ticks to accomplish.

#4: That's best case scenario, assuming that capacitors are large enough to actually hold the energy you need to perform your calculations. But in an AVX2 situation, where your CPU suddenly will use up more Power (energy/time) than expected, you need to communicate to your power-circuitry to increase the voltage or current delivery to the chip.

#5: Sending commands to the power-circuits / VRMs, or worse case to the Power Supply, can take many milliseconds (millions of nanoseconds) before the Power Supply responds. A 200W chip at 1.25V will draw 160 Amps of current (0.160 coulomb of electrons per millisecond). A 1 Farad capacitor will lose 0.16 volts per 0.16 coulombs of electrons drawn, and hint-hint, you can't physically fit a 1-Farad capacitor in a chip (https://qph.fs.quoracdn.net/main-qimg-dddab4aa574936f8a61437...).

#6: This means that the only reasonable approach with today's technology, is to wait for the VRMs or Power Supply to respond. Note: The Power Supply will naturally notice the voltage drop and automatically send more electrons down to the motherboard. But the time-scale the PSU operates is on the order of milliseconds. A millisecond is 1-million nanoseconds or ~4-million clock cycles AFTER the AVX2 units started getting used. If you run out of energy before the power gets to your chip... guess what? You get a brownout. Random cells in your chip were now turned off and the state of the CPU is a jumbled mess now.

--------

Where, how, and why to place capacitors around a chip is one of the most difficult PCB-issues an electronic engineer deals with. There's a LOT of "rules of thumb", making PCB-design / capacitors a mystic subject matter for many EEs.

Given the difficulty of the subject of power-delivery, energy calculations, and high-speed signals involved (a chip operates at the GHz band, drawing ~100 Amps on the clock-tick, and then 0 Amps between clock ticks!), I'm almost certain that there's a mistake in the capacitors somewhere.

Its a mystic subject that's very difficult to simulate, especially across varying workloads (AVX vs normal 64-bit code, etc. etc.). That's my bet for what's going wrong here.

Honestly, power-delivery is so awesome and complicated. I'm almost surprised that anything works at all these days.


Last time they trusted compilers to become smart Itanium happened.


sTRX4 can take one of three CPUs ranging from 24 to 64 cores. The TRX40 Aorus Xtreme that's mentioned in the thread should be the absolute top of the line even if it was launched before the 64 core monster was available. So I'd expect it to work just fine with the 3970X which is a 32 core part.

But I wonder if this is a Gigabyte issue who have a history of playing around with the power delivery and using "fake phases" (to the point where they now have to advertise their boards as having 16 "real phases"). As far as I can tell many (most?) reviewers benchmarked the board with 24 core CPUs and most likely skipped on the power intensive tests.


To me this smells 100% like a transient issue, Gigabyte probably assumed that more phases would wipe the transient response time out. The issue is that if you're shutting down phases as Buildzoid mentions you'd have to do to get efficiency; you then lose that response time advantage until the phase is spun back up which is the longest part of the power system cycle. Normally output filtering is designed to make up that gap, but I think in the case of these chips that's not happening. I suspect it's a lot more complicated because I have a sneaky suspicion that adding more caps wouldn't solve the problem (completely).


This mobo has 16 real phases, according to BuildZoid's analysis.


I imagine it is. I meant they had such issues in the past (hence the "this time for real" approach) so it may be that they cut different corners in order to meet some other targets that are marketable.


That would be a motherboard issue not a CPU one, though. Has anyone attached an oscilloscope?



Great thread thanks for sharing. Really makes you want to order an oscilloscope and some CPUs to putz around with.

Everything in computer science is such a rabbit hole, great field to be in.


If you're interested in a relatively inexpensive oscilloscope to play around with, check out the Rigol 1054Z (or 1074Z Plus if you want MSO capabilities). It's four channels and has more than enough features for playing around with, especially for the price. Using the Riglol website, you can unlock all software options on it, including increasing the bandwidth to 100mHz and the memory to 24MP.

It was the first 'real' scope I bought and I still use it a fair amount despite having upgraded to a Rohde and Schwarz MSO model. I've been amazed how much I use an oscilloscope after getting one, from measuring ripple on power supplies to diagnosing serial communication issues.


> Using the Riglol website, you can unlock all software options on it, including increasing the bandwidth to 100mHz and the memory to 24MP.

It's interesting how Rigol did the locking of those extra features. The unlock key for a feature set for your scope has to be signed by a Rigol private key using an elliptic curve signature system.

But they are only using a 56 bit private key. That was quickly brute forced, and key generators proliferated.

They used a good library for the cryptography stuff, and except for the short key seem to have used it well and knew what they were doing. This suggests that the choice of a weak key was deliberate.

Each family of scopes has its own private key. As few families came out with new private keys, Rigol continued to use 56 bits. When major firmware upgrades came out in existing families, where they could have easily changed to a longer private key, they kept the same 56 bit that was now widely circulated on the net.

It seems pretty clear that they are not interested in stopping people from free unlocking.


It seems like a decent price discrimination strategy to me. They make advanced capabilities more price-accessible to particularly interested hobbyists and more popular among that market, and probably aren't losing much revenue from corporate and academic institution sales.


Thanks for the recommendations! I have a DSLabs DScope (100 MHz, 2-channel FPGA scope) and while it’s handy I’d prefer to have a proper hardware scope someday. Rigol’s scopes look like they nicely fit in between the basic DSO/FPGA stuff and the “proper” 4-5 digit priced test bench gear.

Any recommendations for learning resources that could help with understanding DC power supply analysis for non-EE types? While refurbishing laptops and working with microcontrollers I’ve run into some odd things where ruling out transient power supply issues would probably be helpful.


The low end Rigols make good entry-level scopes, and have a surprising amount of capability for the price.

As for learning resources, I came across a decent article on the subject when I was starting out (1), and most of the oscilloscope manufacturers have whitepapers on SMPS diagnostics, the Tektronics one I read a while back (2) gave a good overview. A lot of the whitepapers have a manufacturer-specific focus, but they still have good information that can be applied to almost any oscilloscope.

If you want to get really into the power supply and do high-side measurements you'll need an isolated differential probe, which can cost as much as an inexpensive oscilloscope, but for DC output measurement you shouldn't need anything special. Current probes are a lot more affordable if you're interested in looking at loads or current fluctuations/harmonics, but that's more useful after you've figured out a bit more what specific properties you're trying to measure.

1: https://www.testandmeasurementtips.com/test-switching-power-...

2: https://download.tek.com/document/3GW_23612_7.pdf

Edit: I forgot to mention that the EEVblog forums are a good resource also, but they sometimes aren't as friendly as they could be towards people just starting out.


Wow neat, thanks! Happy to have helped spawn this little thread here. Will definitely check out Rigol.


I'm really confused the OP there didn't try a different sTRX4 motherboard.


OP here. I wish I had! Unfortunately that's my work machine and I don't have the time to somehow get my hands on a new fairly rare, very expensive motherboard, disassemble my workstation and rebuild it just to see if my motherboard is the culprit. From the experience of other owners of the 3970X, this problem happens with other TRX40 motherboards.


At least EU Amazon is extremely good about RMAs, e.g. international expresses delivery (next day) even without returning the defective component 1st. Return shipment is also covered, i.e. free for the customer. Dunno, if they'd do that for high priced stuff, though.

p.s. lovely/lively oscilloscope shots!


Makes sense. I’ve done Amazon Prime and returned it because I consider that a defective mobo and fair game to return in the past.


> fortunately fixable if you disable all spread-spectrum

If these are on by default, are they required for FCC certification conformance? Presumably the device has not been EMI/EMC tested in the mode where spread-spectrum was disabled.


They are, but I'm not physically in the USA. At any rate, isn't a computer case basically a Faraday cage?

The frequencies are up in the microwave band, so I can't imagine there's much range.


> computer case basically a Faraday cage?

If it was designed well ;)

But that's only effective for addressing radiated emissions, not conducted ones.


Hmm are we having another "AMD motherboards are crap" moment?

Or is it simply that delivering 200+W at load through a CPU socket can't be reliably done at consumer prices?

Anyone has had this problem with less high end CPUs? Something at 95-65 W?


Anything with sTRX4 socket should be able to handle at least between 250W and 280W. But it's not out of the question that many motherboard designs draw a lot from AM4 boards which have much lower requirements, to keep costs down. This may be the result.

The "CPU defective by design" in the title might be a bit misguided since the suggested workarounds do not address a CPU issue but a motherboard one.


I agree (my fault, sorry) but the fact is that several people are encountering the exact same issue with a variety of motherboards. So either all those motherboards are "defective" or the problem is more complex than it looks.

My motherboard is the highest end consumer motherboard GIGABYTE has ever built (as far as I know), and I wouldn't say that they tried to keep costs down (it's listed at $849 on PartPicker, and it was introduced at $999) and the power delivery stage is insane. Here's BuildZoid's _in-depth_ review and analysis of its VRM: https://www.youtube.com/watch?v=HMUWzDSAS9c


I learned a long time ago to never get the most expensive motherboard (ASUS Republic of Gamers many years back): they're not as heavily tested as the cheaper ones. They are only ostensibly better; in practice, the mid-level "workstation-class" motherboards that feature similar capabilities but missing fancier options are put through the works and far more thoroughly tested and debugged. Just look at the BIOS changelogs and even the hardware revisions for high-but-not-flagship-high motherboards as compared to the flagship models. When you buy the top-of-the-line, you're on your own.

I now literally go out of my way not to get the flagship motherboard models, even if it means holding off on a purchase until a lower-spec'd model comes out, and have never regretted it since. (I also will never again buy Gigabyte motherboards, either.)


I have an ASUS ROG mobo. I can't do a soft reset of the computer with RAID enabled. For some reason on restart, it fails to recognize my SSD (I have an SSD as my boot drive and 3 HDD in a RAID array for data). Without fail, I always get "No boot drive detected". I have to do a hard power cycle every time. Annoying when Windows updates require multiple restarts sometimes...


I see Gigabyte being singled out more than the rest so I wonder if this is another "fake phases" debacle as it happened to them in the past. They advertise the board now as having "real phases". It's not out of the question that they took a shortcut elsewhere in order to meet some efficiency expectations or anything else like this.

Just because it's high end doesn't mean they won't cut corners or fail to test properly. My ROG motherboard was (at the time) the most expensive motherboard for the socket. Yet it behaved overall worse than many mid/low-end motherboards I owned, an experience shared with other owners of the same board. Even if this is a problem with the CPU that is mitigated by changing parameters of the motherboard it should have been caught during the motherboard design and testing. I can't imagine Gigabyte's engineers noticing this stuff and saying "just ship it like that, nobody will notice". So the best interpretation I can have is that they missed it during testing (the worst is that the marketing department said "we have to put it out there fast, all else be damned", and all else was damned).


> Just because it's high end doesn't mean they won't cut corners or fail to test properly. My ROG motherboard was (at the time) the most expensive motherboard for the socket.

High end in consumer stuff, and especially anything gaming related, is basically scam.. look mah, 16000 dpi! And blingy leds! (But no engineering)

Sure they might drop an over-specced part somewhere in it but it's just marketing when the rest of the product is crap and still has no proper engineering behind it.

It's incredibly frustrating. I think there was a time when you could generally assume that expensive = high end = actually good, but now it's just a cheap thing with crazy markup and a premium part (but nowhere near premium enough to justify the markup) or two somewhere (where it probably doesn't matter much anyway) along with other gimmicks. Now it's just expensive = expensive, good or maybe not.


I see similar behavior with a 105W 3950X when running prime95. I haven't experienced instability otherwise over the 2 months I've had this build running.

Google's stressapptest runs fine for long durations, building a kernel with make -j32 succeeds (and can boot it), every parallelized archiver like 7z,pbzip2,pigz,xz is ok, and even gaming on a Windows VM using 8c/16t + GPU passthrough works well.

This is my first Ryzen system so I chalked it up to a possible carry over of DDR4 issues from earlier generations. I didn't investigate further since the 3950x had just been released and I couldn't find any other reports of prime95-only instability until now. And just to reiterate, it is perfectly stable otherwise.

Heavy AVX2 workload breaking things fits better so will have to try to collect more data.


I'm still eyeing the 3950x as my next upgrade so I hope this gets resolved through a BIOS update or something. What motherboard are you using for it? (I'm assuming stock settings, no overclock or PBO, AMD doesn't consider PBO to be stock)


I picked the ASRock X570 Taichi since I needed 3 PCIe slots and it was on sale for ~$260. Very happy with it and yeah I'm not overclocking or using any of the quasi-overclock settings either.


I have a AMD GPU, a 380X... and to be honest I am just sure AMD power management is crap in general. 380X in particular has bigger demand than 380, but AMD didn't bother raising power limit, causing some weird issues (some games for example get unstable unless I use some overclocking app to reduce voltage and raise power limit... mind you, the normal 380 people already used it with maximum power limit by default, because it was already too low...).

Then we get the R480 melting cables for the same reason... and so on.


Hmm whatever is melting cables is drawing more power than the cables are specced for. I doubt increasing the power limit will help with that...


You are correct. But what I meant is that the source of the problem was the same.

In this case, AMD wanted to keep using 6pin cable despite it not being approprite, on the 380 it just slowed it down.

On 380X, it is unstable, it still don't draw enough to melt things, but you need hackery to make the GPU useable

The 480 they outright made a GPU that used more power than the cables specs would allow, and insisted in using the same power delivery design the 380 had... Their "fix" to the issue was make patches that would just make the GPU run slow, and make it misbehave like the 380X does.


Wow, that's ridiculous...


> Then we get the R480 melting cables for the same reason... and so on.

RX480 wasn't melting cables, it was melting motherboards. The stock VBIOS was pulling more power than the PCIe spec allowed you to pull from the slot. It was pulling over 100W, vs the 75W spec, not massive but probably enough to push some older/weaker boards over the edge (most of which were just ready to fail anyway and were going to fail from a solid 75W draw too).

https://www.tweaktown.com/news/52871/amds-radeon-rx-480-draw...

It's pretty hard to melt a cable. You can overdraw the connectors by roughly twice their rated power safely, and the cables will do more than that as long as you're not using splitters or some other hack.

The 295x2 pulled almost 500W through a pair of 8-pins and the slot (nominally that's 375W rated) while overclocking.


The stock VBIOS?! Wth were they thinking when programming it...


Basically, the chips drew more power than expected. You can tell from the fact that they only put one 6-pin on there. That gives a total board power limit of 150W, which the card actually exceeds pretty easily (pulls about 200W). They made the mistake of overdrawing the slot, the fix was to switch to overdrawing the power cable instead (which can tolerate it much more easily, you can very safely do up to twice the rated power).

Part of it was likely that NVIDIA's performance was so good. Some anecdotal rumors from the AMD vlogosphere suggest that AMD thought that the RX 480 would be competitive with the GTX 1080 - I consider this kinda dubious because the RX 480 is basically a GTX 980 tier chip, so that would mean AMD thought that NVIDIA wouldn't make any progress at all from a node shrink, which seems dubious. Maybe they figured the RX 480 would be a lot faster or more efficient than it actually ended up being.

Polaris was the last generation with "old style" voltage control where you just set a target and go, it is possible that they figured the chips would hit close to 2 GHz like NVIDIA's but ended up with validation problems and didn't hit the expected clocks. This could possibly be the reason they nuked Big Polaris and ended up lengthening the pipeline so much in Vega to try and get the clocks up (along with adding the Pascal-style "smart" power conditioning management).

Anyway, regardless, the point is that it seems likely that AMD was forced to push the RX 480 much farther than they intended to, at the last moment. Like, after the PCBs had already gone out for manufacturing, and it was too late to switch the 6-pin to an 8-pin (which would have solved the whole problem, that would have allowed 225W rated total board power).

It was still a bit of a showoff move to run from only a 6-pin, there is no way that the card would have been significantly less than 150W total board power, but perhaps defensible as a compatibility move - although some marginal PSUs still might not have handled it.


> Or is it simply that delivering 200+W at load through a CPU socket can't be reliably done at consumer prices?

It could be that the spike is the problem not the end load. The scenario in play here is going from entirely idle to maximum load in an instant. So the CPU is going to shoot straight from ~20-30w usage to 280w (or even higher if the power management is a bit sluggish at reducing the clocks). That's a pretty drastic swing if anything isn't entirely up to spec, assuming the spec even handles this properly.

In theory the 3970X could even spike all the way up to 430w. The individual cores top out at 13.5w, so if they all end up running at single-core turbo frequencies even for a split second that's going to be brutal.

> Anyone has had this problem with less high end CPUs? Something at 95-65 W?

My 3700X in an old X370 board has been flawless. As has many others, Ryzen 5/7/9 are _widely_ recommended and have been for a while. Any systemic issues in the "regular" consumer end would have cropped up by now.



The original Ryzen (1000 series) shipped plenty units with hardware defects that could be exposed by running parallel compiles. The so-called segfault bug:

https://www.phoronix.com/scan.php?page=article&item=new-ryze...


From your link:

> AMD has confirmed this issue doesn't affect EPYC or Threadripper processor

The CPU Michael was using was a Ryzen 7, not a Threadripper. But I don't know if TRs were later found to have the same problem, which wouldn't surprise me.


IMO the likely cause was some kind of cache bug due to manufacturing. Threadripper processors were binned for tighter cache timings (they had the same timings as the 2000 series desktop chips) and thus didn't suffer from it.

EPYC was on a different stepping entirely and didn't share dies with the consumer or HEDT processors. I suspect they likely had the tighter cache timings although I'm not sure on that.

Regardless, it's also possible that they were just binned out, or nobody ever encountered it. EPYC was not a very high-volume product, most people did test installations and said "yeah, we'll wait for Rome". Quad-numa on a package and lower-than-Intel IPC on a pretty poor node was not a winning formula.

Once the problem was realized, desktop Ryzen processors started getting binned for it as well. A few slipped through here and there, so it wasn't a manufacturing change, I think binning is the most likely explanation.


I can reproduce this problem on my non-Threadripper Ryzen 5 3600 Zen 2 CPU. I don't think it's specific to TR.

With AVX2 enabled, Prime95's torture test is only stable when I use 3 workers or less. With 4 workers one of them will abort due to an error within 20 seconds. The more workers, the sooner a crash; with 5 workers it happens within 10 seconds, and with 6 workers it happens within 2-3 seconds.

If I play with the tests on and off for a while, seemingly increasing the quiescent temperature of the CPU, the whole experience and testing actually becomes a bit more stable. My motherboard uses the B450M chipset.


I cannot reproduce it on Ryzen 5 3600, Gigabyte B450M DS3H, Prime95 on Linux. (FMA3 FFT length 16K)


What motherboard make/model?


Gigabyte B450M DS3H. Currently running with latest BIOS (F50).


Note that @DerAlbi thinks the CPU is fine, instead he suspects the VRM on his Gigabyte mobo for this issue (based on his oscilloscope readings):

https://forum.level1techs.com/t/3970x-prime95-stability/1532...


Thanks for the info!


Do you have PBO or overlocking enabled? PBO is not considered stock by AMD, ensure that it's disabled in BIOS (some motherboards incorrectly enable it by default)


I don't do any manual overclocking, but "core performance boost" is enabled which allows the CPU to ramp up about 500 MHz during heavy load instead of staying fixed at base frequency. I'm guessing this is the same setting you're referring to?

Add.: OK, I finally found a couple of "Precision Boost Overdrive" menus deep down in the overclocking settings of this board's BIOS, and I've disabled everything PBO I can find while letting the "core performance boost" remain in auto mode. Without doing any extended testing this seems to have solved the problem, as I can now let Prime95 run through the AVX2 code path with 6 workers without any crashes. Thanks for the hint!


Do some more investigating. Drop the memory clocks to stock, drop the CPU clocks to, perhaps, 3 GHz, and see if the same issues happen. If they do, there's a systemic issue that needs to be addressed. If the issue disappears, try raising the clock incrementally until the issue reappears. Get a Kill-a-watt and look at power usage for each frequency and graph the results.


> Finally, a note on CPU temperatures: At idle the CPU hovers around 39-50 °C and tops around 72-78 °C under full load. I’m using the best air cooling setup I could think of and get my hands on, but it’s still air cooling, and my system is installed in a closed case (but with extreme attention to airflow).

I know air coolers can be competitive, but it says right on the outside of the 2950X box that you should use liquid cooling.


Can anyone suggest a different CPU load-testing tool other than prime95, that might catch things prime95 wouldn't?

I have a machine running a 1950X and I get random ffmpeg segfaults anywhere from six to eight hours in to an encoding session with all 16 cores fully loaded, but the machine is prime95 stable for a week+, so I suspect it's an AVX/AVX2 issue.


Intel's Linpack benchmark is quite good perf/stability/throttling test for win/linux/mac: https://software.intel.com/en-us/articles/intel-mkl-benchmar...

Not the easiest one to configure though.


There's a nice, simple to use wrapper for Windows:

https://www.majorgeeks.com/files/details/intelburntest.html

https://www.techpowerup.com/download/intelburntest/

(Not sure which is the canonical source...)

I've ran it for 5 hours (20 iterations) at the maximum stress level but it passed. In my limited experience, it's less punishing than Prime95.


Linpack only runs on Intel CPUs, so it's not useful for Threadripper.


I don’t know if we’re talking about the same Linpack test but I’m able to run it on my AMD TR 3970X without issues.


The linpack package in AUR, at any rate. If there are multiple software suites under the same name then I have no idea.


That, or it's a DDR4 issue possibly? My post over at Level1Techs forum (story's link) lists a bunch of stress tests (but none as punishing as Prime95, in my experience).


Memory issues were my initial guess, but I'm running ECC RAM at stock speeds (although I had to go direct to manufacturer and wait a few months to get my hands on 2666MHz ECC DDR4 modules, at the time). Bumping up my DDR4 voltage just a tad did appear to help, but I would think prime95 would be less likely to fail to catch a DRAM voltage issue. I really didn't want to go down the oscilloscope rabbit hole (but I commend you for doing so!), although once upon a time I would have relished it. I guess I'm too old to have patience for things that don't work as advertised, but then again, as I've gotten older I've found that fewer and fewer things work as intended, anyway :/


> I really didn't want to go down the oscilloscope rabbit hole (but I commend you for doing so!)

To be clear, I'm not the one who plugged a scope to my motherboard, another 3970X owner did. I wish I had a scope! (The Rohde & Schwarz RTB2000 is my dream!)


Do you really mean segfault (i.e., SIGSEGV), or do you just mean that the program crashes in general?

Either way, you might consider bringing this to the attention of the ffmpeg developers [0]. If they don't have a fix already, that may appreciate your help in root-causing the bug.

[0] https://lists.ffmpeg.org/mailman/listinfo/ffmpeg-devel/


Yes, a literal SIGSEGV but the dump doesn't indicate anything amiss. It's not an ffmpeg issue because other long-running software also crashes when run alongside an encode job (e.g. I've had rav1e crash) some hours in, too. I prefer not to send software developers goose hunting unless I have some sort of valid repro case, which I don't, not really.


Any chance you're running out of memory? Sometimes C/C++ code fails to properly handle failed memory allocations in those situations, leading to a misguided attempt to dereference a null pointer.


You can run non-AVX, AVX, AVX2 and AVX-512 tests in Prime95, look under the options (assuming you use the latest version)


you can test avx with newer versions of prime95. you probably shouldn't run that for a week with small ffts though.


I never realized prime95 was still updated! I seem to remember there was a time when the newest prime95 releases were several years old and simply assumed that was still the case. I just ran whatever copy I had in my downloads archive at the time; I just checked and it was version 29.3 build 1.


apparently chips now detect a prime95 like workload and modify their behaviour accordingly


As far as I know, this is a legend. Chips do change their behavior when encountering an AVX, AVX2 or AVX-512 instruction, though.


So, does this mean threadripper is unusable and we shouldn't buy them?


If your workload is exclusively max parallel AVX2 workloads then you probably do want to hold off a bit, yes.

For everything else, though, there doesn't seem to be any cause for concern. It's currently a single stress test in a single very specific configuration that's failing.

For reference the tests that pass:

    MemTest86 v8.3, 4 passes (~8 hours), RAM at 3466 MT/s [PASS]
    MemTestPro 7.0 (paid version of HCI MemTest), 700% (~5 hours), RAM at 3466 MT/s [PASS]
    AIDA64 6.20.5300 System Stability Test (full system test except local disks) [PASS]
    IntelBurnTest v2.54 (based on Intel Linpack), maximum stress level, all available RAM, 20 runs [PASS]
    OCCT 5.5.3 test, large data set, 64 threads, AVX2 [PASS]
    Google stressapptest 1.0.9, all available RAM [PASS]
    Prime95 v29.8 build 6 torture test, 64 threads, min/max FFT = 4K, in-place FFTs, AVX2 (~1 hour) [PASS]
    Prime95, min/max FFT = 8K, in-place FFTs, AVX2 [PASS]
    Prime95, min/max FFT = 16K, in-place FFTs, without AVX2 [PASS]
The tests that fail:

    Prime95, min/max FFT = 16K (in-place FFTs or not), with AVX2 [INSTANT FAIL]

Note that an AVX2 workload is both in the passing & failing camp. So it's not as clear cut as "AVX2 doesn't work"


The link is in response to @DerAlbi, who suspects the VRM of hisGigabyte mobo, not the CPU. DerAlbi is techo enough to use an oscilloscope so he’s more likely to have a clue than many.

Read DerAlbi’s (recent) reply on thread:

https://forum.level1techs.com/t/3970x-prime95-stability/1532...

I.e. if an AMD CPU suits your needs, you should get one (unless you are doing Prime95 as a load!), and perhaps avoid a MoBo with that VRM. Or wait a couple of days for the truth to percolate up through the murk.


When it becomes reachable through WASM or browser Javascript optimization (if it isn't already), that seems like a big problem (easy DoS, probably a tricky RCX). Ideally browser vendors will hold off on that; besides this, there are the Intel throttling issues for some AVX2 instructions that could be an annoyance. But AVX2 is good for benchmarks...


I'm not sure why I'm being downvoted. This is a legitimate question. Does this bug make the cpu unusable? Does the work around for it slow down the per core performance to the point of unusability?


Whether the issue is with the CPU hardware, the mainboard design (VRM, etc), mainboard BIOS, kernel, or the Prime95 app itself still appears to be an open question.

Based on oscilloscope analysis of the VRM output in a linked thread elsewhere in the comments it looks like the board’s VRM design, or its configuration by the board’s BIOS, may be the most likely suspect.

But there are less-researched reports of similar issues on other boards as well, which makes things a bit more murky.

Given the uncertainties there it may put some people off from buying into the TR/sTRX40 platform in general. But to offer a blanket recommendation to avoid is a bit premature.


It's a little early to start condemning things. It might be motherboard power delivery, a prime95 bug or something else. If it is the CPU there could be a firmware correctable flaw. This is the bleeding edge; these devices have only been available for 13 weeks and someone is indeed bleeding, it seems.

It could even be an uncorrectable flaw in the device design, in which case AMD will likely do as they did in 2017: replace the parts.


> This is the bleeding edge; these devices have only been available for 13 weeks [...]

Shouldn't these devices have been tested by AMD and motherboard manufacturers for months before they were released? Including on workloads like Prime95 which are well known to uncover system instabilities.

Keep in mind also that this is not the first time in the Zen/Zen2 history that AMD has shipped buggy products: there was the segfault bug, the random number generator instruction bug, and the unbootable Threadripper on recent (at the time of release) Linux kernel. The worst part, to me personally, is that they were all "surface bugs" that could have been detected by AMD with a little bit of testing.

I've been wondering why Supermicro isn't releasing any Threadripper motherboards (while they do for Xeon-W). Maybe this explains it: the consumer CPU business at AMD is a circus and Supermicro wants none of it?


Unless it shows up in other workloads, or your workload is specifically Prime95, it doesn't make it unusable, it just means you can't do Prime95.


I overclocked a threadripper 3950x on a Gigabyte card and never encountered any problems in the various stress tests I ran. That machine convinced me that AMD's presently marketed consumer CPUs are superior to Intel's offerings, dollar for dollar. (I also benchmarked a comparable Intel chip. Both machines were cooled with a Corsair H100, ie liquid pumped through a radiator.)


> So, does this mean threadripper is unusable and we shouldn't buy them?

There's only a couple of datapoints. No official confirmation of an issue. This could be the fault of the CPU, Chipset, specific motherboard, or a bad batch.


Probably the motherboard, or VRM brownout to be more exact. That said, I'm glad I did not pick up a 3970X like I was planning to, yet. AMD is pretty great with its warranty, motherboard manufacturers can be a chore.


It appears this could just be a prime95 bug from reading the comments.


There is a beautiful oscilloscope shot, that shows it's a power delivery issue - the voltage sharpy drops and no software bug should cause it.

It appears to be a VRM issue, either a design flaw or incorrect setup by the BIOS.


I ran the same test on my ryzen 3900X with no issues, MSI X570 ACE motherboard, seasonic PSU.


Great to see L1T posted here.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: