Most (all?) Intel server CPUs in fact decrease clock speed when executing AVX2 (and some other) instructions to keep things a bit more sane. Vlad from Cloudflare wrote about this, more specific to AVX-512 back in 2017: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
Then there is PROCHOT signal. Which is supposed to protect the CPU from getting too hot but keeps getting raised in lopsided AVX2 loads not because CPU is too hot but voltage regulation gets whacked.
You may wonder: what is an example of AVX2 heavy load. RSA multiplication is a good candidate. AES constructions or modes (CBC with SHA, GCM) are implemented in AVX2-BMI2 as well.
I debugged this problem over a number of years... I replaced the RAM, I replaced the motherboard, I eventually replaced the CPU... still happened no matter what I did. Even if I underclocked the machine and kept the voltages the same, instant crash. It was maddening.
Exasperated, I eventually busted out an oscilloscope and looked at the waveform on the 12V supply to the CPU. When starting the AVX benchmark, there was a huge brownout. That basically explained everything; my power supply essentially turned off when the CPU started drawing a ton of power. I replaced the power supply and got lucky -- it handled it fine and I could run the benchmark. I even got some overclock out of it.
After this whole experience, I've never looked at computers the same way again. You can buy high-spec brand-name components, and it's all just a crapshoot. Maybe your computer won't crash in the middle of an important task. Maybe it will. There isn't much you can do but cross your fingers.
I overbuy power supplies because I think, but have not tested, that it will convince them to run the fan at lower speed. This strategy has resulted in quiet, for sure, but I'm not sure that it gets you reliability.
This theory is sound. High-end power supplies tend to not turn on the fan until reaching 30-50% of their rated maximum output. Fanless power supplies are usually internally using the same components as regular high-end models with 80-100% higher output ratings, but with larger heatsinks. A high-quality oversized power supply with a fan will be silent most of the time, but keep itself cool when necessary.
( I have a semi-passive 750W power supply that has spun up its fan exactly once, when I accidentally shorted two of its power rails together and melted some wires while burning out the PCIe slot I was measuring the power draw through. The motherboard and wires had to be replaced, but the CPU, PSU and PCIe card were all fine.)
Spending days trying to chase down an intermittent reboot just to find that your 750W PSU is actually shitting the bed around ~350W is a terribly unsatisfying journey.
These days I don't know if that's entirely true, so I would at least look at reviews to make sure.
Anecdotally, Corsair has always been a top-notch company to me. Even when their products go bad, their support is awesome. I had a set of memory sticks appear to go bad, but I couldn't tell if it was my northbridge or the RAM itself. I was a cash-strapped kid, so I didn't have another machine to test with. Corsair sent out a set of known-good RAM sticks for me to test with, at completely no charge to me. When the test revealed that it was indeed the RAM that was going bad, they offered to let me keep the test sticks they sent me until my replacement came in the mail.
The OEM for most of their PSUs is CWT, but they are "Corsair custom" designs instead of off-the-shelf. Seasonic is only used for relatively few of Corsair's models.
Only one out of ~ 10 I bought ever failed on me, and that was after 2 years of no usage.
Jonnyguru's last Seasonic review: https://www.jonnyguru.com/blog/2018/07/03/seasonic-prime-ult...
Literally nothing in the bad & mediocre summary sections, and with nothing but recommendations for the entire series:
> Our look at the PRIME Ultra Titanium series has now at last come to a conclusion. We can now definitively say there’s not a bad performer in the bunch, and while all do have some very minor drawbacks I can’t see any reason not to recommend any of them.
Test the RAM in a known stable machine. If it passes an overnight memtest, then replace the flaky machine's PSU.
I avoid XMP, however, and typically manually specify the voltage that is listed in the RAM's spec sheet and run at the speed that the CPU manufacturer recommends. (So the RAM might be sold as 3200 but Intel says 2400 is the max, so I run it at 2400.)
You need 3600MT to match that while keeping 1:1 sync, which means XMP unfortunately.
Gaming hardware has larger economies of scale which leads to more people running the hardware and submitting RMAs. Gamers are also more likely to overvolt their hardware, leading to most consumer components being overspec'ed for heat and power.
As such, the second revision of a gaming board imo is more refined than a workstation/server board that almost never gets a new revision.
On a more practical level, consumer CPUs usually have higher clockrates and newer architectures and gaming NVidia GPUs are so much cheaper than the enterprise product lines that you can buy multiple. My workflow is faster with the cheap stuff, sometimes twice as fast.
I think you're also overestimating how big the gaming desktop components market is, and underestimating how fragmented it is. For motherboards, the economies of scale apply to the OEM custom motherboards used by Dell, HP, etc., which have far more volume than any one consumer retail motherboard model. You're probably also vastly overestimating how many gaming-oriented motherboards make it past revision 1.0. I just checked Gigabyte's top gaming product line, and out of 75 models listed, all but 8 were revision 1.0. Browsing through a similar number of their most recent mainstream consumer motherboards didn't turn up any that went beyond rev 1.0.
Some gamer-oriented motherboards are over-specced for power delivery and cooling. It's definitely common to find VRM heatsinks that are oversized gaudy sculptures, but that doesn't help product reliability. There are also tons of examples of gamer-oriented boards that advertise a large number of VRM phases, but have obvious shortcomings in their power delivery design once the heatsinks are removed and an EE takes a close look at things. And then there's the motherboard at issue here. When gamer boards are overbuilt, it's not thorough, it's usually just the most visible aspects that get beefed up (and the price).
RAM is a really obvious area where paying a bit extra for workstation-class ECC DIMMs gets you a real reliability advantage, but paying extra for high-speed consumer modules with RGB LEDs and big heatsinks does not.
RGB LEDs are still thankfully pretty uncommon on SSDs, but the handful of examples show that the impact varies from compromised performance to catastrophic instability. Most gamer-oriented SSDs with no LEDs but with flashy heatsinks also fail to outperform or outlive the best pedestrian-looking models.
Enterprise and NAS hard drives are definitely diverging from mainstream consumer hard drives in more ways than simple firmware feature gating. Some of the technologies they're using to drive density beyond consumer HDD needs have a questionable impact on reliability, but filling them with helium does help reliability, and you don't find that on a WD Black or Seagate Barracuda.
"Using AllFrame™ technology, WD Purple™ drives improve video capturing and help to reduce errors, pixelation, and video interruptions that can occur in a video recorder system."
"WD Purple 8TB, 10TB, 12TB & 14TB capacities feature AllFrame AI technology that enables not only recording up to 64 cameras, but also supports an additional 32 streams for Deep Learning analytics within the system."
"WD Purple drives with AllFrame AI technology feature a workload rating up to 360TB/year to support the Deep Learning analytics that is featured in AI capable NVRs."
It also does not test all parts of the processor equally, it really just is slamming the cache and the AVX units. You can be "Prime95 stable" and still crash in other things. It's just not a good test anymore in multiple respects.
Same thing for Furmark on GPUs. And nowadays GPUs are actually smart enough to realize they're running a power virus and the power management pulls back the power. So again, it doesn't demonstrate anything. Zen2 is supposed to work the same way, if it realizes you're redlining the chip then it should be clocking down somewhat.
On Intel you can also intentionally "triple fault" the CPU and that'll cause it to reset. The way that works is you tell the CPU that it doesn't have interrupt handlers anymore, then cause a software interrupt. Then the CPU will first try to call your interrupt handler (but that'll fail because you told it there are no handlers), then errors during an interrupt cause a "double fault" interrupt that's supposed to handle those errors (but once again, there's no handlers), so now you're in a "triple fault" condition and the only way to recover is to reset, so that's what the CPU does.
So what you are really saying is Intel made another defective product.
AMD is subject to electromigration as well. I've seen a couple guys on Reddit already degrade their Zen2 chips from pushing 1.4V through them just during normal usage, not even Prime95. Literally weeks of usage.
Anti intel circlejerk ahoy.
Would you buy a GPU that you could safely run Furmark on 24/7 if it was limited to, half or 3/4 of the clocks? That's basically what you're proposing. I seriously doubt that you would actually lay money down for that shit. The market as a whole certainly wouldn't, they would buy the card that's faster under actual real-world workloads and/or didn't require manual overclocking to achieve its potential in real-world workloads.
The good news is that you can have it both ways. Modern power management is smart enough to figure out when you're running a power virus and it will downclock to those safe levels, while us normies can get competitive clocks when running our real-world code. It works pretty well most of the time.
Given that modern GPUs have complex frequency scaling algorithms that take temperature and power usage into account when setting clock speed, there's a good chance that a modern GPU will do exactly that.
> Modern power management is smart enough to figure out when you're running a power virus and it will downclock to those safe levels, while us normies can get competitive clocks when running our real-world code.
Which is perfectly fine, as long as the clock speed behavior is disclosed (which it is).
What isn't fine is a product running at such a high speed from the factory that running certain workloads damages it.
I believe that's where the issue lays. Overclocking usually also means increasing the voltage.
Yes, you can run it, and it should run successfully. Yes, sustained testing will rapidly damage your processor and cause notable performance degradation (reduced maximum overclocks/increase in required voltages) within a matter of days, and yes, you can still crash in Aida64 or other stress tests because it's not testing the whole processor very well.
It's far outside the normal current draw, even compared to other AVX2 loads like video encoding. It's an AVX load that fits entirely inside instruction cache and will hammer the AVX2 units nonstop at the maximum possible rate with absolutely no other instructions going to any of the other units. That's not how any other workload behaves, even video encoding. CPUs are not designed around that, they're designed for some reasonably normal mix of instructions.
Yes, it's a good idea to stress your processor a bit more than normal to prove it's stable, but if you make your 'proof charge' too large you will still damage the firearm.
Furthermore, it should be able to do this 24/7 for years at a time.
This test, assuming it actually does kill the processor or stops executing, clearly indicates it should be returned.
Overclocking? Yeah, it'll kill it quick, and that's your problem, because you're operating it outside normal specifications (more voltage than designed). If that's your standard, that you should be able to overclock and then redline the chip 24/7 for an unlimited period of time and it should never kill the chip, then you're going to be disappointed.
The fact that Threadripper is having problems is definitely a fault but it's something that you really shouldn't be doing anyway (and it sounds like the problem is the motherboard not the processor).
More generally: this is like arguing that you should be able to buy a camry and run redlined 24/7 and get the inevitable engine failures covered by warranty. If the manufacturer finds out you're doing that, they're not going to cover it, because it's not normal usage. Somehow people think it's a reasonable expectation for processors.
This is like the cooling argument all over again - "your cooling system is good enough for normal workloads". No, it must be good enough for permanent full load! Like a CPU under Prime95. If your heatsink lets it overheat, it's defective. You saved a few pennies on the copper and now it's a defective product.
Same for Prime95, the CPU must be able to run it for days at end and last ~5 years. Doesn't seem like too much to ask for, imo. Intel says their chips will run at 98 degrees for years, so why would a full load like Prime95 kill them in days?
One other thing, overclocking these days is actually undervolting combined with raising clocks. The old school "just raise the voltage" is a bad way to do it on any Intel chips post-Sandy Bridge...
If you overclock, you're responsible. Losing warranty sounds acceptable to me. Better than manufacturers doing everything to make overclocking impossible.
To me, it's a great way to get more performance out of an old chip, the warranty would be expired at that point.
It is hard to prove a CPU was overclocked though, so I can see some people taking advantage of that.
Nowadays it's just "up to XX GHz", and people expect it to run at those clocks.
AVX really pushes the CPU, an Intel Core/Xeon will hit the power draw limit pretty fast. They don't decrease clock speed as much as fall back to base clocks. The ones you paid for. Anything above that is just a bonus.
That said, you should never trigger PROCHOT even under full stress load with AVX! If you do, you need better cooling.
It's a last resort throttling feature for when your processor is hot enough to boil water :/
I still don’t understand what it is about avx2 that results in these kinds of issues - is it really just a matter of increased number of execution units running at once causing weird power and heat issues?
That's been the root cause of all the previous AVX weirdness I'm aware of. It's probably the case for these Threadrippers, too. Modern CPUs have very high power density and run at pretty low voltage, which results in insane current delivery requirements.
A normal multiply does 1 number at a time (32 bit for simplicity). AVX2 can do up to 8 multiplications at a time. That's a huge amount more transistors firing all at once and that causes the voltage to start to droop.
AVX-512 takes that even further and now it's 16 multiplications per unit, oh and Intel moved from 1x256-bit unit per core on Haswell/Broadwell to 2x512-bit units on Skylake-X, so it's potentially 32 multiplications at a time - 4x as much as AVX2.
Basically to prep for that much power being drawn all at once, the chip has to switch to a higher-voltage mode to account for the voltage droop caused by all those transistors switching at once in one place. It takes time for the regulator (on-chip Fully Integrated Voltage Regulator or motherboard VRM) to wind up the voltage far enough.
At this level behavior is intensively analog and thermals/voltage both significantly affect transistor current draw and switching time, which feeds back into thermals and power consumption/voltage droop.
This gets even more problematic on 7nm/10nm type nodes and especially in GPUs where you are doing a huge amount of vector arithmetic all the time. Essentially it is no longer possible to design processors that are 100% stable under all potential execution conditions, or even under normal operating conditions, so you have to have power watchdog circuitry that realizes when it's getting close to brownout/missing its timing conditions and slows itself down to stay stable. That's why AMD introduced clock stretching in a big way with Zen2 (despite the fact that it's nominally been around since Steamroller). NVIDIA piloted this with Pascal, AMD piloted it with Vega and brought it to CPU with Zen2. You simply cannot design the processor to be 100% stable at competitive clocks anymore. You have to have power management that's smart enough to withstand small transient power conditioning faults.
Edit: To be clear, the i5 9600k is sold as 3.7ghz with boost up to 4.6 on a single core. So there is a difference with the AMD case in that this doesn't happen on the setting Intel sell it at.
At some point it does stop being worth it though, because the power/voltage implications of 5 GHz AVX are so severe/potentially damaging to the chip. It is a lot of current and current kills chips. SiliconLottery does all their validation with a 200 MHz offset.
But yes, if you can get higher clocks at least some of the time then you might as well.
The other other downside is that flipping between power states can cause problems/crashes too. It shouldn't but it can.
Either a slight background workload (Windows seems to be trying to use half a core for an OS update) resolves this, or this board does not have a broken power design?
If you look at just the list of issues fixed in v29.x series there is more than one that is close to the code path that is problematic here:
I'm on a 3900X.
It seems to be a power delivery issue, and fortunately fixable if you disable all spread-spectrum and VRM power-saving options, but the Zen series seems a tricky beast.
I've had machine crashes triggered by using the "wrong" CPU scheduler under Linux. It's amazing, in a horrible way.
To my knowledge, AMD hasn't implemented any proactive measures that are as severe as Intel's AVX512 strategy, where the instructions get split and handled by the narrower vector units for a surprisingly long time while powering up the full-width vector units (and dropping CPU clocks). This AMD instability is only using AVX2: 256-bit SIMD rather than 512-bit. But spread across so many cores, that's still a lot of FPUs to be lighting up.
(specifically: https://travisdowns.github.io/assets/avxfreq1/fig-volts256-1... )
> So one theory is that this type of transition represents the period between when the CPU has requested a higher voltage (because wider 256-bit instructions imply a larger worst-case current delta event, hence a worst-case voltage drop) and when the higher voltage is delivered. While the core waits for the change to take effect, throttling is in effect in order to reduce the worst-case drop: without throttling33 there is no guarantee that a burst of wide SIMD instructions won’t drop the voltage below the minimum voltage required for safe operation at this frequency.
(Ok, it actually ramps up frequencies slower rather than lock out cores, but the effect is the same.)
A very interesting watch if you're interested in electronics in general and in power delivery in particular (the whole YouTube channel is awesome to be honest).
Do they just not believe in compiler authors to be able to emit these kinds of hints? Is there too much legacy code that would come without the hints for it to be worthwhile? Would it just not be worth it given how often OS context switches could drop the CPU directly from regular code in one process to AVX2 code in another process?
> I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.
Agner also notes that previous generations would just flatly stall all instructions (including non-vector) once they saw 256-bit instructions, which is probably why there is no "warmup" instruction. If there is an explicit stall of all instructions then there are very few reasons to incur that until the last possible second.
But, if such "hint" ops were seen as the only viable way to get correct, high-performance behavior in these cases, and there was a commercial case to be made for it, I could see processor vendors adding them to the IA.
Another approach I haven’t seen from either Intel or AMD yet, is to copy ARM’s big.LITTLE architecture: to set up separate “AVX2 cores” that can execute AVX instructions and most basic ALU ops but not e.g. branches, put those on the other side of the die with in-wafer thermal insulation between them and the regular cores; and then throw workloads between regular and AVX2 cores in a way where the AVX2 cores heating up doesn’t mean that the non-AVX2 cores are heating up, and the CPU can go back to full Turbo Boost as soon as the workload is thrown back to the regular cores, because the respective regular core is actually quite cool.
(The performance effects of this are possible to loosely estimate, I think, by writing code that synchronously sets up a GPGPU pass on some data, executes it, retrieves the result, and then returns to executing CPU instructions for a while, in a loop; and then executing this code on an Intel CPU using its on-die IGPU as the GPU. The CPU and IGPU form somewhat-separate thermal domains already, though there’s no explicit insulation.)
Sorry to be pedantic, but... physics requires a degree of pedantry.
1. This is probably an ENERGY issue, not a power issue. Energy is the integral of power (or power is the derivative of energy: power is the change of energy over time).
2. Electronics can only really measure voltage easily. Everything else is converted to voltage to be measured. For example, to measure current, you create a resistor and then measure the voltage across the resistor. To measure power, you need to measure voltage, and multiply it with current. To measure energy, you take the integral of your power measurement. (No joke: that's how it typically works).
3. With a capacitor, you can measure your energy level. Capacitors are relatively inaccurate however (+/- 10% tolerances at the electronics level, and probably worse tolerances at the nanometer chipscale level). The voltage from a capacitor does roughly correlate to its energy level.
Issue #1: Capacitors suck at energy storage, but they're all that engineers have at these speed and sizes.
#2: Voltage is the only thing that can be measured reliably, but it takes a long time (more than a few nanoseconds, aka 10s or 100s of clock cycles) to measure voltage. Measuring other variables, such as "Energy" (voltage level of a capacitor), or "Power" (change of energy over time, or voltage * current), takes even longer to measure.
#3: Computers are fast. 4GHz computer means that you have 0.25 nanoseconds to make a decision per clock tick. Waiting on the results of any calculation on #2 will naturally take dozens or hundreds of clock ticks to accomplish.
#4: That's best case scenario, assuming that capacitors are large enough to actually hold the energy you need to perform your calculations. But in an AVX2 situation, where your CPU suddenly will use up more Power (energy/time) than expected, you need to communicate to your power-circuitry to increase the voltage or current delivery to the chip.
#5: Sending commands to the power-circuits / VRMs, or worse case to the Power Supply, can take many milliseconds (millions of nanoseconds) before the Power Supply responds. A 200W chip at 1.25V will draw 160 Amps of current (0.160 coulomb of electrons per millisecond). A 1 Farad capacitor will lose 0.16 volts per 0.16 coulombs of electrons drawn, and hint-hint, you can't physically fit a 1-Farad capacitor in a chip (https://qph.fs.quoracdn.net/main-qimg-dddab4aa574936f8a61437...).
#6: This means that the only reasonable approach with today's technology, is to wait for the VRMs or Power Supply to respond. Note: The Power Supply will naturally notice the voltage drop and automatically send more electrons down to the motherboard. But the time-scale the PSU operates is on the order of milliseconds. A millisecond is 1-million nanoseconds or ~4-million clock cycles AFTER the AVX2 units started getting used. If you run out of energy before the power gets to your chip... guess what? You get a brownout. Random cells in your chip were now turned off and the state of the CPU is a jumbled mess now.
Where, how, and why to place capacitors around a chip is one of the most difficult PCB-issues an electronic engineer deals with. There's a LOT of "rules of thumb", making PCB-design / capacitors a mystic subject matter for many EEs.
Given the difficulty of the subject of power-delivery, energy calculations, and high-speed signals involved (a chip operates at the GHz band, drawing ~100 Amps on the clock-tick, and then 0 Amps between clock ticks!), I'm almost certain that there's a mistake in the capacitors somewhere.
Its a mystic subject that's very difficult to simulate, especially across varying workloads (AVX vs normal 64-bit code, etc. etc.). That's my bet for what's going wrong here.
Honestly, power-delivery is so awesome and complicated. I'm almost surprised that anything works at all these days.
I have my doubts about this being a power issue, because localized power analysis is pretty advanced in IC design and they'd totally have tested for this.
Assuming it is a power issue though, nearly all power is used in dynamic switching, so simply inserting stall cycles will resolve it.
Power rails on silicon chips have quite a lot of capacitance, at least enough for many clock cycles, so you don't even need to make every cycle in every bit of hardware a possible stall cycle - it would be enough to simply gate instruction issue I expect and let functional units drain.
But I wonder if this is a Gigabyte issue who have a history of playing around with the power delivery and using "fake phases" (to the point where they now have to advertise their boards as having 16 "real phases"). As far as I can tell many (most?) reviewers benchmarked the board with 24 core CPUs and most likely skipped on the power intensive tests.
Everything in computer science is such a rabbit hole, great field to be in.
It was the first 'real' scope I bought and I still use it a fair amount despite having upgraded to a Rohde and Schwarz MSO model. I've been amazed how much I use an oscilloscope after getting one, from measuring ripple on power supplies to diagnosing serial communication issues.
It's interesting how Rigol did the locking of those extra features. The unlock key for a feature set for your scope has to be signed by a Rigol private key using an elliptic curve signature system.
But they are only using a 56 bit private key. That was quickly brute forced, and key generators proliferated.
They used a good library for the cryptography stuff, and except for the short key seem to have used it well and knew what they were doing. This suggests that the choice of a weak key was deliberate.
Each family of scopes has its own private key. As few families came out with new private keys, Rigol continued to use 56 bits. When major firmware upgrades came out in existing families, where they could have easily changed to a longer private key, they kept the same 56 bit that was now widely circulated on the net.
It seems pretty clear that they are not interested in stopping people from free unlocking.
Any recommendations for learning resources that could help with understanding DC power supply analysis for non-EE types? While refurbishing laptops and working with microcontrollers I’ve run into some odd things where ruling out transient power supply issues would probably be helpful.
As for learning resources, I came across a decent article on the subject when I was starting out (1), and most of the oscilloscope manufacturers have whitepapers on SMPS diagnostics, the Tektronics one I read a while back (2) gave a good overview. A lot of the whitepapers have a manufacturer-specific focus, but they still have good information that can be applied to almost any oscilloscope.
If you want to get really into the power supply and do high-side measurements you'll need an isolated differential probe, which can cost as much as an inexpensive oscilloscope, but for DC output measurement you shouldn't need anything special. Current probes are a lot more affordable if you're interested in looking at loads or current fluctuations/harmonics, but that's more useful after you've figured out a bit more what specific properties you're trying to measure.
Edit: I forgot to mention that the EEVblog forums are a good resource also, but they sometimes aren't as friendly as they could be towards people just starting out.
p.s. lovely/lively oscilloscope shots!
If these are on by default, are they required for FCC certification conformance? Presumably the device has not been EMI/EMC tested in the mode where spread-spectrum was disabled.
The frequencies are up in the microwave band, so I can't imagine there's much range.
If it was designed well ;)
But that's only effective for addressing radiated emissions, not conducted ones.
Or is it simply that delivering 200+W at load through a CPU socket can't be reliably done at consumer prices?
Anyone has had this problem with less high end CPUs? Something at 95-65 W?
The "CPU defective by design" in the title might be a bit misguided since the suggested workarounds do not address a CPU issue but a motherboard one.
My motherboard is the highest end consumer motherboard GIGABYTE has ever built (as far as I know), and I wouldn't say that they tried to keep costs down (it's listed at $849 on PartPicker, and it was introduced at $999) and the power delivery stage is insane. Here's BuildZoid's _in-depth_ review and analysis of its VRM: https://www.youtube.com/watch?v=HMUWzDSAS9c
I now literally go out of my way not to get the flagship motherboard models, even if it means holding off on a purchase until a lower-spec'd model comes out, and have never regretted it since. (I also will never again buy Gigabyte motherboards, either.)
Just because it's high end doesn't mean they won't cut corners or fail to test properly. My ROG motherboard was (at the time) the most expensive motherboard for the socket. Yet it behaved overall worse than many mid/low-end motherboards I owned, an experience shared with other owners of the same board. Even if this is a problem with the CPU that is mitigated by changing parameters of the motherboard it should have been caught during the motherboard design and testing. I can't imagine Gigabyte's engineers noticing this stuff and saying "just ship it like that, nobody will notice". So the best interpretation I can have is that they missed it during testing (the worst is that the marketing department said "we have to put it out there fast, all else be damned", and all else was damned).
High end in consumer stuff, and especially anything gaming related, is basically scam.. look mah, 16000 dpi! And blingy leds! (But no engineering)
Sure they might drop an over-specced part somewhere in it but it's just marketing when the rest of the product is crap and still has no proper engineering behind it.
It's incredibly frustrating. I think there was a time when you could generally assume that expensive = high end = actually good, but now it's just a cheap thing with crazy markup and a premium part (but nowhere near premium enough to justify the markup) or two somewhere (where it probably doesn't matter much anyway) along with other gimmicks. Now it's just expensive = expensive, good or maybe not.
Google's stressapptest runs fine for long durations, building a kernel with make -j32 succeeds (and can boot it), every parallelized archiver like 7z,pbzip2,pigz,xz is ok, and even gaming on a Windows VM using 8c/16t + GPU passthrough works well.
This is my first Ryzen system so I chalked it up to a possible carry over of DDR4 issues from earlier generations. I didn't investigate further since the 3950x had just been released and I couldn't find any other reports of prime95-only instability until now. And just to reiterate, it is perfectly stable otherwise.
Heavy AVX2 workload breaking things fits better so will have to try to collect more data.
Then we get the R480 melting cables for the same reason... and so on.
In this case, AMD wanted to keep using 6pin cable despite it not being approprite, on the 380 it just slowed it down.
On 380X, it is unstable, it still don't draw enough to melt things, but you need hackery to make the GPU useable
The 480 they outright made a GPU that used more power than the cables specs would allow, and insisted in using the same power delivery design the 380 had... Their "fix" to the issue was make patches that would just make the GPU run slow, and make it misbehave like the 380X does.
RX480 wasn't melting cables, it was melting motherboards. The stock VBIOS was pulling more power than the PCIe spec allowed you to pull from the slot. It was pulling over 100W, vs the 75W spec, not massive but probably enough to push some older/weaker boards over the edge (most of which were just ready to fail anyway and were going to fail from a solid 75W draw too).
It's pretty hard to melt a cable. You can overdraw the connectors by roughly twice their rated power safely, and the cables will do more than that as long as you're not using splitters or some other hack.
The 295x2 pulled almost 500W through a pair of 8-pins and the slot (nominally that's 375W rated) while overclocking.
Part of it was likely that NVIDIA's performance was so good. Some anecdotal rumors from the AMD vlogosphere suggest that AMD thought that the RX 480 would be competitive with the GTX 1080 - I consider this kinda dubious because the RX 480 is basically a GTX 980 tier chip, so that would mean AMD thought that NVIDIA wouldn't make any progress at all from a node shrink, which seems dubious. Maybe they figured the RX 480 would be a lot faster or more efficient than it actually ended up being.
Polaris was the last generation with "old style" voltage control where you just set a target and go, it is possible that they figured the chips would hit close to 2 GHz like NVIDIA's but ended up with validation problems and didn't hit the expected clocks. This could possibly be the reason they nuked Big Polaris and ended up lengthening the pipeline so much in Vega to try and get the clocks up (along with adding the Pascal-style "smart" power conditioning management).
Anyway, regardless, the point is that it seems likely that AMD was forced to push the RX 480 much farther than they intended to, at the last moment. Like, after the PCBs had already gone out for manufacturing, and it was too late to switch the 6-pin to an 8-pin (which would have solved the whole problem, that would have allowed 225W rated total board power).
It was still a bit of a showoff move to run from only a 6-pin, there is no way that the card would have been significantly less than 150W total board power, but perhaps defensible as a compatibility move - although some marginal PSUs still might not have handled it.
It could be that the spike is the problem not the end load. The scenario in play here is going from entirely idle to maximum load in an instant. So the CPU is going to shoot straight from ~20-30w usage to 280w (or even higher if the power management is a bit sluggish at reducing the clocks). That's a pretty drastic swing if anything isn't entirely up to spec, assuming the spec even handles this properly.
In theory the 3970X could even spike all the way up to 430w. The individual cores top out at 13.5w, so if they all end up running at single-core turbo frequencies even for a split second that's going to be brutal.
> Anyone has had this problem with less high end CPUs? Something at 95-65 W?
My 3700X in an old X370 board has been flawless. As has many others, Ryzen 5/7/9 are _widely_ recommended and have been for a while. Any systemic issues in the "regular" consumer end would have cropped up by now.
> AMD has confirmed this issue doesn't affect EPYC or Threadripper processor
The CPU Michael was using was a Ryzen 7, not a Threadripper. But I don't know if TRs were later found to have the same problem, which wouldn't surprise me.
EPYC was on a different stepping entirely and didn't share dies with the consumer or HEDT processors. I suspect they likely had the tighter cache timings although I'm not sure on that.
Regardless, it's also possible that they were just binned out, or nobody ever encountered it. EPYC was not a very high-volume product, most people did test installations and said "yeah, we'll wait for Rome". Quad-numa on a package and lower-than-Intel IPC on a pretty poor node was not a winning formula.
Once the problem was realized, desktop Ryzen processors started getting binned for it as well. A few slipped through here and there, so it wasn't a manufacturing change, I think binning is the most likely explanation.
With AVX2 enabled, Prime95's torture test is only stable when I use 3 workers or less. With 4 workers one of them will abort due to an error within 20 seconds. The more workers, the sooner a crash; with 5 workers it happens within 10 seconds, and with 6 workers it happens within 2-3 seconds.
If I play with the tests on and off for a while, seemingly increasing the quiescent temperature of the CPU, the whole experience and testing actually becomes a bit more stable. My motherboard uses the B450M chipset.
Add.: OK, I finally found a couple of "Precision Boost Overdrive" menus deep down in the overclocking settings of this board's BIOS, and I've disabled everything PBO I can find while letting the "core performance boost" remain in auto mode. Without doing any extended testing this seems to have solved the problem, as I can now let Prime95 run through the AVX2 code path with 6 workers without any crashes. Thanks for the hint!
I know air coolers can be competitive, but it says right on the outside of the 2950X box that you should use liquid cooling.
I have a machine running a 1950X and I get random ffmpeg segfaults anywhere from six to eight hours in to an encoding session with all 16 cores fully loaded, but the machine is prime95 stable for a week+, so I suspect it's an AVX/AVX2 issue.
Not the easiest one to configure though.
(Not sure which is the canonical source...)
I've ran it for 5 hours (20 iterations) at the maximum stress level but it passed. In my limited experience, it's less punishing than Prime95.
To be clear, I'm not the one who plugged a scope to my motherboard, another 3970X owner did. I wish I had a scope! (The Rohde & Schwarz RTB2000 is my dream!)
Either way, you might consider bringing this to the attention of the ffmpeg developers . If they don't have a fix already, that may appreciate your help in root-causing the bug.
For everything else, though, there doesn't seem to be any cause for concern. It's currently a single stress test in a single very specific configuration that's failing.
For reference the tests that pass:
MemTest86 v8.3, 4 passes (~8 hours), RAM at 3466 MT/s [PASS]
MemTestPro 7.0 (paid version of HCI MemTest), 700% (~5 hours), RAM at 3466 MT/s [PASS]
AIDA64 6.20.5300 System Stability Test (full system test except local disks) [PASS]
IntelBurnTest v2.54 (based on Intel Linpack), maximum stress level, all available RAM, 20 runs [PASS]
OCCT 5.5.3 test, large data set, 64 threads, AVX2 [PASS]
Google stressapptest 1.0.9, all available RAM [PASS]
Prime95 v29.8 build 6 torture test, 64 threads, min/max FFT = 4K, in-place FFTs, AVX2 (~1 hour) [PASS]
Prime95, min/max FFT = 8K, in-place FFTs, AVX2 [PASS]
Prime95, min/max FFT = 16K, in-place FFTs, without AVX2 [PASS]
Prime95, min/max FFT = 16K (in-place FFTs or not), with AVX2 [INSTANT FAIL]
Read DerAlbi’s (recent) reply on thread:
I.e. if an AMD CPU suits your needs, you should get one (unless you are doing Prime95 as a load!), and perhaps avoid a MoBo with that VRM. Or wait a couple of days for the truth to percolate up through the murk.
Based on oscilloscope analysis of the VRM output in a linked thread elsewhere in the comments it looks like the board’s VRM design, or its configuration by the board’s BIOS, may be the most likely suspect.
But there are less-researched reports of similar issues on other boards as well, which makes things a bit more murky.
Given the uncertainties there it may put some people off from buying into the TR/sTRX40 platform in general. But to offer a blanket recommendation to avoid is a bit premature.
It could even be an uncorrectable flaw in the device design, in which case AMD will likely do as they did in 2017: replace the parts.
Shouldn't these devices have been tested by AMD and motherboard manufacturers for months before they were released? Including on workloads like Prime95 which are well known to uncover system instabilities.
Keep in mind also that this is not the first time in the Zen/Zen2 history that AMD has shipped buggy products: there was the segfault bug, the random number generator instruction bug, and the unbootable Threadripper on recent (at the time of release) Linux kernel. The worst part, to me personally, is that they were all "surface bugs" that could have been detected by AMD with a little bit of testing.
I've been wondering why Supermicro isn't releasing any Threadripper motherboards (while they do for Xeon-W). Maybe this explains it: the consumer CPU business at AMD is a circus and Supermicro wants none of it?
There's only a couple of datapoints. No official confirmation of an issue. This could be the fault of the CPU, Chipset, specific motherboard, or a bad batch.
It appears to be a VRM issue, either a design flaw or incorrect setup by the BIOS.