
AMD Threadripper 3970X under heavy AVX2 load: Defective by design? - franzb
https://forum.level1techs.com/t/amd-threadripper-3970x-under-heavy-avx2-load-defective-by-design/153883
======
bdd
“Unable to perform AVX2 instructions correctly under heavy load” is also a
common “WTF Intel!?”–inducing phenomenon. I’m certain SREs who work at
companies with more than 1 million servers have a bunch of hair pulling
stories.

Most (all?) Intel server CPUs in fact decrease clock speed when executing AVX2
(and some other) instructions to keep things a bit more sane. Vlad from
Cloudflare wrote about this, more specific to AVX-512 back in 2017:
[https://blog.cloudflare.com/on-the-dangers-of-intels-
frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-
scaling/)

Then there is PROCHOT signal. Which is supposed to protect the CPU from
getting too hot but keeps getting raised in lopsided AVX2 loads not because
CPU is too hot but voltage regulation gets whacked.

You may wonder: what is an example of AVX2 heavy load. RSA multiplication is a
good candidate. AES constructions or modes (CBC with SHA, GCM) are implemented
in AVX2-BMI2 as well.

~~~
jrockway
I ran into this issue on one of my builds. Aida64 has a benchmark (floating
point photo or something?) that uses AVX instructions. Pressing the "run
benchmark" button would instantly black-screen crash my machine with 100%
certainty.

I debugged this problem over a number of years... I replaced the RAM, I
replaced the motherboard, I eventually replaced the CPU... still happened no
matter what I did. Even if I _underclocked_ the machine and kept the voltages
the same, instant crash. It was maddening.

Exasperated, I eventually busted out an oscilloscope and looked at the
waveform on the 12V supply to the CPU. When starting the AVX benchmark, there
was a huge brownout. That basically explained everything; my power supply
essentially turned off when the CPU started drawing a ton of power. I replaced
the power supply and got lucky -- it handled it fine and I could run the
benchmark. I even got some overclock out of it.

After this whole experience, I've never looked at computers the same way
again. You can buy high-spec brand-name components, and it's all just a
crapshoot. Maybe your computer won't crash in the middle of an important task.
Maybe it will. There isn't much you can do but cross your fingers.

~~~
dman
If there is one thing in a computer you should not cheap out on, it is the
power supply. Buy from Seasonic and buy more power supply than you currently
need.

~~~
zymhan
You say not to cheap out, and then recommend Seasonic. I'd say avoid a brand
that was notoriously the cheapest option in recent history.

~~~
ricketycricket
At the very least, Anandtech seems to rate Seasonic very highly:
[https://www.anandtech.com/show/11252/the-seasonic-prime-
tita...](https://www.anandtech.com/show/11252/the-seasonic-prime-titanium-
power-supply-review)

~~~
kllrnohj
It's not just Anandtech. All reviewers have nothing but praise for Seasonic.
They have the best reputation in the industry at the moment and all their PSUs
have a 10 year warranty to back it up.

Jonnyguru's last Seasonic review:
[https://www.jonnyguru.com/blog/2018/07/03/seasonic-prime-
ult...](https://www.jonnyguru.com/blog/2018/07/03/seasonic-prime-
ultra-750-titanium-power-supply/6/)

Literally nothing in the bad & mediocre summary sections, and with nothing but
recommendations for the entire series:

> Our look at the PRIME Ultra Titanium series has now at last come to a
> conclusion. We can now definitively say there’s not a bad performer in the
> bunch, and while all do have some very minor drawbacks I can’t see any
> reason not to recommend any of them.

~~~
dispat0r
They actually have 12 year warranty at least for the PRIME TX ones.

------
t0mas88
Most performance motherboards with Intel unlocked K models will downclock the
maximum boost when using AVX instructions. The reason is very high power draw
and temperatures. For example my i5 9600k runs at 5ghz turbo boost on all
cores but 4.7 when using AVX. If I disable that option it crashed with
prolonged usage like benchmarks.

Edit: To be clear, the i5 9600k is sold as 3.7ghz with boost up to 4.6 on a
single core. So there is a difference with the AMD case in that this doesn't
happen on the setting Intel sell it at.

~~~
paulmd
AVX offset is a configurable parameter with unlocked (K- or X- series)
processors. You can run with 0 offset at all but the highest overclocks.

At some point it does stop being worth it though, because the power/voltage
implications of 5 GHz AVX are so severe/potentially damaging to the chip. It
is a _lot_ of current and current kills chips. SiliconLottery does all their
validation with a 200 MHz offset.

~~~
leetcrew
there's no real reason to try and hit 0 offset anyways. if you get it stable,
it usually implies you could just increase the multiplier and offset by some
n>0 for greater overall performance.

~~~
paulmd
The problem is that real-world code (including games) often includes _at least
some_ AVX instructions, so the AVX number is often more representative of
"real" performance. The way AMD does it where it's a smooth transition based
on current/thermals is definitely better than Intel's "whoops, AVX
instruction, pump the brakes!".

But yes, if you can get higher clocks _at least some of the time_ then you
might as well.

The other other downside is that flipping between power states can cause
problems/crashes too. It shouldn't but it can.

~~~
leetcrew
agreed, I was only speaking from an overclocker's perspective.

------
ntauthority
My 3970X on the ASRock Taichi (with default settings, generally) does not seem
to reproduce this issue at this time - the system remains operational despite
the FMA3 path being used (I'm assuming this is behind the AVX2 flag? disabling
FMA3 leads to a plain AVX path) while running an all-core test with 16K FFTs
in Prime95.

Either a slight background workload (Windows seems to be trying to use half a
core for an OS update) resolves this, or this board does not have a broken
power design?

~~~
franzb
Thanks for sharing your experience. What version of Prime95 are you using?
Make sure to use the latest one (v29.8 build 6). The only CPU options I see in
the torture test settings are: (1) Disable AVX-512 (grayed out since
unsupported on this CPU), (2) Disable AVX2, (3) Disable AVX. There's nothing
about FMA3.

~~~
ntauthority
Yeah - I explicitly updated to the latest one; the FMA3 setting is one that
existed in prior builds in local.txt so I toggled it off there just to be sure
I was hitting an AVX2 code path (in case it didn't mean the UI saying FMA3 in
each worker window), but it seems to interpret AVX2 as being synonymous to
FMA3 I guess.

~~~
hrgiger
same here on 1950x, I built from source, when toggled avx2 it shows "using
type-1" I dont know what it is:

[https://pastebin.com/tPuYzYC0](https://pastebin.com/tPuYzYC0)

------
Filligree
He's not alone; I've had similar problems with my 3960X.

It seems to be a power delivery issue, and fortunately fixable if you disable
all spread-spectrum and VRM power-saving options, but the Zen series seems a
tricky beast.

I've had machine crashes triggered by using the "wrong" CPU scheduler under
Linux. It's amazing, in a horrible way.

~~~
leeter
I remember when Buildzoid of AHOC did the mobo breakdowns of the TR4 boards
and thinking that while some of the super high end boards were probably good
for this sort of beating, the mid and low range might struggle hard with the
64/128 part if it ever came into being (3990X was just a rumor at that time).
But it looks like they need more caps to handle the transient response time,
and probably also some firmware fixes to slow ramp because I don't think all
the SMD caps in the world are going to handle that sort of ramp. It's just not
possible to get them close enough to the actual CPU without literally putting
them under the IHS.

~~~
derefr
Tangent: is there a reason that CPUs’ instruction sets aren’t designed with
explicit “hint”-ops in them to let a compiler assert “I’m going to execute
some instructions that’re going to draw a lot of power about 1000 cycles from
now, so start ramping up for it now”? That’d basically eliminate what Intel
calls “license-switching” costs.

Do they just not believe in compiler authors to be able to emit these kinds of
hints? Is there too much legacy code that would come without the hints for it
to be worthwhile? Would it just not be worth it given how often OS context
switches could drop the CPU directly from regular code in one process to AVX2
code in another process?

~~~
gruez
Probably because the better approach would be to stall the processor (or at
least stall on power intensive instructions) until the requisite power has
arrived. 1000 instructions is probably too far out (in terms of instructions)
for compilers to accurately add the requisite instructions. Also, it requires
code change, unlike the stalling approach. The only disadvantage to the
stalling approach would be if the workload is constantly switching between low
power and high power, but that can be solved by making the switching interval
longer/shorter in the firmware.

~~~
derefr
Why is it “better” if the stall results in lower performance-per-watt? That’s
what people buy these high-end CPU SKUs for, after all. They aren’t just
concerned with CapEx (the cost of the CPU) but also OpEx (the aggregate cost
of electricity for the CPUs, PSUs and cooling in their studio/data center.)
You could also “fix” the problem by just allowing the customer to lock the CPU
into AVX2 mode and thereby never switching out at all—but of course that’d be
dumb; they’d be drawing more power, and _also_ not going as fast as they
could, when executing non-AVX2 code.

Another approach I haven’t seen from either Intel or AMD yet, is to copy ARM’s
big.LITTLE architecture: to set up separate “AVX2 cores” that can execute AVX
instructions and most basic ALU ops but not e.g. branches, put those on the
other side of the die with in-wafer thermal insulation between them and the
regular cores; and then throw workloads between regular and AVX2 cores in a
way where the AVX2 cores heating up doesn’t mean that the non-AVX2 cores are
heating up, and the CPU can go back to full Turbo Boost as soon as the
workload is thrown back to the regular cores, because the respective regular
core is actually quite cool.

(The performance effects of this are possible to loosely estimate, I think, by
writing code that synchronously sets up a GPGPU pass on some data, executes
it, retrieves the result, and then returns to executing CPU instructions for a
while, in a loop; and then executing this code on an Intel CPU using its on-
die IGPU as the GPU. The CPU and IGPU form _somewhat_ -separate thermal
domains already, though there’s no explicit insulation.)

~~~
blihp
Because it doesn't rely on both the user and developers to do some yet to be
determined 'right thing' for the CPU to behave in a consistent manner. While
you are correct that some people want absolute max performance per watt, I
would argue that no one wants a CPU that behaves unpredictably or
inconsistently. An option to disable one or more power saving features that
result in this issue (assuming that's what the issue is) for those who just
need maximum performance could be a desired solution for some. While others,
possibly most, would be willing to take a short term performance hit from a
stall to maintain the best overall performance per watt characteristics.

------
nottorp
Hmm are we having another "AMD motherboards are crap" moment?

Or is it simply that delivering 200+W at load through a CPU socket can't be
reliably done at consumer prices?

Anyone has had this problem with less high end CPUs? Something at 95-65 W?

~~~
speeder
I have a AMD GPU, a 380X... and to be honest I am just sure AMD power
management is crap in general. 380X in particular has bigger demand than 380,
but AMD didn't bother raising power limit, causing some weird issues (some
games for example get unstable unless I use some overclocking app to reduce
voltage and raise power limit... mind you, the normal 380 people already used
it with maximum power limit by default, because it was already too low...).

Then we get the R480 melting cables for the same reason... and so on.

~~~
nottorp
Hmm whatever is melting cables is drawing more power than the cables are
specced for. I doubt increasing the power limit will help with that...

~~~
speeder
You are correct. But what I meant is that the source of the problem was the
same.

In this case, AMD wanted to keep using 6pin cable despite it not being
approprite, on the 380 it just slowed it down.

On 380X, it is unstable, it still don't draw enough to melt things, but you
need hackery to make the GPU useable

The 480 they outright made a GPU that used more power than the cables specs
would allow, and insisted in using the same power delivery design the 380
had... Their "fix" to the issue was make patches that would just make the GPU
run slow, and make it misbehave like the 380X does.

~~~
jotm
Wow, that's ridiculous...

------
frou_dh
The original Ryzen (1000 series) shipped plenty units with hardware defects
that could be exposed by running parallel compiles. The so-called segfault
bug:

[https://www.phoronix.com/scan.php?page=article&item=new-
ryze...](https://www.phoronix.com/scan.php?page=article&item=new-ryzen-
fixed&num=1)

~~~
ComputerGuru
From your link:

> AMD has confirmed this issue doesn't affect EPYC or Threadripper processor

The CPU Michael was using was a Ryzen 7, not a Threadripper. But I don't know
if TRs were later found to have the same problem, which wouldn't surprise me.

~~~
paulmd
IMO the likely cause was some kind of cache bug due to manufacturing.
Threadripper processors were binned for tighter cache timings (they had the
same timings as the 2000 series desktop chips) and thus didn't suffer from it.

EPYC was on a different stepping entirely and didn't share dies with the
consumer or HEDT processors. I suspect they likely had the tighter cache
timings although I'm not sure on that.

Regardless, it's also possible that they were just binned out, or nobody ever
encountered it. EPYC was not a very high-volume product, most people did test
installations and said "yeah, we'll wait for Rome". Quad-numa on a package and
lower-than-Intel IPC on a pretty poor node was not a winning formula.

Once the problem was realized, desktop Ryzen processors started getting binned
for it as well. A few slipped through here and there, so it wasn't a
manufacturing change, I think binning is the most likely explanation.

------
daneel_w
I can reproduce this problem on my non-Threadripper Ryzen 5 3600 Zen 2 CPU. I
don't think it's specific to TR.

With AVX2 enabled, Prime95's torture test is only stable when I use 3 workers
or less. With 4 workers one of them will abort due to an error within 20
seconds. The more workers, the sooner a crash; with 5 workers it happens
within 10 seconds, and with 6 workers it happens within 2-3 seconds.

If I play with the tests on and off for a while, seemingly increasing the
quiescent temperature of the CPU, the whole experience and testing actually
becomes a bit more stable. My motherboard uses the B450M chipset.

~~~
Scramblejams
What motherboard make/model?

~~~
daneel_w
Gigabyte B450M DS3H. Currently running with latest BIOS (F50).

~~~
robocat
Note that @DerAlbi thinks the CPU is fine, instead he suspects the VRM on his
Gigabyte mobo for this issue (based on his oscilloscope readings):

[https://forum.level1techs.com/t/3970x-prime95-stability/1532...](https://forum.level1techs.com/t/3970x-prime95-stability/153206/21)

~~~
daneel_w
Thanks for the info!

------
johnklos
Do some more investigating. Drop the memory clocks to stock, drop the CPU
clocks to, perhaps, 3 GHz, and see if the same issues happen. If they do,
there's a systemic issue that needs to be addressed. If the issue disappears,
try raising the clock incrementally until the issue reappears. Get a Kill-a-
watt and look at power usage for each frequency and graph the results.

------
cma
> Finally, a note on CPU temperatures: At idle the CPU hovers around 39-50 °C
> and tops around 72-78 °C under full load. I’m using the best air cooling
> setup I could think of and get my hands on, but it’s still air cooling, and
> my system is installed in a closed case (but with extreme attention to
> airflow).

I know air coolers can be competitive, but it says right on the outside of the
2950X box that you should use liquid cooling.

------
ComputerGuru
Can anyone suggest a different CPU load-testing tool other than prime95, that
might catch things prime95 wouldn't?

I have a machine running a 1950X and I get random ffmpeg segfaults anywhere
from six to eight hours in to an encoding session with all 16 cores fully
loaded, but the machine is prime95 stable for a week+, so I suspect it's an
AVX/AVX2 issue.

~~~
AHTERIX5000
Intel's Linpack benchmark is quite good perf/stability/throttling test for
win/linux/mac: [https://software.intel.com/en-us/articles/intel-mkl-
benchmar...](https://software.intel.com/en-us/articles/intel-mkl-benchmarks-
suite)

Not the easiest one to configure though.

~~~
Filligree
Linpack only runs on Intel CPUs, so it's not useful for Threadripper.

~~~
franzb
I don’t know if we’re talking about the same Linpack test but I’m able to run
it on my AMD TR 3970X without issues.

~~~
Filligree
The linpack package in AUR, at any rate. If there are multiple software suites
under the same name then I have no idea.

------
ehutch79
So, does this mean threadripper is unusable and we shouldn't buy them?

~~~
ehutch79
I'm not sure why I'm being downvoted. This is a legitimate question. Does this
bug make the cpu unusable? Does the work around for it slow down the per core
performance to the point of unusability?

~~~
topspin
It's a little early to start condemning things. It might be motherboard power
delivery, a prime95 bug or something else. If it is the CPU there could be a
firmware correctable flaw. This is the bleeding edge; these devices have only
been available for 13 weeks and someone is indeed bleeding, it seems.

It could even be an uncorrectable flaw in the device design, in which case AMD
will likely do as they did in 2017: replace the parts.

~~~
boris
> This is the bleeding edge; these devices have only been available for 13
> weeks [...]

Shouldn't these devices have been tested by AMD and motherboard manufacturers
for months before they were released? Including on workloads like Prime95
which are well known to uncover system instabilities.

Keep in mind also that this is not the first time in the Zen/Zen2 history that
AMD has shipped buggy products: there was the segfault bug, the random number
generator instruction bug, and the unbootable Threadripper on recent (at the
time of release) Linux kernel. The worst part, to me personally, is that they
were all "surface bugs" that could have been detected by AMD with a little bit
of testing.

I've been wondering why Supermicro isn't releasing any Threadripper
motherboards (while they do for Xeon-W). Maybe this explains it: the consumer
CPU business at AMD is a circus and Supermicro wants none of it?

------
m0zg
Probably the motherboard, or VRM brownout to be more exact. That said, I'm
glad I did not pick up a 3970X like I was planning to, yet. AMD is pretty
great with its warranty, motherboard manufacturers can be a chore.

------
citilife
It appears this could just be a prime95 bug from reading the comments.

~~~
xxs
There is a beautiful oscilloscope shot, that shows it's a power delivery issue
- the voltage sharpy drops and no software bug should cause it.

It appears to be a VRM issue, either a design flaw or incorrect setup by the
BIOS.

------
maljx
I ran the same test on my ryzen 3900X with no issues, MSI X570 ACE
motherboard, seasonic PSU.

------
grokas
Great to see L1T posted here.

