
Reducing the Performance Gap of Intel's MKL on AMD Threadripper - smartmic
https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/
======
Teknoman117
It's an unfortunate trend. I got a PR merged which removed an "Is
GenuineIntel" check in the ZFS on Linux crypto layer. It checked for
GenuineIntel, and only if found did it check cpuid. Dunno who originally
checked that in, but it seemed pretty ridiculous to me. AMD has had an AES-NI
implementation since Bulldozer.

I feel it would have been acceptable to say if you had a mutant x86 CPU which
set the AES-NI bit in CPUID and then had a non-compliant implementation l,
that was on you.

~~~
PedroBatista
Probably some guy trying to clean his ToDo list, but in recent times as we
scratch more than the surface, there’s a real possibility that guy’s paycheck
was “Genuine Intel” too.

------
m0zg
AMD needs to do something like [https://github.com/intel/mkl-
dnn](https://github.com/intel/mkl-dnn) and
[https://github.com/opencv/dldt](https://github.com/opencv/dldt), either by
talking to Intel to get them to accept contributions (vastly preferable), or,
failing that, by forking the libs and implementing their own support. With
recent work on efficient NN architectures the CPUs are pretty viable for most
_inference_ (not training) tasks. DLDT/OpenVINO perf is particularly
impressive. It does, however, extensively use JIT kernel generation through
[https://github.com/herumi/xbyak](https://github.com/herumi/xbyak) depending
on the supported features for a particular chip (search for "mayiuse()" in the
code to see examples). It's anyone's guess what it will detect and generate on
an AMD chip however. Now that AMD is raking in more cash, it'd be a great time
to invest some of that back into their dev ecosystem.

~~~
gnufx
The relevant free low-level library in this area is libxsmm, which drove MKL
to improve for small matrix work by being faster a while ago. I don't know how
it performs on AMD CPUs, or how much a priori detailed micro-architecture
knowledge it requires, but I doubt the maintainer would refuse changes for
AMD. It was referenced with no interest in
[https://news.ycombinator.com/item?id=16600347](https://news.ycombinator.com/item?id=16600347)

How specific to Intel CPUs is MKL-DNN (now renamed)? When I looked at it, the
CPU code seemed fairly generic SIMD.

~~~
m0zg
MKL-DNN is not MKL though. I think MKL-DNN is fairly specific to Intel CPUs
though if you want maximum performance. It'll necessarily make certain
assumptions about instruction throughput and latency, something that's pretty
much certain to be different on AMD in some cases. In super tight kernels such
minute differences often boil down to much lower performance. So if AMD wants
to do a good job, they'll need, at least in some cases, to detect AMD CPUs and
JIT their own specialized kernels that perform well on AMD.

~~~
gnufx
Sure, but it wasn't immediately obvious it was that specific when I looked,
hence the question.

------
Abishek_Muthian
10th gen intel (Ice Lake) has more instructions under AVX-512 and are not
limited to just Xeon processors.

So wouldn't comparing with latest Ice Lake CPU using intel MKL instead of
Xeon-W 2175 yield even larger performance gap for workloads taking advantage
of the new instruction Subset? i.e. AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI,
VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES.

~~~
wmf
The server/workstation version of Ice Lake is not available yet and I wouldn't
recommend buying a quad-core Ice Lake laptop for high performance computing.

~~~
Abishek_Muthian
Btw, AMD Ryzen 3900X used in the comparison isn't a server CPU as well.

------
eberkund
Would there be any merit in the OS hiding what CPU brand? Keep exposing what
specific CPU features/instruction sets are supported but don't display whether
it is AMD/Intel/Via. What's the benefit? I find it hard to believe there are
many "optimizations" out there that despite the instruction sets being the
same do different things based on the specific CPU model or brand. If I'm
wrong and there is a substantial amount of software that requires that then
perhaps it can be hidden behind some kind of request access popup.

~~~
paol
First, the OS can't hide that because it's not a middleman in the execution of
program code. When a program is running on a CPU it has access to the
instructions that tell it what CPU it is.

Second, there are already many ways for programs to get granular CPU feature
information. It's entirely the software's fault if it decides to ignore that
information and instead use a big "if" switch keyed solely on the
manufacturer.

~~~
pmjordan
_First, the OS can 't hide that because it's not a middleman in the execution
of program code. When a program is running on a CPU it has access to the
instructions that tell it what CPU it is._

Technically, it could, using virtualisation extensions to x86. CPUID can be
trapped by a hypervisor, and the hypervisor can supply whatever result it
likes. This is how Qemu/KVM simulates different CPU models.

Mainstream OSes don’t use virtualisation extensions for regular processes of
course, and hypercalls would be even more expensive than regular syscall based
context switches, so this is probably not something that is going to change in
the short to medium term.

~~~
jotm
Could Windows 10 with Hyper V enabled do it? They claim native performance
when running it

~~~
Filligree
No, they're just lying. You can get close, but you'll never achieve quite the
same performance.

------
aloknnikhil
We honestly don't know how this impacts the results as was noted at the end.
I'd like to see the numbers from the same run w/ the env variable set on an
Intel CPU. Perhaps this debug code does something drastically different?

------
shaklee3
Has anyone seen similar benchmarks, but with BLIS? OpenBLAS is interesting,
but BLIS would possibly perform better.

~~~
edraferi
The article says they’d like to include BLIS in the benchmark, but nobody has
written a good Numpy Conda package with BLIS hooks yet

~~~
lhl
The article was specific to AMD's BLIS fork, but just as an FYI, using
standard BLIS w/ numpy (which I believe would still outperform OpenBLAS) is
pretty straightforward:

    
    
      conda create -c conda-forge -n numpy-blis numpy "blas=*=blis"

~~~
wjnc
Are there any reads you can recommend on how to recover from (possibly) failed
experiments like these under pip / conda / linux in general? I've been doing R
and Linux for two decades, but still am perfectly capable of totally breaking
my python/jupyter-workflow by messing up dependencies without knowing how to
recover. Would love to learn to remedy that gap and reducing the risk.

~~~
lhl
You can actually see in my pasted command that I created a new venv for the
blis install. It's as simple as "activate" or "deactivate" to switch around.

That's basically all there is to it, but here are some docs for you to
reference from the top page of search:

[https://docs.conda.io/projects/conda/en/latest/user-
guide/ta...](https://docs.conda.io/projects/conda/en/latest/user-
guide/tasks/manage-environments.html)

[https://uoa-eresearch.github.io/eresearch-cookbook/recipe/20...](https://uoa-
eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/)

------
gnufx
As for a previous article like this -- it's pointless. Just use BLIS, like AMD
do, and which is infinitely faster than MKL on non-x86 systems.
[https://github.com/flame/blis](https://github.com/flame/blis)

------
RL_Quine
That's some seriously scummy stuff from Intel.

~~~
creato
See the caveat at the end:

> It is also possible that the resulting code path has some precision loss or
> other problems on AMD hardware. I have not tested for that!

Intel doesn't test it either, which is why they disable it.

You could argue that they should test and enable on AMD CPUs, but this isn't
clearly scummy behavior (to me), it's not a zero cost decision they made
solely to screw over AMD. Buying the hardware and testing it has a cost. It
may not even be possible to test it to the same degree they test their own
hardware. There may be modes, subtle differences between architecture
versions, etc. that are hard to know about to test for.

~~~
nemetroid
Until the opposite has been shown, the assumption should be that a CPU that
indicates that it supports AVX2 does support AVX2.

~~~
roenxi
That is not the case, this is at the boundary of software and hardware and
real-world concerns dominate. It is quite possible that a library highly
optimised to a specific hardware set could have ghastly bugs on hardware it
isn't tested on and testing has a cost.

Intel is being reasonable and in a sense prudent by only supporting hardware
they understand very well. It isn't like they are sending the lawyers out to
sue anyone writing high performance code on AMD chips - that would be
unreasonable.

~~~
the8472
The usual approach (e.g. taken in the linux kernel, llvm, hotspot jvm) is to
look at the cpuid capability flags and rely on them _by default_ and only
blacklist specific CPU models when they don't implement advertised features
correctly, not the other way around.

------
moonbug
The MKL is, and ways has been, a marketing tool for Intel. They develop it to
make their CPUs look good. Why should they help the other team?

This really isn't news to anyone who cares about linear algebra library
performance, it's just that it's passed from general consciousness in the
decade since AMD was last relevant. For performance portability to non-Intel
CPUs ooenblas and its ilk have always been the way to go.

You might think it's a sharp practice, in which case, you're free to vote with
your wallet.

~~~
yarg
At very least, because they're legally obliged to. The anti monopoly laws
prohibit companies in Intel's position from doing what they're doing (and they
also agreed not to in the settlement for the AMD lawsuit).

~~~
kbumsik
> At very least, because they're legally obliged to.

Are you sure about that Intel MKL is also applied to that law? Intel MKL is a
propriety software by Intel and its system requirements clearly states that it
is Intel 64 and IA-32 only. Do you mean that all propriety software made by
the hardware manufacturer should support competitor's hardware as well?

~~~
lawrenceyan
That's false. Intel's MKL library has been repeatedly marketed as a tool for
all processors including non-Intel ones. Combined with this fact, Intel's
choice to purposefully remove performance enhancements for AMD processor is
very clearly in violation of antitrust law.

~~~
Teknoman117
Except it's not against antitrust law. As long as they document that it may
not run well on non-Intel processors, they are in the clear, at least in the
US.

Is it scummy? Sure, but that was the ruling when this came up for their
compiler.

------
seriesf
The real “big caveat” here is to make sure that this actually improves your
real workload in real life with real inputs. Zen and Zen+ have a lot of edge
cases, for example their pdep/pext support is just crazy slow and how slow it
is depends on the input values (it gets slower if more bits are set, which is
bananas). But this also goes for genuine intel hardware. MKL is not always
optimal. Even -march=native does not always produce the best code. Your
program may be faster with the latest instructions disabled.

~~~
partingshots
If you go to the MKL library, it specifically says it speeds up performance on
non-Intel processors as well. That they would state this and still sabotage
the performance for AMD processors by directly disabling performance
enhancements for them shows to me that this was a calculated decision that was
willfully made.

~~~
seriesf
Or it shows that the mkl developers are aware of some obscene edge case that
kills the performance of a major customer’s application. This article tested
ONE thing.

~~~
lhl
It's certainly not an "obscene edge case" \- MKL cripples AMD on every single
matrix operation tested here:
[https://github.com/flame/blis/blob/master/docs/graphs/large/...](https://github.com/flame/blis/blob/master/docs/graphs/large/l3_perf_epyc_nt1.pdf)

