
Intel MKL on AMD Zen - todsacerdoti
https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
======
segfaultbuserr
Related thread from a few days ago, including discussions about Intel MKL.

* Intel's “cripple AMD” function (2019)"

[https://news.ycombinator.com/item?id=24307596](https://news.ycombinator.com/item?id=24307596)

~~~
zepearl
MKL = Math Kernel Library?

[https://software.intel.com/content/www/us/en/develop/tools/m...](https://software.intel.com/content/www/us/en/develop/tools/math-
kernel-library.html)

~~~
bodono
Yes, and this highlights something that bugs me about many tech articles -
using acronyms without ever defining them.

~~~
waynecochran
I hate the overuse of acronyms (IHTOOA). One way to get back is just to start
making up your own -- throw a few random TLA's out at your next meeting. It's
fun.

Also, I would like to push for an anti-acronym day. A full day where you have
to use full and complete terminology. Ironic thing is, you don't add that many
syllables when you say "Proof of Concept" vs POC and similar such absurd
overused abbreviations.

~~~
jcranmer
To be fair, in the supercomputing community (or, more generally "high
performance computing"\--HPC), certain acronyms for libraries are very well-
known: MPI, HDF5, FFTW, BLAS, MKL--no one is going to bat their eyes at seeing
those names without an acronym expansion.

~~~
BenjiWiebe
And I've only had a passing curiosity of supercomputing, a few years ago, and
the only acronym you listed that I have no idea about is HDF5.

~~~
Jarwain
That's funny. I've never really touched supercomputing & the only one I c
recognize is hdf5

------
singhrac
Wow, why is Intel supporting Zen kernels in MKL? That seems... really
interesting. One of the few places where Intel still has a somewhat clear
advantage is in high performance numerical code that can't be easily
multiprocessed (e.g. off-the-shelf ML code) because MKL is much faster than
AMD alternatives.

Also, does anyone know if one can use patchelf on e.g. Python (or the numpy
compiled section?) to get MKL/Zen support? I don't have my Zen CPU on me to
test.

~~~
nl
I think the real question one should be asking is "why is AMD's support for
high-performance computing" so bad?

This applies both to CPUs (where they have to rely on Intel for this) and toe
GPUs (where their offerings are under resourced, buggy and lagging what NVidia
offer).

~~~
slavik81
I work for AMD on our GPU math libraries. If you have specific complaints, I
will gladly listen. I can't promise more than that, but I will read and
consider whatever you write.

I'm a true believer in ROCm HIP. It has tremendous potential. I joined the
company specifically because I wanted to help ensure its success... mostly by
writing fast, reliable code, but also by listening to our users.

If you (or anyone else) would prefer to respond privately, my email is my HN
username at gmail.

------
rasz
>Good news: Intel seems to be adding Zen kernels

how is that good news when your investigation shows in reality its "Intel
seems to be adding cripple-Zen kernels" when compared to spoofing Intel? 382
GF/s vs 430 GF/s

~~~
rblatz
I’m an AMD fanboy, and just built a new Ryzen box. I’d be curious to see if
there is a difference in accuracy running the two kernels on AMD hardware.

Maybe it is faster but less accurate to run the intel kernel on AMD hardware.

~~~
Tuna-Fish
There is no difference. The operations these kernels use are all well-defined
and work identically on all CPUs that implement them. In general, FP/SIMD is
not nearly as much of a crapshoot as people seem to expect it to be. Beyond
timings, there are generally no user-space visible differences in operation
between any AVX/SSE instructions in Intel/AMD cpus.

------
gnufx
Why this obsession with MKL on AMD hardware? I've long made measurements of
OpenBLAS on "large" dimension serial DGEMM, at least. Even on Intel hardware
between Westmere and SKX, with the exception of KNL, it was always at least
within the sort of noise level of HPC jobs of MKL performance, and was always
better than ACML on the generations of Opterons we used.

There are results for AVX2 systems with older versions of OpenBLAS and BLIS,
which is AMD's BLAS at
[https://github.com/flame/blis/blob/master/docs/Performance.m...](https://github.com/flame/blis/blob/master/docs/Performance.md)
I haven't seen or made measurements for EPYC2 yet (and don't have results with
the current versions online for Haswell and SKX) but I'd be surprised if AMD
BLIS doesn't perform equally well on EPYC2. OpenBLAS serial DGEMM is very
similar to MKL on SKX, for instance, with BLIS not far behind. I think AMD
work has contributed to BLIS Haswell performance; their BLAS certainly
supports Intel hardware decently, as well as aarch64, at least.

If you're interested in small dimension matrix multiplication on AVX2
hardware, consider libxsmm and AMD's recent "SUP" support in BLIS, e.g.
[https://github.com/flame/blis/blob/master/docs/PerformanceSm...](https://github.com/flame/blis/blob/master/docs/PerformanceSmall.md)
MKL only got good at small dimensions because of libxsmm.

Off-topic for AMD, but in lieu of detailed figures, here are the first few
points for measurements to hand on SKX serial square DGEMM with BLIS 0.7,
OpenBLAS 0.3.10, and MKL mkl-2021.1-beta06 using the framework for the figures
on the BLIS site:

    
    
      size BLIS OpenBLAS MKL
      2400 90.8 98.3     94.5
      2352 91.4 97.4     93.6
      2304 90.6 98.3     93.9

------
jeffbee
If you were going to use Zen, why would you choose MKL? Why not OpenBLAS?

~~~
shepardrtc
To add to that, I believe OpenBLAS surpassed MKL in benchmarks a while ago.
Even if I had an Intel CPU, I would probably use OpenBLAS if I had the choice.

~~~
danieldk
I benchmarked some of my large transformer networks the last few days and MKL
is still 50% faster than OpenBLAS.

What's even worse in real-world applications is that OpenBLAS misbehaves when
an application uses threads. This is also described in the OpenBLAS FAQ:

 _If your application is already multi-threaded, it will conflict with
OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as
following._

[https://github.com/xianyi/OpenBLAS/wiki/faq#multi-
threaded](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded)

~~~
gnufx
So what are the results with libxsmm and current AMD BLAS, as that must be for
small dimensions?

The reason it's serial BLAS that mainly matters is that HPC codes are usually
parallelized above the BLAS; why do you want the nesting? Swapping in threaded
OpenBLAS or BLIS is something you might do with basically serial stuff like
vanilla R, e.g. [https://loveshack.fedorapeople.org/blas-
subversion.html#_add...](https://loveshack.fedorapeople.org/blas-
subversion.html#_addendum_example) OpenBLAS threading has been somewhat buggy,
but the main problem with its OpenMP support currently seems to be that using
OMP_PLACES kills it.

------
arthur2e5
The new mkl_serv_intel_cpu_true() function seems to have been known since
Agner Fog's 2019 update. I am quite surprised that no changes to the feature
indicator was needed though.

If you are publishing an application, I still recommend using the
intel_dispatch_patch.zip.

------
satya71
patchelf is such a nice tool. I use it all the time to replace hard-codes
library paths.

------
tinus_hn
Quake/Quack all over again. Ironically now it’s ATI/AMD getting cheated.

