Hacker News new | past | comments | ask | show | jobs | submit login

In a previous life, almost a decade ago, I fought very similar fights with OpenMP and MKL using R. It's painful and you need to pay heed to all these small details pointed out in the docs as in OPs case. However, it's worth noting that OpenBLAS is as fast as MKL, at least if you compile it yourself for your system (i would expect that system provided ones with system detection would be as good, but that wasn't always the case back then). I benched this extensively for all my R usecases and for several systems that i cared about back then. So there is usually no need to use MKL in the first place.



OpenBLAS

OpenBLAS is incompatible with application threads. Most Linux distributions provide a multi-threaded OpenBLAS that burns in a fire if you use it in multi-threaded applications. Even though OpenBLAS' performance is great, I'd be careful to give a general recommendation for people to rely on OpenBLAS. Like this MKL example, you have to be aware of its threading issues, read the documentation and compile it with the right flags (in a multi-threaded application: single-threaded, but with locking).

it's worth noting that OpenBLAS is as fast as MKL

This depends highly on the application. E.g. MKL provides batch GEMM, which is used by libraries like PyTorch. So if you use PyTorch for machine learning, performance is still much better with MKL. Of course, that is if you do not have an AMD CPU. If you have an AMD CPU, you have to override Intel CPU detection if you do not want abysmal performance:

https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html

https://www.agner.org/optimize/blog/read.php?i=49

The BLAS/LAPACK ecosystem is a mess. I wish that Intel would just open source MKL and properly support AMD CPUs.


> OpenBLAS is incompatible with application threads. Most Linux distributions provide a multi-threaded OpenBLAS that burns in a fire if you use it in multi-threaded applications.

Can you explain what you mean by this? Are you saying there's a correctness issue here? I only recall running into issues with MPI, where you (typically) run one MPI rank (process) per CPU core. Then if you combine that with a multi-threaded BLAS library you'll suddenly have N^2 BLAS threads fighting over the CPU's and performance goes down the drain. The solution to this is, like you say, to use a single-threaded OpenBLAS, or then the OpenMP OpenBLAS and set OMP_NUM_THREADS=1

I guess with threads you'll have the same issue if you launch N cpu-bound threads and all those call BLAS, resulting in the same N^2 issue as you see with MPI.


Can you explain what you mean by this?

There is a nice description of this:

https://github.com/xianyi/OpenBLAS/issues/2543

At a previous employer, we have seen various issues, including crashes, non-determinisms, etc. Usually, these issues would go away when switching to MKL.


One of the more painful issues is hanging (lockup) at full CPU usage. At my workplace, initially we introduced a timeout to workaround the hang while trying to determine the cause of the hang. It happened within multithread R code. Various build flags for OpenBLAS have been tried to no avail. Setting OPENBLAS_NUM_THREADS=1 surely makes the problem go away, at the expense of performance.

That R code has since been ported to Python, but we faced the same issue again when using ThreadPoolExecutor, so we had to change it into ProcessPoolExecutor instead.


Debian and Fedora provide serial, OpenMP, and pthreads versions of lilbopenblas. Are you sure OpenBLAS doesn't detect nested OpenMP? I thought it did, though I'd normally use the serial version outside something like R, but if you mix different low-level simple pthreads with high-level OpenMP, you can expect problems. OpenBLAS is fine generally -- competitive with MKL on Intel hardware and infinitely faster on ARM and POWER. For PyTorch, presumably you want libxsmm (which is responsible for MKL's current small matrix performance). On AMD hardware, I don't understand why people avoid AMD's support, which is just a version of BLIS and libflame. (BLIS' OpenMP story seems better than OpenBLAS'.) The linear algebra story on GNU/Linux distributions would be less of a mess without proprietary libraries like MKL. It's fine if you take the Debian approach, in significant experience running heterogeneous HPC systems. Fedora has cocked up policy through not listening to such experience, but you can do the Debian-style thing with the approach of https://loveshack.fedorapeople.org/blas-subversion.html (and see the old R example refuting the MKL story). That's one example of the value of dynamic linking.


On AMD hardware, I don't understand why people avoid AMD's support, which is just a version of BLIS and libflame.

A year ago, I benchmarked a transformer network with libtorch linked against various BLAS libraries (numbers are in sentences per second, MKL with CPU detection override on AMD, 4 threads):

Ryzen 3700X - OpenBLAS: 83, BLIS: 69, AMD BLIS: 80, MKL: 119

Xeon Gold 6138 - OpenBLAS: 88, BLIS: 52, AMD BLIS: 59, MKL: 128

I guess people avoid AMD's support, because MKL is just much faster? AMD BLIS did add batch GEMM support since then. Didn't have time to try that out yet.


I was thinking of the usual complaint about Intel not supporting AMD hardware that is common in HPC.

We don't know what that example was actually measuring, except apparently not the same thing for BLIS and MKL. On the basis of only that, it's not reasonable to say "just so much faster", in particular for what I care about. I have Zen2 measurements (unfortunately only in a VM) using the BLIS test/3 framework. MKL came out nearly as fast as vanilla BLIS 0.7 and OpenBLAS on serial DGEMM, less so on the rest of D level 3, and nowhere close with S, C, and Z. Similarly for one- and two-socket OpenMP. At least in that "2021" version of MKL, there's only a Zen DGEMM kernel.


Have you set the environment variable OPENBLAS_CORETYPE to specify the CPU?


I went further than that, I profiled with perf and checked that the right kernels were used.


> The BLAS/LAPACK ecosystem is a mess. I wish that Intel would just open source MKL and properly support AMD CPUs.

Given that their latest compilers are based on LLVM, that seems like a fair trade between the closed- and open-source worlds.


> OpenBLAS is incompatible with application threads.

I’ve never had any issue when using it in OpenMP codes (either compiling it myself or using the libopenblas_omp.so present in some distros), what do you mean by “burn in a fire”?


> OpenBLAS is incompatible with application threads

R is single-threaded.


> i would expect that system provided ones with system detection would be as good, but that wasn't always the case back then

Also in a previous life, I recall running into distro openblas packages that were not compiled with DYNAMIC_ARCH=1 (which enables the openblas runtime cpu target architecture selection, similar to e.g. MKL) but were instead compiled with some lowest common denominator x86_64 arch. I filed some bug(s?), and IIRC this problem has subsequently been fixed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: