Hacker News new | past | comments | ask | show | jobs | submit login
Reducing the Performance Gap of Intel's MKL on AMD Threadripper (pugetsystems.com)
222 points by smartmic 41 days ago | hide | past | web | favorite | 138 comments



It's an unfortunate trend. I got a PR merged which removed an "Is GenuineIntel" check in the ZFS on Linux crypto layer. It checked for GenuineIntel, and only if found did it check cpuid. Dunno who originally checked that in, but it seemed pretty ridiculous to me. AMD has had an AES-NI implementation since Bulldozer.

I feel it would have been acceptable to say if you had a mutant x86 CPU which set the AES-NI bit in CPUID and then had a non-compliant implementation l, that was on you.


Probably some guy trying to clean his ToDo list, but in recent times as we scratch more than the surface, there’s a real possibility that guy’s paycheck was “Genuine Intel” too.


That feels like an easy string to search for across open source projects. Are there any other low-hanging fruit in core libraries?


AMD needs to do something like https://github.com/intel/mkl-dnn and https://github.com/opencv/dldt, either by talking to Intel to get them to accept contributions (vastly preferable), or, failing that, by forking the libs and implementing their own support. With recent work on efficient NN architectures the CPUs are pretty viable for most _inference_ (not training) tasks. DLDT/OpenVINO perf is particularly impressive. It does, however, extensively use JIT kernel generation through https://github.com/herumi/xbyak depending on the supported features for a particular chip (search for "mayiuse()" in the code to see examples). It's anyone's guess what it will detect and generate on an AMD chip however. Now that AMD is raking in more cash, it'd be a great time to invest some of that back into their dev ecosystem.


The relevant free low-level library in this area is libxsmm, which drove MKL to improve for small matrix work by being faster a while ago. I don't know how it performs on AMD CPUs, or how much a priori detailed micro-architecture knowledge it requires, but I doubt the maintainer would refuse changes for AMD. It was referenced with no interest in https://news.ycombinator.com/item?id=16600347

How specific to Intel CPUs is MKL-DNN (now renamed)? When I looked at it, the CPU code seemed fairly generic SIMD.


MKL-DNN is not MKL though. I think MKL-DNN is fairly specific to Intel CPUs though if you want maximum performance. It'll necessarily make certain assumptions about instruction throughput and latency, something that's pretty much certain to be different on AMD in some cases. In super tight kernels such minute differences often boil down to much lower performance. So if AMD wants to do a good job, they'll need, at least in some cases, to detect AMD CPUs and JIT their own specialized kernels that perform well on AMD.


Sure, but it wasn't immediately obvious it was that specific when I looked, hence the question.


Would there be any merit in the OS hiding what CPU brand? Keep exposing what specific CPU features/instruction sets are supported but don't display whether it is AMD/Intel/Via. What's the benefit? I find it hard to believe there are many "optimizations" out there that despite the instruction sets being the same do different things based on the specific CPU model or brand. If I'm wrong and there is a substantial amount of software that requires that then perhaps it can be hidden behind some kind of request access popup.


First, the OS can't hide that because it's not a middleman in the execution of program code. When a program is running on a CPU it has access to the instructions that tell it what CPU it is.

Second, there are already many ways for programs to get granular CPU feature information. It's entirely the software's fault if it decides to ignore that information and instead use a big "if" switch keyed solely on the manufacturer.


First, the OS can't hide that because it's not a middleman in the execution of program code. When a program is running on a CPU it has access to the instructions that tell it what CPU it is.

Technically, it could, using virtualisation extensions to x86. CPUID can be trapped by a hypervisor, and the hypervisor can supply whatever result it likes. This is how Qemu/KVM simulates different CPU models.

Mainstream OSes don’t use virtualisation extensions for regular processes of course, and hypercalls would be even more expensive than regular syscall based context switches, so this is probably not something that is going to change in the short to medium term.


Could Windows 10 with Hyper V enabled do it? They claim native performance when running it


No, they're just lying. You can get close, but you'll never achieve quite the same performance.


10th gen intel (Ice Lake) has more instructions under AVX-512 and are not limited to just Xeon processors.

So wouldn't comparing with latest Ice Lake CPU using intel MKL instead of Xeon-W 2175 yield even larger performance gap for workloads taking advantage of the new instruction Subset? i.e. AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES.


The server/workstation version of Ice Lake is not available yet and I wouldn't recommend buying a quad-core Ice Lake laptop for high performance computing.


Btw, AMD Ryzen 3900X used in the comparison isn't a server CPU as well.


We honestly don't know how this impacts the results as was noted at the end. I'd like to see the numbers from the same run w/ the env variable set on an Intel CPU. Perhaps this debug code does something drastically different?


Has anyone seen similar benchmarks, but with BLIS? OpenBLAS is interesting, but BLIS would possibly perform better.


Based on Zen1 Epyc performance it would seem that BLIS might be significantly more performant even without AMD-specific optimizations (I know AMD has their own fork, but check out the 64 core results for "vanilla" BLIS: https://github.com/flame/blis/blob/master/docs/Performance.m...)


AMD maintain a fork which is merged back as they make it available, so mainline BLIS typically has that work soon after. That was last done in October, since the current release (0.6.0). The FLAME people encourage you to use the current master branch, but you might bother about correctness, as it was failing BLAS tests recently on POWER and the generic target.


Something is still weird with the gemm benchmarks. That link you sent has BLIS winning all around, but the absolute value of the BLIS is poor. MKL gets around 2TFLOPS on xeons, and this blis benchmark is only hitting around 30-50GFLOPS. Are AMDs hitting over 1TFLOPS anywhere?


Everything on that benchmark page is listed as GFLOPS/core so at 30GFLOPS that Epyc 7551 should be pushing around 2 TFLOPS. You can take a look at reproducing them if you want if you have a similar system: https://github.com/flame/blis/blob/master/docs/Performance.m...

I ran into some compile issues, but I care more about numpy so I gave this script a try: https://markus-beuckelmann.de/blog/boosting-numpy-blas.html with some quick/dirty results on my Ryzen 3700X (8C16T) workstation (BLIS doesn't seem to perform very well on this small test). You can see performance basically double on MKL when MKL_DEBUG_CPU_TYPE=5 is used. I'll probably continue to stick w/ OpenBLAS for now:

  ## blis 0.6.0 h516909a_0
  # conda activate numpy-blis
  # conda run python bench.py
  Dotted two 4096x4096 matrices in 2.30 s.
  Dotted two vectors of length 524288 in 0.08 ms.
  SVD of a 2048x1024 matrix in 0.94 s.
  Cholesky decomposition of a 2048x2048 matrix in 0.24 s.
  Eigendecomposition of a 2048x2048 matrix in 6.36 s.

  ## libopenblas 0.3.7 h5ec1e0e_4
  # conda activate numpy-openblas
  # conda run python bench.py
  Dotted two 4096x4096 matrices in 0.41 s.
  Dotted two vectors of length 524288 in 0.02 ms.
  SVD of a 2048x1024 matrix in 0.55 s.
  Cholesky decomposition of a 2048x2048 matrix in 0.14 s.
  Eigendecomposition of a 2048x2048 matrix in 5.53 s.

  ## mkl 2019.4 243
  # conda activate numpy-mkl
  # conda run python bench.py
  Dotted two 4096x4096 matrices in 1.53 s.
  Dotted two vectors of length 524288 in 0.02 ms.
  SVD of a 2048x1024 matrix in 0.51 s.
  Cholesky decomposition of a 2048x2048 matrix in 0.29 s.
  Eigendecomposition of a 2048x2048 matrix in 4.79 s.

  # export MKL_DEBUG_CPU_TYPE=5
  # conda run python bench.py
  Dotted two 4096x4096 matrices in 0.33 s.
  Dotted two vectors of length 524288 in 0.02 ms.
  SVD of a 2048x1024 matrix in 0.29 s.
  Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
  Eigendecomposition of a 2048x2048 matrix in 3.29 s.


Oh, just for fun, I manually replaced the BLIS libs w/ the AMD BLIS 2.0 fork (Ubuntu MT binary on an Arch system: https://developer.amd.com/amd-aocl/blas-library/) and it seemed to work ok. Slightly faster than BLIS 0.6.0 but not actually faster than OpenBLAS on this test...

  ## AMD BLIS 2.0 https://developer.amd.com/amd-aocl/blas-library/
  # conda activate numpy-amdblis
  Dotted two 4096x4096 matrices in 2.33 s.
  Dotted two vectors of length 524288 in 0.08 ms.
  SVD of a 2048x1024 matrix in 0.78 s.
  Cholesky decomposition of a 2048x2048 matrix in 0.23 s.
  Eigendecomposition of a 2048x2048 matrix in 5.84 s.


Thanks. I completely missed that legend. Does it actually scale linearly by cores? I would think there's a plateau after a certain number of cores.


That benchmark page I referenced shows scaling on a SkylakeX at 1, 26, and 52 cores, on a Haswell at 1, 12, and 24 cores, and on Zen1 at 1, 32, and 64 cores so you can probably map out a ballpark scaling co-efficent from that.

I'm sure there are theoretical and practical limits but it's probably completely dependent on implementation and each system architecture, but that's beyond my pay grade. Maybe start here: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...


If you want to understand implementation and performance, that page won't help you. Read the papers referenced on the BLIS web site, and/or run their benchmark code, which is now referenced somewhere in a README.


To illustrate this, let's assume the GEMM used avx512 perfectly. Roughly that's 2.5e9 (clock) * 16 (ops per cycle) * 2 (FMA per cycle) * 64 (cores). That puts Intel's peak performance over 2TFLOPs, which they're showing. That's about 80x higher than AMD on BLIS...


The article says they’d like to include BLIS in the benchmark, but nobody has written a good Numpy Conda package with BLIS hooks yet


The article was specific to AMD's BLIS fork, but just as an FYI, using standard BLIS w/ numpy (which I believe would still outperform OpenBLAS) is pretty straightforward:

  conda create -c conda-forge -n numpy-blis numpy "blas=*=blis"


Are there any reads you can recommend on how to recover from (possibly) failed experiments like these under pip / conda / linux in general? I've been doing R and Linux for two decades, but still am perfectly capable of totally breaking my python/jupyter-workflow by messing up dependencies without knowing how to recover. Would love to learn to remedy that gap and reducing the risk.


You can actually see in my pasted command that I created a new venv for the blis install. It's as simple as "activate" or "deactivate" to switch around.

That's basically all there is to it, but here are some docs for you to reference from the top page of search:

https://docs.conda.io/projects/conda/en/latest/user-guide/ta...

https://uoa-eresearch.github.io/eresearch-cookbook/recipe/20...


When dealing with purely python packages: always always work in python virtual envs. Conduct these kinds of experiments in separate venvs. If worse comes to worse, a reset consists simply in deleting the venv directory and installing from scratch.

If an experiment involves system packages it's a little trickier. You can learn about Docker (which does for the whole system what venv does for python packages), but in most cases you may get away with just taking note of what packages you have installed/uninstalled, so you can revert the process later if needed.


never work as root, just undo all changes you did for it to work


Why would you need a Conda package to measure performance or BLIS or another implementation? Assuming an ELF system, either preload an alternative library, or make a trivial shim for libblas.so3, as the Fedora blis package does. See the somewhat old https://loveshack.fedorapeople.org/blas-subversion.html for example.


because all of the people "benchmarking" cpus nowadays are just clueless self-advertisers...


As for a previous article like this -- it's pointless. Just use BLIS, like AMD do, and which is infinitely faster than MKL on non-x86 systems. https://github.com/flame/blis


That's some seriously scummy stuff from Intel.


Is the "MKL" mentioned by the article referring to https://docs.anaconda.com/mkl-optimizations/ (or something similar)?

If yes, then there is apparently an active engagement by Intel (therefore involving people&money) and I don't see anything wrong if they optimize their code for their CPUs and then ignore other manufacturers (AMD) - mooost probably they just don't have a reason about why they should improve their competition + trying to do that anyway (without having full insight about the "foreign" multiple CPU architectures) might generate even worse performance which in turn would backfire (e.g. "aaahh Intel is sabotaging AMD CPUs").

So, personally I would have no problem if when using iMKL ("intel MKL") the code would run faster on Intel CPUs by default and being conservative on other CPUs - but if I would own an AMD CPU then I would expect AMD to implement some similar optimization for their own CPUs (aMKL?).


what you are talking about isn’t the situation. they are intentionally detecting cpu features wrong so as to not use features amd cpus have. nobody is expecting special amd optimization attention.


Ok, if it's like that ("intentionally detecting cpu features wrong") then I would agree.

Can a "CPU feature" (e.g. sse2, avx, aes, etc...) be implemented the same way between different architectures?


(Until someone has a more detailed answer) Both Intel and AMD are on the same X86 architecture, and sse2 etc are just extensions of the X86 architecture, u can check which feature flags a CPU has

And the call/use of these features is the same for AMD or Intel


They explicitly look for the "GenuineIntel" product string and then basically disable all the efficient code paths if it isn't found, totally against their own advice for ISA extension detection [1].

https://www.agner.org/optimize/blog/read.php?i=49


I'm curious if AMD could produce an internal engineering sample that simply copies all the relevant product strings from the competing vendor. They could check for statistically relevant performance discrepancies from just that modification.


Sounds like user-agent strings all over again...


don't need to make samples, you can fake the identifiers with virtualization and measure the delta.


But that has virtualization overhead. You don't want to measure that/risk being accused of measuring that.


Run both processes in a virtual environment then. Hell, that's where a lot of machines are running anyway. At least in server environments.


Also the overhead these days is pretty marginal, been running macos on an amd CPU emulating a Pentium(since macos also looked for Intel) and it beat all Mac pros of that time in cinebench etc


yes :) RDRAND is implemented differently


> they just don't have a reason about why they should improve their competition

I don't think this is quite true.

While their compiler team does care a lot about benchmarks, ultimately, they sell compilers. Some users won't use their compilers if they are too bad at running code on AMD's CPUs: often times 95th percentile performance is more important than median or mean.


Their compiler is to sell their own chips. They lost a FTC case about this, and their solution was to add smallprint that said when running compiled code on non Intel CPU's it "may" not perform as well.

There are ways to patch ICC produced binaries to disable this genuineintel patch.

https://www.ftc.gov/news-events/press-releases/2010/08/ftc-s...


Mmmhh, I have doubts about this - e.g. that their business that sells compilers isn't as important as the one that sells CPUs. There would be of course an indirect long-term effect etc... .


Maybe, but please note this part of the article: "It is also possible that the resulting code path has some precision loss or other problems on AMD hardware. I have not tested for that!"


I'm a lot more willing to believe that Intel is attempting to cook benchmarks, than it being a decision made out of caution to save face for AMD if the results are incorrect. They sat for an excruciatingly long time making no obvious improvements to their processors, and then suddenly more than halved the price of everything the moment there was competition.


Yeah, some companies might deserve the benefit of the doubt but Intel isn't one of them. They shouldn't be judged in a vacuum; Intel is a company with a long and nasty history of underhanded bullshit.


> than it being a decision made out of caution to save face for AMD if the results are incorrect.

It's not about "saving face" for anyone, it's about the possibility of bug reports that need to be investigated, and possibly new code paths need to be added if a fix is necessary.

> and then suddenly more than halved the price of everything the moment there was competition.

This would have happened regardless of how much or little their products improved in the prior years. The price is what enough people are willing to pay, and that goes down if there's an alternative.


eg. "AMD shipped Ryzen 3000 with a serious microcode bug in its random number generator." "Windows users couldn't successfully launch Destiny 2, and Linux users in many cases couldn't even get their system to boot." https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd...


The article currently states the Destiny 2 issue was unrelated to the bug in the random number generator.



See the caveat at the end:

> It is also possible that the resulting code path has some precision loss or other problems on AMD hardware. I have not tested for that!

Intel doesn't test it either, which is why they disable it.

You could argue that they should test and enable on AMD CPUs, but this isn't clearly scummy behavior (to me), it's not a zero cost decision they made solely to screw over AMD. Buying the hardware and testing it has a cost. It may not even be possible to test it to the same degree they test their own hardware. There may be modes, subtle differences between architecture versions, etc. that are hard to know about to test for.


I don't understand. Shouldn't Intel be using the CPUID feature detection system they created?

https://software.intel.com/en-us/articles/how-to-detect-new-...

They don't need to test to see if it works on AMD. The processor advertises it. In their own words:

> It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.

Indeed, my 3900X has the AVX2 flag flipped. So why would it not get used?

I don't think the precision loss argument really holds water. If you can't trust their AVX2 implementation, why can you trust any other instruction on the CPU to not have similar problems?

Unless I am misunderstanding something, it really looks like they're not following their own advice and artificially limiting the use of AVX2 to only Intel processors.

edit: Also, we really have no reason to trust Intel's honesty. Don't forget:

https://web.archive.org/web/20110312082557/http://www.compil...


Many instructions (that likely see heavy use in MKL) are approximations, the exact implementation and behavior of these approximations varies across platforms (including possibly architecture versions).

The docs you quoted describes the necessary conditions for using an instruction to avoid crashes/illegal instruction issues. That doesn't cover these kinds of more subtle differences that exist.

For example, I just made a plot of |1/x - rcpps(x)|: https://snipboard.io/Rei74t.jpg


I'm very much not buying it.

Although of course an instruction for something like approximations can vary across different micro-architectures and vendors, there has to be some definition for what the baseline is. Without it, the instruction may as well output junk.

Indeed, I just checked and Intel's "Intel 64 and IA-32 Architectures Software Developer’s Manual" contains a bound on the error of the approximation in rcpps:

    The relative error for this approximation is:

        |Relative Error| ≤ 1.5 ∗ 2^−12
In fact, the document goes on to further define a lot of the behavior of the instruction in edge cases. Because that's kind of important for an architecture. If the extended instruction set did different things on every different processor, it would be useless.

So I don't think this is really the issue. I don't think Intel MKL defines what the error is for many operations, so it is presumably not defined, and therefore it'd be hard for AMD's implementation to run afoul.

It's also not going to be reproducibility/consistency, since they cover this in a separate issue: https://software.intel.com/en-us/articles/consistency-of-flo...


That's the maximum error. The error is often much less than that (as the plot shows). Code might unintentionally be relying on the error being much smaller than the spec requires, and on another architecture, the instruction might not meet this (unintentional!) requirement.

Of course an issue stemming from something like this is rare/unlikely. But Intel also can't assume that the consequences for a related error are small.

It seems like most people commenting irately about this in this thread are hobbyists or are otherwise working on low consequence software. Yeah, for these uses, it would probably make sense for intel to just look at the CPUID and leave it at that. But they can't know that, and they aren't optimizing solely for these use cases.


I am suggesting that if Intel does not care about their code running on AMD processors, then it should not arbitrarily change behavior based on the presence of the string “GenuineIntel.” End of story.

But I suspect Intel does care, in the sense that they certainly care about how they perform and benchmark against AMD.

Let’s put it another way. What happens if Intel enables AVX2 for AMD and it doesn’t work?

1. Precision is not as good as intel?

1.a. Many developers care and it impacts AMD negatively. (Probably nothing happens since nothing is actually broken per se.)

1.b. Few developers care and it suggests Intel is not optimizing for the best performance trade-off.

2. Programs using MKL crash on AMD processors.

2.a. There is a bug in MKL: it probably gets fixed with little fanfare.

2.b. There is a bug on AMD processors, and it probably gets fixed in microcode.

3. AMD processors produce wrong results. There seems like there can only be one outcome here, and it’s probably worse for AMD than Intel. The Intel FDIV bug proves that people take these things very seriously. At best it could be fixed in microcode, and at absolute worst AVX2 could be disabled entirely in an update.

The best argument for why Intel would still rationally avoid AVX2 on AMD outside of the above is to ensure their customer’s code runs correctly on AMD processors. However, there’s still some problems:

- They have to contend with their own processors, too. If they write code they know depends on non-public details of the architecture, they could very well break themselves.

- What is the definition for “break?” Intel surely has test suites for their software, but just because the software’s precision may be worse (still pure conjecture) when running under AMD does not suggest it is broken. All we can say is that users may rely on the behavior of precision on their specific configuration, but the problem is that’s not specific to AMD. Intel alludes to the fact that precision can be different amongst Intel processors in several places on MKL’s documentation. Nothing stops someone from improperly relying on this.

By playing fast and loose with specifying an architecture then writing software that disregards it, I’m not sure how defensible a position Intel is sitting in from pretty much any perspective.

> “otherwise working on low consequence software”

No offense but it is hard to respectfully respond to something like this, even though I know it wasn't directed at me personally. I hope you don’t think about coworkers with this mindset.


> Let’s put it another way. What happens if Intel enables AVX2 for AMD and it doesn’t work?

The "safe" thing to do is rather than work through all of these cases you've considered, is to just give up on it.

> No offense but it is hard to respectfully respond to something like this, even though I know it wasn't directed at me personally. I hope you don’t think about coworkers with this mindset.

I didn't intend it to be disrespectful. It's not a value judgement. Maybe these aren't the best terms, but I've worked on both what I would call "low consequence" software, and "high consequence" software. The tolerance for decision making about things like swapping out hardware is different. In some cases, this can be done with little hesitation. In other cases, it would cost millions of dollars to test the system to the point where it could be trusted.

Maybe the solution to the internet pitchfork wielders is for intel to simply stop trying to serve both of these sets of users with the same software. But I suspect they simply don't care about this problem enough to do anything.


But they don't give up, they fall back to SSE. Which ALSO might have exactly the same issues as AVX2 instructions.


They probably fall back to SSE2, which is the least supported SSE on any 64-bit x86 processor. That might be their minimum/baseline codepath and don't want to build/maintain any other code paths.


Your argument might be worth considering if Intel didn't use the same "cripple AMD" check across a lot of other libraries and their own compiler too - where there are no such "numerical approximation" concerns.

They've been doing it long before MKL [1].

---

[1] https://www.agner.org/optimize/blog/read.php?i=49


And how exactly does that plot show that AMD's AVX instructions are faulty?


It doesn't. It shows that the behavior of instructions like this is very unpredictable and subtle, and that code using it that was tested on one processor might not behave the same on another processor. It doesn't mean one is faulty and the other is not.


A processor would be faulty if the error or behavior fell outside of the architecture specification. The behavior is not unpredictable because the behavior is defined by a specification.


I completely disagree. Your plot doesn't show CPU instructions are unpredictable (which is not correct). Yes, IEEE floats are not the same as real numbers (which is what your plot is showing): they have inherent precision errors, and float operations are not even associative.

That being said, IEEE floats are carefully and consistently defined, and are perfect predictable. The unpredictability you claim is not due to stochastic errors or faulty implementation of CPU vendors, they're a part of the IEEE definition and are deterministic.


The '1/x' part of this experiment is described by an IEEE spec. The reciprocal instructions are not. This experiment doesn't cover the difference between theoretical real numbers and float computations, it's the difference between two different ways of working with floats. One of them is IEEE specified, the other is not.

In practice, using these kinds of instructions (which are not specified by IEEE) can give massive performance advantages.


And how does the difference between IEEE-754 1/x and SSE reciprocal instruction on the same CPU is relevant to Intel disabling AVX/AVX2 on AMD CPUs?

So your logic is: given that IEEE-754 1/x and SSE rcpps yield different results on a single Intel CPU, then... AMD's AVX implementation cannot be the same as Intel's AVX implementation and therefore, Intel is perfectly correct in disabling AVX on AMD?


The logic is:

1. Some instructions definitely have implementation defined behavior that will vary across platforms. rcpps is just one example to illustrate the point. 2. Therefore, if you test something on an intel CPU, it might not behave the same on an AMD CPU (and vice versa). This doesn't mean AMD's implementation is bad. (For all you know, it might work on intel because you are accidentally exploiting a bug, and that bug won't exist on AMD, and break your software.) 3. Therefore, if you don't test something using these on an AMD CPU, it could be risky to enable it.

As an aside, I don't necessarily think this logic is the only factor that should go into this decision and I'm not sure it's the right one for intel to make. But there is some logic to it that isn't simply intel screwing over AMD, which is the only point I've been trying to make here.

I think what you can fairly say about intel here is they are being lazy and overly risk averse, but not anti-competitive (in this instance!).


This is a pretty long thread for you to not notice that your data doesn't show an implementation defined behavior difference because it's ... a graph of just one implementation.

1/x has been around as an approximation used in the overwhelming majority of floating point division units for a long time, even before the SSE instruction came about, so I'm pretty dubious that either Intel or AMD is changing their answers. Perhaps you could prove it.


He’s right. The values are actually different. Here’s a graph of two: http://const.me/tmp/vrcpps-errors-chart.png AMD is Ryzen 5 3600, Intel is Core i3 6157U.

Over the complete range of floats, AMD is more precise on average, 0.000078 versus 0.000095 relative error. However, Intel has 0.000300 maximum relative error, AMD 0.000315. Good news is both are well within the spec. The documentation says “maximum relative error for this approximation is less than 1.5*2^-12”, in human language that would be 3.6621E-4.

Source code: https://gist.github.com/Const-me/a6d36f70a3a77de00c61cf4f6c1...


As you point out, the spec includes threshold values for errors, and the actual epsilon error values vary between vendors as well as within the CPUs from the same vendor.

He's wrong, because 1) this discussion is his false claim that Intel is justified in disabling AMD's AVX/AVX2 implementation because it fails to meet the spec 2) his plot doesn't show any variance between AMD vs Intel.


My claim is not that AMD doesn't meet the spec, as I've repeatedly stated in this thread. The claim is that it's different, and so you can't assume testing on one processor is sufficient to guarantee the results on another. And again, that this particular instruction is just one of many examples where intel and AMD will differ in their results.

The point of my plot was to show that the behavior of instructions like this is super weird and random, it's not something that we should expect to be consistent across vendors. It's probably not consistent across all of intel's CPUs either, but they're probably aware of and test for all of these.


The LHC@Home BOINC project had issues due to this. Work units calculated on Intel CPUs would frequently not match the same work unit calculated on AMD CPUs. They traced it down to the subtle difference in results from certain instructions, like shown in parent.

Since the LHC@Home project simulated protons circulating in the LHC, the small differences added up over the typically large number of time steps.

IIRC their solution was to pay a small price and use software implementations of those instructions, but I can't find the reference right now.


That indicates a numerically unstable code on his/her part though. Scientific results shouldn't depend on unspecified epsilon values that fall within the spec.

If they're getting different results on different CPUs which both implement the spec correctly, they need to fix their code anyways.


Besides using a deterministic floating-point implementation, how can you avoid that in sims like this? A slightly difference force on the particle at current time step will cause it to end up in a slightly different place.

edit: They do run sims with slightly different initial conditions to get the physics rather than simulation artifacts. The issue here is that for a given set of initial conditions, results computed on different machines would not agree.


All floating point implementations are deterministic. You do get the same results on each run on a given CPU. There is no stochastic floating point implementation in AMD or Intel.

I understand that they don't agree when run on different CPUs. But they both work correctly, because IEEE floats and operations are defined up to a certain level of precision by specification.

What I'm saying is, if they obtain different results across CPUs within their accuracy target for results, their code is simply broken and they need to change their code to either use a numerically more stable algorithm or higher-precision floats. No scientific software should rely on unspecified numerical "bugs" that fall outside of specs and can change in the future even with the same vendor.

You also have to remember that there are all sorts of people in CERN from undergrads learning to code on the go to software engineers with no physics background. Just because a piece of code made into a CERN repository at some point doesn't mean it's a gold standard, word of gods (which appears to be your premise) and CPU vendors are to blame for any problems.


> All floating point implementations are deterministic.

The entire point here though is that implementations are allowed to produce different results, and that code that needs to produce the exact same result using different implementations need to take this into account.

The application (SixTrack) had, AFAIK, only been used in compute clusters before. When they started using it in LHC@Home, running on a large variety of user hardware, this issue was exposed.

> What I'm saying is, if they obtain different results across CPUs within their accuracy target for results, their code is simply broken and they need to change their code to either use a numerically stable algorithm or higher-precision floats.

Right, that's what they discovered and that's what their solution was as I mentioned in my original post.


> and that code that needs to produce the exact same result

... is broken. "The same result" is a complicated question when floating point computations are involved. Even in an IEEE 754 environment, the compiler can easily cause non-bit-exact differences over time on the same chip.


I agree that code that depends on exact results from things that are known to depend on implementation defined behavior is broken code.

But the reality is that a ton of scientific and high performance code does it, and simply declaring it broken and not dealing with it will not do you any favors if you are trying to work with such systems.

It's similar to how microsoft has to maintain decades of bugs because a ton of software depends on them and would break if they fixed them. Yeah, microsoft would be "correct" to fix them and declare the code buggy. But that would piss off a lot of people that don't care about this argument and probably lose microsoft money. It's the same situation here.


I deal with it by using the numerical skills I used in grad school. There are no surprises here for anyone who's studied scientific computing. You appear to not understand the issues involved. Robust code doesn't have a problem with different compilers or implementation-defined behavior that's within bounds.


Good for you? I didn't originally write most of the code I have to work with. Just like intel doesn't write the code that depends on MKL.


This makes no sense at all. The differences that you're seeing fall within the spec. You're basically saying that Intel is justified in crippling AMD CPUs if AMD's floating point instructions doesn't implement Intel's out-of-spec quirks as well. This is a flimsy argument, the FUD that Intel has been propagating for years now, and I'm not sure why you keep on pushing the Intel FUD so hard again and again.

There is basically nothing which guarantees that the numerical errors that are within the error tolerance of the spec won't change in new iterations of Intel CPUs either.

If the correctness of your calculation depends on the values of the unspecified within the spec, this means you need to change the code to a more stable algorithm or use higher precision floats anyway.


> There is basically nothing which guarantees that the numerical errors that are within the error tolerance of the spec won't change in new iterations of Intel CPUs either.

You're absolutely right. The difference is that if a customer of a library like MKL comes to intel and reports a problem like this, intel is going to help troubleshoot and fix the problem, and they're going to have a lot of internal technical documentation to help understand and fix the differences.

> The differences that you're seeing fall within the spec. ... This is a flimsy argument, the FUD that Intel has been propagating for years now, and I'm not sure why you keep on pushing the Intel FUD so hard again and again.

The problem is that it's really hard to know if code depends only on the requirement in the spec, or depends on more specific behavior present in the machine(s) you tested on. Call it FUD if you want. A few people in this thread have reported spending significant effort on troubleshooting implementation defined behavior. Yeah, code that depends on this is bad, but bad code exists and sometimes people with the bags of money care about it.


I can't tell if you're being naive or deliberately perpetuating an Intel FUD. You're being part of the problem by doubling down on some ambiguous comment (which is not even based on any real test) as if there actually is a problem with AMD's AVX implementation.

As a physicists, I can tell you for sure that no, this is not why Intel disables it. Intel had quite a success by purposely crippling icc/ifort and MKL on AMD, and created GenuineIntel as a legal barrier to prevent AMD (and AMD users) to come up with a workaround, which allowed Intel to essentially kill the competition in HPC.

It is a well known fact that Intel has been actively working to cripple AMD's performance across the board. They invented GenuineIntel exclusively for that purpose, which was a part of the bigger picture filled with false advertisements, bribes (the most famous Intel bribe cases involved Dell), smear campaigns, lawsuits, so on and so forth. Intel has a long history of playing dirty against competitors.


> I can't tell if you're being naive or deliberately perpetuating an Intel FUD. You're being part of the problem by doubling down on some ambiguous comment (which is not even based on any real test)

I have never actually seen Intel comment on this, I just personally have experienced headaches porting numerical code between two different architectures (not AMD <-> Intel though), where on the surface the architectures appear to be the same and straightforward to swap between, but in practice they are not.

This doesn't mean one of the architectures had "problems" and the other did not. It's not about one architecture being inferior than the other, they're simply different.


I'm not sure what you're trying to imply here. I'm not sure if it's relevant either, as this is about the implementation of the same instruction set (AVX/AVX2) on Intel and AMD, whereas you say "not Intel <-> AMD though". Can you be more specific?

In any case, I never saw any real reason which warrant disabling an entire instruction set, such as AVX. Which is why you don't see such artificial crippling in open source implementations of LAPACK/BLAS/sundials/etc, and people (including me) have been using the same fortran code for many decades across many architectures.

And in case this is what you're trying to imply, no, they don't really give different numerical results on different CPUs.


My relevant experience is in porting numerical code between CPUs and GPUs. Some of the issues that have caused problems are: - Different precision of approximate math (transcendental functions, reciprocals, etc.) - Different rounding behavior of the intermediate result in multiply-add instructions. - Different handling of exception cases (inf, nan, etc.). - Aside from correctness differences, some optimization strategies that make things faster on one processor make them slower on another. This happens even within different generations of x86 hardware.

> Which is why you don't see such artificial crippling in open source implementations of LAPACK/BLAS/sundials/etc

Are they as fast as MKL? If so, just use them?

If not, why not? Maybe the reason is you can do better if you optimize for specific CPUs, with different latencies of various instructions?


Porting between different instruction sets is a very different thing from this situation.

> Aside from correctness differences, some optimization strategies that make things faster on one processor make them slower on another. This happens even within different generations of x86 hardware.

This is the one notably relevant part and, yeah, that's fine. Follow the CPUID features. Nobody expects it to be absolutely optimal on AMD. But let it use the code that was optimized for Intel chips with the same features.


Then it truly is irrelevant! You're not even talking about CPUs vs CPUs.

The differences you are quoting coming from the difference in libraries (sin, exp etc will give different results depending on them libm implementation, that's normal and it has nothing to do with CPU instructions!), not the implementation of IEEE instructions (assuming that you're talking about IEEE floats, otherwise, you shouldn't expect them to behave the same in the first place!), though.

> Are they as fast as MKL? If so, just use them?

I (and a lot of other people) do use them, when I have a choice. Sometimes they are faster, sometimes they aren't. When there is a significant disparity, however, it usually is because of GeniueneIntel checks.

> If not, why not?

Because scientific software geared toward applications is usually closed-source proprietary or too complicated to be modified (remember that users aren't interested in becoming software engineers, in addition to their own jobs as researchers) to add new alternative backends and you don't get to choose.


"Sometimes they are faster, sometimes they aren't. When there is a significant disparity, however, it usually is because of GeniueneIntel checks."

so you are saying that your open-source BLAS/LAPACK is showing performance differences (and worse performance compared to MKL) because of "something Intel". Seems like a lot of people here (including the ones not being able to compile numpy against another BLAS) are a little bit short on actual experience/knowing about the problem...

"scientific software geared toward applications is usually closed-source proprietary or too complicated to be modified" If it's geared towards applications, it's usually opaque engineering stuff and the results of people claiming to do science with this software are mediocre at best... In my domain (qunatum-chemistry) nearly all software is delivered as source-distribution. Because modifications of methods are part of science...


That's funny because I was actually talking about your field! (whose math is borrowed from one of the subfields of physics) I haven't so far met even a single chemist or material scientist who actually knows what they're running, even when they have access to the millons of lines of code they're using. And I don't blame them (or call them "mediocre" as you do), because they only have 24 hours in a day and only one life!

I met only one computational physicist so far doing DFT who used to write his own code back in 70s, but he admits he has no idea what VASP and others are doing nowadays.

If you're claiming that you actually know how VASP or Quantum Espresso (or any other similar significant piece of software) works in depth and you can tweak/replace any part as you like (which I'd find very very hard to believe, millions of work hours go into the development of those), you'd nevertheless be the exception in chemistry not the norm.

The most common high-level tools theoretical physicists like me use (such as Mathematica) don't give access to source code, on the other hand, so we can't make Mathematica not use MKL and not suck on AMD.


> Then it truly is irrelevant! You're not even talking about CPUs vs CPUs.

Why does that matter? The bulk of the issues come from implementation defined behavior, of which there is plenty within x86 itself to cause issues.

In general, the IEEE-compliant parts of x86 are also IEEE-compliant on other processors, at least the ones I've dealt with. It's the operations that aren't specified by IEEE that cause problems.


How solid is that legal barrier? User agent strings which is a similar case are really weird to deal with sniffing.

Edge supposedly has the string "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 Edg/44.18362.449.0".


Until the opposite has been shown, the assumption should be that a CPU that indicates that it supports AVX2 does support AVX2.


Yup, the number of people in this thread buying into Intel's excuses is astounding to me.


There are over 10 different x86 CPUs, some with very dodgy implementation of basic x86, not mentioning extensions. It could have been just a typical "safe" corporate decision to run optimized code path only on specific Intel CPUs. They use different code paths for different Intel CPUs/generations as well. Now imagine they had to believe that Hygon/SiS/Vortex implement it properly... It's not completely black/white.


That is not the case, this is at the boundary of software and hardware and real-world concerns dominate. It is quite possible that a library highly optimised to a specific hardware set could have ghastly bugs on hardware it isn't tested on and testing has a cost.

Intel is being reasonable and in a sense prudent by only supporting hardware they understand very well. It isn't like they are sending the lawyers out to sue anyone writing high performance code on AMD chips - that would be unreasonable.


The usual approach (e.g. taken in the linux kernel, llvm, hotspot jvm) is to look at the cpuid capability flags and rely on them by default and only blacklist specific CPU models when they don't implement advertised features correctly, not the other way around.


Then don't test it. It is perfectly ok (and even expected) for Intel to not test on AMD. Just don't intentionally cripple it.


>Intel doesn't test it either, which is why they disable it.

I think the software part of Intel should not sabotage the competition, we don't like when Google devs are sabotaging Firefox intentionally or by laziness, as a software dev you test for the feature, if it is present you use it and if for some reason the hardware is broken then you implement the workarounds if your customers demand it or if not let AMD to fix the broken feature. As a developer myself the least I could do is to add a flag to turn the crippling on/off but let it on by default, then the users can decide that precision/thermals or things are worse on AMD and enable crippled mode.


The idea that a R&D department has cost restrictions doesn't really hold water. You can be sure that they already have rooms filled with every single AMD processor that's ever been released, and a bunch which haven't. The cost of buying retail products at the scale of a company, even a small startup, is peanuts compared with even a single engineers salary. I was recently asked which test phones to buy for application development, and didn't get kick back when I said to buy every single one of the top selling on Amazon, to begin with.


The question is why would they spend that much money to test it on a competing platform.


Problem is not testing per se. Ok, they tested and found out MKL is 50% slower. Then what? Write proposal for MKL team to spend time/money to make MKL faster on AMD? I don't think this would fly with upper management...


Intel's pretty motivated to do so for competitive analysis reasons even if they don't care about doing QA to benefit AMD.


Well, I'm pretty sure Intel tests competitor product, and maybe even test MKL. So what? Suppose I (being Intel engineer, which I'm not) tested linalg+MKL on threadripper and found it to be 50% slower. Then WHAT? Ask management for more money for MKL team to make it faster on AMD? I don't think such proposal would fly.


I don't think anybody expects Intel to do anything in that case. What is expected is not to cripple performance when running on competitors CPUs.


What kind of reasoning is that? When I develop software I don't start by writing an if-statement that exits if the user's hardware is not identical to the hardware specs that I wrote and tested the software on.

Certain things should carry the expectation they are compatible, not the other way around. Innocent before proven guilty! I would say an instruction set (the sole purpose of which is to avoid scenarios like the one you're describing) are one of those things.


Almost no software is tested on a proper suite of different architectures. If you don't want to test it, the proper answer is to emit valid code without any shenanigans.


The caveat at the end of the article is grounded in reality. I've personally experienced material precision bugs when running Intel-targeted code on AMD microarchitectures. These edge cases are very difficult to test for. The only reason we ever detected instances of these bugs is because we ran numerical codes at massive scales (finding the edge cases by brute force) and customers occasionally ran identical test workloads across both Intel and AMD clusters and compared numerical results. Every so often, something would surface.

It is not Intel's responsibility to ensure that there is no obscure edge case in AMD's microarchitecture that may cause the MKL algorithms to give incorrect results on AMD silicon. MKL's design is based in part on Intel knowing the internal details of their own silicon.


> MKL's design is based in part on Intel knowing the internal details of their own silicon.

This is just conjecture unless there’s a citation to support it.

Besides, the microarchitecture design is bound to change dramatically over time just as a matter of course. Relying on things that they don’t document publicly is probably not a wise strategy. How would the teams coordinate not breaking each-other? I can imagine such a position would be incredibly unpopular amongst those working on silicon designs (and maybe not popular amongst those working on software, either.)

Also, nobody has really presented a reasoning that justifies this behavior, since if it broke on AMD hardware, it still wouldn’t be Intel’s problem, and therefore perhaps they should just use the instruction sets as intended and documented instead.


This is literally discussed in the MKL documentation, which spends considerable time on numerical reproducibility across microarchitectures. You have a rather strong opinion for someone who can't be bothered to educate themselves on the subject matter.

I'm not an Intel fanboi but casually dismissing this specific issue is just ignorance. Intel's algorithms, for better and worse, are very aware of the limitations of their silicon implementations.


I am sitting here without a citation of any kind.

edit: In retrospect this comes off as unnecessarily snarky. Let me clarify:

- I am just asking for a specific citation about this issue.

- Reproducibility is one case where details about microarchitectures matter, but it is not actually the default behavior of MKL, and the documentation makes it clear that its limitations are largely unrelated to the normal operation of MKL; for example: https://software.intel.com/en-us/mkl-windows-developer-guide...

> Dispatching optimized code paths based on the capabilities of the processor on which the code is running is central to the optimization approach used by Intel MKL. So it is natural that consistent results require some performance trade-offs. If limited to a particular code path, performance of Intel MKL can in some circumstances degrade by more than a half. To understand this, note that matrix-multiply performance nearly doubled with the introduction of new processors supporting Intel AVX2 instructions. Even if the code branch is not restricted, performance can degrade by 10-20% because the new functionality restricts algorithms to maintain the order of operations.

This bit here is pretty important because it suggests that strict CNR mode can disable AVX2 codepaths even on Intel. It isn't doing so on AMD specifically for reproducibility; it just won't load AVX2 codepaths on non-Intel processors, period.


It's a shame this comment is downvoted, considering that it's the first one to clearly show that the reproducibility argument for MKL's behaviour is bogus:

MKL purposefully crippling AMD's performance does nothing for reproducibility between the microarchitectures, because to get proper reproducibility you have to cripple the performance on both AMD and Intel. And of course that's not what Intel's doing by default.


Correct, it's not their responsibility to avoid AMD bugs. So if switching from fast possibly-buggy instructions to slow possibly-buggy instructions is some lazy attempt at avoiding bugs, they should stop doing it.


> It is not Intel's responsibility to ensure that there is no obscure edge case in AMD's microarchitecture that may cause the MKL algorithms to give incorrect results on AMD silicon. MKL's design is based in part on Intel knowing the internal details of their own silicon.

So then why let MKL run on AMD CPUs at all?


Can you indicate how often the issues you detected were bugs in the Intel CPUs?


As I recall (and I am pulling from stale memory), we generally concluded that Intel had the more strictly correct behavior. Our numerical algorithms were designed in Julia and we had some pretty rigorous formal models for what they should be producing.

I had a similar experience compiling complex C++ code bases in both GCC and Clang. There were interesting edge cases that cropped up where the compilers apparently had a fundamental disagreement on how to interpret a bit of code. Every time we did deep dive to understand the discrepancy, we came to the conclusion that Clang was probably correct strictly speaking. This still caused us problems and we ended up designing work arounds to support GCC.


The MKL is, and ways has been, a marketing tool for Intel. They develop it to make their CPUs look good. Why should they help the other team?

This really isn't news to anyone who cares about linear algebra library performance, it's just that it's passed from general consciousness in the decade since AMD was last relevant. For performance portability to non-Intel CPUs ooenblas and its ilk have always been the way to go.

You might think it's a sharp practice, in which case, you're free to vote with your wallet.


At very least, because they're legally obliged to. The anti monopoly laws prohibit companies in Intel's position from doing what they're doing (and they also agreed not to in the settlement for the AMD lawsuit).


> At very least, because they're legally obliged to.

Are you sure about that Intel MKL is also applied to that law? Intel MKL is a propriety software by Intel and its system requirements clearly states that it is Intel 64 and IA-32 only. Do you mean that all propriety software made by the hardware manufacturer should support competitor's hardware as well?


That's false. Intel's MKL library has been repeatedly marketed as a tool for all processors including non-Intel ones. Combined with this fact, Intel's choice to purposefully remove performance enhancements for AMD processor is very clearly in violation of antitrust law.


Except it's not against antitrust law. As long as they document that it may not run well on non-Intel processors, they are in the clear, at least in the US.

Is it scummy? Sure, but that was the ruling when this came up for their compiler.


Was that part of the settlement made public? I got subpoenaed by both sides (compiler vendor) but no one told me much about the settlement other than it was a small amount of money to Intel that was a big windfall for AMD.


No, they're really not. Here's the ruling, read it for yourself https://ftc.gov/sites/default/files/documents/cases/101102in...


Thanks for the correction, it was a long time ago and that was how I recalled it, of the top of my head.


Unfortunately the settlement only required Intel to document their deoptimization practices, not change them.


OpenBLAS still didn't have adequate avx512 performance the last time I tried it (a few months ago).

I think BLIS is the best library for performance portability.

MKL offers a lot more than just BLAS and LAPACK. Would be cool for a project like BLIS to expand its scope.


The real “big caveat” here is to make sure that this actually improves your real workload in real life with real inputs. Zen and Zen+ have a lot of edge cases, for example their pdep/pext support is just crazy slow and how slow it is depends on the input values (it gets slower if more bits are set, which is bananas). But this also goes for genuine intel hardware. MKL is not always optimal. Even -march=native does not always produce the best code. Your program may be faster with the latest instructions disabled.


If you go to the MKL library, it specifically says it speeds up performance on non-Intel processors as well. That they would state this and still sabotage the performance for AMD processors by directly disabling performance enhancements for them shows to me that this was a calculated decision that was willfully made.


Or it shows that the mkl developers are aware of some obscene edge case that kills the performance of a major customer’s application. This article tested ONE thing.


It's certainly not an "obscene edge case" - MKL cripples AMD on every single matrix operation tested here: https://github.com/flame/blis/blob/master/docs/graphs/large/...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: