I feel it would have been acceptable to say if you had a mutant x86 CPU which set the AES-NI bit in CPUID and then had a non-compliant implementation l, that was on you.
How specific to Intel CPUs is MKL-DNN (now renamed)? When I looked at it, the CPU code seemed fairly generic SIMD.
Second, there are already many ways for programs to get granular CPU feature information. It's entirely the software's fault if it decides to ignore that information and instead use a big "if" switch keyed solely on the manufacturer.
Technically, it could, using virtualisation extensions to x86. CPUID can be trapped by a hypervisor, and the hypervisor can supply whatever result it likes. This is how Qemu/KVM simulates different CPU models.
Mainstream OSes don’t use virtualisation extensions for regular processes of course, and hypercalls would be even more expensive than regular syscall based context switches, so this is probably not something that is going to change in the short to medium term.
So wouldn't comparing with latest Ice Lake CPU using intel MKL instead of Xeon-W 2175 yield even larger performance gap for workloads taking advantage of the new instruction Subset?
i.e. AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES.
I ran into some compile issues, but I care more about numpy so I gave this script a try: https://markus-beuckelmann.de/blog/boosting-numpy-blas.html with some quick/dirty results on my Ryzen 3700X (8C16T) workstation (BLIS doesn't seem to perform very well on this small test). You can see performance basically double on MKL when MKL_DEBUG_CPU_TYPE=5 is used. I'll probably continue to stick w/ OpenBLAS for now:
## blis 0.6.0 h516909a_0
# conda activate numpy-blis
# conda run python bench.py
Dotted two 4096x4096 matrices in 2.30 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.94 s.
Cholesky decomposition of a 2048x2048 matrix in 0.24 s.
Eigendecomposition of a 2048x2048 matrix in 6.36 s.
## libopenblas 0.3.7 h5ec1e0e_4
# conda activate numpy-openblas
# conda run python bench.py
Dotted two 4096x4096 matrices in 0.41 s.
Dotted two vectors of length 524288 in 0.02 ms.
SVD of a 2048x1024 matrix in 0.55 s.
Cholesky decomposition of a 2048x2048 matrix in 0.14 s.
Eigendecomposition of a 2048x2048 matrix in 5.53 s.
## mkl 2019.4 243
# conda activate numpy-mkl
# conda run python bench.py
Dotted two 4096x4096 matrices in 1.53 s.
Dotted two vectors of length 524288 in 0.02 ms.
SVD of a 2048x1024 matrix in 0.51 s.
Cholesky decomposition of a 2048x2048 matrix in 0.29 s.
Eigendecomposition of a 2048x2048 matrix in 4.79 s.
# export MKL_DEBUG_CPU_TYPE=5
# conda run python bench.py
Dotted two 4096x4096 matrices in 0.33 s.
Dotted two vectors of length 524288 in 0.02 ms.
SVD of a 2048x1024 matrix in 0.29 s.
Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
Eigendecomposition of a 2048x2048 matrix in 3.29 s.
## AMD BLIS 2.0 https://developer.amd.com/amd-aocl/blas-library/
# conda activate numpy-amdblis
Dotted two 4096x4096 matrices in 2.33 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.78 s.
Cholesky decomposition of a 2048x2048 matrix in 0.23 s.
Eigendecomposition of a 2048x2048 matrix in 5.84 s.
I'm sure there are theoretical and practical limits but it's probably completely dependent on implementation and each system architecture, but that's beyond my pay grade. Maybe start here: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...
conda create -c conda-forge -n numpy-blis numpy "blas=*=blis"
That's basically all there is to it, but here are some docs for you to reference from the top page of search:
If an experiment involves system packages it's a little trickier. You can learn about Docker (which does for the whole system what venv does for python packages), but in most cases you may get away with just taking note of what packages you have installed/uninstalled, so you can revert the process later if needed.
If yes, then there is apparently an active engagement by Intel (therefore involving people&money) and I don't see anything wrong if they optimize their code for their CPUs and then ignore other manufacturers (AMD) - mooost probably they just don't have a reason about why they should improve their competition + trying to do that anyway (without having full insight about the "foreign" multiple CPU architectures) might generate even worse performance which in turn would backfire (e.g. "aaahh Intel is sabotaging AMD CPUs").
So, personally I would have no problem if when using iMKL ("intel MKL") the code would run faster on Intel CPUs by default and being conservative on other CPUs - but if I would own an AMD CPU then I would expect AMD to implement some similar optimization for their own CPUs (aMKL?).
Can a "CPU feature" (e.g. sse2, avx, aes, etc...) be implemented the same way between different architectures?
And the call/use of these features is the same for AMD or Intel
I don't think this is quite true.
While their compiler team does care a lot about benchmarks, ultimately, they sell compilers. Some users won't use their compilers if they are too bad at running code on AMD's CPUs: often times 95th percentile performance is more important than median or mean.
There are ways to patch ICC produced binaries to disable this genuineintel patch.
It's not about "saving face" for anyone, it's about the possibility of bug reports that need to be investigated, and possibly new code paths need to be added if a fix is necessary.
> and then suddenly more than halved the price of everything the moment there was competition.
This would have happened regardless of how much or little their products improved in the prior years. The price is what enough people are willing to pay, and that goes down if there's an alternative.
> It is also possible that the resulting code path has some precision loss or other problems on AMD hardware. I have not tested for that!
Intel doesn't test it either, which is why they disable it.
You could argue that they should test and enable on AMD CPUs, but this isn't clearly scummy behavior (to me), it's not a zero cost decision they made solely to screw over AMD. Buying the hardware and testing it has a cost. It may not even be possible to test it to the same degree they test their own hardware. There may be modes, subtle differences between architecture versions, etc. that are hard to know about to test for.
They don't need to test to see if it works on AMD. The processor advertises it. In their own words:
> It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.
Indeed, my 3900X has the AVX2 flag flipped. So why would it not get used?
I don't think the precision loss argument really holds water. If you can't trust their AVX2 implementation, why can you trust any other instruction on the CPU to not have similar problems?
Unless I am misunderstanding something, it really looks like they're not following their own advice and artificially limiting the use of AVX2 to only Intel processors.
edit: Also, we really have no reason to trust Intel's honesty. Don't forget:
The docs you quoted describes the necessary conditions for using an instruction to avoid crashes/illegal instruction issues. That doesn't cover these kinds of more subtle differences that exist.
For example, I just made a plot of |1/x - rcpps(x)|: https://snipboard.io/Rei74t.jpg
Although of course an instruction for something like approximations can vary across different micro-architectures and vendors, there has to be some definition for what the baseline is. Without it, the instruction may as well output junk.
Indeed, I just checked and Intel's "Intel 64 and IA-32 Architectures Software Developer’s Manual" contains a bound on the error of the approximation in rcpps:
The relative error for this approximation is:
|Relative Error| ≤ 1.5 ∗ 2^−12
So I don't think this is really the issue. I don't think Intel MKL defines what the error is for many operations, so it is presumably not defined, and therefore it'd be hard for AMD's implementation to run afoul.
It's also not going to be reproducibility/consistency, since they cover this in a separate issue: https://software.intel.com/en-us/articles/consistency-of-flo...
Of course an issue stemming from something like this is rare/unlikely. But Intel also can't assume that the consequences for a related error are small.
It seems like most people commenting irately about this in this thread are hobbyists or are otherwise working on low consequence software. Yeah, for these uses, it would probably make sense for intel to just look at the CPUID and leave it at that. But they can't know that, and they aren't optimizing solely for these use cases.
But I suspect Intel does care, in the sense that they certainly care about how they perform and benchmark against AMD.
Let’s put it another way. What happens if Intel enables AVX2 for AMD and it doesn’t work?
1. Precision is not as good as intel?
1.a. Many developers care and it impacts AMD negatively. (Probably nothing happens since nothing is actually broken per se.)
1.b. Few developers care and it suggests Intel is not optimizing for the best performance trade-off.
2. Programs using MKL crash on AMD processors.
2.a. There is a bug in MKL: it probably gets fixed with little fanfare.
2.b. There is a bug on AMD processors, and it probably gets fixed in microcode.
3. AMD processors produce wrong results. There seems like there can only be one outcome here, and it’s probably worse for AMD than Intel. The Intel FDIV bug proves that people take these things very seriously. At best it could be fixed in microcode, and at absolute worst AVX2 could be disabled entirely in an update.
The best argument for why Intel would still rationally avoid AVX2 on AMD outside of the above is to ensure their customer’s code runs correctly on AMD processors. However, there’s still some problems:
- They have to contend with their own processors, too. If they write code they know depends on non-public details of the architecture, they could very well break themselves.
- What is the definition for “break?” Intel surely has test suites for their software, but just because the software’s precision may be worse (still pure conjecture) when running under AMD does not suggest it is broken. All we can say is that users may rely on the behavior of precision on their specific configuration, but the problem is that’s not specific to AMD. Intel alludes to the fact that precision can be different amongst Intel processors in several places on MKL’s documentation. Nothing stops someone from improperly relying on this.
By playing fast and loose with specifying an architecture then writing software that disregards it, I’m not sure how defensible a position Intel is sitting in from pretty much any perspective.
> “otherwise working on low consequence software”
No offense but it is hard to respectfully respond to something like this, even though I know it wasn't directed at me personally. I hope you don’t think about coworkers with this mindset.
The "safe" thing to do is rather than work through all of these cases you've considered, is to just give up on it.
> No offense but it is hard to respectfully respond to something like this, even though I know it wasn't directed at me personally. I hope you don’t think about coworkers with this mindset.
I didn't intend it to be disrespectful. It's not a value judgement. Maybe these aren't the best terms, but I've worked on both what I would call "low consequence" software, and "high consequence" software. The tolerance for decision making about things like swapping out hardware is different. In some cases, this can be done with little hesitation. In other cases, it would cost millions of dollars to test the system to the point where it could be trusted.
Maybe the solution to the internet pitchfork wielders is for intel to simply stop trying to serve both of these sets of users with the same software. But I suspect they simply don't care about this problem enough to do anything.
They've been doing it long before MKL .
That being said, IEEE floats are carefully and consistently defined, and are perfect predictable. The unpredictability you claim is not due to stochastic errors or faulty implementation of CPU vendors, they're a part of the IEEE definition and are deterministic.
In practice, using these kinds of instructions (which are not specified by IEEE) can give massive performance advantages.
So your logic is: given that IEEE-754 1/x and SSE rcpps yield different results on a single Intel CPU, then... AMD's AVX implementation cannot be the same as Intel's AVX implementation and therefore, Intel is perfectly correct in disabling AVX on AMD?
1. Some instructions definitely have implementation defined behavior that will vary across platforms. rcpps is just one example to illustrate the point.
2. Therefore, if you test something on an intel CPU, it might not behave the same on an AMD CPU (and vice versa). This doesn't mean AMD's implementation is bad. (For all you know, it might work on intel because you are accidentally exploiting a bug, and that bug won't exist on AMD, and break your software.)
3. Therefore, if you don't test something using these on an AMD CPU, it could be risky to enable it.
As an aside, I don't necessarily think this logic is the only factor that should go into this decision and I'm not sure it's the right one for intel to make. But there is some logic to it that isn't simply intel screwing over AMD, which is the only point I've been trying to make here.
I think what you can fairly say about intel here is they are being lazy and overly risk averse, but not anti-competitive (in this instance!).
1/x has been around as an approximation used in the overwhelming majority of floating point division units for a long time, even before the SSE instruction came about, so I'm pretty dubious that either Intel or AMD is changing their answers. Perhaps you could prove it.
Over the complete range of floats, AMD is more precise on average, 0.000078 versus 0.000095 relative error. However, Intel has 0.000300 maximum relative error, AMD 0.000315. Good news is both are well within the spec. The documentation says “maximum relative error for this approximation is less than 1.5*2^-12”, in human language that would be 3.6621E-4.
Source code: https://gist.github.com/Const-me/a6d36f70a3a77de00c61cf4f6c1...
He's wrong, because 1) this discussion is his false claim that Intel is justified in disabling AMD's AVX/AVX2 implementation because it fails to meet the spec 2) his plot doesn't show any variance between AMD vs Intel.
The point of my plot was to show that the behavior of instructions like this is super weird and random, it's not something that we should expect to be consistent across vendors. It's probably not consistent across all of intel's CPUs either, but they're probably aware of and test for all of these.
Since the LHC@Home project simulated protons circulating in the LHC, the small differences added up over the typically large number of time steps.
IIRC their solution was to pay a small price and use software implementations of those instructions, but I can't find the reference right now.
If they're getting different results on different CPUs which both implement the spec correctly, they need to fix their code anyways.
edit: They do run sims with slightly different initial conditions to get the physics rather than simulation artifacts. The issue here is that for a given set of initial conditions, results computed on different machines would not agree.
I understand that they don't agree when run on different CPUs. But they both work correctly, because IEEE floats and operations are defined up to a certain level of precision by specification.
What I'm saying is, if they obtain different results across CPUs within their accuracy target for results, their code is simply broken and they need to change their code to either use a numerically more stable algorithm or higher-precision floats. No scientific software should rely on unspecified numerical "bugs" that fall outside of specs and can change in the future even with the same vendor.
You also have to remember that there are all sorts of people in CERN from undergrads learning to code on the go to software engineers with no physics background. Just because a piece of code made into a CERN repository at some point doesn't mean it's a gold standard, word of gods (which appears to be your premise) and CPU vendors are to blame for any problems.
The entire point here though is that implementations are allowed to produce different results, and that code that needs to produce the exact same result using different implementations need to take this into account.
The application (SixTrack) had, AFAIK, only been used in compute clusters before. When they started using it in LHC@Home, running on a large variety of user hardware, this issue was exposed.
> What I'm saying is, if they obtain different results across CPUs within their accuracy target for results, their code is simply broken and they need to change their code to either use a numerically stable algorithm or higher-precision floats.
Right, that's what they discovered and that's what their solution was as I mentioned in my original post.
... is broken. "The same result" is a complicated question when floating point computations are involved. Even in an IEEE 754 environment, the compiler can easily cause non-bit-exact differences over time on the same chip.
But the reality is that a ton of scientific and high performance code does it, and simply declaring it broken and not dealing with it will not do you any favors if you are trying to work with such systems.
It's similar to how microsoft has to maintain decades of bugs because a ton of software depends on them and would break if they fixed them. Yeah, microsoft would be "correct" to fix them and declare the code buggy. But that would piss off a lot of people that don't care about this argument and probably lose microsoft money. It's the same situation here.
There is basically nothing which guarantees that the numerical errors that are within the error tolerance of the spec won't change in new iterations of Intel CPUs either.
If the correctness of your calculation depends on the values of the unspecified within the spec, this means you need to change the code to a more stable algorithm or use higher precision floats anyway.
You're absolutely right. The difference is that if a customer of a library like MKL comes to intel and reports a problem like this, intel is going to help troubleshoot and fix the problem, and they're going to have a lot of internal technical documentation to help understand and fix the differences.
> The differences that you're seeing fall within the spec. ... This is a flimsy argument, the FUD that Intel has been propagating for years now, and I'm not sure why you keep on pushing the Intel FUD so hard again and again.
The problem is that it's really hard to know if code depends only on the requirement in the spec, or depends on more specific behavior present in the machine(s) you tested on. Call it FUD if you want. A few people in this thread have reported spending significant effort on troubleshooting implementation defined behavior. Yeah, code that depends on this is bad, but bad code exists and sometimes people with the bags of money care about it.
As a physicists, I can tell you for sure that no, this is not why Intel disables it. Intel had quite a success by purposely crippling icc/ifort and MKL on AMD, and created GenuineIntel as a legal barrier to prevent AMD (and AMD users) to come up with a workaround, which allowed Intel to essentially kill the competition in HPC.
It is a well known fact that Intel has been actively working to cripple AMD's performance across the board. They invented GenuineIntel exclusively for that purpose, which was a part of the bigger picture filled with false advertisements, bribes (the most famous Intel bribe cases involved Dell), smear campaigns, lawsuits, so on and so forth. Intel has a long history of playing dirty against competitors.
I have never actually seen Intel comment on this, I just personally have experienced headaches porting numerical code between two different architectures (not AMD <-> Intel though), where on the surface the architectures appear to be the same and straightforward to swap between, but in practice they are not.
This doesn't mean one of the architectures had "problems" and the other did not. It's not about one architecture being inferior than the other, they're simply different.
In any case, I never saw any real reason which warrant disabling an entire instruction set, such as AVX. Which is why you don't see such artificial crippling in open source implementations of LAPACK/BLAS/sundials/etc, and people (including me) have been using the same fortran code for many decades across many architectures.
And in case this is what you're trying to imply, no, they don't really give different numerical results on different CPUs.
> Which is why you don't see such artificial crippling in open source implementations of LAPACK/BLAS/sundials/etc
Are they as fast as MKL? If so, just use them?
If not, why not? Maybe the reason is you can do better if you optimize for specific CPUs, with different latencies of various instructions?
> Aside from correctness differences, some optimization strategies that make things faster on one processor make them slower on another. This happens even within different generations of x86 hardware.
This is the one notably relevant part and, yeah, that's fine. Follow the CPUID features. Nobody expects it to be absolutely optimal on AMD. But let it use the code that was optimized for Intel chips with the same features.
The differences you are quoting coming from the difference in libraries (sin, exp etc will give different results depending on them libm implementation, that's normal and it has nothing to do with CPU instructions!), not the implementation of IEEE instructions (assuming that you're talking about IEEE floats, otherwise, you shouldn't expect them to behave the same in the first place!), though.
> Are they as fast as MKL? If so, just use them?
I (and a lot of other people) do use them, when I have a choice. Sometimes they are faster, sometimes they aren't. When there is a significant disparity, however, it usually is because of GeniueneIntel checks.
> If not, why not?
Because scientific software geared toward applications is usually closed-source proprietary or too complicated to be modified (remember that users aren't interested in becoming software engineers, in addition to their own jobs as researchers) to add new alternative backends and you don't get to choose.
so you are saying that your open-source BLAS/LAPACK is showing performance differences (and worse performance compared to MKL) because of "something Intel". Seems like a lot of people here (including the ones not being able to compile numpy against another BLAS) are a little bit short on actual experience/knowing about the problem...
"scientific software geared toward applications is usually closed-source proprietary or too complicated to be modified"
If it's geared towards applications, it's usually opaque engineering stuff and the results of people claiming to do science with this software are mediocre at best...
In my domain (qunatum-chemistry) nearly all software is delivered as source-distribution. Because modifications of methods are part of science...
I met only one computational physicist so far doing DFT who used to write his own code back in 70s, but he admits he has no idea what VASP and others are doing nowadays.
If you're claiming that you actually know how VASP or Quantum Espresso (or any other similar significant piece of software) works in depth and you can tweak/replace any part as you like (which I'd find very very hard to believe, millions of work hours go into the development of those), you'd nevertheless be the exception in chemistry not the norm.
The most common high-level tools theoretical physicists like me use (such as Mathematica) don't give access to source code, on the other hand, so we can't make Mathematica not use MKL and not suck on AMD.
Why does that matter? The bulk of the issues come from implementation defined behavior, of which there is plenty within x86 itself to cause issues.
In general, the IEEE-compliant parts of x86 are also IEEE-compliant on other processors, at least the ones I've dealt with. It's the operations that aren't specified by IEEE that cause problems.
Edge supposedly has the string "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 Edg/44.18362.449.0".
Intel is being reasonable and in a sense prudent by only supporting hardware they understand very well. It isn't like they are sending the lawyers out to sue anyone writing high performance code on AMD chips - that would be unreasonable.
I think the software part of Intel should not sabotage the competition, we don't like when Google devs are sabotaging Firefox intentionally or by laziness, as a software dev you test for the feature, if it is present you use it and if for some reason the hardware is broken then you implement the workarounds if your customers demand it or if not let AMD to fix the broken feature. As a developer myself the least I could do is to add a flag to turn the crippling on/off but let it on by default, then the users can decide that precision/thermals or things are worse on AMD and enable crippled mode.
Certain things should carry the expectation they are compatible, not the other way around. Innocent before proven guilty! I would say an instruction set (the sole purpose of which is to avoid scenarios like the one you're describing) are one of those things.
It is not Intel's responsibility to ensure that there is no obscure edge case in AMD's microarchitecture that may cause the MKL algorithms to give incorrect results on AMD silicon. MKL's design is based in part on Intel knowing the internal details of their own silicon.
This is just conjecture unless there’s a citation to support it.
Besides, the microarchitecture design is bound to change dramatically over time just as a matter of course. Relying on things that they don’t document publicly is probably not a wise strategy. How would the teams coordinate not breaking each-other? I can imagine such a position would be incredibly unpopular amongst those working on silicon designs (and maybe not popular amongst those working on software, either.)
Also, nobody has really presented a reasoning that justifies this behavior, since if it broke on AMD hardware, it still wouldn’t be Intel’s problem, and therefore perhaps they should just use the instruction sets as intended and documented instead.
I'm not an Intel fanboi but casually dismissing this specific issue is just ignorance. Intel's algorithms, for better and worse, are very aware of the limitations of their silicon implementations.
edit: In retrospect this comes off as unnecessarily snarky. Let me clarify:
- I am just asking for a specific citation about this issue.
- Reproducibility is one case where details about microarchitectures matter, but it is not actually the default behavior of MKL, and the documentation makes it clear that its limitations are largely unrelated to the normal operation of MKL; for example: https://software.intel.com/en-us/mkl-windows-developer-guide...
> Dispatching optimized code paths based on the capabilities of the processor on which the code is running is central to the optimization approach used by Intel MKL. So it is natural that consistent results require some performance trade-offs. If limited to a particular code path, performance of Intel MKL can in some circumstances degrade by more than a half. To understand this, note that matrix-multiply performance nearly doubled with the introduction of new processors supporting Intel AVX2 instructions. Even if the code branch is not restricted, performance can degrade by 10-20% because the new functionality restricts algorithms to maintain the order of operations.
This bit here is pretty important because it suggests that strict CNR mode can disable AVX2 codepaths even on Intel. It isn't doing so on AMD specifically for reproducibility; it just won't load AVX2 codepaths on non-Intel processors, period.
MKL purposefully crippling AMD's performance does nothing for reproducibility between the microarchitectures, because to get proper reproducibility you have to cripple the performance on both AMD and Intel. And of course that's not what Intel's doing by default.
So then why let MKL run on AMD CPUs at all?
I had a similar experience compiling complex C++ code bases in both GCC and Clang. There were interesting edge cases that cropped up where the compilers apparently had a fundamental disagreement on how to interpret a bit of code. Every time we did deep dive to understand the discrepancy, we came to the conclusion that Clang was probably correct strictly speaking. This still caused us problems and we ended up designing work arounds to support GCC.
This really isn't news to anyone who cares about linear algebra library performance, it's just that it's passed from general consciousness in the decade since AMD was last relevant. For performance portability to non-Intel CPUs ooenblas and its ilk have always been the way to go.
You might think it's a sharp practice, in which case, you're free to vote with your wallet.
Are you sure about that Intel MKL is also applied to that law? Intel MKL is a propriety software by Intel and its system requirements clearly states that it is Intel 64 and IA-32 only. Do you mean that all propriety software made by the hardware manufacturer should support competitor's hardware as well?
Is it scummy? Sure, but that was the ruling when this came up for their compiler.
I think BLIS is the best library for performance portability.
MKL offers a lot more than just BLAS and LAPACK. Would be cool for a project like BLIS to expand its scope.