At a previous company I worked at, we had an issue with our software (Windows-based, written in a proprietary language) randomly crashing. After some debugging, we found that this happened whenever the user made some specific actions, but only if, in that session, the user had previously printed something or opened a file picker. The culprit was either a printer driver or a shell extension which, when loaded, changed the floating point control word to trap. That happened whenever the culprit DLL had been compiled by a specific compiler, which had the offending code in the startup routine it linked into every DLL it produced.
Our solution was the inverse of the one presented in this article: instead of wrapping our routines to temporarily set the floating point control word to sane values, we wrapped the calls to either printing or the file picker, and reset the floating point control word to its previous (and sane) value after these calls.
Had to deal with this same issue when I had a program supporting plugins, DLLs compiled with Delphi would turn on all the floating point traps. Took a while to track down what was causing FP faults in comctl32.dll. It got so bad that I had to put in a popup dialog that would name and shame the offending DLL so the authors would fix their broken plugins. It's an ABI violation in Windows since the ABI specifically defines FPU exceptions as masked, so this was more egregious than just turning on FTZ/DAZ (which Intel-compiled DLLs did).
Many of these same DLLs would also hijack SetUnhandledExceptionFilter() for their custom exception support, which would also result in hard fastfail crashes when they failed to unhook properly. Ended up having to hotpatch SetUnhandledExceptionFilter() Detours-style to prevent my crash reporting filter from being overridden. Years later, Microsoft revealed that Office had done the same thing for the same reasons.
The new version of this problem is DLLs that use AVX instructions and then don't execute a VZEROALL/VZEROUPPER instruction before returning. This is more sinister as it doesn't cause a failure, it just causes SSE2 code to run up to four times slower in the thread.
Yep, I've encountered floating point flag incompatibilities when dynamically loading Borland-compiled libraries into Visual Studio compiled applications, as well as when using C++ code via Java Native Interface.
It is nice that diverse vendor-specific calling conventions and ABIs are less common these days.
You could also get an issue with x87/MMX where floating point code wouldn't work if you wrote some MMX code and didn't do an `emms` instruction afterward.
This is basically the reason compiler autovectorization doesn't do MMX.
Direct3D used to flip the x87 FPU to single precision mode by default. This produced some amazing bugs when your other C libraries reasonably assumed that a double would be at least 64 bits. (The FPU mode settings affected the thread that called Direct3D, and most programs used to be single-threaded.)
It seems they changed this behavior in Direct3D 10:
I stumbled into this bug in a rather spetacular manner.
I was making a game using D3D, Lua and Chipmunk physics, and some of the behaviour of the game was being odd.
So I started to try printing random stuff with Lua, eventually I just tried: print(5+5), and to my surprise my console outputted "11".
I went into Lua's irc channel to talk about this, and everyone said I was nuts, that the number was too small to trigger precision issues, that I was a troll and so on.
After a lot of searching I found out about this D3D bug, so I switched the game to use OpenGL instead there it was, 5+5 = 10 again!
Now why fiddling with the FPU could make 5+5 become 11, I have no idea.
If it ends up as 10+epsilon after the loss of precision and for some reason the fp round mode is set to FE_UPWARD ... And some part of the Lua stack recognizes that 5+5 is an integer-yielding expression and casts it as one for the display...
Had this exact same problem. It was a specific color inkjet driver doing this, my guess is to enable dithering or something similar. It’s one of those things that infects everything in the code base because the way you print with GDI is to progressively draw parts of the page - so you have to call in and out of code that talks to the printer DC. We also had to render one item using Direct3D retained mode and that added to the fp control word complexity. Things seemed to be more robust on NT based OSes.
I've heard so many stories akin to this one that I just shake my head. It's a self-inflicted wound that people who prioritize performance above other considerations keep inflicting on everyone else.
I hope we learned our lessons on this specific question in the design of Wasm. There are subnormals in Wasm and you can't turn them off for performance.
I think the problem is libraries implicitly affecting code outside the library. This time has been related to optimization of floating point operations, next time it will be other thing. Why bother having lexically scoped languages if the real behavior is dinamical? Debugging this kind of error is very hard
Agreed. Side-effects to global state is generally bad. It would have been not as bad to introduce the FTZ mode in a way that wasn't global state, but alas, the performance mode itself was the original sin. There is apparently zero overhead for subnormals on PPC and very little on arm. It's always been Intel pushing this crap because of their FPU designs' shortcomings.
The Julia package ecosystem has a lot of safeguards against silent incorrect behavior like this. For example, if you try to add a package binary build which would use fast math flags, it will throw an error and tell you to repent:
In user codes you can do `@fastmath`, but it's at the semantic level so it will change `sin` to `sin_fast` but not recurse down into other people's functions, because at that point you're just asking for trouble. There's also calls to rename it `@unsafemath` in Julia, just to make it explicit. In summary, "Fastmath" is overused and many times people actually want other optimizations (automatic FMA), and people really need to stop throwing global changes around willy-nilly, and programming languages need to force people to avoid such global issues both semantically and within its package ecosystems norms.
But if what you want is automatic FMA, then why carry along every other possible behavior with it? Just because you want FMA, suddenly NaNs are turned into Infs, subnormal numbers go to zero, handling of sin(x) at small values is inaccurate, etc? To me that's painting numerical handling in way too broad of strokes. FMA also only increases numerical accuracy, it doesn't decrease numerical accuracy, so bundling it with unsafe transformations makes one uncertain now whether it has improved or decreased accuracy.
For reference, to handle this well we use MuladdMacro.jl which is a semantic transformation that turns x*y+z into muladd expressions, and it does not recurse into functions so it does not change the definitions of the callers inside of the macro scope.
This is something that will always increase performance and accuracy (performance because muladd in Julia is an FMA that is only applied if hardware FMA exists, effectively never resorting to a software FMA emulation) because it's targeted to do only a transformation that has that property.
"FMA also only increases numerical accuracy, it doesn't decrease numerical accuracy". This strictly speaking isn't true. For example, xx-yy will sometimes be more accurate than fma(x,x, -y*y) (e.g. if x==y). That said, I agree that fma is different from the rest in that it will on average, usually, increase accuracy.
This isn't really as valid a comparison as you might think it is. The results of operations varying is not the problem with 'fast-math', the problem is that can negatively impact accuracy in catastrophic ways (among other things).
Sure, automatic FMA can change the result, but to my knowledge it always gives a more accurate result, not a less accurate one, and the way in which the results may differ is bounded.
fma can give less accurate results, and can give very large differences. for example 1.5floatmax(Float64)-floatmax(Float64) is Inf, while the fma version will give .5floatmax(Float64)
The problem here is that enabling FTZ/DAZ flags involves modifying global (technically thread-local) state that is relatively expensive to do. Ideally, you'd want to twiddle these flags only for code that wants to work in this mode, but given the relative expense of this operation, it's not entirely practicable to auto-add twiddling to every function call, and doing it manually is somewhat challenging because compilers tend to support accessing the floating-point status rather poorly. Also, FTZ/DAZ aren't IEEE 754, so there's no portable function for twiddling these bits as there is for other rounding mode or exception controls. I will note that icc's -fp-model=fast and MSVC's /fp:fast correctly do not link code with crtfastmath.
As a side note, this kind of thing is why I think a good title for a fast-math would be "Fast math, or how I learned to start worrying and hate floating point."
I don't think flipping these flags is expensive. Can you provide a source for that? AFAICT modern microarchitectures are going to register-rename that into the u-ops issued to the functional units, rather than flush the entire ROB.
Keep in mind that twiddling these flags is going to require saving the MXCSR register to memory, or'ing or and'ing bits in memory, and then reading that memory back into MXCSR. And both saving and reading the MXCSR requires stalls, because floating point operations both read and write that register. So you require, minimum, 4 L1 cache hits and 2 partial pipeline flushes to twiddle a MXCSR bit.
(As far as I'm aware, modern microarchitectures generally don't register-rename the floating-point status register.)
Note that you wouldn't necessarily need to do a read-modify-write -- it'd suffice in most cases to just to save the old value and then reset the whole MXCSR for the scope requiring special treatment.
Also worth noting that it's not the entire MXCSR that needs to be renamed, but just a handful of status bits, so the logic is likely even cheaper than renaming a GPR.
Global state is the root of so many evils! FPU rounding mode, FPU flush-to-zero mode, C locale, errno, and probably some other things should all be eliminated. The functionality should still exist but not as global flags.
OTOH MXCSR being threadlocal means that automatically linking crtfastmath.o makes even less sense as it only affects the inital thread that loaded the object so you still need to set FTY/DAZ manually if you really want them.
And implicit locales should just die. It's sad that even newer functionality like std::format relies on them. Sadder that if it didn't then you'd probably have compiler developers pulling shit like GCC/libstdc++ does for std::from_chars [0] which defeats the entire point of that function but hey, at least they can mark that as implemented. Not like there are suitably licensed implementations of the functionality [1] available that they could use instead if they don't want to implement float parsing themselves.
Denormalized numbers is one reason why you really want to think carefully if you try to optimize code by rewriting expressions involving multiplication and division.
For example, if you got "x = (a / b) * (c / d)" one might think that rewriting it as "x = (a * c) / (b * d)" will save you a division and gain you speed. It will and it might, respectively.
However it will also potentially break an otherwise safe operation. If the numbers are very small, but still normal, then the product (b * d) might result in a denormalized number, and dividing by it will result in +/- infinity.
However, the code might guarantee that the ratios (a / b) and (c / d) are not too small or too large, so that multiplying them is guaranteed to lead to a useful result.
Anyway, since there aren't any dependencies between a, b, c, and d, I would expect the two divisions to end up basically in parallel in the pipeline. So the critical path is a division and a multiplication either way. Of course that is just a guess.
That assumes you can do multiple divisions in parallel. Back in the good old days, a single division unit was the norm, and it still is on most microcontrollers (assuming they even have hardware floating-point division[1]).
Anyone have any references on how the current state of affairs on modern AMD/Intels?
I ran Gentoo back in the good old days. The biggest draw was that after about a week of compiling my system ran a lot faster because of all the compiler optimisations one could enable because it only had to work on your CPU.
I might be misremembering, but I think fastmath was one of the flags explicitly warned against in the Gentoo manual.
ChromeOS is sort of the successor to Gentoo. The images are built with profile-guided, link-time, and post-link optimization, and they are targeted to the specific CPU in a given Chromebook. Every other Linux leaves a large amount of performance on the table by targeting a common denominator CPU that's 20 years old and not having PGO.
It's not a successor, it's a derivative. And yes, if you're only targeting specific known hardware than you can and probably should optimize for it, but most distributions fully intend to be usable on very nearly any x86(_64) hardware so they can't do that.
It's also a bit less relevant when everything is so fast. I used Gentoo on a cheap-for-the-time Pentium 133MHz. Gentoo was basically the difference between a modestly pleasant system and an unusably slow system if I tried to run a standard still-compiled-for-386 distro on it.
I've long since stopped worrying about it because on the systems I run, which are not top-of-the-line but aren't RPis either, it's not worth worrying about anymore for most programs. At most maybe you should target the one particular program you use that could use a boost.
Yeah, I don't know the breakdown between better hardware and better compiler optimizations (even in the default settings) and less differentiation between processors, but I've done some minor not-very-scientific tests of compiling packages with O3/march=mtune=native and in my limited experience it wasn't particularly useful. Like, not just small benefits, but zero or below the noise floor benefits in my benchmarks. Obviously this is super dependent on your workload and maybe hardware; it's an area where if you care, you have to do your own testing.
Tune for native sometimes makes a difference but not always. Targeting a platform that is known to have AVX2, instead of detecting AVX2 at runtime and bouncing through the PLT, can make a large difference. PGO remains the largest opportunity.
PGO is not something Gentoo does (outside a few packages like Firefox that have it in their build system) because the profiling cannot be done automatically for arbitrary packages.
glibc also supports loading different libraries based on CPU capabilities and some binary distros make use of that.
Still, there are always (admittedly diminishing) returns if you can target the exact CPU you have - pretty sure no binary distro ever had variants for AMD's TBM instructions [0] but my Gentoo install made use of them (which was made absolutely clear when trying to run any of that on a Zen 2 machine lol).
Though compiling for the exact CPU was definitely not insignificant before the amd64 switch which gave a new baseline - many distros still targeted i586.
Wow, I am surprised that -ffast-math triggers a mode switch in the FPU in part due to the author's library problem, but also because the documentation for clang at least[1] does not say it impacts behaviour of denormals and in fact has a separate mode switch for that, which is not explicitly called out as being implied by -ffast-math.
Does this only affect pypi, or should I now worry about shared libraries shipped with my distro as well? Debian is not crazy enough to ship shared libs compiled with -ffast-math, right? RIGHT?
All right fine, I give in. In current amd64 Debian unstable (main, contrib, and non-free), 48 of the 69,112 packages include a shared library built with -ffast-math, a total of 485 .so files.
Actually it wasn't so bad :) All the current debs add up to a mere 113GB, and the .so files are just 46GB once extracted. Quick work. Here's the list:
Various distributions have meddled with ffast-math over the years. At one stage early on, I think Clear Linux might have used it, which was part of the reason they kept demolishing distributions on benchmarks. I don't know any that still actively use it, but holy crap folks, please don't.
Ouch. Two flags that should reasonably stop this, and neither do. This feels a bit like the time I was told "No, -wAll does not in fact add all warnings".
-Wextra is also in Clang and is definitely not everything in either compiler.
-Weverything is everything but that is only really useful to discover new warning flags (in combination with lots of -Wno...) for which you are better off reading GCC/Clang's release notes and/or documentation.
Wait that other python package besides gevent you linked the according issue/pr "fixed" it by adding that flag. So I assumed "ok, -fno-fast-math doesn't override -Ofast, but then -fno-unsafe-math-optimizations surely does if they use this. No way they also just randomly add a flag and hope for the best"
> Finally, I am legally obligated to inform you that I used GNU Parallel to scan all the wheels
FWIW, this is tangential to this awesome article, but if the author is here or anyone else who cares: that statement isn’t true, you are not legally obligated to mention GNU Parallel. It is nice to do though! The link even says this explicitly, and also separately mentions the citation notice is only asking for scientific paper citations, not blog citations. I love parallel, but boy has that citation notice thing caused a lot of confusion over the years.
The citation notice thing in parallel is not a contractual obligation either; the link explains that it is not required (nor allowed) by any version of the GPL.
Following the article’s links, I fail to find an actual example of anything failing to converge in flush-subnormals mode. I mean, I’m sure one could be squeezed out, but the justification given amounts to “Sterbenz’s lemma [the one that rephrases “catastrophic cancellation” as “exact differences”] fails, maybe something somewhere also will”. And my (shallow but not nonexistent) experience with numerical analysis is that proofs lump subnormals with underflow, and most of them don’t survive even intermediate underflows.
(AFAIU the original Intel justification for pushing subnormals into 754 was gradual underflow, i.e. to give people at least something to look at for debugging when they’ve ran out of precision.)
So, yes, it’s not exactly polite to fiddle with floating-point flag bits that are not yours, and it’s better that this not happen for reproducibility if nothing else, but I doubt it actually breaks any interesting numerics.
I haven't examined the code of scipy.stats.skellam.sf so I can't say for sure that it's not converging, but it's clearly some kind of pathological behavior.
So somebody tried to calculate, for integer arguments from 0 to 99 inclusive, the CDF of the difference of two Poisson variables with means 4e-6 and 1e-6? I... don’t know if it is at all reasonable to expect an answer to that question. As in, genuinely don’t know—obviously it’s an utterly rotten thing to compute, but at the same time maybe somebody got really interested in that and figured out a way to make it work.
Anyhow, my spelunking was cut off by sleep, so the best I can tell that would end up in the CDFLIB[1] routine CUMCHN with X = 8e-6, PNONC = 2e-6, DF from 0 to 99. The insides don’t really look like the kind of magic that is held up by Sterbenz’s lemma and strategically arranged to take advantage of gradual underflow, so at first glance I wouldn’t trust anything subnormal-dependent that it would compute, but maybe it still is? Sleep.
Yeah, unfortunately I have no idea if that was their original goal (which seems unlikely?) or if this is just a minimal example they came up with after tripping over the actual problem in a more realistic setting.
I think it suffices to show that the behavior of FTZ/DAZ caused an actual problem for someone, though. I agree that the vast majority of numerical code won't care about FTZ/DAZ, but when it's enabled thread-wide you have no idea what kind of code you'll end up affecting.
My last bug report I wrote a small C++ program to put all the values between 0x000 .. 0xfff into a tree structure and then iterate over the tree printing out the values.
I’d have loved if the library author replied with “why don’t you just print out the values directly?”
My point wasn’t that the program in the bug report did not do anything meaningful (duh). My point was that most numerical algorithms, once they hit intermediate underflow in the form of subnormals for a set of inputs, do not proceed to compute a result that is in any way related to their stated purpose (with the notable exception of string conversions). Unless a hard limit on the number of iterations is imposed, most iterative algorithms can also be made to hang by giving them unfortunate inputs. In both cases it is the responsibility of the numerical analyst invoking them to not provide such inputs.
So it seems likely (if not certain) that the bug report in question says, essentially, that a function inside scipy, given certain arguments, returns one sequence of meaningless bits quickly when gevent is not loaded and a different sequence of meaningless bits very slowly when it is. While that is certainly not nice, and I’d accept it as a bug in gevent, it still does not count as a piece of code succeeding at a reasonable task with subnormals but failing without them.
(I do not have any ideological opposition to such code—it would certainly be interesting to see some!)
-Ofast isn't a good name for the option, but in GCC's defense the manual is pretty clear about all this, and there's no excuse for blindly turning on compiler options - they literally change the semantics of your code.
There's strict standards compliance and then there's the crazy grab bag of code changes that is `-ffast-math`. Further, I'd say gevent can defensibly say that -ffast-math is okay for them given what the manual says:
-ffast-math
... it can result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions.
It may, however, yield faster code for programs that do not require the
guarantees of these specifications.
This is 100% on the compiler people. For the option name, the documentation, and the behavior.
Well, how would you improve the docs? Both documentation entries seem reasonable to me.
That said, I don't see why the -Ofast option even needs to exist, except backwards compatibility, as -ffast-math and the others can (and should IMO) be specified explicitly.
The fact that -ffast-math makes no mention that it will poison any other code executing in your process space is a huge missing point of info. Docs as written, anyone not doing scientific math should have that flag, but the reality is that most people have some code somewhere in their process that expects fairly sane floating point math behavior, even if it's just displaying progress bars or something.
> The fact that -ffast-math makes no mention that it will poison any other code executing in your process space
Untrue. The doc entry for -ffast-math says "can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions". Emphasis mine.
So they clearly say that the entire program can turn invalid when -ffast-math is used.
You and some other people here act like the docs say "translation unit" or something like that, instead of "program", but this is simply not the case.
Furthermore, the entry for -ffast-math points to entries for suboptions that -ffast-math turns on (located right below in the man page), e.g. -funsafe-math-optimization. These also make clear how dangerous they can be even when turned on one at a time.
Consider the documentation for the similar compiler flag in the OpenCL specification:
> -cl-unsafe-math-optimizations
> Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid, (b) may violate IEEE 754 standard and (c) may violate the OpenCL numerical compliance requirements as defined in section 7.4 for single-precision floating-point, section 9.3.9 for double-precision floating-point, and edge case behavior in section 7.5. This option includes the -cl-no-signed-zeros and -cl-mad-enable options.
While it stops short of saying "this will likely break your code" (maybe because it doesn't have the nonlocal effects of -ffast-math), it makes it much more clear that this flag is generally unsafe and fragile, except under rather specific circumstances. Also, it is reasonably exact about what those circumstances are. I'm not sure -ffast-math is documented with enough precision for a programmer to even know whether it will break their code. Best you can do is try and see if the program still works.
The relevant GCC man page entries are even more clear than the OpenCL spec excerpt.
-ffast-math:
> This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions.
It also point to the -funsafe-math-optimizations sub-option, where it is said that:
> Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations. [...]
What's missing is that it also affects linking, and results in this strange action-at-a-distance. Maybe disabling the linker part with -shared would be a reasonable compromise.
You're wrong, both the doc entry for -Ofast and the one for -ffast-math say that they can result in incorrect programs. Programs are produced by linking, so I don't see what other way to interpret this is possible.
Why not simply replace all FP math with a constant zero? That’d be really fast and an equally valid strict interpretation of “can result in incorrect programs.”
Just because you’re shallowly dismissing my comment doesn’t make it wrong.
Linking in code with undefined (in this case, redefined) behavior doesn’t automatically invalidate the entire program. But thats the language used because once the undefined behavior is hit at runtime, the spec no longer defines what the behavior is and what the program will do afterwards.
> It's hard to come up with a similar name that isn't long.
The suggestion given elsewhere in these comments to call it "unsafe math" instead of "fast math" sounds good. It's nearly as short, and properly conveys the "you must know what you're doing" aspect of these flags. It's even better if you're used to Rust.
I agree. I think --ffast-math should actually be called --finexact-math. One would also hope that explicitly disabling an option on the command line would, you know, explicitly disable the option, but maybe that's too much to ask.
I don't think it should exist at all. It's such a crazy grab bag of code changes disguised as "optimizations" that it's completely impossible to reason about, even for folks that "don't care" about the exact floating point arithmetic.
It has global effects like those in TFA, and even locally you no longer know if a line or two of arithmetic will become more precise (e.g., by using higher precision intermediate results), less precise, or become complete gibberish (e.g., because it thinks it can prove you're now dividing by zero and thus can just return whatever it wants).
If you are willing to accept the various caveats that come with -ffast-math for your own library, then it looks like it's okay to use -ffast-math during compilation, but not during linking (because it links in crtfastmath.so, as the author points out).
Your own compiled library functions will use the optimizations, but won't force the weird FPU register modes on the rest of the process.
Pay close attention to your build system though. Many will pass all compiler flags to the linker driver as that is needed to make sure that they affect LTO.
That thread-local MXCSR register is particularly entertaining in a thread pool environment, such as OpenMP. OSes carefully preserve that piece of thread state across context switches.
I tend to avoid touching that value, even when it means extra instructions like roundpd for specific rounding mode, or shuffles to avoid division by 0 in the unused lanes.
I've measured it on real workload twice with LDC and each time "@fastMath" made things 1% slower. Of course this varies from backend to backend, but such woes would be avoided if the flag was named differently than "fast". It doesn't seem worth it to use it.
I got a good speed up in [insert work project here] but that was more of an experiment since it isn't really performance constrained in the first place.
I was going to suggest another package that just resets the MXCSR when imported, but I guess... hypothetically... some function might actually want the FTZ behavior.
If you want that behavior, you should explicitly enable it/disable it at the borders of the region where you want that behavior, rather than screwing over everybody for your own benefit.
I thought the purpose of Python was to make development simple and predictable. Needing to track down the compilation and linker flags of every single shared library reveals the fallacy of this abstraction.
If a language wishes to reap the rewards of a pre-existing ecosystem, it must pay for the warts and misfeatures of that ecosystem. Python is deeply dependent on C libraries to achieve acceptable performance, and this is the price.
What is the alternative here? To provide a python.so file with all possible binary Python packages statically linked into it? You'd need to update it every hour to include all the bugfixes in every native library yanked in! To recompile Python itself every time you install a package? Even with a compiler cache you'd have the Gentoo experience of waiting for ages every time you try to use the package manager.
Dynamic linking solves a real problem, especially in this space. It comes with new problems of its own but so does the alternative.
The content of this post has nothing to do with the specifics of dynamic linking: it would be just as true if the wheels in question had static binaries instead.
It's fair to point out that shared objects surface the problem here, but I don't know if I would lay the blame with them: the underlying problem is that a FPU control register isn't (and can't be) meaningfully isolated. Python needs to use shared objects for loadable extensions, but the contaminating code might be statically linked into that shared object.
(I don't say this because I want to excuse dynamic linking, which I also generally dislike! Only that I think the problem is somewhere else in this particular case.)
Our solution was the inverse of the one presented in this article: instead of wrapping our routines to temporarily set the floating point control word to sane values, we wrapped the calls to either printing or the file picker, and reset the floating point control word to its previous (and sane) value after these calls.