Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Std: Clamp generates less efficient assembly than std:min(max,std:max(min,v)) (1f6042.blogspot.com)
174 points by x1f604 on Jan 16, 2024 | hide | past | favorite | 142 comments



Depending on the order of the arguments to min max you'll get an extra move instruction [1]:

std::min(max, std::max(min, v));

        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2
std::min(std::max(v, min), max);

        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2
For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).

The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.

Possibly std::clamp has the comparisons ordered like the second case?

[1]: https://godbolt.org/z/coes8Gdhz


I think the libstdc++ implementation does indeed have the comparisons ordered in the way that you describe. I stepped into the std::clamp() call in gdb and got this:

    ┌─/usr/include/c++/12/bits/stl_algo.h──────────────────────────────────────────────────────────────────────────────────────
    │     3617     \*  @pre `_Tp` is LessThanComparable and `(__hi < __lo)` is false.
    │     3618     \*/
    │     3619    template<typename _Tp>
    │     3620      constexpr const _Tp&
    │     3621      clamp(const _Tp& __val, const _Tp& __lo, const _Tp& __hi)
    │     3622      {
    │     3623        __glibcxx_assert(!(__hi < __lo));
    │  >  3624        return std::min(std::max(__val, __lo), __hi);
    │     3625      }
    │     3626


Thanks for sharing. I don't know if the C++ standard mandates one behavior or another, it really depends on how you want clamp to behave if the value is NaN. std::clamp returns NaN, while the reverse order returns the min value.


From §25.8.9 Bounded value [alg.clamp]:

> 2 Preconditions: `bool(comp(proj(hi), proj(lo)))` is false. For the first form, type `T` meets the Cpp17LessThanComparable requirements (Table 26).

> 3 Returns: `lo` if `bool(comp(proj(v), proj(lo)))` is true, `hi` if `bool(comp(proj(hi), proj(v)))` is true, otherwise `v`.

> 4 [Note: If NaN is avoided, `T` can be a floating-point type. — end note]

From Table 26:

> `<` is a strict weak ordering relation (25.8)


Does that mean NaN is undefined behavior for clamp?


My interpretation is that yes, passing NaN is undefined behavior. Strict weak ordering is defined in 25.8 Sorting and related operations [alg.sorting]:

> 4 The term strict refers to the requirement of an irreflexive relation (`!comp(x, x)` for all `x`), and the term weak to requirements that are not as strong as those for a total ordering, but stronger than those for a partial ordering. If we define `equiv(a, b)` as `!comp(a, b) && !comp(b, a)`, then the requirements are that `comp` and `equiv` both be transitive relations:

> 4.1 `comp(a, b) && comp(b, c)` implies `comp(a, c)`

> 4.2 `equiv(a, b) && equiv(b, c)` implies `equiv(a, c)`

NaN breaks these relations, because `equiv(42.0, NaN)` and `equiv(NaN, 3.14)` are both true, which would imply `equiv(42.0, 3.14)` is also true. But clearly that's not true, so floating point numbers do not satisfy the strict weak ordering requirement.

The standard doesn't explicitly say that NaN is undefined behavior. But it does not define the behavior for when NaN is used with `std::clamp()`, which I think by definition means it's undefined behavior.


Based on my reading of cppreference, it is required to return negative zero when you do std::clamp(-0.0f, +0.0f, +0.0f) because when v compares equal to lo and hi the function is required to return v, which the official std::clamp does but my incorrect clamp doesn't.


Sir cmovq, you have deserved your username.


Yes, I arrived at the same conclusion.

The various code snippets in the article don't compute the same "function". The order between the min() and max() matters even when done "by hand". This is apparent when min is greater than max as the results differ in the choice of the boundaries.

Funny that for such simple functions the discussion can become quickly so difficult/interesting.

Some toying around with the various implementations in C [1]:

[1]: https://godbolt.org/z/d4Tcdojx3


Yes, you are correct, the faster clamp is incorrect because it does not return v when v is equal to lo and hi.


It seems that this is close to the most likely reason. See also:

https://godbolt.org/z/q7e3MrE66


I did a double take on this because I wrote a blog post about this topic a few months ago and came to a very different conclusion, that the results are effectively identical on clang and gcc is just weird.

Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?

Never did figure out what gcc's problem was.

https://godbolt.org/z/Y75qnTGdr


Try switching to -Ofast it produces different ASM


-Ofast is one of those dangerous flags that you should probably be careful with. It is “contagious” and it can mess up code elsewhere in the program, because it changes processor flags.

I would try a more specific flag like -ffinite-math-only.


finite-math-only is a footgun as well as it allows the compiler assume that NaNs do not exist. Which means all `isnan()` calls are just reduced to `false` so it’s difficult to program defensively. And if a NaN in fact occurs it’s naturally a one-way ticket to UB land.


If that’s a foot gun, then -Ofast is an autocannon.

I like to think that the flag should be renamed “-Ofuck-my-shit-up”.


As 1 of ∞ examples of UB land, I once had to debug JS objects being misinterpreted as numbers when https://duktape.org/ was miscompiled with a fast-math equivalent (references to objects were encoded as NaNs.)


IIRC the changing global flags "feature" was removed recently from GCC and now you have to separately ask for it.


Is that changing of global processor flags a x86 feature or does it hold for arm as well?


https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/...

So, yes when targeting VFP math. NEON already always works in this mode though.


On gcc 13, the difference in assembly between the min(max()) version and std::clamp is eliminated when I add the -ffast-math flag. I suspect that the two implementations handle one of the arguments being NaN a bit differently.

https://gcc.godbolt.org/z/fGaP6roe9

I see the same behavior on clang 17 as well

https://gcc.godbolt.org/z/6jvnoxWhb


You (celegans25) probably know this but here is a PSA that -ffast-math is really -finaccurate-math. The knowledgeable developer will know when to use it (almost never) while the naive user will have bugs.


What you really should enable is the fun and safe math optimizations, with -funsafe-math-optimizations.


The problem is this causes the compiler to correctly solve your recreational math problems, which isn't actually a much fun as solving them yourself!


I know almost nothing about compiler flags but I got a laugh out of this even though I still don't know if you're joking or not. Edit: Just read it again and now I understand the joke. Haha


To others `-f` is a common prefix for GCC flags. You can think of this as "enable feature". So -funsafe-math-operations should be read as (-f) (unsafe-math-operations). Not (-)(funsafe-math-operations).


I kind of like the idea the flag is sarcastically calling them very fun and very safe.


don't forget libiberty which is linked in using -liberty (and freedom for all)


BTW that library was named by John Gilmore, who’s a pretty hardcore libertarian.


strangely I'm not aware of a libibre.


oh, that's called -lmojito


Why do you say almost never? Don’t let the name scare you; all floating point math is inaccurate. Fast math is only slightly less accurate, I think typically it’s a 1 or maybe 2 LSB difference. At least in CUDA it is, and I think many (most?) people & situations can tolerate 22 bits of mantissa compared to 23, and many (most?) people/situations aren’t paying attention to inf/nan/exception issues at all.

I deal with a lot of floating point professionally day to day, and I use fast math all the time, since the tradeoff for higher performance and the relatively small loss of accuracy are acceptable. Maybe the biggest issue I run into is lack of denorms with CUDA fast-math, and it’s pretty rare for me to care about numbers smaller than 10^-38. Heck, I’d say I can tolerate 8 or 16 bits of mantissa most of the time, and fast-math floats are way more accurate than that. And we know a lot of neural network training these days can tolerate less than 8 bits of mantissa.


Here are some of the problems with fast-math:

* It links in an object file that enables denormal flushing globally, so that it affects all libraries linked into your application, even if said library explicitly doesn't want fast-math. This is seriously one of the most user-hostile things a compiler can do.

* The results of your program will vary depending on the exact make of your compiler and other random attributes of your compile environment, which can wreak havoc if you have code that absolutely wants bit-identical results. This doesn't matter for everybody, but there are some domains where this can be a non-starter (e.g., multiplayer game code).

* Fast-math precludes you from being able to use NaN or infinities, and often even being able to defensively test for NaN or infinity. Sure, there are times where this is useful, but an option you might generally prefer to suggest for an uninformed programmer would rather be a "floating-point code can't overflow" option rather than "infinity doesn't exist and it's UB if it does exist".

* Fast-math can cause hard range guarantees to fail. Maybe you've got code that you can prove that, even with rounding error, the result will still be >= 0. With fast-math, the code might be adjusted so that the result is instead, say, -1e-10. And if you pass that to a function with a hard domain error at 0 (like sqrt), you now go from the result being 0 to the result being NaN. And see above about what happens when you get NaN.

Fast-math is a tradeoff, and if you're willing to except the tradeoff it offers, it's a fine option to use. But most programmers don't even know what the tradeoffs are, and the failure mode can be absolutely catastrophic. It's definitely an option that is in the "you must be this knowledgeable to use" camp.


> The results of your program will vary depending on the exact make of your compiler and other random attributes of your compile environment, which can wreak havoc if you have code that absolutely wants bit-identical results. This doesn't matter for everybody, but there are some domains where this can be a non-starter (e.g., multiplayer game code).

This already shouldn't be assumed, because even the same code, compiler, and flags can produce different floating point results on different CPU targets. With the world increasingly split over x86_64 and aarch64, with more to come, it would be unwise to assume they produce the same exact numbers.

Often this comes down to acceptable implementation defined behavior, e.g. temporarily using an 80-bit floating register despite the result being coerced to 64 bits, or using an FMA instruction that loses less precision than separate multiply and add instructions.

Portable results should come from integers (even if used to simulate rationals and fixed point), not floats. I understand that's not easy with multiplayer games, but doing so with floats is simply impossible because of what is left as implementation-defined in our language standards.


This advice is out-of-date.

All CPU hardware nowadays conforms to IEEE 754 semantics for binary32 and binary64. (I think all the GPUs now have non-denormal-flushing modes, but my GPU knowledge is less deep). All compilers will have a floating-point mode that preserves IEEE 754 semantics assuming that FP exceptions are unobservable and rounding mode is the default, and this is usually the default (icc/icx is unusual in making fast-math the default).

Thus, you have portability of floating-point semantics, subject to caveats:

* The math library functions [1] are not the same between different implementations. If you want portability, you need to ensure that you're using the exact same math library on all platforms.

* NaN payloads are not consistent on different platforms, or necessarily within the same platform due to compiler optimizations. Note that not even IEEE 754 attempts to guarantee NaN payload stability.

* Long double is not the same type on different platforms. Don't use it. Seriously, don't.

* 32-bit x86 support for exact IEEE 754 equivalence is essentially a "known-WONTFIX" bug. (This is why the C standard implemented FLT_EVAL_METHOD). The x87 FPU evaluates everything in 80-bit precision, and while you can make this work for binary32 easily (double rounding isn't an issue), though with some performance cost (the solution involves reading/writing from memory after every operation), it's not so easy for binary64. However, the SSE registers do implement IEEE 754 exactly, and are present on every chip old enough to drink, so it's not really a problem anymore. There's a subsidiary issue that the x86-32 ABI requires floats be returned in x87 registers, which means you can't properly return an sNaN correctly, but sNaN and floating-point exceptions are firmly in the realm of nonportability anyways.

In short, if you don't need to care about 32-bit x86 support (or if you do care but can require SSE2 support), and you don't care about NaNs, and you bring your own libraries along, you can absolutely expect to have floating-point portability.

[1] It's actually not even all math library functions, just those that are like sin, pow, exp, etc., but specifically excluding things like sqrt. I'm still trying to come up with a good term to encompass these.


> It's actually not even all math library functions, just those that are like sin, pow, exp, etc., but specifically excluding things like sqrt. I'm still trying to come up with a good term to encompass these.

Transcendental functions. They're called that because computing an exactly rounded result might be unfeasible for some inputs. https://en.wikipedia.org/wiki/Table-maker%27s_dilemma So standards for numerical compute punt on the issue and allow for some error in the last digit.


Not all of the functions are transcendental--things like cbrt and rsqrt are in the list, and they're both algebraic.

(The main defining factor is if they're an IEEE 754 §5 operation or not, but IEEE 754 isn't a freely-available standard.)


While it is a difficult problem, it is not an infeasible problem nowdays, at least for trigonometric, logarithmic and exponential functions. (All possible arguments have been mapped to prove how many additional bits are needed for correct rounding.) Two-argument pow remains an unsolved problem in my knowledge though.


My understanding is we have exhaustively enumerated the unary binary32 functions and proved the correctness of correct-rounding for them. For binary64, exhaustive enumeration is not a viable strategy, but we generally have a decent idea of what cases end up being hard-to-round, and in a few cases, we may have mechanical proofs of correctness.

There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.


For your information, binary64 has been indeed mapped exhaustively for several functions [1], so it is known that at most triple-double representation is enough for correct rounding.

[1] https://inria.hal.science/inria-00072594/document

> There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.

Thank you for the pointer. These were written by usual folks you'd expect from such papers (e.g. Paul Zimmermann) so I believe they did achieve significant improvement. Unfortunately it is still not complete, the paper notes that the third and final phase may still fail but is unknown whether it indeed occurs or not. So we will have to wait...


Not sure if this is a spooky coincidence, but I happened to be reading the Rust 1.75.0 release notes today and fell into this 50-tab rabbit hole: https://github.com/rust-lang/rust/pull/113053/


> All CPU hardware nowadays conforms to IEEE 754 semantics for binary32 and binary64.

Is this out of date?

https://developer.arm.com/documentation/den0018/a/NEON-Instr...


> this is usually the default

No, it's not. gcc itself still defaults to fp-contract=fast. Or at least does in all versions I have ever tried.


> Often this comes down to acceptable implementation defined behavior,

I believe this is "always" rather than often when it comes to the actual operations defined by the FP standard. gcc does play it fast and loose (as -ffast-math is not yet enabled by default, and FMA on the other hand is), but this is technically illegal and at least can be easily configured to be in standards-compliant mode.

I think the bigger problem comes from what is _not_ documented by the standard. E.g. transcendental functions. A program calling plain old sqrt(x) can find itself behaving differently _even between different stepping of the same core_, not to mention that there are well-known differences between AMD vs Intel. This is all using the same binary.


I'm surprised by this, regarding sqrt. The standard stipulates correct rounding for simple arithmetic, including sqrt ever since 754 1985.

Unless of course we are talking about the 80 bit format.

If that's not the case, would be interested to know where they differ.

Unfortunately for the transcendental function the accuracy still hasn't been pinned down, especially since that's still an ongoing research problem.

There's been some great strides in figuring out the worst cases for binary floating point up to doubles so hopefully an upcoming standard will stipulate 0.5 ULP for transcendentals. But decimal floating point still has a long way to go.


Because compilers can and have implemented sqrt in terms of rsqrt which is .. fun to work with. This also on SSE.


I spent most of my career working with rsqrt haha. And had my fair share of non-754 architectures too!

Every 754 architecture (including SSE) I've worked on has an accurate sqrt().

I'm assuming you're talking about with "fast math" enabled? In which case all bets are off anyway!


No; compilers have done this even without fast-math. Gcc does not seem to do this anymore, but still does plenty of unsafe optimizations by default, like FMA.

Or maybe the library you use...


Argh, sounds really frustrating! It's hard enough to get accuracy when you can control operations never mind when the compiler is doing magic behind the scenes!

FMAs were difficult. The Visual Studio compiler in particular didn't support purposeful FMAs for SSE instructions so you had to rely on the compiler to recognise and replace multiply-additions. Generally I want FMAs because they're more accurate but I want to control where they go.


sqrt is a fundamental IEEE 754 operation, required to be correctly rounded, and many architectures implement a dedicated, correctly rounded sqrt instruction.

Now, there is also often an approximate rsqrt and approximate reciprocal, with varying degrees of accuracy, and that can be "fun."


Thank you, great points. I’d have to agree that disabling denorms globally is pretty bad, even if (or maybe especially if) caring about denorms is rare.

> Fast-math can cause hard range guarantees to fail. Maybe you’ve got code that you can prove that, even with rounding error, the result will still be >= 0.

Floats do this too, it’s pretty routine to bump into epsilon out-of-range issues without fast-math. Most people don’t prove things about their rounding error, and if they do, it’s easy for them to account for 3 ULPs of fast-math error compared to 1/2 ULP for the more accurate operations. Like, nobody who knows what they’re doing will call sqrt() on a number that is fresh out of a multiplier and might be anywhere near zero without testing for zero explicitly, right? I’m sure someone has done it, but I’ve never seen it, and it ranks high on the list of bad ideas even if you steer completely clear of fast-math, no?

I guess I just wanted to resist the unspecific parts of the FUD just a little bit. I like your list a lot because it’s specific. Fast-math does carry some additional risks for accuracy sensitive code, and clearly as you and others showed, can infect and impact your whole app, and it can sometimes lead to situations where things break that wouldn’t have happened otherwise. But I think in the grand scheme these situations are quite rare compared to how often people mess up regular floating point math. For a very wide swath of people doing casual arithmetic, fast-math is not likely to cause more problems than floats cause, but it’s fair to want to be careful and pay attention.


> I’d have to agree that disabling denorms globally is pretty bad,

and yet, for audio processing, this is an option that most DAWs either implement silently, or offer users the choice, because denormals are inevitable in reverb tails and on most Intel processors they slow things by orders of magnitude.


I would think for audio, there’s no audible difference between a denorm and a flushed zero. Are there cases where denorms are important to audio?


They are important in the negative sense: Intel processors are appalling at handling them, and they can break realtime code because of this.

My DAW uses both "denormals are zero" and "flush denormals to zero" to try to avoid them; it also offers a "DC Bias" option where extremely small values are added to samples to avoid denormals.


For game development we had them off as well because of performance issues. Most stuff calculates around 0 so it was pretty common to trigger denorms.

The slowing down on Intel platforms has always frustrated me because denorms provide nice smoothing around 0.

At the same time it was nice only having to consider normal floating point when trying to get more accuracy out of calculations, etc.


The scary thing IMO is: your code might be fine with unsafe math optimisations, but maybe you're using a library which is written to do operations in a certain order to minimise numerical error, and unsafe math operations changes the code which are mathematically equivalent but which results in many orders of magnitude more numerical error. It's probably fine most of the time, but it's kinda scary.


It shouldn’t be scary. Any library that is sensitive to order of operations will hopefully have a big fat warning on it. And it can be compiled separately with fast-math disabled. I don’t know of any such libraries off the top of my head, and it’s quite rare to find situations that result in orders of magnitude more error, though I grant you it can happen, and it can be contrived pretty easily.


You can't fully disable fast-math per-library, moreover a library compiled with fast-math might also introduce inaccuracies in a seemingly unrelated library or application code in the same executable. The reason is that fast-math enables some dynamic initialization of the library that changes the floating point environment in some ways.


> You can’t fully disable fast-math per library

Can you elaborate? What fast-math can sneak into a library that disabled fast-math at compile time?

> fast-math enables some dynamic initialization of the library that changes the floating point environment in some ways.

I wasn’t aware of this, I would love to see some documentation discussing exactly what happens, can you send a link?


> Can you elaborate? What fast-math can sneak into a library that disabled fast-math at compile time?

A lot of library code is in headers (especially in C++!). The code in headers is compiled by your compiler using your compile options.


Ah, of course, very good point. A header-only library doesn’t have separate compile options. This is a great reason for a float-sensitive library to not be header-only, right?


It's not just about being header-only, lots of libraries which aren't header-only still have code in headers. The library may choose to put certain functions in headers for performance reasons (to let compiler inline them), or, in C++, function templates and class templates generally have to be in headers.

But yeah, it's probably a good idea to not put code which breaks under -ffast-math in headers if possible.


https://github.com/llvm/llvm-project/issues/57589

Turn on fast-math, it flips the FTZ/DAZ bit for the entire application. Even if you turned it on for just a shared library!


That's only one small part of -ffast-math/-Ofast though and not a very scary one at that.


But it's an example of -ffast-math affecting separately compiled libraries.


you're gonna hive to give us a concrete real world example to convince most of us...


I don't typically thoroughly read through the documentation for all the dependencies which my dependencies are using.

But you're correct that it's probably usually fine in practice.


That’s fair. Ideally transitive dependencies should be completely hidden from you. Hopefully the author of the library you include directly has heeded the instructions of libraries they depend on.

Hey I grant and acknowledge that using fast-math carries a little risk of surprises, we don’t necessarily need to try to think of corner cases. I’m mostly pushing back a little because using floats at all carries almost as much risk. A lot of people seem to use floats without knowing how inaccurate floats are, and a lot of people aren’t doing precision analysis or handling the exceptional cases… and don’t really need to.


> A lot of people seem to use floats without knowing how inaccurate floats are,

Small nit, but floats aren't inaccurate, they have non uniform precision. Some float operations can be inaccurate, but that's rather path dependent...

One problem with -ffast-math is that a) it sounds appealing and b) people don't understand floats, so lots of people turn it on without understanding what it does, and that can introduce subtle problems in code they didn't write.

Sometimes in computational code it makes sense e.g. to get rid of denorms, but a very small fraction of programmers understand this properly, or ever will.

I wish they had named it something scary sounding.


> Sometimes in computational code it makes sense e.g. to get rid of denorms, but a very small fraction of programmers understand this properly, or ever will.

"Some times" here being almost all the time. It is rare that your code will break without denormals if it doesn't already have precision problems with them.


I am talking about float operations, of course. And they’re all inaccurate, generally speaking, because they round. Fast math rounding error is not much larger than rounding error without fast mast.


Nah, you don't deal with floats. You do machine learning which just happens to use floats. I do both numerical computing and machine learning. And oh boy are you wrong!

People who deal with actual numerical computing know that the statement "fast math is only slightly less accurate" is absurd. Fast math is unbounded in its inaccuracy! It can reorder your computations so that something that used to sum to 1 now sums to 0, it can cause catastrophic cancellation, etc.

Please stop giving people terrible advice on a topic you're totally unfamiliar with.


> It can reorder your computations so that something that used to sum to 1 now sums to 0, it can cause catastrophic cancellation, etc.

Yes, and it could very well be that the correct answer is actually 0 and not 1.

Unless you write your code to explicitly account for fp associativity effects, in which case you don't need generic forum advice about fast-math.


I only do numeric computation, I don’t work in machine learning. Sorry your assumptions are incorrect, maybe it’s best not to assume or attack. I didn’t exactly advise using fast math either, I asked for reasoning and pointed out that most casual uses of float aren’t highly sensitive to precision.

It’s easy to have wrong sums and catastrophic cancellation without fast math, and it’s relatively rare for fast math to cause those issues when an underlying issue didn’t already exist.

I’ve been working in some code that does a couple of quadratic solves and has high order intermediate terms, and I’ve tried using Kahan’s algorithm repeatedly to improve the precision of the discriminants, but it has never helped at all. On the other hand I’ve used a few other tricks that improve the precision enough that the fast math version is higher precision than the naive one without fast math. I get to have my cake and eat it too.

Fast math is a tradeoff. Of course it’s a good idea to know what it does and what the risks of using it are, but at least in terms of the accuracy of fast math in CUDA, it’s not an opinion whether the accuracy is relatively close to slow math, it’s reasonably well documented. You can see for yourself that most fast math ops are in the single digit ulps of rounding error. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....


+1. I'm years away from fp-analysis, but do the transcendental expansions even converge in the presence of fast-math? No `sin()`, no `cos()`, no `exp()`, ...


Well there are library implementations of fast-math trancendentals that offer bounded error, and a million different fast sine approximation algorithms, so, yes? This is why you shouldn’t listen to FUD. The corner cases are indeed frustrating for a few people, but most never hit them, and the world doesn’t suddenly break when fast math is enabled. I am paid to do some FP analysis, btw.


> Fast math is only slightly less accurate

'slightly'? Last I checked, -Ofast completely breaks std::isnan and std::isinf--they always return false.


Hopefully it was clear from the rest of my comment that I was talking about in-range floats there. I wouldn’t necessarily call inf & nan handling an accuracy issue, that’s more about exceptional cases, but to your point I would have to agree that losing std::isinf is kinda bad since divide by zero is probably near the very top of the list of things most people using floats casually might have to deal with.

Which compiler are you using where std::isinf breaks? Hopefully it was also clear that my experience leans toward CUDA, and I think the inf & nan support works there in the presence of NVCC’s fast-math.


My experience is with gcc and clang on x86. I generally agree with you regarding accuracy, which is why I was quite surprised when I first discovered that -Ofast breaks isnan/isinf.

Even if I don't care about the accuracy differences, I still need a way to check for invalid input data. The upshot is that I had to roll my own isnan and isinf to be able to use -Ofast (because it's actually the underlying __builtin_xxx intrinsics that are broken), which still seems wrong to me.


They are talking about -ffast-math, not -Ofast.


From the gcc manual:

-Ofast

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. It turns off -fsemantic-interposition.


If your code ventures into the domain where fast-math matters and you're not a mathematician trying to solve a lyapunov-unstable problem with very tricky numeric methods, then most likely your code is already broken.


Another PSA is that dynamic libraries compiled with fast-math will also introduce inaccuracies in unrelated libraries in the same executable, as they introduce dynamic initialization that globally changes the floating point environment.


This would only affect code that uses the old-school x87 floating point instructions, though? The x87 FPU unit indeed has scary global state that can make your doubles behave like floats in secret and silence.

I would think practically all modern FPU code on x86-64 would be using the SIMD registers which have explicit widths.


So it was a bit more pervasive than this, the issue was that flushing subnormals (values very close to 0) to 0 is a register that gets set, so if a library is built with the fastmath flags and it gets loaded, it sets the register, causing the whole process to flush it's subnormals. i.e https://github.com/llvm/llvm-project/issues/57589


> This would only affect code that uses the old-school x87 floating point instructions, though?

Actually, no, the x87 FPU instructions are the only ones that won't be affected.

It sets the FTZ/DAZ bits, which exist for SSE instructions but not x87 instructions.


You're mistaking something else for the rounding mode and subnormal handling flags.


Ehh, not so much inaccurate, more of a "floating point numbers are tricky, let's act like they aren't".

Compilers are pretty skittish about changing the order of floating point operations (for good reason) and ffast-math is the thing that lets them transform equations to try and generate faster code.

IE, instead of doing "n / 10" doing "n * 0.1". The issue, of course, being that things like 0.1 can't be perfectly represented with floats but 100 / 10 can be. So now you've introduced a tiny bit of error where it might not have existed.


It isn't just that. -ffast-math also allows the compiler to ignore infinites. In fact for GCC with -ffast-math, isinf always returns false. Something similar happens for NaNs/isnan.


I lump this into "floating points are tricky". NaNs and inf are definitely legitimate floating point values. They are also things that a lot of applications will break on they ever encounter them.


I've never understood why generating exceptions is preferable to just using higher precision.


Higher precision isn’t always available. IEEE 754 is an unusually well-thought-through standard (thanks to some smart people with a lot of painful experience) and is pretty good at justifying its decisions, some of which are surprising (far from obvious) to anyone not steeped in it.


The main problem with floats in general is they are designed primarily for scientific computing.

We are fortunately starting to see newer (well, not that new now) CPU instructions like FMA that make more accurate decimal representations not take such huge performance hits.


what does fma have to do with decimal numbers?


Oh shoot, nvm, I thought it was an optimization for integers.

Really it'll be the SIMD style instructions that speeds things up.


On a GPU, higher precision can cost between 2 and 64 times more than single precision, with typical ratios for consumer cards being 16 or 32. Even on the CPU, fp64 workloads tend to run at half the speed on real data due to the extra bandwidth needed for higher precision.


One of the things that you can do with D and as far as I know Julia is enable specific optimizations locally e.g. allow FMAs here and there, not globally.

fast-math is one of the dumbest things we have as an industry IMO.


Totally agreed. In Julia we use https://github.com/SciML/MuladdMacro.jl all over the place so that way it's contextual and does not bleed into other functions. fast-math changing everything is just... dangerous.


Clang generates the shortest of these if you target sandybridge, or x86-64-v3, or later. The real article that's buried in this article is that compilers target k8-generic unless you tell them otherwise, and the features and cost model of opteron are obsolete.

Always specify your target.


Yep. Adding "-C target-cpu=native" to rustc on my desktop computer consistently gets a ~10-15% performance boost compared to the default target. The default target is extremely conservative. As far as I can tell, it doesn't take advantage of any CPU features added in the last 20 years. (The k8 came out in 2003.)


Red Hat Enterprise Linux has upgraded their default target to x86-64-v2 and is considering switching to x86-64-v3 for RHEL 10 (which should release around 2026?). I'd take that as a sign that those might be reasonable choices for newly released software.

Some linux distros also give you the option to either get a version compatible with ancient hardware or the optimized x86-64-v3 version, which seems like a good compromise.


Those Gentoo people were onto something.


Funny that it stopped being the case for a while around 2006. AMD64 became widespread while also being very new, closing the gap between "default" and "native".


Of course, gentoo just started using prebuilt packages a few months ago…


Even with -march=x86-64-v4 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/hd44KjMMn


I'm a heavy std::clamp user, but I'm considering replacing it with min+max because of the uncertainty about what will happen when lo > hi. On windows it triggers an assertion, while other platforms just do a min+max in one or the other order. Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.


> Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.

Sounds to me like you are missing a validation step before calling your logic. When it comes to parsing, trusting user input is a recipe for disaster in the form of buffer overruns and potential exploits.

As they used to say in the Soviet Union: "trust, but verify".


The answer is of course

    clamp(min(a,b), max(a,b))
classic c++


That was what Reagan said about the Soviet Union, not what was said in the Soviet Union.

Correct me if I'm wrong.


https://en.wikipedia.org/wiki/Trust,_but_verify

> Trust, but verify (Russian: доверяй, но проверяй, tr. doveryay, no proveryay, IPA: [dəvʲɪˈrʲæj no prəvʲɪˈrʲæj]) is a Russian proverb, which is rhyming in Russian. The phrase became internationally known in English after Suzanne Massie, a scholar of Russian history, taught it to Ronald Reagan, then president of the United States, the latter of whom used it on several occasions in the context of nuclear disarmament discussions with the Soviet Union.

Memorably referenced in "Chernobyl": https://youtu.be/9Ebah_QdBnI?t=79


Also referenced in the Metro Exodus "Sam's Story" DLC because of the backstory of the two characters speaking, and nuclear weapons once again being part of the scenario.


When I hear the phrase, I rewrite it to "don't trust, verify."


Thanks for the clarification/explanation!


Russian here. We use that expression from time to time: https://ya.ru/search/?text="доверяй%2C+но+проверяй"


According to wikipedia you're right (about Reagan) but it's also a Russion proverb.


Pretty sure that their behaviors on NaN arguments will also differ.


I hope they fix it. Thats quite a basic functional unit for it to be a footgun all on its own.


Don't get your hopes up, the behavior when lo > hi is explicitly undefined.


Will min+max help you? What do you expect the answer to be when lo > hi? What certainty should std::clamp have? Using min+max on a number that’s between lo+hi when lo>hi will always return either lo or hi, and never your input value.


Sure, that was the point - min(max()) forces you to give explicit preference to lo or hi, whereas with clamp it's up to the std library. I trust my users to bend my software to their will, but I don't want different behavior on mac and windows (for example).


Yeah, seems reasonable. I think the outer call wins, so min(max()) will always return lo for empty intervals, right? I didn’t know std::clamp() was undefined for empty intervals. It does seem like a good idea to try to guarantee the interval is valid instead of worrying about clamp… even with a guarantee, the answer might still surprise someone, since technically the problem is mathematically undefined and the guaranteed answer is wrong.


Both recent GCC and Clang are able to generate the most optimal version for std::clamp() if you add something like -march=znver1, even at -O1 [0]. Interesting!

[0] https://godbolt.org/z/YsMMo7Kjz


But then it uses AVX instructions. (You can replace -march=znver1 with just -mavx.)

When AVX isn’t enabled, the std::min + std::max example still uses fewer instructions. Looks like a random register allocation failure.


The additional "movapd xmm0, xmm2" is mostly free as it is handled by renaming, but yes, it seems a quirk of the register allocator. It wouldn't be the first time I see GCC trying to move stuff around without obvious reasons.


I don't think it's a register allocation failure but is in fact necessitated by the ABI requirement (calling convention) for the first parameter to be in xmm0 and the return value to also be placed into xmm0.

So when you have an algorithm like clamp which requires v to be "preserved" throughout the computation you can't overwrite xmm0 with the first instruction, basically you need to "save" and "restore" it which means an extra instruction.

I'm not sure why this causes the extra assembly to be generated in the "realistic" code example though. See https://godbolt.org/z/hd44KjMMn


Even with -march=znver1 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/WMKbeq5TY


On a somewhat similar note, don't use std::lerp if you don't need its strong guarantees around rounding (monotonicity among other things).

https://godbolt.org/z/hzrG3s6T4


I see that the assembly instructions are different, but what's the performance difference? Personally, I don't care about the number of instructions used, as long as it's faster. With things like store forwarding and register files, a lot of those movs might be treated as noops.


The only times I worry about min/max/clamp performance is when I need to do thousands or millions of them. And in that case, I’d suggest intrinsics. You get to choose how NaN is handled, it’s branchless, and you can do multiple in parallel.

It feels backwards that you need to order your comparisons so as to generate optimal assembly.


https://bugs.llvm.org/show_bug.cgi?id=47271

This specific test (click the godbolt links) does not reproduce the issue.


If you benchmark these, you'll likely find the version with the jump edges out the one with the conditional instruction in practice.


FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'

Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.


With random test cases, branch prediction can't help.


Compilers often under-generate conditional instructions. They implicitly assume (correctly) that most branches you write are 90/10 (ie very predictable), not 50/50. The branches that actually are 50/50 suffer from being treated as being 90/10.


The branches in this example are not 50/50.

Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.


Do you know that for a fact? For all calls of clamp? I have definitely used min and max when they are true 50/50s and I assume clamp also gets some similar use.


Modern compilers generate code assuming all branches are highly predictable.

If your use case does not follow that pattern and you really care about performance, you have to pull out something like inline assembly.

Consider software like ffmpeg which have to do this for the sake of performance.


It's hard to predict statically which branches will be dynamically unpredictable.

A seasoned hardware architect once told me that Intel went all-in on predication for Itanium, under the assumption that a Sufficiently Smart Compiler could figure it out, and then discovered to their horror that their compiler team's best efforts were not Sufficiently Smart. He implied that this was why Intel pushed to get a profile-guided optimization step added to the SPEC CPU benchmark, since profiling was the only way to get sufficiently accurate data.

I've never gone back to see whether the timeline checks out, but it's a good story.


The compiler doesn't do much of the predicting, it's done by the CPU in runtime.


Not prediction, predication: https://en.wikipedia.org/wiki/Predication_(computer_architec...

By avoiding conditional branches and essentially masking out some instructions, you can avoid stalls and mis-predictions and keep the pipeline full.

Actually I think @IainIreland mis-remembers what the seasoned architect told him about Itanium. While Itanium did support predicated instructions, the problematic static scheduling was actually because Itanium was a VLIW machine: https://en.wikipedia.org/wiki/VLIW .

TL;DR: dynamic scheduling on superscalar out-of-order processors with vector units works great and the transistor overhead got increasingly cheap, but static scheduling stayed really hard.


That must depend on the platform and the surrounding code, no?


Yes. On platform - most modern cpus are happier with predictable branches than exotic instructions.

On surrounding code - for sure.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: