I remember being amazed in the final year of college, the PowerMac 7600 I'd spent my student loan on, was running my FORTRAN as quick as the DEC Alphas in the lab.
So, ironically, it can be orders of magnitude faster to compute a long-double expl() than a double-precision exp().
 Even the asm contains a lot of extra code to handle edge cases that Intel screwed up, like returning NaN for exp(+/- Inf).
Do you have actual measurements that reveal some actual irony, or is it only superficial irony in that expl() is computed using the (microcoded, slow) hardware instruction and exp() isn't?
Note: For most functions, you can actually make a pretty accurate double-precision implementation by calling the corresponding 387 instruction and then rounding the result to double-precision. Not a correctly rounded implementation (that would require a wider intermediate result than 80-bit) but pretty good nonetheless.
In our case, we only needed about 12 bits of accuracy anyway, so we wound up writing our own implementation using polynomial approximations (which have the benefit of predictable running times regardless of the quality of your C library). The fact that you need to do this in any context where worst-case performance is a concern might be surprising to some people. It was to me before I read glibc's code.
16:12:48 < gmaxwell> ha. So, completely eliminating all the doubles made celt slower.
16:13:23 < gmaxwell> I'm guessing that this is because I used the C99 float versions of the math functions and they suck.
16:13:57 < gmaxwell> looks like 3% slower or so.
16:14:05 < gmaxwell> <3 GLIBC.
16:26:02 < gmaxwell> with float approx the no-doubles is .9% faster. Whoohoo. ;)
17:24:06 < derf> Yes. Don't use the float versions of libc functions, at least not on x86-64.
17:24:24 < Dark_Shikari> what the heck did they do ?
17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(), for example, regardless of whether or not they're already correctly set up.
17:25:07 < derf> Apparently this is really slow.
17:25:32 < gmaxwell> derf: I was hoping that they'd be fixed by now.
"Float approx" refers to our custom polynomial approximation of exp(). Given that, I think it's safe to infer that expl() would also have been slower than exp() in the average case. That was in 2010, though. This appears to have been fixed for expf() in 2012 (which now uses SSE/AVX instead of x87 asm on x86-64), but not for expl().
I fixed this around the same time I worked on the mp improvements:
Author: Siddhesh Poyarekar <firstname.lastname@example.org>
Date: Wed Jun 12 10:36:48 2013 +0530
Set/restore rounding mode only when needed
The most common use case of math functions is with default rounding
mode, i.e. rounding to nearest. Setting and restoring rounding mode
is an unnecessary overhead for this, so I've added support for a
context, which does the set/restore only if the FP status needs a
change. The code is written such that only x86 uses these. Other
architectures should be unaffected by it, but would definitely benefit
if the set/restore has as much overhead relative to the rest of the
code, as the x86 bits do.
Everyone's needs are different, but it does seem that it is a little bit early, and perhaps that it will always be too early, for a general-purpose libm to aim for correctly rounded transcendental function. Even the best implementations only work in one rounding mode, leading to the pleasantries discussed in a sibling of this comment, and the worst-case execution time can be a problem for some.
It is possible to aim for 0.51 ULP with implementations that still work(-ish) if the rounding mode is not round-to-nearest, and perhaps this is the sweet spot that libms that want to be useful should aim for.
(Second guess: ideology)
Also, crlibm functions are provably correctly rounded only for univariate transcendentals (thanks to the worst case paper I cited in the blog). There is no such proof available for bivariate functions, which is why computing at an arbitrary very large precision (the way glibc does) is the only way to increase the likelihood of correctly rounded results. IIRC crlibm does not even try for bivariates.
I believe some documentation on the crlibm website cites benchmark numbers where IIRC glibc performs better (although not by much) in the fast paths of most functions they mention but gets hit really badly in the slow path. Slow path inputs are usually quite rare, but if your application/benchmark hits them regularly enough (especially sice the slow path is a factor of 1000 times worse) then you'll obviously conclude that the performance sucks and that crlibm is better.
CRLibM does not have a provably correctly-rounded pow, but univariate functions can be merged.
The paper you cited is about decimal floating-point implementation, it is not relevant to the LibM functions discussed in the article.
I have a utility for benchmarking LibM libraries here: https://bitbucket.org/MDukhan/hysteria
Some benchmarks including CRLibM and GNU LibM are here:
When we started looking at actual timings, we found that it was mostly pretty consistent, except for huge slowdowns in some frames. The cause? glibc's exp() slow path.
git log --author=siddhesh release/2.18/master -- sysdeps/ieee754/dbl-64
Consider the comment in glibc's qsort that says something like "this works well on a Sun 4/260". That machine is about 25 years old.
When I have something to report of course HN will be the first to know.
My main focus is the reduction of energy consumption. Often but not always that can be done by making the code faster.
I mean, that should probably be an option. It's realloc all over, plain unusable because of the rare chance the function goes mad and copies data instead of failing.
Many users of floating-point are not using floating-point in hard real-time contexts. Even soft real-time contexts can be fine with Ziv-style successive approximations. Correctly rounded library functions have intangible benefits beyond precision (portability, for instance), but if you don't value these benefits, you can always use inaccurate and always fast functions. You don't even need to invent your own: just rip off the first stage of a multi-stage Ziv implementation.
I don't think any of this is the reason why “people don't like doing floating-point calculations”.