
Improving math performance in glibc - LaSombra
http://developerblog.redhat.com/2015/01/02/improving-math-performance-in-glibc/
======
fdej
Related: I recently published a preprint describing some techniques to speed
up evaluation of transcendental functions in the range of a few hundred bits
to a few thousand bits:
[http://arxiv.org/abs/1410.7176](http://arxiv.org/abs/1410.7176)

------
simonbyrne
The easiest way to make elementary functions faster is to use a different
libm: the glibc preoccupation with being correctly rounded (ie accurate to
within 0.5 ulps) imposes significant performance penalties. For example the
glibc log function, which is often the dominant cost in a lot of sampling
algorithms, is about 2.5x slower than a straightforward implementation (ie no
fancy SIMD tricks) of Tang's 25 year old table lookup algorithm, which is
accurate to 0.57 ulps. Recent OS X libms are faster still, and appear to be
accurate to about 0.51 ulps, though unfortunately are closed source.

~~~
gaius
Back in the good old days, Motorola provided a drop-in highly optimized
replacement maths library for Power Macs
[http://uk.mathworks.com/matlabcentral/newsreader/view_thread...](http://uk.mathworks.com/matlabcentral/newsreader/view_thread/1220)

I remember being amazed in the final year of college, the PowerMac 7600 I'd
spent my student loan on, was running my FORTRAN as quick as the DEC Alphas in
the lab.

------
MichaelCrawford
I am tinkering around with making some library code faster. I don't have
definite results yet but I expect to soon.

Consider the comment in glibc's qsort that says something like "this works
well on a Sun 4/260". That machine is about 25 years old.

When I have something to report of course HN will be the first to know.

My main focus is the reduction of energy consumption. Often but not always
that can be done by making the code faster.

------
csense
Aren't these implemented in hardware on most CPU's? Or is this floating point
emulation for CPU's that don't have built-in floating point?

~~~
derf_
There's an x87 asm implementation for x86-32 in sysdeps/i386/fpu/e_exp.S [1].
For x86-64, it falls back to the generic C code for exp(), in
sysdeps/ieee754/dbl-64/e_exp.c. The sad part is that there _is_ x86-64 asm for
a few variants (exp10l, exp2l, expf, expl), but since these are not C90,
people tend to avoid them for more portable versions.

So, ironically, it can be orders of magnitude faster to compute a long-double
expl() than a double-precision exp().

[1] Even the asm contains a lot of extra code to handle edge cases that Intel
screwed up, like returning NaN for exp(+/\- Inf).

~~~
pascal_cuoq
The 387 instructions for elementary functions do not have the reputation of
having a good accuracy/time ratio compared to, say, state-of-the-art
implementations using polynomial approximations.

Do you have actual measurements that reveal some actual irony, or is it only
superficial irony in that expl() is computed using the (microcoded, slow)
hardware instruction and exp() isn't?

Note: For most functions, you can actually make a pretty accurate double-
precision implementation by calling the corresponding 387 instruction and then
rounding the result to double-precision. Not a correctly rounded
implementation (that would require a wider intermediate result than 80-bit)
but pretty good nonetheless.

~~~
derf_
We didn't measure average case performance, but for worst-case performance the
"orders of magnitude" comment is absolutely backed by measurement. See also my
comment at
[https://news.ycombinator.com/item?id=8830667](https://news.ycombinator.com/item?id=8830667)

In our case, we only needed about 12 bits of accuracy anyway, so we wound up
writing our own implementation using polynomial approximations (which have the
benefit of predictable running times regardless of the quality of your C
library). The fact that you need to do this in any context where worst-case
performance is a concern might be surprising to some people. It was to me
before I read glibc's code.

~~~
derf_
I take it back, we did measure average case performance of expf() (though not
expl()), and it was slower than exp():

16:12:48 < gmaxwell> ha. So, completely eliminating all the doubles made celt
slower.

16:13:23 < gmaxwell> I'm guessing that this is because I used the C99 float
versions of the math functions and they suck.

16:13:57 < gmaxwell> looks like 3% slower or so.

16:14:05 < gmaxwell> <3 GLIBC.

16:26:02 < gmaxwell> with float approx the no-doubles is .9% faster. Whoohoo.
;)

17:24:06 < derf> Yes. Don't use the float versions of libc functions, at least
not on x86-64.

17:24:24 < Dark_Shikari> what the heck did they do ?

17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(),
for example, regardless of whether or not they're already correctly set up.

17:25:07 < derf> Apparently this is really slow.

17:25:32 < gmaxwell> derf: I was hoping that they'd be fixed by now.

"Float approx" refers to our custom polynomial approximation of exp(). Given
that, I think it's safe to infer that expl() would also have been slower than
exp() in the average case. That was in 2010, though. This appears to have been
fixed for expf() in 2012 (which now uses SSE/AVX instead of x87 asm on
x86-64), but not for expl().

~~~
siddhesh
> 17:24:53 < derf> They reset the rounding modes on entrance and exit of
> expf(), for example, regardless of whether or not they're already correctly
> set up.

I fixed this around the same time I worked on the mp improvements:

commit 2506109403de69bd454de27835d42e6eb6ec3abc Author: Siddhesh Poyarekar
<siddhesh@redhat.com> Date: Wed Jun 12 10:36:48 2013 +0530

    
    
        Set/restore rounding mode only when needed
        
        The most common use case of math functions is with default rounding
        mode, i.e. rounding to nearest.  Setting and restoring rounding mode
        is an unnecessary overhead for this, so I've added support for a
        context, which does the set/restore only if the FP status needs a
        change.  The code is written such that only x86 uses these.  Other
        architectures should be unaffected by it, but would definitely benefit
        if the set/restore has as much overhead relative to the rest of the
        code, as the x86 bits do.

~~~
derf_
That patch doesn't affect the x87 asm in sysdeps/x86_64/fpu/e_expl.S, though,
does it? Good to hear it's fixed for some cases, though.

------
Marat_Dukhan
Why not just merge the CRLibM implementations?

~~~
GFK_of_xmaspast
Are they actual improvements? The project looks like it's been dead almost
five years.

(Second guess: ideology)

~~~
Marat_Dukhan
Implementations in CRLibM are correctly rounded (i.e. as accurate as possible)
and are, in my benchmarks, faster than GNU LibM. Thus, they are a Pareto
improvement over GNU LibM.

~~~
siddhesh
The crlibm code is terrible to read and hence maintain. I had considered
including crlibm too during my work but decided against it because it would be
a lot more work.

Also, crlibm functions are provably correctly rounded only for univariate
transcendentals (thanks to the worst case paper I cited in the blog). There is
no such proof available for bivariate functions, which is why computing at an
arbitrary very large precision (the way glibc does) is the only way to
increase the likelihood of correctly rounded results. IIRC crlibm does not
even try for bivariates.

I believe some documentation on the crlibm website cites benchmark numbers
where IIRC glibc performs better (although not by much) in the fast paths of
most functions they mention but gets hit really badly in the slow path. Slow
path inputs are usually quite rare, but if your application/benchmark hits
them regularly enough (especially sice the slow path is a factor of 1000 times
worse) then you'll obviously conclude that the performance sucks and that
crlibm is better.

~~~
Marat_Dukhan
Sorry, I don't buy the argument that "code is terrible to read and hence
maintain" from a GLibC developer. CRLibM does have a bunch or tricky macros to
work-around bugs in decade-old compilers for architectures no one cares about,
but GNU LibC is not any better in this aspect.

CRLibM does not have a provably correctly-rounded pow, but univariate
functions can be merged.

The paper you cited is about decimal floating-point implementation, it is not
relevant to the LibM functions discussed in the article.

I have a utility for benchmarking LibM libraries here:
[https://bitbucket.org/MDukhan/hysteria](https://bitbucket.org/MDukhan/hysteria)

Some benchmarks including CRLibM and GNU LibM are here:
[http://www.yeppp.info/benchmarks.html](http://www.yeppp.info/benchmarks.html)

~~~
siddhesh
Yes the link is incorrect. I am working on getting that fixed.

------
cschwan
Did anyone find the authors commits in glibc's repository? Would be nice to
know what he really changed and when.

~~~
siddhesh
These are all the changes I made in math during 2.18. They're almost all mp
stuff with some code cleanup thrown in.

git log --author=siddhesh release/2.18/master -- sysdeps/ieee754/dbl-64

------
revelation
No wonder people don't like doing floating-point calculations if theres a
chance with certain input to have your pow-call fallthrough into a 1000x
slower code path.

I mean, that should probably be an option. It's realloc all over, plain
unusable because of the rare chance the function goes mad and copies data
instead of failing.

~~~
pascal_cuoq
Do you have a reference for the claim that realloc() is unusable for the
reason you say?

Many users of floating-point are not using floating-point in hard real-time
contexts. Even soft real-time contexts can be fine with Ziv-style successive
approximations. Correctly rounded library functions have intangible benefits
beyond precision (portability, for instance), but if you don't value these
benefits, you can always use inaccurate and always fast functions. You don't
even need to invent your own: just rip off the first stage of a multi-stage
Ziv implementation.

I don't think any of this is the reason why “people don't like doing floating-
point calculations”.

