Hacker News new | comments | show | ask | jobs | submit login
Improving math performance in glibc (redhat.com)
123 points by LaSombra on Jan 2, 2015 | hide | past | web | favorite | 32 comments

Related: I recently published a preprint describing some techniques to speed up evaluation of transcendental functions in the range of a few hundred bits to a few thousand bits: http://arxiv.org/abs/1410.7176

The easiest way to make elementary functions faster is to use a different libm: the glibc preoccupation with being correctly rounded (ie accurate to within 0.5 ulps) imposes significant performance penalties. For example the glibc log function, which is often the dominant cost in a lot of sampling algorithms, is about 2.5x slower than a straightforward implementation (ie no fancy SIMD tricks) of Tang's 25 year old table lookup algorithm, which is accurate to 0.57 ulps. Recent OS X libms are faster still, and appear to be accurate to about 0.51 ulps, though unfortunately are closed source.

Back in the good old days, Motorola provided a drop-in highly optimized replacement maths library for Power Macs http://uk.mathworks.com/matlabcentral/newsreader/view_thread...

I remember being amazed in the final year of college, the PowerMac 7600 I'd spent my student loan on, was running my FORTRAN as quick as the DEC Alphas in the lab.

Maybe this is a place where GLIBC could provide additional variants with improved performance and reduced rounding accuracy.

Aren't these implemented in hardware on most CPU's? Or is this floating point emulation for CPU's that don't have built-in floating point?

The hardware implementations often have problems: http://randomascii.wordpress.com/2014/10/09/intel-underestim...

There's an x87 asm implementation for x86-32 in sysdeps/i386/fpu/e_exp.S [1]. For x86-64, it falls back to the generic C code for exp(), in sysdeps/ieee754/dbl-64/e_exp.c. The sad part is that there is x86-64 asm for a few variants (exp10l, exp2l, expf, expl), but since these are not C90, people tend to avoid them for more portable versions.

So, ironically, it can be orders of magnitude faster to compute a long-double expl() than a double-precision exp().

[1] Even the asm contains a lot of extra code to handle edge cases that Intel screwed up, like returning NaN for exp(+/- Inf).

The 387 instructions for elementary functions do not have the reputation of having a good accuracy/time ratio compared to, say, state-of-the-art implementations using polynomial approximations.

Do you have actual measurements that reveal some actual irony, or is it only superficial irony in that expl() is computed using the (microcoded, slow) hardware instruction and exp() isn't?

Note: For most functions, you can actually make a pretty accurate double-precision implementation by calling the corresponding 387 instruction and then rounding the result to double-precision. Not a correctly rounded implementation (that would require a wider intermediate result than 80-bit) but pretty good nonetheless.

We didn't measure average case performance, but for worst-case performance the "orders of magnitude" comment is absolutely backed by measurement. See also my comment at https://news.ycombinator.com/item?id=8830667

In our case, we only needed about 12 bits of accuracy anyway, so we wound up writing our own implementation using polynomial approximations (which have the benefit of predictable running times regardless of the quality of your C library). The fact that you need to do this in any context where worst-case performance is a concern might be surprising to some people. It was to me before I read glibc's code.

I take it back, we did measure average case performance of expf() (though not expl()), and it was slower than exp():

16:12:48 < gmaxwell> ha. So, completely eliminating all the doubles made celt slower.

16:13:23 < gmaxwell> I'm guessing that this is because I used the C99 float versions of the math functions and they suck.

16:13:57 < gmaxwell> looks like 3% slower or so.

16:14:05 < gmaxwell> <3 GLIBC.

16:26:02 < gmaxwell> with float approx the no-doubles is .9% faster. Whoohoo. ;)

17:24:06 < derf> Yes. Don't use the float versions of libc functions, at least not on x86-64.

17:24:24 < Dark_Shikari> what the heck did they do ?

17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(), for example, regardless of whether or not they're already correctly set up.

17:25:07 < derf> Apparently this is really slow.

17:25:32 < gmaxwell> derf: I was hoping that they'd be fixed by now.

"Float approx" refers to our custom polynomial approximation of exp(). Given that, I think it's safe to infer that expl() would also have been slower than exp() in the average case. That was in 2010, though. This appears to have been fixed for expf() in 2012 (which now uses SSE/AVX instead of x87 asm on x86-64), but not for expl().

> 17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(), for example, regardless of whether or not they're already correctly set up.

I fixed this around the same time I worked on the mp improvements:

commit 2506109403de69bd454de27835d42e6eb6ec3abc Author: Siddhesh Poyarekar <siddhesh@redhat.com> Date: Wed Jun 12 10:36:48 2013 +0530

    Set/restore rounding mode only when needed
    The most common use case of math functions is with default rounding
    mode, i.e. rounding to nearest.  Setting and restoring rounding mode
    is an unnecessary overhead for this, so I've added support for a
    context, which does the set/restore only if the FP status needs a
    change.  The code is written such that only x86 uses these.  Other
    architectures should be unaffected by it, but would definitely benefit
    if the set/restore has as much overhead relative to the rest of the
    code, as the x86 bits do.

That patch doesn't affect the x87 asm in sysdeps/x86_64/fpu/e_expl.S, though, does it? Good to hear it's fixed for some cases, though.

What I meant is that the assembly instruction is faster only because it does not provide the same level of accuracy as the generic implementation.

Everyone's needs are different, but it does seem that it is a little bit early, and perhaps that it will always be too early, for a general-purpose libm to aim for correctly rounded transcendental function. Even the best implementations only work in one rounding mode, leading to the pleasantries discussed in a sibling of this comment, and the worst-case execution time can be a problem for some.

It is possible to aim for 0.51 ULP with implementations that still work(-ish) if the rounding mode is not round-to-nearest, and perhaps this is the sweet spot that libms that want to be useful should aim for.

I believe on x86/x64 CPUs there's an advantage to doing these operations via SSE rather than using the old x87 opcodes. For instance, you don't need to move data into the x87 registers.

Right, you have to move it instead into the SSE registers.

Where it should be living anyways.

Why not just merge the CRLibM implementations?

Are they actual improvements? The project looks like it's been dead almost five years.

(Second guess: ideology)

Implementations in CRLibM are correctly rounded (i.e. as accurate as possible) and are, in my benchmarks, faster than GNU LibM. Thus, they are a Pareto improvement over GNU LibM.

The crlibm code is terrible to read and hence maintain. I had considered including crlibm too during my work but decided against it because it would be a lot more work.

Also, crlibm functions are provably correctly rounded only for univariate transcendentals (thanks to the worst case paper I cited in the blog). There is no such proof available for bivariate functions, which is why computing at an arbitrary very large precision (the way glibc does) is the only way to increase the likelihood of correctly rounded results. IIRC crlibm does not even try for bivariates.

I believe some documentation on the crlibm website cites benchmark numbers where IIRC glibc performs better (although not by much) in the fast paths of most functions they mention but gets hit really badly in the slow path. Slow path inputs are usually quite rare, but if your application/benchmark hits them regularly enough (especially sice the slow path is a factor of 1000 times worse) then you'll obviously conclude that the performance sucks and that crlibm is better.

Sorry, I don't buy the argument that "code is terrible to read and hence maintain" from a GLibC developer. CRLibM does have a bunch or tricky macros to work-around bugs in decade-old compilers for architectures no one cares about, but GNU LibC is not any better in this aspect.

CRLibM does not have a provably correctly-rounded pow, but univariate functions can be merged.

The paper you cited is about decimal floating-point implementation, it is not relevant to the LibM functions discussed in the article.

I have a utility for benchmarking LibM libraries here: https://bitbucket.org/MDukhan/hysteria

Some benchmarks including CRLibM and GNU LibM are here: http://www.yeppp.info/benchmarks.html

Yes the link is incorrect. I am working on getting that fixed.

In many contexts, the problem is not the absolute performance, but the variation in performance. For example, we want the Opus codec to be relatively insensitive to timing analysis, since Opus audio is often encrypted and transmitted over a network. Large variations represent an information leak that could be exploited by an external observer.

When we started looking at actual timings, we found that it was mostly pretty consistent, except for huge slowdowns in some frames. The cause? glibc's exp() slow path.

We have had this in our todo list for a while. The leading idea currently is to provide compiler selectable function variants that don't fall into the multiple precision abyss. The biggest challenge here is to come up with data about accuracy of results without the mp path.

For single precision single argument functions you can exhaustively test to obtain data about the accuracy of the results.

That's good to hear.

Did anyone find the authors commits in glibc's repository? Would be nice to know what he really changed and when.

These are all the changes I made in math during 2.18. They're almost all mp stuff with some code cleanup thrown in.

git log --author=siddhesh release/2.18/master -- sysdeps/ieee754/dbl-64

I am tinkering around with making some library code faster. I don't have definite results yet but I expect to soon.

Consider the comment in glibc's qsort that says something like "this works well on a Sun 4/260". That machine is about 25 years old.

When I have something to report of course HN will be the first to know.

My main focus is the reduction of energy consumption. Often but not always that can be done by making the code faster.

No wonder people don't like doing floating-point calculations if theres a chance with certain input to have your pow-call fallthrough into a 1000x slower code path.

I mean, that should probably be an option. It's realloc all over, plain unusable because of the rare chance the function goes mad and copies data instead of failing.

Do you have a reference for the claim that realloc() is unusable for the reason you say?

Many users of floating-point are not using floating-point in hard real-time contexts. Even soft real-time contexts can be fine with Ziv-style successive approximations. Correctly rounded library functions have intangible benefits beyond precision (portability, for instance), but if you don't value these benefits, you can always use inaccurate and always fast functions. You don't even need to invent your own: just rip off the first stage of a multi-stage Ziv implementation.

I don't think any of this is the reason why “people don't like doing floating-point calculations”.

Why are you blaming this on ~~~floating point~~~ instead of glibc?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact