

Fast Approximate Logarithms, Part III: The Formulas - kilimchoi
http://www.ebaytechblog.com/2015/06/25/fast-approximate-logarithms-part-iii-the-formulas/

======
eutectate
Using Horner's method to evaluate the polynomials is probably suboptimal
because it introduces dependencies and eliminates instruction-level
parallelism. It might be worth trying Estrin's method instead (if the compiler
does not rewrite them already).

~~~
brandmeyer
Actually, there is quite a bit of available ILP in the forms that use division
- as high as four insns. The numerator and denominator can be evaluated in
parallel, and the coefficient loads can be interleaved with the FMAs to hide
some of their latency.

That said, with such a low variation in the timings I wonder if some other
overhead was dominating the costs. For example, floating-point division is
typically much slower than either addition or multiplication. According to
Agner Fog, div is 2-4x slower in latency than add and mul while completely
hogging the pipeline.

------
opcvx
I have tried the code posted at the end of the article and it gives wildly
incorrect results for input values in range of [ 0.5 , 1.5 ). Also for output,
starting with multiplies of 4, values suddenly diverge and then start to
converge towards correct value until the next multiple.

Did I forget to do something with the result or is there a bug?

~~~
pkhuong
For values far from 1, log_2 will be dominated by the integer portion, which
IEEE FP representation gives you for free, so it's not surprising that outputs
eventually look right. I think the problem is that the approximation works
with (x - 1), but only the "significand >= 1.5" branch subtracts that offset.
Really, if you need speed, I'd probably start with an approximation over [1,
2)...

Re-implementation in SBCL:

    
    
      CL-USER> (defun approx (x)
                 (let ((o (1- x)))
                   (values (/ (+ (* o o 0.338953) (* o 2.198599)) (+ o 1.523692))
                           (log x 2.0))))
    
      CL-USER> (loop for x from 0.70 upto 1.51 by 0.1
                     do (format t "~,3F: ~{~F ~F~}~%" x (multiple-value-list (approx x))))
      0.700: -0.51407874 -0.5145732
      0.800: -0.32194924 -0.32192805
      0.900: -0.15204856 -0.15200303
      1.000: 0.0 0.0
      1.100: 0.13749498 0.13750356
      1.200: 0.26296926 0.26303446
      1.300: 0.3784003 0.37851173
      1.400: 0.48535433 0.48542693
      1.500: 0.585088 0.5849626
    

That's in the right ballpark for [0,7, 1.5]

    
    
      CL-USER> (defun almost-log2 (x)
                 (let* ((bits (sb-kernel:single-float-bits x))
                        (exp (ash bits -23)))
                   (values 
                    (if (logbitp 22 bits)
                        (+ (- exp 126) (approx (sb-kernel:make-single-float
                                                (dpb 126 (byte 9 23) bits))))
                        (+ (- exp 127) (approx (sb-kernel:make-single-float
                                                (dpb 127 (byte 9 23) bits)))))
                    (log x 2.0))))
    
      CL-USER> (loop for x from 0.1 upto 10.1 by 0.1
                     do (format t "~,3F: ~{~,8F ~,8F~}~%" x (multiple-value-list (almost-log2 x))))
      0.100: -3.32194920 -3.32192800
      0.200: -2.32194920 -2.32192800
      0.300: -1.73703070 -1.73696570
      0.400: -1.32194920 -1.32192800
      0.500: -1.00000000 -1.00000000
      0.600: -0.73703074 -0.73696554
      0.700: -0.51464570 -0.51457310
      0.800: -0.32194915 -0.32192796
      0.900: -0.15204845 -0.15200295
      1.000: 0.00000017 0.00000017
      1.100: 0.13749513 0.13750371
      1.200: 0.26296943 0.26303460
      1.300: 0.37840047 0.37851185
      1.400: 0.48535448 0.48542705
      1.500: 0.58509207 0.58496270
      1.600: 0.67805094 0.67807215
      1.700: 0.76547647 0.76553500
      1.800: 0.84795165 0.84799720
      1.900: 0.92598030 0.92599964
      2.000: 1.00000010 1.00000010
      2.100: 1.07039330 1.07038940
      2.200: 1.13749500 1.13750360
      [...]
      2.800: 1.48535400 1.48542650
      2.900: 1.53605470 1.53605260
      3.000: 1.58508750 1.58496210
      3.100: 1.63230250 1.63226800
      3.200: 1.67805030 1.67807150
      [...]
      3.800: 1.92597950 1.92599880
      3.900: 1.96346550 1.96347340
      4.000: 1.99999940 1.99999930
      4.100: 2.03562740 2.03562330
      4.200: 2.07039260 2.07038880
      [...]
      5.000: 2.32183340 2.32192750
      [...]
      6.000: 2.58508730 2.58496170
      [...]
      9.900: 3.30734100 3.30742860
      10.000: 3.32183430 3.32192850

~~~
jason_s
in [part I]([http://www.ebaytechblog.com/2015/05/01/fast-approximate-
loga...](http://www.ebaytechblog.com/2015/05/01/fast-approximate-logarithms-
part-i-the-basics/)) the author argues against using [1,2) because it causes
large relative error near x=1 (since log x = 0 there). I would disagree (why
would you ever care about relative error in a logarithm -- IMHO absolute error
makes much more sense... unless for some crazy reason you're dividing by log x
near x=1) but that's his argument.

~~~
pkhuong
It's an interesting trade off; you get a nice bump in precision for a small
increase in code complexity.

------
stagger87
Just for fun I bench marked the function found at the end of the post,
disregarding accuracy for the moment.

If you call this function on a single value at a time, the MSFT log function
will take twice as long to generate a full accuracy properly handled result.

If you call this function over an array of data, for instance, iterating over
an array of floats to perform the log function on each value in the array,
(this is typically done as it allows you to take advantage of SIMD CPU
instructions), Visual Studio will actually replace your for loop (with
standard log function) with a vectorized version of the log function at around
an array of 16 floats in length giving you a full accuracy log function for
about 1/2 the speed of this function for array lengths of 16 to upwards of 4.5
times faster for array lengths of 64k in length. Intel's comparable 11 bit
precision vectorized log function performs about 6 times faster at lengths of
64k or longer.

Simply reorganizing your data could lead to incredible improvements.

*Tested on an Intel 4770 CPU, forcing single threaded execution, using CPU clock cycles as a timer, averaged over ~1 minute of execution time. Intel library cost is a couple hours of developert time. Input was uniformly distributed over a range of 1.0e-6 to 1.0e6.

~~~
pkhuong
> Intel's comparable 11 bit precision vectorized log function performs about 6
> times faster at lengths of 64k or longer.

The issue is probably the division in the approximation. Agner Fog says even
Sandy Bridge doesn't pipeline FP (or integer) divisions, which take from 10 to
14 cycles (less than two add/mul pairs). If the goal is to minimise latency
for a small number of values, a lower order rational approximation might be
preferable to a higher order polynomial; however, when we want throughput,
add/mul pipeline better.

------
pervycreeper
Anyone care to speculate why they need to use logs so much in their codebase?

~~~
nkurz
Log functions can be called heavily in the machine learning algorithms used
for making recommendations and automatically categorizing items. I'd presume
this is Ebay's use case.

