
Chebyshev Approximation - alexcasalboni
http://www.embeddedrelated.com/showarticle/152.php
======
pietjepuk88
I was actually looking into Chebyshev polynomials and minimax optimization to
make a fast vectorized implementation of the logarithm function a few days
ago. In the process of looking around the internet for information I ran into
this: [https://github.com/pkhuong/polynomial-approximation-
catalogu...](https://github.com/pkhuong/polynomial-approximation-catalogue)
(also see link to blog at bottom)

He has nice sets of coefficients derived by taking into account the discrete
values of floats/doubles, and the performance gain by setting some
coefficients to 0, 1 or 2.

The only thing he did not account for is precision loss when multiplying and
adding values when actually calculating the polynomial, which unfortunately
results in a large relative error around log(1). Eigen's vectorized
implementation does some weird tricks with changing the range to sqrt(0.5) to
sqrt(2), and does not exhibit this large relatively error. The unvectorized
MSVC implementation does not exhibit this either.

Does anyone how Microsoft and Intel and the like derive the coefficients for
their approximations of the elementary functions in math.h?

~~~
eliteraspberrie
The reference collection of polynomial approximations is: _Approximations for
digital computers_ by Cecil Hastings Jr. The method to derive the polynomials
is described there as well as coefficients for many common functions. It isn't
as complicated as it seems. Intel has probably automated the procedure.

------
JacobiX
If you want a very simple introduction to Chebyshev approximation
[http://sidewindiegames.com/?p=436](http://sidewindiegames.com/?p=436) I skip
many useful details. If you are realy interested I highly recommend this
article by Sam Hocevar [http://lolengine.net/blog/2011/12/21/better-function-
approxi...](http://lolengine.net/blog/2011/12/21/better-function-
approximations)

~~~
jacobolus
Inre the first blog post link, the barycentric interpolation formula is a
pretty awesome way to efficiently evaluate polynomial interpolants,
[https://people.maths.ox.ac.uk/trefethen/publication/PDF/2004...](https://people.maths.ox.ac.uk/trefethen/publication/PDF/2004_106.pdf)

The standard Lagrange interpolation formula implemented in that blog post
works too, but is O(n^2) vs. O(n) for each evaluation, and not numerically
stable.

Also see
[http://www.maths.manchester.ac.uk/~higham/narep/narep440.pdf](http://www.maths.manchester.ac.uk/~higham/narep/narep440.pdf)
[https://people.maths.ox.ac.uk/trefethen/publication/PDF/2011...](https://people.maths.ox.ac.uk/trefethen/publication/PDF/2011_143.pdf)

Inre the second blog post link, it’s generally a bad idea to evaluate
polynomials on an interval in terms of the monomial basis.

But anyhow, on the subject of the Remez algorithm, see
[http://www.chebfun.org/publications/remez.pdf](http://www.chebfun.org/publications/remez.pdf)
[http://www.chebfun.org/examples/approx/BestApprox.html](http://www.chebfun.org/examples/approx/BestApprox.html)

Note that while a standard Chebyshev polynomial approximant isn’t quite
“optimal” for any particular function, in an L∞ norm minimax sense, it’s much
easier to find, and often better for most values of x. See Trefethen’s book
for details, and also the Chebfun guide,
[http://www.chebfun.org/docs/guide/guide04.html](http://www.chebfun.org/docs/guide/guide04.html)
or this paper has a nice example picture under “myth 3”
[https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf](https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf)

~~~
JacobiX
The paper on Barycentric Lagrange Interpolation did a pretty good job at
explaining clearly how to evaluate efficiently the polynomial.

------
drallison
It's always interesting to review numerical mathematics at this level. If this
interests you, you may want to read C Lanczos's book _Applied Analysis_ and
Hank Warren's book _Hackers Delight_ as well as the classic MIT HAKMEM report
available on the Web. On machines with highly asymmetric architectures it's
sometimes worthwhile to use a superoptimizer approach
([https://en.wikipedia.org/wiki/Superoptimization](https://en.wikipedia.org/wiki/Superoptimization)).
See the classic paper by Alexia Massalin and the gnu implementation.

------
mempko
I love how the context is embedded systems, yet python is used to demonstrate
the technique. Are people really that bad at reading C or C++?

~~~
jason_s
The author (me) frankly hates C and C++. Haven't touched it for use on a PC
since 2009. However I use it every day for my job working on dsPIC motor
control, because I have no better options.

When I am trying to teach someone about algorithms, I turn to Python because
it is more readable, quicker for me to write, it's platform-independent, and I
can incorporate it into an IPython (now Jupyter) Notebook and postprocess that
into my blog articles with a minimum of hiccups.

~~~
3pt14159
Have you checked out Nim? I've been using it in production for a couple small
programs and I've had no issue. I've also been able to bind it to higher level
languages using the FFI:

[http://nim.community/answers/create_a_shared_library/](http://nim.community/answers/create_a_shared_library/)

Although I haven't figured out the garbage collection bit for the shared
object bit. It is a lovely language and it's reaching 1.0 sometime this year.

~~~
jason_s
I've heard of it. Rust is probably the most promising to me.

Unfortunately just because a language exists, doesn't mean it is usable on a
particular embedded system architecture. The processor core and system
libraries need support.

~~~
mindslight
Rust works fine with Cortex-M3/M4 (a bit larger than dsPIC, and their
documentation isn't as nice as Microchip's). Library support isn't great
though - try zinc.rs or roll your own. An STM32F4 Discovery board is $15. Just
in case you didn't have enough to play with...

~~~
mokus
Shameless self-plug coming here, but it's pretty directly relevant: I've
recently started tinkering with rust on a handful of STM32 "discovery" boards.
I'm currently focusing on building a low-level environment to use as a
foundation for higher-level HAL projects like zinc. The code is still pretty
rough around the edges and evolving fast, but I'm open to collaboration and
willing to help deal with breakage due to the current pace of change, as time
permits, if anyone just wants to build something on it.

[https://github.com/mokus0/stm32.rs](https://github.com/mokus0/stm32.rs)

~~~
mindslight
I was/am doing something similar (haven't touched it in a bit though).

I avoided putting anything in linker scripts that didn't _have_ to be there
(IIRC, memory blocks and VTOR (for assembly startup routine that can handle
flash or RAM)). Peripheral addresses stayed in Rust so that the optimizer
could see them (eg writing to RA and then RB, the high u16 stays the same).

BTW F103C8 "dev boards" are available from China for $5.

~~~
mokus
zinc's "ioregs" macro system takes care of that sort of optimization quite
nicely these days, actually. For each peripheral register, one of the things
it generates is an "update" structure that the optimizer can eliminate, which
allows you to chain updates to multiple fields in a single write. For example,
in my stm32f4-discovery "blinky" example it does the bsrr writes to flip 2
LEDs in a single write (I've inspected the generated asm to verify this).

EDIT: nevermind, I realize I misunderstood; you're referring to the optimizer
being able to see that both reg addresses have the same high 16 bits. I
thought you were referring to the actual interactions with the register
contents. Yea, that makes sense.

EDIT 2: for what it's worth, i did a quick test just now and found that moving
the addresses into the source actually made the generated code significantly
worse. I'm just tinkering around on a lunch break and I need to get back to
real work, so I can't investigate too deeply, but the 2 biggest factors I see
are that (1) it simply doesn't perform the optimization you describe, and (2)
it makes things even worse because rather than addressing relative to the
peripheral base address it loads a complete new constant address for every
operation.

~~~
mindslight
I just remembered, hardcoded addresses are actually even more important for
utilizing "bit banding" for atomic writes to peripheral registers. Not that
you couldn't find a way to do this with addresses in the linker script, but
it's more straightforward to handle efficiently with compiler optimization.

(And if one isn't atomically writing to control registers when interrupts are
enabled, you are most likely building on a foundation of nondeterministic
bugs. Rust's linear types kind of inform you of this, but to do anything
useful the temptation is to unsafely share across threads)

re edit 2: Really? I'd obviously looked at my generated code and saw the
expected improvement, so we must be doing something different.

~~~
mokus
Re edit 2: I'm pretty new at Rust so it's entirely possible I did something in
an unusual way that caused the compiler to emit worse code than it would
otherwise. My quick hack to test was to declare the registers as &mut, and
assign them by casting a u32 literal to *mut, dereference and take &mut of
that. It was just the first thing that actually compiled ;). I did also try
directly transmuting the u32 to &mut and got the same output.

It's also possible something has changed in rustc since you were playing with
it.

~~~
mindslight
It's very possible rustc has changed, but I hope not because the generated
code was looking pretty good.

My basic register write is volatile_store(($base as * mut u8).offset(idx as
isize) as * mut u32, v); $base is just a hex literal address. There is an
extra cast because I defined 'idx' in terms of bytes.

I might not have been concerned about the absolute smallest code when I
accessed $base+4 then $base+8 (use of a small offset vs loading a second
constant), but I definitely observed eliminating redundant loading of the high
half between different registers.

I was compiling with "-O -Z no-landing-pads -C lto".

------
jacobolus
Folks interested in this topic should look at Chebfun,
[http://www.chebfun.org](http://www.chebfun.org)

Also at Lloyd Trefethen’s book Approximation Theory and Approximation
Practice, of which the first 6 chapters are available online,
[http://www.chebfun.org/ATAP/](http://www.chebfun.org/ATAP/)

The author of this blog post might as well have used Clenshaw’s algorithm as
he mentioned was suggested by Numerical Recipes. It’s like 4–5 lines of code,
fast, and has been proven numerically stable.

If bk1, and bk2 are the previous two terms of the recurrence, and c the
coefficient array, code to find y = f(x) for x in [-1, 1] looks like:

    
    
      x2 = 2*x
      bk2 = bk1 = 0
      for ck in c[:0:-1]:
          bk2, bk1 = bk1, ck + x2*bk1 - bk2
      y = c[0] + x * bk1 - bk2
    

(Note: There might be some typos there, I didn’t try running that. Here’s a
vectorized Matlab version
[https://github.com/chebfun/chebfun/blob/development/%40chebt...](https://github.com/chebfun/chebfun/blob/development/%40chebtech/clenshaw.m)
with coefficients in that version ordered from highest to lowest)

Or if he wanted an easy Python tool, just used this (partial) python port of
Chebfun (though I guess it dates from around the same time as his post),
[https://github.com/olivierverdier/pychebfun](https://github.com/olivierverdier/pychebfun)

I’m not convinced about his measurements on an arbitrary grid -> chebyshev
coefficients via weighted least squares method. If his measurement points are
reasonably clustered near the endpoints of his interval, he could just use an
exactly interpolating polynomial and use the barycentric interpolation formula
to get the values at the chebyshev points (see Trefethen’s book). His approach
might also be fine, I haven’t thought deeply about it. Also see
[http://www.chebfun.org/examples/approx/RationalInterp.html](http://www.chebfun.org/examples/approx/RationalInterp.html)
[http://www.chebfun.org/examples/approx/EquispacedData.html](http://www.chebfun.org/examples/approx/EquispacedData.html)
[http://www.chebfun.org/examples/approx/BestL2Approximation.h...](http://www.chebfun.org/examples/approx/BestL2Approximation.html)

The neat thing about Chebyshev polynomials is that it’s possible to convert
from values at the Chebyshev points (either extrema or roots of the Chebyshev
polynomial) to Chebyshev coefficients by using an FFT, which takes O(n log n)
time. Along with various other efficient algorithms for differentiation,
integration, root finding, solving differential equations, etc., Chebfun makes
it practical to deal with polynomials of very high degree, and do lots of neat
stuff with them.

This blog post’s statement that “higher degree polynomials can have problems
with numerical accuracy” turns out to be a myth:
[https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf](https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf)

~~~
jason_s
>This blog post’s statement that “higher degree polynomials can have problems
with numerical accuracy” turns out to be a myth

(author here) I should have qualified or corrected that statement in my
article. There are a few reasons I rarely use n>5:

\- when polynomial approximation is appropriate, high-degree polynomials are
usually unnecessary on embedded systems (0.1% accuracy usually sufficient)

\- it's usually a sign that polynomial approximation is being misapplied and
there are better ways (via range reduction, transformation like 1/f(x), etc.)

\- with least-squares, computation of the coefficients can have problems with
numerical accuracy. When I've used least-squares fitting polynomials with
n>10, I often get a complaint from my numerical tool (MATLAB or Python) that
the matrix in question is ill-conditioned. That may be because I used the
Vandermonde matrix rather than Chebyshev polynomials, or because I used a
large number of points. Chebyshev approximation doesn't have the same problem,
I don't think, and I should not have mapped my experience with least-squares
fitting onto Chebyshev approximation.

Evaluation of the polynomials themselves should be fine (after all, we can
easily compute a 20th-degree Chebyshev polynomial).

~~~
jacobolus
For fitting arbitrary spaced data, you might also try using a rational
interpolant, e.g. Floater & Hormann’s
[http://www.inf.usi.ch/hormann/papers/Floater.2007.BRI.pdf](http://www.inf.usi.ch/hormann/papers/Floater.2007.BRI.pdf)
and then afterward approximating that by a Chebyshev polynomial of somewhat
higher degree for evaluation or other operations like taking derivatives or
finding roots etc. (IIRC Trefethen’s book explains this idea, and the last few
links in my comment above are also related).

------
lexcorvus
Incidentally, Chebyshev is pronounced "cheh-bee-shoff", not (as is commonly
heard) "cheh-bee-shev".

~~~
jason_s
Чебышёв!

------
daniel-levin
Shameless plug: here is my GPL'd MATLAB implementation of Chebyshev
approximation [1]

[1] [https://github.com/daniel-levin/numerical-
methods-3/blob/mas...](https://github.com/daniel-levin/numerical-
methods-3/blob/master/function-approximation/chebyshevapproximate.m)

------
srean
If this interests you may I suggest Pade approximation as another good tool to
have in your tool chest.

~~~
daniel-levin
You can use both at the same time for extremely nice results. Compute some
truncated Chebyshev polynomial approximation (i.e. the first few terms) and
then use Pade approximation [1]. The cool part about this is that you can
prescribe your maximum allowed error (between exact solution and
approximation) ahead of time, and compute the order of the Pade approximation
_that under all circumstances guarantees_ that your error will be bounded!

[1]
[http://siber.cankaya.edu.tr/NumericalComputations/Fall2004/c...](http://siber.cankaya.edu.tr/NumericalComputations/Fall2004/ceng375/node69.html)

~~~
jason_s
is there a way to have Pade approximation but with continued fractions, so you
just truncate the series as necessary? I tried looking into this a few months
ago, but my head just started to hurt.

------
weirdtunguska
I used Chebyshev approximation on my thesis connecting a boundary value
problem to a way to solve it using Fourier Transform. Very, very useful when
solving a system of equations that can diverge if a iterative solver is used.

------
lifeisstillgood
Oh God. It's like I thought I knew how to read and suddenly the comic books
were replaced with Shakespeare and I realise there is a huge mountain to
climb.

Somehow my children cannot grow up as mathematically stunted as I am.

~~~
jason_s
Heh. Two immediate comments:

\- It's not as bad as that. Chebyshev approximation + other techniques for
evaluating mathematical functions are pretty easy to learn.

\- You're absolutely right. Welcome to mathematics! The mountain goes
infinitely high. There are tons of technical papers I'll never be able to
digest simply because they've compressed lots of thoughts into cryptic-looking
but standard notation.

~~~
lifeisstillgood
Thanks :-) I'll get back to you once I have mastered long division !

~~~
jason_s
Ironically, that's one of the few useless parts of mathematics, and the least
favorite part of my elementary school education.

Suggest group theory or linear algebra instead. ;-)

------
juskrey
How to overfit a function, and lose all money, friends and influence at once
(after a happy while).

