
Outperforming LAPACK with C++ metaprogramming - okaleniuk
http://wordsandbuttons.online/vastly_outperforming_lapack_with_cpp_metaprogramming.html
======
costrouc
For those wondering the lapack guys are working on a new version of lapack to
run on heterogeneous architecture including gpus. See their work at
[http://www.icl.utk.edu/research](http://www.icl.utk.edu/research)

I have worked with these guys and all I can say is good luck outperforming the
highly optimized routines they have written. My bet is that the the guy
writing this blog used a non optimized version of lapack.

~~~
stephencanon
There's no mystery here, author is only solving a 5x5 system. LAPACK will have
barely finished checking function call parameters by the time a fully
specialized and inlined solver is finished. Also, from a cursory inspection,
author does no pivoting [edit: worse, it just uses Cramer's rule, which is
insane], so their solver will be much less numerically robust (but this is
admittedly less of a problem for the 5x5 case).

The performance is completely expected--in fact, I would expect an optimized
5x5 solver to be _faster_ than this.

~~~
electricslpnsld
A better comparison might be a C++ library like Eigen that is intended for use
on these small systems, that should optimize out function calls in a similar
way, and that uses intrinsics.

~~~
stochastic_monk
If your matrices are of arbitrary size, Blaze [0] would be my pick. It has the
best performance on their benchmarks, Baptiste Wicht's benchmarks, and my own.

Though if the matrices are guaranteed to be small (like in this example),
libxsmm [1] is specialized and highly optimized for this use case and beats
its competitors above.

And yes, it's absolutely essential to make sure you didn't just move
computations to compile-time.

[0]: [https://bitbucket.org/blaze-lib/blaze](https://bitbucket.org/blaze-
lib/blaze)

[1]: [https://github.com/hfp/libxsmm](https://github.com/hfp/libxsmm)

------
barrkel
Please check with array initialized from file or other dynamic source. With
maximal inlining, there's a risk of constant propagation resulting in a lot of
computation happening at compile time rather than run time.

~~~
dpwm
As an anecdote of just how powerful this can be, I once wrote a prototype of a
simulation (written in C) that used a lot of integer divisions and needed to
be run many times with different parameters.

As a quick and dirty hack that took about half hour to implement, I decided to
just have the program as a template string and the parameters be substituted
in by a python script that called GCC. To the amazement of a very sceptical
friend, the result was a 5x speedup.

A lot of this speedup was consistent with replacing the integer divisions with
multiplicative inverses, but also merging constants in a way that would have
made things quite unreadable. I was quite impressed with how far GCC had gone.

Considering each instance of this simulation ran for tens of minutes, it was
well worth compilation time of a few seconds. It kept the code really simple,
too.

~~~
pkaye
You should check out this presentation done by Matt Godbolt about compiler
internals. He talks about what kind of optimizations are being done by GCC and
Clang. Pretty amazing stuff. For example, the compiler will figure out you are
trying to write a pop count in high level code and replace it with a single
popcount instruction.

[https://www.youtube.com/watch?v=bSkpMdDe4g4](https://www.youtube.com/watch?v=bSkpMdDe4g4)

~~~
mpweiher
Amazing, yes. Scary, also.

If you're going to figure this out, please output a warning and ask me to call
the optimized intrinsic or something.

~~~
bigcheesegs
Why? The intrinsics aren't portable, the other code is.

~~~
dpwm
Even with intrinsics that do have a degree of portability, I prefer the more
explicit form of the bit counting just for readability. Nobody has to learn
what popcnt means.

These optimizations really mean that as software writers we can declare our
intent and have the compiler do safe optimizations.

It seems like these optimizations of Clang are also quite insensitive to the
many different ways of writing the same thing that C++ has earned notoriety
for.

~~~
mpweiher
> I prefer the more explicit form

Huh? It's not _explicit_ , the purpose is _implicit_. You are writing down an
algorithm that _implements_ a popcount algorithm. The compiler than has to
_infer_ the fact that you actually wanted popcount (maybe you just wanted to
waste a bit of time in a timing loop?) and then substitute something based on
that inference.

I am all for it being _explicit_ , i.e. some library function that has a name
that somewhere mentions "popcount" or some such. Then have the compiler
generate whatever code it needs for that function.

------
sdeer
It is somewhat unclear from his writing but did he compare this
metaprogramming solution against OpenBlas or Atlas or just the reference
BLAS/LAPACK implementation? If I was trying to do any significant linear
algebra computations, I would definitely first try an optimized BLAS/LAPACK
solution first.

~~~
pletnes
Also, it seems the author solved a triangular system. LAPACK has special
routines for that. Were they used?

LAPACK is typically optimized for larger systems, not 5 unknowns. Also, 5 is
not a great number for vectorized operations - it might even be beneficial to
zero/one pad the matrix.

Optimized LAPACK is often 5-10x faster than «basic LAPACK».

BLAS/LAPACK was originally written in the 70s/80s and sometimes unroll loops
to the tune of 5 or 7 - it made sense then. Not so much these days.

I didn’t read the article in detail, but there appear to be a lot of holes in
it.

~~~
sevensor
Personally, my biggest reason to prefer LAPACK in general is that its authors
have already put a great deal of effort into correctness and numerical
stability, so I don't have to. Even basic LAPACK is pretty fast, let alone the
optimized libraries. Hand-optimizing my own special case is an absolute last
resort.

~~~
pletnes
Yes, Cramer’s rule (referred to in the article) does not give numerically
stable results AFAIK - which is yet another can of worms. Good catch.

~~~
thanatropism
I don't think I have heard Cramer's rule mentioned without the caveat "but
don't use this for real problems".

~~~
pletnes
It’s useful for certain theoretical proofs in analytic math - or so I’ve
heard. Not all math is well suited to be coded into reliable software.

------
bluescarni
In his first factorial example, he could've avoided templates altogether with
constexpr:

[https://godbolt.org/g/ZaoMNa](https://godbolt.org/g/ZaoMNa)

Or just use a template specialisation:

[https://godbolt.org/g/TVkHMd](https://godbolt.org/g/TVkHMd)

~~~
ZenoArrow
Thanks for sharing the godbolt site, can see myself using it for debugging.

~~~
andrewjw
the author of the site (Matt Godbolt) is in general pretty great.

------
stabbles
This article makes no sense at all. It compares a numerically unstable O(n!)
algorithm (Cramer's rule) to a numerically stable O(n^3) algorithm (Gaussian
elimination).

~~~
sannee
What bothers me more is the fact that the article seems to compare LAPACKs
function for solving any-sized systems of equations with a specialized
function for solving 5x5 systems.

I mean, it's neat trick and all, but the comparison with LAPACK does not
really stand on equal ground.

------
drwells
The proposed C++ tricks are useful and I think that its reasonably well-known
(at least to people who do numerics) that a C++ compiler, with the right
flags, can beautifully unroll loops when array sizes are known at compile-
time.

> brute force Cramer's rule solution

This isn't really a fair comparison: trusty old DGESV will yield far more
accurate results for badly conditioned problems since it does row swaps
(partial pivoting). It might also rescale values; I don't remember the
details. It would be much more reasonable to compare a hypothetical DGEC
(double precision general matrix cramer's rule) routine.

------
jordigh
D's Mir does the same sort of thing, and metaprogramming in D is much nicer,
because it was designed from the ground up instead of the bug elevated to
feature that metaprogramming is in C++.

[http://docs.mir.dlang.io/latest/index.html](http://docs.mir.dlang.io/latest/index.html)

[http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...](http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/glas-
gemm-benchmark.html)

~~~
tzahola
>and metaprogramming in D is much nicer, because it was designed from the
ground up instead of the bug elevated to feature that metaprogramming is in
C++

Yet neither of them are as elegant as LISP macros.

~~~
bachmeier
I'm not sure how you're measuring elegance, but writing Lisp macros and
getting them to work correctly has never been something I'd call elegant. You
have the Common Lisp approach, where you work in a complicated sublanguage
like this example from PCL:

    
    
      (defmacro when (condition &rest body)
        `(if ,condition (progn ,@body)))
    

and then you think hard about where the leaks are. Or else you take the Scheme
approach of hygienic macros which are even less elegant. Macros in Lisp are
powerful, but I'm not sure metaprogramming in Lisp is better than in a
language like D.

~~~
sedachv
> You have the Common Lisp approach, where you work in a complicated
> sublanguage like this example from PCL:

I don't think you actually understand anything about the example code you
posted. There is no special "complicated sublanguage" \- the backquote[1] is
an extension of the quote shorthand notation[2] for list literals. It is for
building data structures and has nothing to do with "sublanguages" or macros.
There is no special syntax for macros in Common Lisp. Common Lisp macros do
not produce source code - they produce any type of Common Lisp objects
directly. That is why they are simpler and strictly more powerful than D
templates.

When it comes to templates, I wouldn't call anything from this list of D
examples[3] elegant.

[1]
[http://www.lispworks.com/documentation/HyperSpec/Body/02_df....](http://www.lispworks.com/documentation/HyperSpec/Body/02_df.htm)
[2]
[http://www.lispworks.com/documentation/HyperSpec/Body/s_quot...](http://www.lispworks.com/documentation/HyperSpec/Body/s_quote.htm)
[3] [https://dlang.org/templates-revisited.html](https://dlang.org/templates-
revisited.html)

~~~
bachmeier
> I don't think you actually understand anything about the example code you
> posted.

This doesn't add anything to the conversation. If you want to discuss my post,
I'm happy to discuss. You're clearly upset that I didn't praise Common Lisp,
but things like that don't lead to useful discussion.

> There is no special "complicated sublanguage" \- the backquote[1] is an
> extension of the quote shorthand notation[2] for list literals. It is for
> building data structures and has nothing to do with "sublanguages" or
> macros. There is no special syntax for macros in Common Lisp.

I've never seen anyone write macros using only s-expressions. True, you don't
need to learn any special syntax to do it, but that's not how anyone writes
macros.

> When it comes to templates, I wouldn't call anything from this list of D
> examples[3] elegant.

How is that is relevant to my comment?

~~~
sedachv
> You're clearly upset that I didn't praise Common Lisp, but things like that
> don't lead to useful discussion. > How is that is relevant to my comment?

No, I am upset that you are posting misinformed comments about Common Lisp
macros vs D templating while representing yourself as being knowledgeable
about the subject. You clearly do not understand how the former works.

> I've never seen anyone write macros using only s-expressions. True, you
> don't need to learn any special syntax to do it, but that's not how anyone
> writes macros.

Please stop pretending that you have enough experience with Common Lisp to say
things like "I've never seen anyone do X." Writing macros that call functions
that return code is a basic technique - Graham's _On Lisp_ is full of
examples. Saying that quoting as a way of specifying list literals is a
"complicated sublanguage" is on the level of "C++ templates have too many
pointy brackets" of criticism.

------
vog
I wonder if the similar tricks work in other languages with similarily
optimizing compilers, but "nicer" metaprogramming, such as Common LISP, Rust
and perhaps OCaml? (with camlp4)

Even more interesting would be whether this leads to the same huge performance
gains in those languages as well.

~~~
ChrisRackauckas
It does lead to performance gains. This kind of stuff is done all the time in
Julia's numerical codes. Flexible macros definitely make this easier, and
multiple dispatch + automatic function specialization on types + inlining +
interprocedural optimizations makes a lot of what's discussed here automatic.
If it's not automatic, then libraries like StaticArrays.jl make functions that
use small arrays do these kinds of tricks automatically (it overloads and
unrolls operations specifically to match the size of the array which is
compile time information). I am sure other languages can do this too, but
having it semi-automatic is pretty cool if what you do all day is write
numerical codes.

~~~
uadjet
Julia's flexibility in the numerical space is why it's my daily driver
language for prototyping and implementation nowadays. Truly a fantastic
language.

------
1024core
Template metaprogramming just shifts the burden of computation to compile
time. Sure, I could pre-compute the first 10000 factorials, put them in an
array, compile and then print the answer in constant time. But it'll work for
the first 10000 numbers only.

~~~
ska
It's not nearly that simple. You can use template meta-programming to make
expressions both easy to read and efficient, by avoiding creation of temporary
objects, fusion, lifting, etc. etc.

In some domains (e.g. linear algebra) this can be very powerful because you
don't have write non idiomatic expressions just to feed the compiler things it
can understand.

------
xvilka
Modern Fortran (especially upcoming Fortran 2018 standard) is good enough
already for those kind of libraries. Moreover, big chunks of the computations
are implemented in assembly in BLAS/LAPACK. What surprises me, that authors of
those libraries still stuck in Fortran 90.

------
virgilp
[edit] I had a bug in my code; it does seem that the compiler is not picking
up constants at compiletime. That is to say, it's not precomputing the final
value at compiletime based on the value from those matrices.

Here's how I modified the code to make sure:
[https://pastebin.com/iG21pTiP](https://pastebin.com/iG21pTiP)

~~~
gmueckl
There needs to be a j in the exit condition of the inner loop of the clobber
function.

------
bwasti
in C++17 you can get generic loop unrolling templates in 8 lines of code:
[https://godbolt.org/g/2Di4PM](https://godbolt.org/g/2Di4PM)

------
skierscott
There’s also a closed form for the Fibonacci numbers. It runs in O(1).

[http://stsievert.com/blog/2015/01/31/the-mysterious-
eigenval...](http://stsievert.com/blog/2015/01/31/the-mysterious-eigenvalue/)

------
blt
Others have already mentioned Cramer's rule, but even if this code were
switched to Gaussian elimination it might not beat lapack on big problems.
Lapack implementations do extra tricks to optimize cache usage on big
matrices. Would be interesting to include eigen in the benchmark too.

~~~
bachmeier
I've read your comment several times and can't follow it. I think you're
saying that even if the author continued to use Cramer's rule, he would lose
to LAPACK on big problems.

~~~
AnimalMuppet
No. If he continued to use Cramer's rule, he would definitely lose to LAPACK
on big problems, _purely because of using Cramer 's rule_. But even if he
didn't continue to use Cramer's rule (if he switched to Gaussian elimination),
he still might lose to LAPACK, because LAPACK is pretty well optimized.

At least, that's how I interpreted the comment...

~~~
sannee
> he would definitely lose to LAPACK on big problems, purely because of using
> Cramer's rule.

More importantly though, the code would probably take ages (as in, "age of the
universe" ages) to compile, as Cramer's rule is O(n!) (using a naive
implementation, I think one could get to O(n^4) if determinant was computed
using LU factorization, but what's the point of Cramer's rule then...).

