
The Linear Algebra Mapping Problem - furcyd
https://arxiv.org/abs/1911.09421
======
StefanKarpinski
Whether you want a language to do this or not is a fundamental matter of
design philosophy. I would be quite unhappy if a language changed
multiplication by the inverse of a matrix into backslash without my
permission: that is not the computation I told you to do. In a language that
did that automatically, how would you check the difference in the result
between using the inverse and using backslash? You would try to compare them
and conclude that they produce identical results, because the system had
surreptitiously changed both computations to do the same thing behind your
back. Then you’d get user reports about this and have to add a way to opt out
of the optimization when it wasn’t wanted. And of course that’s not the only
time you want to suppress an optimization like this—sometimes it will give
incorrect results. Optimizations should either be truly indistinguishable from
the correct answer or be opt-in.

In Julia we take a very different approach: make it as easy and convenient as
possible to ask for the efficient version, but not automatic. For example, you
can wrap a matrix that you know to be symmetric in the `Symmetric` wrapper
type and leave the rest of the code the same and dispatch will call
specialized BLAS and LAPACK routines for symmetric matrices. This approach had
been designed by people who live and breathe numerical linear algebra and it
allows them to write the naive version of an algorithm, and then tweak it to
use the most efficient available kernels with minimal effort and without
fundamentally changing the shape of their code.

What that means for the evaluation in this paper is that while the naive code
may not do the fastest, cleverest thing, a very small variation on the naive
code will do the fastest thing, using optimized kernels. You ask for this
explicitly, but it’s easy and convenient to ask for. We’ve found this to be a
very pragmatic and effective approach. It keeps optimization explicit but
easy.

It would be an easy layer on top of that to add a code analyzer to look for
patterns that might be possible to optimize further, but that seems like a
problem that is better suited for tooling like a linter than being baked into
the language itself.

~~~
fyp
The paper also talks about reordering matrix multiplications by automatically
adding parentheses (despite the fact that floating point multiplication is
non-associative).

Is there a way to achieve this in Julia? If not what syntax might you
introduce to do it? You can't really annotate just the matrices or
multiplication operators since you are optimizing a whole expression.

~~~
photon-torpedo
Matrix chain multiplication can be seen a special case of tensor contraction,
and the
[https://github.com/Jutho/TensorOperations.jl](https://github.com/Jutho/TensorOperations.jl)
package has an implementation which tries to optimize the contraction order
(macro @tensoropt).

------
mratsim
The follow-up paper should include the deep learning and linear algebra
compilers:

\- Halide

\- Tiramisu

\- Ngraph

\- Weld

\- PyTorch JIT & PyTorch Glow and Tensor Comprehensions

\- TVM

\- Futhark

\- A toy MLIR language build on the linalg dialect. I'm sure Google would love
to help them.

to name the main ones.

Everyone is moving towards a compiler approach.

I'm actively researching the domain and I think Halide is the most promising:
by separating the algorithm from the schedule you can leave the algorithm side
to the researchers and optimization is done by HPC devs or automated methods
without having to rewrite in C/Fortran.

Much more attempts in the past:

\-
[https://github.com/mratsim/Arraymancer/issues/347#issuecomme...](https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890)

\- My own domain specific language and compiler:
[https://github.com/numforge/laser/tree/master/laser/lux_comp...](https://github.com/numforge/laser/tree/master/laser/lux_compiler/core)

    
    
      there are a low_level_design_considerations.md and challenges.md that should be interesting as well.
    

Unfortunately I didn't go beyond the proof of concept as I need to revamp the
Nim multithreading runtime to improve composition of parallel linear algebra
kernels.

One thing for sure is that OpenMP is today something that needs to be
workaround due to it's lack of composability and GCC implementation is a
contention point due to using a single queue protected by a lock. While
traditional linear algebra algorithms can deal with static distribution and no
load balancing, some algorithms would hugely benefits from a more flexible
task system, for example Beam Search.

~~~
Jhsto
How viable do you think SPIR-V is for computation? I am thinking of
researching (that is, creating my own) DSLs targeting SPIR-V.

~~~
mratsim
Good question.

I can't find a single SPIR-V AXPY (a*X+y) example which is the Hello World of
linear algebra in Vulkan/SPIR-V so I'm not optimistic on ease of use.

You can probably do texture hacks like what people did in 2006 when they
realized that GPU were fast for linear algebra and OpenCL/Cuda did not exist
([https://igoro.com/archive/how-gpu-came-to-be-used-for-
genera...](https://igoro.com/archive/how-gpu-came-to-be-used-for-general-
computation/))

The best would probably be to follow this Halide issue and the corresponding
branches:
[https://github.com/halide/Halide/issues/2296](https://github.com/halide/Halide/issues/2296).
Note that Halide is a compiler so it automates lots of the boilerplate that is
probably needed to Vulkan for compute compared to Cuda.

------
pfortuny
Looks like “you cannot automate every decision”, which is rather true and a
very relevant problem imho: I teach at a school of Engineering and it looks
like everything now boils down to “let the computer do it”.

~~~
commandlinefan
> you cannot automate every decision

Disturbingly too many people don’t seem to recognize the difference between
letting the computer do something they don’t have time to do and trusting the
computer to do something they don’t understand.

~~~
JadeNB
I think that people think that they are doing the former when they do the
latter: "I don't have time to understand this technique, so I'll let the
computer understand it for me."

------
vmchale
Numpy was quite difficult to get working efficiently. J actually outperformed
numpy pretty much any way I compiled it: [https://github.com/vmchale/matrix-
benchmarks](https://github.com/vmchale/matrix-benchmarks), even doing basic
stuff like multiplying two matrices.

------
ur-whale
Looks to me like the age-old standard problem of using higher abstractions and
expecting the machinery underneath to be smart enough to deploy it optimally
to the underlying hardware.

For similar instances of the same story, see: compilers, java, haskell, cuda,
etc...

News at 10: you always pay a performance price when you use high level
abstractions, aka you can't have your lunch and eat it too.

~~~
TheWizardofOdds
Theoretically you are correct, but taking the skills of the user into account,
I think the very opposite is true. In practice, hardly anyone has the skills
to actually implement anything with high performance anymore (especially in
numerics) and it is thus (in most cases ) more performant to simply stick to
stuff like numpy.

In most cases, "working code, fast" is more important than "fast working
code".

~~~
commandlinefan
> "working code, fast" is more important than "fast working code"

That depends on who you ask. If you ask the people who are managing the
schedules, whose job reviews and bonuses are tied to meeting a specific date?
Yes, absolutely, performance is a "nice to have" as long as the dates don't
"slip". Now ask the users, and the number one complaint I hear most often is
"why is this thing so damned slow?"

~~~
TheWizardofOdds
> Now ask the users, and the number one complaint I hear most often is "why is
> this thing so damned slow?"

Good point. My viewpoint is rather focused on numerical linear algebra, since
I worked there and the article is about it. If you skim through the paper
mentioned, in the article you can e.g. compare numpy and armadillo. You will
see that the speedups from using a (arguably) much more complex C++ framework
instead of numpy are marginal and will not be visible for the user. The
increased production/maintenance costs due to a more complex code, will be
visible for the user.

