In Julia we take a very different approach: make it as easy and convenient as possible to ask for the efficient version, but not automatic. For example, you can wrap a matrix that you know to be symmetric in the `Symmetric` wrapper type and leave the rest of the code the same and dispatch will call specialized BLAS and LAPACK routines for symmetric matrices. This approach had been designed by people who live and breathe numerical linear algebra and it allows them to write the naive version of an algorithm, and then tweak it to use the most efficient available kernels with minimal effort and without fundamentally changing the shape of their code.
What that means for the evaluation in this paper is that while the naive code may not do the fastest, cleverest thing, a very small variation on the naive code will do the fastest thing, using optimized kernels. You ask for this explicitly, but it’s easy and convenient to ask for. We’ve found this to be a very pragmatic and effective approach. It keeps optimization explicit but easy.
It would be an easy layer on top of that to add a code analyzer to look for patterns that might be possible to optimize further, but that seems like a problem that is better suited for tooling like a linter than being baked into the language itself.
Is there a way to achieve this in Julia? If not what syntax might you introduce to do it? You can't really annotate just the matrices or multiplication operators since you are optimizing a whole expression.
- PyTorch JIT & PyTorch Glow and Tensor Comprehensions
- A toy MLIR language build on the linalg dialect. I'm sure Google would love to help them.
to name the main ones.
Everyone is moving towards a compiler approach.
I'm actively researching the domain and I think Halide is the most promising: by separating the algorithm from the schedule you can leave the algorithm side to the researchers and optimization is done by HPC devs or automated methods without having to rewrite in C/Fortran.
Much more attempts in the past:
- My own domain specific language and compiler: https://github.com/numforge/laser/tree/master/laser/lux_comp...
there are a low_level_design_considerations.md and challenges.md that should be interesting as well.
One thing for sure is that OpenMP is today something that needs to be workaround due to it's lack of composability and GCC implementation is a contention point due to using a single queue protected by a lock. While traditional linear algebra algorithms can deal with static distribution and no load balancing, some algorithms would hugely benefits from a more flexible task system, for example Beam Search.
Does one need to understand compiler technology fully to get into this field? I have been trying to get a grip over MLIR and GLOW source code but have failed miserably so far. Could you please give me some pointers which would help me in this regard ? Thanks
I can't find a single SPIR-V AXPY (a*X+y) example which is the Hello World of linear algebra in Vulkan/SPIR-V so I'm not optimistic on ease of use.
You can probably do texture hacks like what people did in 2006 when they realized that GPU were fast for linear algebra and OpenCL/Cuda did not exist (https://igoro.com/archive/how-gpu-came-to-be-used-for-genera...)
The best would probably be to follow this Halide issue and the corresponding branches: https://github.com/halide/Halide/issues/2296. Note that Halide is a compiler so it automates lots of the boilerplate that is probably needed to Vulkan for compute compared to Cuda.
moving towards a compiler
Furthermore a compiler approach will lend itself to autodifferentiation like what Julia, Swift for Tensorflow and Halide are doing.
Disturbingly too many people don’t seem to recognize the difference between letting the computer do something they don’t have time to do and trusting the computer to do something they don’t understand.
Imagine the nihilism if you could automate every decision.
For similar instances of the same story, see: compilers, java, haskell, cuda, etc...
News at 10: you always pay a performance price when you use high level abstractions, aka you can't have your lunch and eat it too.
In most cases, "working code, fast" is more important than "fast working code".
That depends on who you ask. If you ask the people who are managing the schedules, whose job reviews and bonuses are tied to meeting a specific date? Yes, absolutely, performance is a "nice to have" as long as the dates don't "slip". Now ask the users, and the number one complaint I hear most often is "why is this thing so damned slow?"
Good point. My viewpoint is rather focused on numerical linear algebra, since I worked there and the article is about it. If you skim through the paper mentioned, in the article you can e.g. compare numpy and armadillo. You will see that the speedups from using a (arguably) much more complex C++ framework instead of numpy are marginal and will not be visible for the user. The increased production/maintenance costs due to a more complex code, will be visible for the user.