And the "how" behind Octavian.jl is basically LoopVectorization.jl [1], which helps make optimal use of your CPU's SIMD instructions.
Currently there can some nontrivial compilation latency with this approach, but since LV ultimately emits custom LLVM it's actually perfectly compatible with StaticCompiler.jl [2] following Mason's rewrite, so stay tuned on that front.
Thanks. But how LoopVectorization.jl is helping here, say comparing to C/Fortran
optimized w.r.t. to the CPU? Is there somewhere in their doc mentioning this?
The basic answer is that LLVM doesn't do as good a job with some types of vectorization because it is working on a lower level representation. There are several causes of this. One is that LoopVectorization has permission to replace elementary functions with hand-written vectorized equivalents, another is that it does a better job using gather/scatter instructions.
Currently there can some nontrivial compilation latency with this approach, but since LV ultimately emits custom LLVM it's actually perfectly compatible with StaticCompiler.jl [2] following Mason's rewrite, so stay tuned on that front.
[1] https://github.com/JuliaSIMD/LoopVectorization.jl
[2] https://github.com/tshort/StaticCompiler.jl