Ah very interesting, this is great to know! Sorry I didn't properly read your initial comment. I'm now quite interested to see what we can do to match or beat Pythran here.
What CPU are you using? I know my CPU, a Zen+ Ryzen 5 2600, is one of the worst modern CPUs for this sort of thing and there's much better scaling on Intel CPUs or the Zen 3 CPUs, but those things are probably also effecting the Pythran code as well.
Is Pythran any good at producing BLAS microkernels?
Would be good to confirm the number of threads being used by both, but it could also be that pythran is using a better `sincos` implementation.
On very old hardware (which it would have to be if multithreaded), the exp implementation might not be great either. That'd especially be the case if it has no AVX, as the exp implementation relies on a table (which requires gather instructions).
For 2000x2000, their time was 14ms, vs 6ms for the 2600.
Just an additional comment, using sincos in the comparison is not really fair, because it is a change to the algorithm (it assumes that a is real), because in general
exp(x) = exp(x.real())+ sincos(x.imag())
(for simplification sincos(x) returns cos(x) + jsin(x), but you get the point).
The other code can't easily optimise for that (although it would be a nice optimisation). Interestingly (and I can't explain why, maybe someone knowing more about julia), if I change the code of that third method using tturbo to use the cis function (or exp) it becomes a factor of 20 slower and is actually the slowest of all the different algorithms)
Anyway, it is the same algorithm as long as `x` is real.
The reason for manually inlining exp(::Complex) and manually decomposing the complex number into its real and imaginary parts is that `@tturbo` currently only supports real inputs.
Given types that it does not support, it will silently run the original loop single-threaded and optimized simply with `@inbounds @fastmath`.
I imagine in changing to cis or exp, you also switched to using complex numbers? If so, that would explain the dramatic slowdown.
If not, mind sharing your example, here or in an issue at LoopVectorization.jl?
It's planned that "some day" LoopVectorization will support complex numbers through the AbstractInterpreter interface, but this is probably still a ways off.
I agree, that it's the same algorithm as long as it x is real. My point was that with this optimisation we give more information to this function than the comparable python function (or the other julia implementations), so it gains an advantage for optimising.
Yes I was using complex inputs when switching to to cis/exp. LoopVectorization not supporting complex does explain this.
As a side note, I generally dislike silent fall-backs, for optimisation libraries/modules. That's why I always do python=False when using numba (and I wish it was default). I'm using the library/module because I want speed up, the problem with the fall-backs is they are often slower than the default code (e.g. unwrapped loops in numba, and in this example), so I would like to know if it doesn't speed up the code, because rather not use it then. A failure would save me benchmarking the slow-down.
Hopefully after the LoopVectoruzation.jl rewrite we can make it understand complex numbers automatically. We could probably put in special-case complex support today, but we’d really like to support arbitrary structs of bits.
What CPU are you using? I know my CPU, a Zen+ Ryzen 5 2600, is one of the worst modern CPUs for this sort of thing and there's much better scaling on Intel CPUs or the Zen 3 CPUs, but those things are probably also effecting the Pythran code as well.
Is Pythran any good at producing BLAS microkernels?