Ah very interesting, this is great to know! Sorry I didn't properly read your in...

celrod · on June 21, 2021

Would be good to confirm the number of threads being used by both, but it could also be that pythran is using a better `sincos` implementation. On very old hardware (which it would have to be if multithreaded), the exp implementation might not be great either. That'd especially be the case if it has no AVX, as the exp implementation relies on a table (which requires gather instructions).

For 2000x2000, their time was 14ms, vs 6ms for the 2600.

eigenspace · on June 21, 2021

Yeah, I tried to get Pythran working on my computer to compare against, but it wouldn't build for me.

cycomanic · on June 21, 2021

Just an additional comment, using sincos in the comparison is not really fair, because it is a change to the algorithm (it assumes that a is real), because in general

   exp(x) = exp(x.real())+ sincos(x.imag())

(for simplification sincos(x) returns cos(x) + jsin(x), but you get the point).

The other code can't easily optimise for that (although it would be a nice optimisation). Interestingly (and I can't explain why, maybe someone knowing more about julia), if I change the code of that third method using tturbo to use the cis function (or exp) it becomes a factor of 20 slower and is actually the slowest of all the different algorithms)

celrod · on June 21, 2021

  exp(x) = exp(x.real()) * sincos(x.imag())

Anyway, it is the same algorithm as long as `x` is real. The reason for manually inlining exp(::Complex) and manually decomposing the complex number into its real and imaginary parts is that `@tturbo` currently only supports real inputs.

Given types that it does not support, it will silently run the original loop single-threaded and optimized simply with `@inbounds @fastmath`. I imagine in changing to cis or exp, you also switched to using complex numbers? If so, that would explain the dramatic slowdown.

If not, mind sharing your example, here or in an issue at LoopVectorization.jl?

It's planned that "some day" LoopVectorization will support complex numbers through the AbstractInterpreter interface, but this is probably still a ways off.

cycomanic · on June 21, 2021

I agree, that it's the same algorithm as long as it x is real. My point was that with this optimisation we give more information to this function than the comparable python function (or the other julia implementations), so it gains an advantage for optimising.

Yes I was using complex inputs when switching to to cis/exp. LoopVectorization not supporting complex does explain this.

As a side note, I generally dislike silent fall-backs, for optimisation libraries/modules. That's why I always do python=False when using numba (and I wish it was default). I'm using the library/module because I want speed up, the problem with the fall-backs is they are often slower than the default code (e.g. unwrapped loops in numba, and in this example), so I would like to know if it doesn't speed up the code, because rather not use it then. A failure would save me benchmarking the slow-down.

eigenspace · on June 21, 2021

Here’s the code for the fully complex case: https://discourse.julialang.org/t/i-just-decided-to-migrate-...

Hopefully after the LoopVectoruzation.jl rewrite we can make it understand complex numbers automatically. We could probably put in special-case complex support today, but we’d really like to support arbitrary structs of bits.