// Compiler doesn't make independent sum* accumulators, so unroll manually.
// We cannot use an array because V might be a sizeless type. For reasonable
// code, we unroll 4x, but 8x might help (2 FMA ports * 4 cycle latency).
That code needs 2 loads per FMA. So a CPU with 2 FMA ports would need at least 4 load ports to be able to feed the 2 FMA ports. Given that most CPUs with 2 FMA ports have just 2 load ports, unrolling by 4 should be more or less ideal.
But, ideally, the compiler could make the decision based on the target architecture.
Without enabling associative math, it isn't legal to duplicate floating point accumulators and change the order of the accumulation. Perhaps compiling under `-funsafe-math` would help.
If you're using GCC, you'll probably need `-fvariable-expansion-in-unroller`, too.
I think highway looks great. I'm sure I'll procrastinate on something important to play with it reasonably soon.
Thanks :) I'd be interested to hear how it goes for you.
Agree that 4x unrolling is getting most of the low-hanging fruit without excessive code size. I saw only very slightly better performance on SKX with 8x.
You're right that it's nicer when the compiler can decide about the unrolling - for example with knowledge whether we have 16 or 32 regs. The unsafe/fast-math flags are pretty dangerous, though :/ https://simonbyrne.github.io/notes/fastmath/
Especially when they enable flush-to-zero, which would be unacceptable for a library loaded into some other application.
But, ideally, the compiler could make the decision based on the target architecture.
Without enabling associative math, it isn't legal to duplicate floating point accumulators and change the order of the accumulation. Perhaps compiling under `-funsafe-math` would help. If you're using GCC, you'll probably need `-fvariable-expansion-in-unroller`, too.
I think highway looks great. I'm sure I'll procrastinate on something important to play with it reasonably soon.