// Compiler doesn't make independent sum* accumulators, so unroll manually. // W...

janwas · on June 5, 2022

Thanks :) I'd be interested to hear how it goes for you.

Agree that 4x unrolling is getting most of the low-hanging fruit without excessive code size. I saw only very slightly better performance on SKX with 8x.

You're right that it's nicer when the compiler can decide about the unrolling - for example with knowledge whether we have 16 or 32 regs. The unsafe/fast-math flags are pretty dangerous, though :/ https://simonbyrne.github.io/notes/fastmath/ Especially when they enable flush-to-zero, which would be unacceptable for a library loaded into some other application.