This isn't really apples-to-apples comparing with FFTW.
1. It's been my experience that distros don't configure AVX properly for it, and
2. PhastFT takes its inputs de-interleaved in separate real/imaginary arrays which is generally not how complex data is provided, so that overhead doesn't appear in PhastFT.
One of the authors of PhastFT here. Thank you for your interest.
We went out of our way to configure FFTW for AVX-512. The Rust bindings don't do it, but the FFTW itself in the benchmark does.
It's worth noting that with FFTW you have to choose between building it for your CPU and making it non-portable, or targeting the lowest common denominator of CPU features so that it runs everywhere but much slower. Meanwhile PhastFT detects the available CPU features at runtime, and will utilize the fastest CPU features without sacrificing portability.
Lastly, we are currently working on support for interleaved format [1]. That should ship in the next release.
FFTW will definitely query cpuid at runtime too, since it's piecing together kernels anyways it's not much more work for it to choose to ignore AVX, etc. If you use the [guru interface](https://www.fftw.org/fftw3_doc/Guru-vector-and-transform-siz...) to configure it to work with split arrays (and maybe use FFTW_MEASURE when planning) I think the benchmarks will be a lot more 1:1