I recently spent a good deal of time looking at assembly code emitted by compilers when doing SIMD code with C and intrinsics. On GCC and Clang pass by value and force_inline functions gave the best results (at least until link time optimization becomes more mainstream). This was even the case with 4x4 matrix structs, not just SIMD vectors.
The speed is not in getting individual functions to work fast, but to let the compiler inline and combine several function calls together and keeping live values in registers from one function to another.
Here's my SIMD math lib: https://github.com/rikusalminen/threedee-simd
Your math lib look nifty! Which license would you share it with? Also, why in particular std=gnu99?
Thanks! zlib license.
> Also, why in particular std=gnu99?
Because I use some c99 things and using -std=c99 will disable some posix/gnu extension features. I think it was time.h and clock_gettime which I was using for benchmarking.