(I guess DarkShikari's comment is nested too deeply for me to reply directly.)
In my (admittedly limited) experience [1], the compiler has actually done pretty decently at optimizing register allocation in intrinsic-heavy loops. I wrote out the assembly loop in [2] with manual allocation into all 16 XMMs and then noticed the compiler managed to optimize 1 of them out.
In my (admittedly limited) experience [1], the compiler has actually done pretty decently at optimizing register allocation in intrinsic-heavy loops. I wrote out the assembly loop in [2] with manual allocation into all 16 XMMs and then noticed the compiler managed to optimize 1 of them out.
[1] https://github.com/simtk/IRMSD
[2] https://github.com/SimTk/IRMSD/blob/master/python/IRMSD/theo...