Great, I'll take a look in a bit, although it might take me until tomorrow to have time to do much with it.
In the meantime, I'll mention that my first quick discovery is that clang seems to be significantly faster than gcc on the standard C code. The ratio changes with different versions and compilation options, but on Skylake with "-Ofast -march=native" I find clang-6.0 to be almost twice as fast as gcc-8. So if you have clang installed, check and see if it might be a better baseline.
Also, what system are you running? This shouldn't make a difference with execution speed, but will make it easier to make tool suggestions. If you are running some sort of Linux, now would be a good time to get familiar with 'perf record'!
Edit: > Intel(R) Pentium(R) CPU N3700 @ 1.60GHz
Hmm, that's a "Braswell" part, which unfortunately isn't covered in Agner's standard guide to instruction timings (https://www.agner.org/optimize/microarchitecture.pdf) and I'm not familiar with it's characteristics. This might make profiling a little more approximate.
The preliminary results I get disagree with what you are seeing. I'm not sure if this is my error, your error, or just genuine differences between processors. Specifically, on Skylake, I get your assembly to be much faster than GCC, although slightly slower than Clang. And that's without trying to use options to limit the compiler:
Not having understood the assembly yet, my guess from skimming is that the compiler isn't able to vectorize this code well, and thus the SSE/AVX distinction isn't going to matter much. "-Ofast" should be comparable to "-O3" here, although I don't recall the exact differences. I didn't use "-no-pie" with your assembly, but don't think this matters.
Are you able to do a similar comparison and report results with "perf stat"? Cycles and "instructions per cycle" are going to be better metrics to compare than clock time. No hurry. I'm East Coast US, and done for the night.
Edit: A quick glance at "perf record" and "perf report" suggests that clang is slightly faster than you (on Skylake, using these options) because it's making use of fused multiply-adds. Which slightly contradicts what I said about the SSE/AVX distinction not mattering, although it's only a minor effect. For both routines, the majority of the time spent is in the chain starting with the division. I'm wondering if there is some major difference in architecture with Braswell --- perhaps it only has a single floating point multiplication unit or something? Or one of the operations is relative _much_ slower.
Edit 3: I hadn't actually looked at your code yet. So you are trying to vectorize, it's just that you are limited in doing so because you can only do 2 doubles at a time. But since you are working in 3D, this doesn't fit evenly, so you do [x,y] then [z,null]. Thus you expected it to be something like 1.5x faster on your processor, and instead it comes out 2x slower. I think the issue might be that your processor doesn't really have full vector units --- instead it's doing some sort of emulation. Look closely at the timings on Page 338 of the instruction table I linked in Edit 2. Note that DIVPD takes just about twice as long as DIVSD. Then check Skylake on Page 252 -- same time for packed and single XMM. While this isn't the full answer, I think it's a strong hint at the issue. The quickest fix (if you actually want to make this faster on your processor) is to use the single instruction for the [z,null] case. This isn't going to help much overall, since it takes twice as long for the packed, but it might at least get them back to parity with the compiler! If you actually want higher speed from your vectorization, you may have to switch to a processor that has better vectorized speeds.
I just want to say: wow! You are showing me something I bet many others are really not aware of: degraded SSE performance in these types of processors! Your DIVSD vs DIVPD comment makes a lot of sense too. Man I feel this HN thread has been just a gold mine of this kind of information.
What is the speed with restriction to SSE3? That would finalize the tests.
Do you mind if I directly quote you in a follow up post? This is really good stuff.
It would be interesting to analyze the difference between GCC and Clang here. Just glancing without comprehension, it looks like one big difference is that Clang might be calling out to a different square root routine rather than using the assembly builtin. Hmm, although it makes me wonder if maybe that library routine is using some more advanced instruction set?
> Do you mind if I directly quote you in a follow up post?
Sure, but realize that I'm speculating here. Essentially, I'm an expert (or at least was a couple years ago) on integer vector operations on modern Intel server processors. But I'm not familiar with their consumer models, and I'm much less fluent in floating point. The result is that I know where to look for answers, but don't actually know them off hand. So quote the primary sources instead when you can.
Please do write a followup, and send me an email when you post it. My address is in my HN profile (click on username). Also, I might be able to provide you remote access to testing machines if it would help you test.
I haven't thought about it much, but more info here: https://stackoverflow.com/questions/37117809/why-cant-gcc-op.... While the question is about C++, it hints that the spec might require sqrt() to set errno if given a negative number. Clang special cases this by adding a never taken branch, but gcc does not (unless told it doesn't need to care).
In the meantime, I'll mention that my first quick discovery is that clang seems to be significantly faster than gcc on the standard C code. The ratio changes with different versions and compilation options, but on Skylake with "-Ofast -march=native" I find clang-6.0 to be almost twice as fast as gcc-8. So if you have clang installed, check and see if it might be a better baseline.
Also, what system are you running? This shouldn't make a difference with execution speed, but will make it easier to make tool suggestions. If you are running some sort of Linux, now would be a good time to get familiar with 'perf record'!
Edit: > Intel(R) Pentium(R) CPU N3700 @ 1.60GHz Hmm, that's a "Braswell" part, which unfortunately isn't covered in Agner's standard guide to instruction timings (https://www.agner.org/optimize/microarchitecture.pdf) and I'm not familiar with it's characteristics. This might make profiling a little more approximate.