Great, I'll take a look in a bit, although it might take me until tomorrow to ha...

fallat · on April 26, 2020

Debian buster; and if you're going to make a comparison, I'm just saying this just in case, you must force it to only use SSE3 and no later. :)

I added the build instruction for the assembly but I'll add it here too: gcc nbodies.s -no-pie -o nbodies.out

Shoot me an email if you get around to it! :D

I'll check out perf record.

nkurz · on April 26, 2020

OK, I'm now able to test your code!

The preliminary results I get disagree with what you are seeing. I'm not sure if this is my error, your error, or just genuine differences between processors. Specifically, on Skylake, I get your assembly to be much faster than GCC, although slightly slower than Clang. And that's without trying to use options to limit the compiler:

  gcc-8 -Ofast -march=native bodies.c -Wall -lm -o bodies_gcc_Ofast_native_c
  perf stat bodies_gcc_Ofast_native_c 5000000
  2,031,645,589 cycles       #    3.691 GHz
  2,181,200,813 instructions #    1.07  insns per cycle

  gcc bodies.S -o bodies_S
  perf stat bodies_S
  1,293,433,853 cycles         #    3.691 GHz
  2,641,011,827 instructions   #    2.04  insns per cycle

  clang-6.0 -Ofast -march=native bodies.c -Wall -lm -o bodies_clang_Ofast_native_c
  perf stat bodies_clang_Ofast_native_c 5000000
  1,158,569,067 cycles       #    3.691 GHz
  2,331,116,659 instructions #    2.01  insns per cycle

Not having understood the assembly yet, my guess from skimming is that the compiler isn't able to vectorize this code well, and thus the SSE/AVX distinction isn't going to matter much. "-Ofast" should be comparable to "-O3" here, although I don't recall the exact differences. I didn't use "-no-pie" with your assembly, but don't think this matters.

Are you able to do a similar comparison and report results with "perf stat"? Cycles and "instructions per cycle" are going to be better metrics to compare than clock time. No hurry. I'm East Coast US, and done for the night.

Edit: A quick glance at "perf record" and "perf report" suggests that clang is slightly faster than you (on Skylake, using these options) because it's making use of fused multiply-adds. Which slightly contradicts what I said about the SSE/AVX distinction not mattering, although it's only a minor effect. For both routines, the majority of the time spent is in the chain starting with the division. I'm wondering if there is some major difference in architecture with Braswell --- perhaps it only has a single floating point multiplication unit or something? Or one of the operations is relative _much_ slower.

Edit 2: Looking at https://en.wikichip.org/wiki/intel/cores/braswell, I now see that Braswell is the name of the "system on a chip", and the CPU microarchitecture is Airmont. This isn't in Agner either, but at least I've heard of it! https://en.wikichip.org/wiki/intel/microarchitectures/airmon... says Airmont is basically the same as Silvermont, which Agner does cover, so we should be able to figure out timings. See page 320 here: https://www.agner.org/optimize/instruction_tables.pdf.

Edit 3: I hadn't actually looked at your code yet. So you are trying to vectorize, it's just that you are limited in doing so because you can only do 2 doubles at a time. But since you are working in 3D, this doesn't fit evenly, so you do [x,y] then [z,null]. Thus you expected it to be something like 1.5x faster on your processor, and instead it comes out 2x slower. I think the issue might be that your processor doesn't really have full vector units --- instead it's doing some sort of emulation. Look closely at the timings on Page 338 of the instruction table I linked in Edit 2. Note that DIVPD takes just about twice as long as DIVSD. Then check Skylake on Page 252 -- same time for packed and single XMM. While this isn't the full answer, I think it's a strong hint at the issue. The quickest fix (if you actually want to make this faster on your processor) is to use the single instruction for the [z,null] case. This isn't going to help much overall, since it takes twice as long for the packed, but it might at least get them back to parity with the compiler! If you actually want higher speed from your vectorization, you may have to switch to a processor that has better vectorized speeds.

fallat · on April 26, 2020

I just want to say: wow! You are showing me something I bet many others are really not aware of: degraded SSE performance in these types of processors! Your DIVSD vs DIVPD comment makes a lot of sense too. Man I feel this HN thread has been just a gold mine of this kind of information.

What is the speed with restriction to SSE3? That would finalize the tests.

Do you mind if I directly quote you in a follow up post? This is really good stuff.

nkurz · on April 26, 2020

> What is the speed with restriction to SSE3? That would finalize the tests.

I think this is the right incantation to restrict to SSE3:

  clang-6.0 -O3 bodies.c -msse3 -Wall -lm -o bodies_clang_O3_sse3_c
  perf stat bodies_clang_O3_sse3_c 5000000
  1,470,870,787 cycles       #    3.691 GHz
  3,391,153,749 instructions #    2.31  insns per cycle

  gcc-8 -O3 bodies.c -msse3 -Wall -lm -o bodies_gcc_O3_sse3_c
  perf stat bodies_gcc_O3_sse3_c 5000000
  2,256,550,525 cycles       #    3.691 GHz
  3,306,361,186 instructions #    1.47  insns per cycle

It would be interesting to analyze the difference between GCC and Clang here. Just glancing without comprehension, it looks like one big difference is that Clang might be calling out to a different square root routine rather than using the assembly builtin. Hmm, although it makes me wonder if maybe that library routine is using some more advanced instruction set?

> Do you mind if I directly quote you in a follow up post?

Sure, but realize that I'm speculating here. Essentially, I'm an expert (or at least was a couple years ago) on integer vector operations on modern Intel server processors. But I'm not familiar with their consumer models, and I'm much less fluent in floating point. The result is that I know where to look for answers, but don't actually know them off hand. So quote the primary sources instead when you can.

Please do write a followup, and send me an email when you post it. My address is in my HN profile (click on username). Also, I might be able to provide you remote access to testing machines if it would help you test.

nkurz · on April 26, 2020

OK, I might have a lead on the GCC versus Clang gap. At the least, this obscure command line option closes the gap:

  gcc-8 -fno-math-errno -O3 bodies.c -msse3 -Wall -lm -o bodies_gcc_O3_sse3_c
  perf stat bodies_gcc_O3_sse3_c 5000000
  1,398,805,282    cycles       #    3.691 GHz
     3,181,041,562 instructions #    2.27  insns per cycle

I haven't thought about it much, but more info here: https://stackoverflow.com/questions/37117809/why-cant-gcc-op.... While the question is about C++, it hints that the spec might require sqrt() to set errno if given a negative number. Clang special cases this by adding a never taken branch, but gcc does not (unless told it doesn't need to care).