Hacker News new | past | comments | ask | show | jobs | submit login

Here [1] is the comparison repository, for anyone interested.

[1] https://github.com/davidad/8queens/tree/%2Bc_comparison

EDIT: I don't seem to have -Ofast, but I added a note to the readme to ask people to try it. I also added the recommended flags for Sandy Bridge, i.e. -march=core2 -msse4.1 -msse4.2.




You can go at least twice as fast if you use symmetries to prune the space. I messed with the C version:

  $ make
  gcc -O3 -march=core2 -msse4.1 -msse4.2 8q_C_bluecalm.c -o 8q_C_bluecalm
  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.376s
  user	0m0.373s
  sys	0m0.001s
  92
  time ./8q_x64_davidad ; echo $?
  
  real	0m0.129s
  user	0m0.128s
  sys	0m0.001s
  92
Of course this doesn't matter if the point is to compare the best asm dfs to the best C dfs of the same space, but if the point is to make the best asm 8queens solver, you can get down below 6μs like this :)

Edit: Using C++ templates to inline the DFS also helps

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.260s
  user	0m0.258s
  sys	0m0.001s
  92
Unfortunately this sort of technique doesn't really apply to the asm >_>

Edit again: I tried to go a bit further by replacing the board struct with your method of passing 3 bytes between levels of the dfs, but at that point the compiler was able to optimize out the whole program.

Removing the templates causes the compiler to actually emit a dfs, which by now is faster than the asm (!). Removing the 2x speedup from just not looking at half of the tree slows us back down to ~1.1x the runtime of the asm:

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.140s
  user	0m0.138s
  sys	0m0.001s
  92
Here's this last solution, the one that's ~1.1x slower than the asm on my machine: http://pastie.org/8784768

And with the free 2x speedup: http://pastie.org/8784770

One last time now: I tried getting rid of the x arg and using a global. This was much slower, so I tried getting rid of the SOLUTIONS global and using a return value. This caused the compiler to optimize out the whole program again :(


Wow! I like your C solution. I'm glad that at least the speedup is still measurable, but it's a close shave.

I implemented your "free 2x speedup" in asm [1]. Sure enough, solution time improved to 4.6us on my machine [2].

[1] https://github.com/davidad/8queens/commit/61119a7f0019e4a85e... [2] https://github.com/davidad/8queens/tree/free-2x-speedup


It seems to me that using a bitboard instead of the struct is the biggest gain. I like your version, very elegant and terse.


Thanks for putting this together!

My immediate question was whether another compiler might be more competitive with the assembly. On my Sandy Bridge processor, I found that although GCC was the slowest, the assembly was still the clear winner:

  gcc 4.8.0: 0.801596113 seconds
  icc 14.0.1: 0.739297534 seconds
  clang 3.2: 0.706446818 seconds
  assembly: 0.104038212 seconds
I was surprised by Clang here. It's an older version than the other two, yet fastest. A quick glance at 'perf stat' (which I used for the timing) says that although it's executing fewer instructions per cycle, it's managing to use fewer instructions than the other two.

Although it doesn't seem to make much of a difference here, those are odd flags for Sandy Bridge. Core2 is a previous generation, and it supports AVX which came out after SSE4.2. If you did to specify SB, you'd want the unwieldy "-march=corei7-avx". But probably better just to use '-mavx' or 'march=native', along with -Ofast or -O3.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: