Hacker News new | comments | show | ask | jobs | submit login

My old code: http://codetidy.com/8255/

windows.h is for QueryPerformanceCounter used only in main(); compile with gcc -Ofast

It iterates/counts all solutions + prints first of them (as 01 board) in 0.23ms on my i7 (if you redirect output to file). You can also run it for N queens changing #define Qs to something bigger. The code as you see is silly (not needed conditional in main loop among other things) and not optimized, still it's pretty fast :) I am not sure what hardware they used but it takes 10x less time on my computer vs theirs.




Hi bluecalm! I tweaked both your code and mine to uselessly solve the problem 10000 times, and not pretty print anything, so the difference in raw 8-queens-solving speed is readily apparent. Here are the results:

    bash-3.2$ make
    gcc -O3 8q_C_bluecalm.c -o 8q_C_bluecalm
    nasm 8q_x64_davidad.asm -DLOOPED=10000 -f macho64 -o 8q_x64_davidad.o
    ld -o 8q_x64_davidad 8q_x64_davidad.o

    time ./8q_C_bluecalm  ; echo $?

    real	0m0.802s
    user	0m0.801s
    sys 	0m0.001s
    92

    time ./8q_x64_davidad ; echo $?

    real	0m0.113s
    user	0m0.112s
    sys 	0m0.000s
    92
If you'd like, I'm happy to post this as a branch of my repository so others can replicate. (I only hesitate because it's a derivative work of your code.)


Hi, I am happy to see that asm magic gains over naive code :) Feel free to host/use the code in any way you like!

It will be nice addition to see how hand tuned code beats naive C implementation. (I hope your code iterates over all solutions as well, the easiest way to change mine to stop after 1st is a goto from solve to main :))

EDIT: Yeah, I meant if it counts all of them and I've just noticed 92 exit code :) I didn't expect that big a speed-up, impressive!

EDIT2: to answer children post: That's great, please change -O3 to -Ofast though if you are using new GCC. It shouldn't matter much but it sometimes does; I believe -march=native won't help with generated assembly though I seems that your code from original article spent all its time in pretty printing. I am happy to see some real comparison now; I am for sure learning a lot here once I understand everything in the article.


It counts up every solution, thus the "92" exit code. However, the non-pretty-printing version does take some shortcuts - it can't reconstruct the entire current board state in O(1) time from its actual register state. So I'm not sure exactly what you mean by "iterate over all solutions". (It certainly doesn't bail out when it finds the first solution, if that's your question.)


Here [1] is the comparison repository, for anyone interested.

[1] https://github.com/davidad/8queens/tree/%2Bc_comparison

EDIT: I don't seem to have -Ofast, but I added a note to the readme to ask people to try it. I also added the recommended flags for Sandy Bridge, i.e. -march=core2 -msse4.1 -msse4.2.


You can go at least twice as fast if you use symmetries to prune the space. I messed with the C version:

  $ make
  gcc -O3 -march=core2 -msse4.1 -msse4.2 8q_C_bluecalm.c -o 8q_C_bluecalm
  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.376s
  user	0m0.373s
  sys	0m0.001s
  92
  time ./8q_x64_davidad ; echo $?
  
  real	0m0.129s
  user	0m0.128s
  sys	0m0.001s
  92
Of course this doesn't matter if the point is to compare the best asm dfs to the best C dfs of the same space, but if the point is to make the best asm 8queens solver, you can get down below 6μs like this :)

Edit: Using C++ templates to inline the DFS also helps

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.260s
  user	0m0.258s
  sys	0m0.001s
  92
Unfortunately this sort of technique doesn't really apply to the asm >_>

Edit again: I tried to go a bit further by replacing the board struct with your method of passing 3 bytes between levels of the dfs, but at that point the compiler was able to optimize out the whole program.

Removing the templates causes the compiler to actually emit a dfs, which by now is faster than the asm (!). Removing the 2x speedup from just not looking at half of the tree slows us back down to ~1.1x the runtime of the asm:

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.140s
  user	0m0.138s
  sys	0m0.001s
  92
Here's this last solution, the one that's ~1.1x slower than the asm on my machine: http://pastie.org/8784768

And with the free 2x speedup: http://pastie.org/8784770

One last time now: I tried getting rid of the x arg and using a global. This was much slower, so I tried getting rid of the SOLUTIONS global and using a return value. This caused the compiler to optimize out the whole program again :(


Wow! I like your C solution. I'm glad that at least the speedup is still measurable, but it's a close shave.

I implemented your "free 2x speedup" in asm [1]. Sure enough, solution time improved to 4.6us on my machine [2].

[1] https://github.com/davidad/8queens/commit/61119a7f0019e4a85e... [2] https://github.com/davidad/8queens/tree/free-2x-speedup


It seems to me that using a bitboard instead of the struct is the biggest gain. I like your version, very elegant and terse.


Thanks for putting this together!

My immediate question was whether another compiler might be more competitive with the assembly. On my Sandy Bridge processor, I found that although GCC was the slowest, the assembly was still the clear winner:

  gcc 4.8.0: 0.801596113 seconds
  icc 14.0.1: 0.739297534 seconds
  clang 3.2: 0.706446818 seconds
  assembly: 0.104038212 seconds
I was surprised by Clang here. It's an older version than the other two, yet fastest. A quick glance at 'perf stat' (which I used for the timing) says that although it's executing fewer instructions per cycle, it's managing to use fewer instructions than the other two.

Although it doesn't seem to make much of a difference here, those are odd flags for Sandy Bridge. Core2 is a previous generation, and it supports AVX which came out after SSE4.2. If you did to specify SB, you'd want the unwieldy "-march=corei7-avx". But probably better just to use '-mavx' or 'march=native', along with -Ofast or -O3.


I couldn't resist the challenge, so I wrote a non-recursive solution [0] in C++. It takes about 0.73s on an E5-4650 to loop 10000 times. Since writing it I've tried testing it against bluecalm's solution and how they compare seems to be fairly sensitive to the particular compiler and hardware combination. But it's at least a slightly different take.

[0] http://pastebin.com/EciqQBg5




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: