windows.h is for QueryPerformanceCounter used only in main(); compile with gcc -Ofast
It iterates/counts all solutions + prints first of them (as 01 board) in 0.23ms on my i7 (if you redirect output to file). You can also run it for N queens changing #define Qs to something bigger.
The code as you see is silly (not needed conditional in main loop among other things) and not optimized, still it's pretty fast :) I am not sure what hardware they used but it takes 10x less time on my computer vs theirs.
Hi bluecalm! I tweaked both your code and mine to uselessly solve the problem 10000 times, and not pretty print anything, so the difference in raw 8-queens-solving speed is readily apparent. Here are the results:
bash-3.2$ make
gcc -O3 8q_C_bluecalm.c -o 8q_C_bluecalm
nasm 8q_x64_davidad.asm -DLOOPED=10000 -f macho64 -o 8q_x64_davidad.o
ld -o 8q_x64_davidad 8q_x64_davidad.o
time ./8q_C_bluecalm ; echo $?
real 0m0.802s
user 0m0.801s
sys 0m0.001s
92
time ./8q_x64_davidad ; echo $?
real 0m0.113s
user 0m0.112s
sys 0m0.000s
92
If you'd like, I'm happy to post this as a branch of my repository so others can replicate. (I only hesitate because it's a derivative work of your code.)
Hi,
I am happy to see that asm magic gains over naive code :)
Feel free to host/use the code in any way you like!
It will be nice addition to see how hand tuned code beats naive C implementation.
(I hope your code iterates over all solutions as well, the easiest way to change mine to stop after 1st is a goto from solve to main :))
EDIT:
Yeah, I meant if it counts all of them and I've just noticed 92 exit code :)
I didn't expect that big a speed-up, impressive!
EDIT2:
to answer children post:
That's great, please change -O3 to -Ofast though if you are using new GCC. It shouldn't matter much but it sometimes does; I believe -march=native won't help with generated assembly though
I seems that your code from original article spent all its time in pretty printing. I am happy to see some real comparison now; I am for sure learning a lot here once I understand everything in the article.
It counts up every solution, thus the "92" exit code. However, the non-pretty-printing version does take some shortcuts - it can't reconstruct the entire current board state in O(1) time from its actual register state. So I'm not sure exactly what you mean by "iterate over all solutions". (It certainly doesn't bail out when it finds the first solution, if that's your question.)
EDIT: I don't seem to have -Ofast, but I added a note to the readme to ask people to try it. I also added the recommended flags for Sandy Bridge, i.e. -march=core2 -msse4.1 -msse4.2.
You can go at least twice as fast if you use symmetries to prune the space. I messed with the C version:
$ make
gcc -O3 -march=core2 -msse4.1 -msse4.2 8q_C_bluecalm.c -o 8q_C_bluecalm
time ./8q_C_bluecalm ; echo $?
real 0m0.376s
user 0m0.373s
sys 0m0.001s
92
time ./8q_x64_davidad ; echo $?
real 0m0.129s
user 0m0.128s
sys 0m0.001s
92
Of course this doesn't matter if the point is to compare the best asm dfs to the best C dfs of the same space, but if the point is to make the best asm 8queens solver, you can get down below 6μs like this :)
Edit: Using C++ templates to inline the DFS also helps
time ./8q_C_bluecalm ; echo $?
real 0m0.260s
user 0m0.258s
sys 0m0.001s
92
Unfortunately this sort of technique doesn't really apply to the asm >_>
Edit again: I tried to go a bit further by replacing the board struct with your method of passing 3 bytes between levels of the dfs, but at that point the compiler was able to optimize out the whole program.
Removing the templates causes the compiler to actually emit a dfs, which by now is faster than the asm (!). Removing the 2x speedup from just not looking at half of the tree slows us back down to ~1.1x the runtime of the asm:
time ./8q_C_bluecalm ; echo $?
real 0m0.140s
user 0m0.138s
sys 0m0.001s
92
Here's this last solution, the one that's ~1.1x slower than the asm on my machine: http://pastie.org/8784768
One last time now: I tried getting rid of the x arg and using a global. This was much slower, so I tried getting rid of the SOLUTIONS global and using a return value. This caused the compiler to optimize out the whole program again :(
My immediate question was whether another compiler might be more competitive with the assembly. On my Sandy Bridge processor, I found that although GCC was the slowest, the assembly was still the clear winner:
I was surprised by Clang here. It's an older version than the other two, yet fastest. A quick glance at 'perf stat' (which I used for the timing) says that although it's executing fewer instructions per cycle, it's managing to use fewer instructions than the other two.
Although it doesn't seem to make much of a difference here, those are odd flags for Sandy Bridge. Core2 is a previous generation, and it supports AVX which came out after SSE4.2.
If you did to specify SB, you'd want the unwieldy "-march=corei7-avx". But probably better just to use '-mavx' or 'march=native', along with -Ofast or -O3.
I couldn't resist the challenge, so I wrote a non-recursive solution [0] in C++. It takes about 0.73s on an E5-4650 to loop 10000 times. Since writing it I've tried testing it against bluecalm's solution and how they compare seems to be fairly sensitive to the particular compiler and hardware combination. But it's at least a slightly different take.
windows.h is for QueryPerformanceCounter used only in main(); compile with gcc -Ofast
It iterates/counts all solutions + prints first of them (as 01 board) in 0.23ms on my i7 (if you redirect output to file). You can also run it for N queens changing #define Qs to something bigger. The code as you see is silly (not needed conditional in main loop among other things) and not optimized, still it's pretty fast :) I am not sure what hardware they used but it takes 10x less time on my computer vs theirs.