
Overkilling the 8-queens Problem - AndyKelley
http://davidad.github.io/blog/2014/02/25/overkilling-the-8-queens-problem/
======
colmmacc
People often think that the N-queens problem is just an abstract exercise,
good sample work that stretches the mind. But when we were building Route 53,
we actually use the N-queens problem, it's part of how we assign Shuffle
Shards;

    
    
         https://github.com/awslabs/route53-infima#shuffleshard
    

The solution we included in Infima is a recursive search with backtracking
because of some other needs;

    
    
         http://git.io/332VLA
    

but it's the N-queens problem, in a real practical setting.

~~~
applecore
Clickable:

[https://github.com/awslabs/route53-infima#shuffleshard](https://github.com/awslabs/route53-infima#shuffleshard)

[http://git.io/332VLA](http://git.io/332VLA)

------
mjn
As an asm tutorial / case study, I found this neat. In terms of a conclusion,
I found it potentially neat, but found myself wondering: cool, but how much of
a win is all this special-cased effort, versus some kind of off-the-shelf
constraint solver written in portable code? It'd be interesting to see a
benchmark so we could have an idea if we're talking about a 50% win, orders of
magnitude win, etc.

To get a rough baseline for what a generic solution could take, I took nqueens
in Answer-Set Programming [1], set n=8, and ran it through clingo [2] on an
Amazon micro instance, which gets me 15ms.

Another interesting question might be whether the two solutions have different
scaling properties as n increases. The ASP solution scales like this
(averaging a few runs and rounding):

    
    
       n   time
       8     15ms
      10     19
      15     50
      20    130
      25    290
      30    600
      35   1050ms [1.05s]
    

[1]
[http://www.hakank.org/answer_set_programming/nqueens.lp](http://www.hakank.org/answer_set_programming/nqueens.lp).
This encoding can be optimized a bit as well, but is a decent starting point.

[2] [http://potassco.sourceforge.net](http://potassco.sourceforge.net), a
menagerie of logic-programming tools, written in C++. 'gringo' takes a high-
level encoding and turns it into something more SAT-like, 'clasp' solves the
SAT-like (variable-free) programs, and 'clingo' staples the two steps together
into one binary. I used the version in Debian stable package 'gringo', which
reports itself as "clingo 3.0.4 (clasp 1.3.10)"

~~~
mzl
N-queens, while quite a cute problem, is typically not that interesting as a
constraint programming benchmark. The way to make it fast is to have a bit-set
representation of the domains, which is not the best representation for many
other important cases.

I tested the standard Gecode ([http://www.gecode.org](http://www.gecode.org))
implementation from version 4.2.1 without any special options on a MacBook Pro
with a 2.6GHz Intel Core i7. Note that Gecode does not have bit-set
representations for domains, so it is not optimized for this special case.
Tests were run for 500 iterations and repeated 5 times to get deviations below
2.5%. I get the following timings for your sizes plus 50 and 100:

    
    
        n   time
        8     0.0368 ms
       10     0.0243 ms
       15     0.0221 ms
       20     0.1263 ms 
       25     0.2677 ms
       30     0.1621 ms 
       35     0.2433 ms 
       50     2.3234 ms 
      100     0.5509 ms
    

The timings are of course all strongly dependent on the heuristic used for
finding the solution which is shown by the varying times above.

Constraint programming is typically also very dependent on the the pruning
power of the constraints. Adding the option "-icl dom" which makes the pruning
for the all-different constraints domain consistent, gives the following
timings.

    
    
        n   time
        8     0.1138 ms
       10     0.0603 ms
       15     0.0879 ms
       20     0.2801 ms 
       25     0.6612 ms
       30     0.4657 ms 
       35     0.7553 ms  
       50     2.7687 ms 
      100     8.7687 ms
    

As can be seen, domain consistency did not actually help us in this case time-
wise. The number of search-nodes visited range between 17 and 1066 for the
normal consistency level, while the number of nodes visited for the domain
consistent version range between 14 and 286. The reduction in search tree size
was however offset by the increased complexity of the propagator.

~~~
mzl
As an update I saw that the original post was about finding all the solutions
to the problem, compared to the single first solution searched for above.
Running for finding all solutions, I got the following timings (same computer
as above, same settings, default consistency).

    
    
        n   time           search tree  solutions
        8      0.8752 ms       767          92
       10     14.2759 ms     11431         724
       12    306.7616 ms    232163       14200
       14   8782.4620 ms   6391931      365596
    

Downloading the c-versions from
[https://github.com/davidad/8queens/tree/%2Bc_comparison](https://github.com/davidad/8queens/tree/%2Bc_comparison)
gives a reference-time using the time built-in command of around 750 ms for
10k iterations. This means that the C solver takes around 0.075 ms per
iteration which is a factor of ten faster than the Gecode version. That is not
unreasonable given that the Gecode version is a high-level model in a general
framework.

I did not manage to build the assembly-version unfortunately.

EDIT: Updated the timings and comments of the C-solver to represent that it
actually solved the problem 10k times. Lesson, never do benchmarks quickly :-)

------
adwhit
I am said Hacker Schooler who issued the challenge. I thought I'd done well
with a Rust solution that ran in 42ms [0].

I was well and truly pwned, as they say (but at least I got some sleep last
night).

[0]
[http://rosettacode.org/wiki/N-queens_problem#Rust](http://rosettacode.org/wiki/N-queens_problem#Rust)

~~~
dbaupp
Looks cool!

BTW, the convention would be for `static SIZE: i8 = 8;` (i.e. upper case), and
I would strongly recommend you don't use `i8` for that, because SIZE * SIZE =
0 (but not in const-exprs, it seems). `uint` is the type of the smallest size
that is guaranteed that to work for all array indexing.

Also, you can use `bool` instead of i8 for your array elements. (A Rust bool
is a u8/i8 that is guaranteed to be either 0 or 1, i.e. exactly what your code
has.)

Lastly, one conventionally matches on just `None` rather than `None()`. (This
actually lead me to file
[https://github.com/mozilla/rust/issues/12560](https://github.com/mozilla/rust/issues/12560)
to make it require that there be no `()`, so thank you for prompting that.)

------
JoeAltmaier
Beautiful! I derived this solution back in 81 when Byte Magazine had a
contest. It took 20 minutes to run in Basic on an HP 2000.

------
deutronium
That's a really fascinating article! I'd love to see how a C implementation
would match up to that.

As an aside there seems to be lots of really interesting concepts involved
involved in implementing chess engines.

From representing the chessboard in a highly compact manor through the use of
bitboards
([http://www.frayn.net/beowulf/theory.html#bitboards](http://www.frayn.net/beowulf/theory.html#bitboards))
to searching for 'best' possible moves.

~~~
bluecalm
My old code: [http://codetidy.com/8255/](http://codetidy.com/8255/)

windows.h is for QueryPerformanceCounter used only in main(); compile with gcc
-Ofast

It iterates/counts all solutions + prints first of them (as 01 board) in
0.23ms on my i7 (if you redirect output to file). You can also run it for N
queens changing #define Qs to something bigger. The code as you see is silly
(not needed conditional in main loop among other things) and not optimized,
still it's pretty fast :) I am not sure what hardware they used but it takes
10x less time on my computer vs theirs.

~~~
davidad_
Hi bluecalm! I tweaked both your code and mine to uselessly solve the problem
10000 times, and not pretty print anything, so the difference in raw 8-queens-
solving speed is readily apparent. Here are the results:

    
    
        bash-3.2$ make
        gcc -O3 8q_C_bluecalm.c -o 8q_C_bluecalm
        nasm 8q_x64_davidad.asm -DLOOPED=10000 -f macho64 -o 8q_x64_davidad.o
        ld -o 8q_x64_davidad 8q_x64_davidad.o
    
        time ./8q_C_bluecalm  ; echo $?
    
        real	0m0.802s
        user	0m0.801s
        sys 	0m0.001s
        92
    
        time ./8q_x64_davidad ; echo $?
    
        real	0m0.113s
        user	0m0.112s
        sys 	0m0.000s
        92
    

If you'd like, I'm happy to post this as a branch of my repository so others
can replicate. (I only hesitate because it's a derivative work of your code.)

~~~
bluecalm
Hi, I am happy to see that asm magic gains over naive code :) Feel free to
host/use the code in any way you like!

It will be nice addition to see how hand tuned code beats naive C
implementation. (I hope your code iterates over all solutions as well, the
easiest way to change mine to stop after 1st is a goto from solve to main :))

EDIT: Yeah, I meant if it counts all of them and I've just noticed 92 exit
code :) I didn't expect that big a speed-up, impressive!

EDIT2: to answer children post: That's great, please change -O3 to -Ofast
though if you are using new GCC. It shouldn't matter much but it sometimes
does; I believe -march=native won't help with generated assembly though I
seems that your code from original article spent all its time in pretty
printing. I am happy to see some real comparison now; I am for sure learning a
lot here once I understand everything in the article.

~~~
davidad_
Here [1] is the comparison repository, for anyone interested.

[1]
[https://github.com/davidad/8queens/tree/%2Bc_comparison](https://github.com/davidad/8queens/tree/%2Bc_comparison)

EDIT: I don't seem to have -Ofast, but I added a note to the readme to ask
people to try it. I also added the recommended flags for Sandy Bridge, i.e.
-march=core2 -msse4.1 -msse4.2.

~~~
anonymoushn
You can go at least twice as fast if you use symmetries to prune the space. I
messed with the C version:

    
    
      $ make
      gcc -O3 -march=core2 -msse4.1 -msse4.2 8q_C_bluecalm.c -o 8q_C_bluecalm
      time ./8q_C_bluecalm  ; echo $?
      
      real	0m0.376s
      user	0m0.373s
      sys	0m0.001s
      92
      time ./8q_x64_davidad ; echo $?
      
      real	0m0.129s
      user	0m0.128s
      sys	0m0.001s
      92
    

Of course this doesn't matter if the point is to compare the best asm dfs to
the best C dfs of the same space, but if the point is to make the best asm
8queens solver, you can get down below 6μs like this :)

Edit: Using C++ templates to inline the DFS also helps

    
    
      time ./8q_C_bluecalm  ; echo $?
      
      real	0m0.260s
      user	0m0.258s
      sys	0m0.001s
      92
    

Unfortunately this sort of technique doesn't really apply to the asm >_>

Edit again: I tried to go a bit further by replacing the board struct with
your method of passing 3 bytes between levels of the dfs, but at that point
the compiler was able to optimize out the whole program.

Removing the templates causes the compiler to actually emit a dfs, which by
now is faster than the asm (!). Removing the 2x speedup from just not looking
at half of the tree slows us back down to ~1.1x the runtime of the asm:

    
    
      time ./8q_C_bluecalm  ; echo $?
      
      real	0m0.140s
      user	0m0.138s
      sys	0m0.001s
      92
    

Here's this last solution, the one that's ~1.1x slower than the asm on my
machine: [http://pastie.org/8784768](http://pastie.org/8784768)

And with the free 2x speedup:
[http://pastie.org/8784770](http://pastie.org/8784770)

One last time now: I tried getting rid of the x arg and using a global. This
was much slower, so I tried getting rid of the SOLUTIONS global and using a
return value. This caused the compiler to optimize out the whole program again
:(

~~~
davidad_
Wow! I like your C solution. I'm glad that at least the speedup is still
measurable, but it's a close shave.

I implemented your "free 2x speedup" in asm [1]. Sure enough, solution time
improved to 4.6us on my machine [2].

[1]
[https://github.com/davidad/8queens/commit/61119a7f0019e4a85e...](https://github.com/davidad/8queens/commit/61119a7f0019e4a85ee7e9454d6a4b4503fbaeb2)
[2]
[https://github.com/davidad/8queens/tree/free-2x-speedup](https://github.com/davidad/8queens/tree/free-2x-speedup)

------
userbinator
This is the sort of thing that really shows why someone who knows Asm well can
beat even the best compiler - because high-level languages, in hiding the
details of the machine under layers of abstraction, are also hiding those
details that make it possible to write efficient code. Examples include using
the flags and the stack in "unusual" (to HLLs) ways, no requirement to conform
to any calling conventions, etc.

That said, the size optimiser in me sees a noticeable absence of rbx, rsi,
rdi, and ebp; those registers should be used before going into the r8~r15,
since the "extended" registers require a prefix byte to the instruction every
time they're used. In fact since only r8, r10, r13, r14, and r15 are used,
this code could be rewritten to use only one extended register and you
wouldn't need the "mov rdi, r8" at the end.

~~~
nkurz
_those registers should be used before going into the r8~r15, since the
"extended" registers require a prefix byte to the instruction every time
they're used._

It's hard to argue against "shorter", but since the modern processors this
code targets all have a decoded instruction micro-op cache, there is no front-
end bottleneck and the instruction length doesn't actually slow anything down.

Also, the four registers you mention are all callee-save on Windows (just RBX
and RBP for Mac and Linux) , so avoiding them means you don't have to save and
restore them. Given a choice, use a scratch register. That said, R12-R15 are
callee-save too, so this may not apply here.

Personally, I dislike the messiness of the historical names when using them
for non-historical purposes. I think if it was more acceptable to alias them
to R1, R2, etc I'd prefer them more than I do.

~~~
userbinator
_but since the modern processors this code targets all have a decoded
instruction micro-op cache, there is no front-end bottleneck and the
instruction length doesn 't actually slow anything down._

Right, this code is running entirely in the core and that's part of what makes
it so fast.

 _Also, the four registers you mention are all callee-save on Windows (just
RBX and RBP for Mac and Linux) , so avoiding them means you don 't have to
save and restore them._

Callee-save and caller-save are just a calling convention; and one of the
advantages of using Asm is you don't have to care about calling conventions
except if you're interfacing with some other language, which isn't the case
here - it's pure Asm. There's not even a single function call in it.

Besides, saving and restoring those registers (if you really need to) only
takes 4 bytes each (2 in 32-bit mode) - a push and a pop. This tradeoff pays
off if you're going to use them in more than 4 instructions.

------
curiousDog
Would it have been easier to code it in C, compile with -O3 and then further
tweak the assembly?

~~~
davidad_
The semantics of C map poorly onto this kind of code. The types of decisions
that need to be made here, like combining certain state values into single
registers so they can be operated on in parallel, and the particular structure
of branching, would be difficult to make post-hoc to a C-style program.

If you're excited about it, though, I'd encourage you to give it a shot and
write a blog post about your experience! However hard it turns out to be, I
bet we'd all learn something new. (:

------
dfc
I do not think I have ever seen so many stylistic/rare/historic ligatures used
on a web page. Does anyone know if something changed in Iceweasel Aurora? Is
this something that can be turned on in css?

UPDATE: If I allow javascript the web page is rendered using some google fonts
which probably do not provide any ligatures. However if I do not allow JS and
Iceweasel uses my default serif font (Minion Pro) the page is littered with
all sorts of contextual/historic/stylistic ligatures.

~~~
TheLoneWolfling
text-rendering: optimizelegibility

~~~
dfc
Thanks. I think I might have to rethink my choice of default fonts. It seems
that Iceweasel uses all available ligatures. I have to use fontspec to
explicitly turn on the rare/obscure ligatures in latex. Hopefully Mozilla will
give a little thought to enabling all ligatures by default.

Examples: [https://imgur.com/a/7Fcdm](https://imgur.com/a/7Fcdm)

~~~
bzbarsky
I think you misunderstand. By default, Gecko enables just the common
ligatures. But this page has the following styles:

    
    
      div {
        text-rendering: optimizeLegibility;
        font-variant-ligatures: common-ligatures;
        -webkit-font-variant-ligatures: common-ligatures;
        -moz-font-feature-settings: "liga", "dlig";
        -ms-font-feature-settings: "liga", "dlig";
        -webkit-font-feature-settings: "liga", "dlig";
        -o-font-feature-settings: "liga", "dlig";
        font-feature-settings: "liga", "dlig";
      }
    

The text-rendering style enabled optional ligatures. The font-variant-
ligatures style enables the use of OpenType "liga" and "clig" ligatures, but
then the font-feature-settings style overrides that with "liga" and "dlig"
(discretionary ligatures). What that means depends on how your font designer
flagged their various ligatures, of course, but the page is explicitly opting
into more than just common ligatures.

~~~
davidad_
I may have been overzealous here. The page also provides Garamond in various
OpenType and webfont styles, and I wanted to be sure that the Garamond is
displayed with ligatures, because Garamond's ligatures are really nice and not
particularly noticeable (whereas I notice the _lack_ of ligatures right away).
I didn't really give thought to what happens if the page is being rendered in
a different OpenType font for some reason. Perhaps I should replace "dlig"
with "clig".

~~~
dfc
I am not sure why you think you should just replace dlig with clig? Did you
want the discretionary ligatures? Which of the contextual ligatures do you
want? In my opinion it is the discretionary and the historic ligatures that
are distracting and gaudy. I you can not think of the specific ligatures you
want from the discretionary/contextual/historic why not just stick with the
basics provided in liga?

This is going to sound crazy: Does anyone know the name of the js/css
development website that has three or four panes and lets you mock up things
on screen. There is probably more than one. I can not think of the name right
now for the life of me. I always see it linked to when people are doing
demonstrations here on HN.

I put up an image here of your defaults with your fonts and mine:

[https://imgur.com/a/x44Wq](https://imgur.com/a/x44Wq)

My default font is Minion Pro and you can see the extra discretionaries in
words like " _sp_ urred, dire _ct_ ed, te _st_ ing, etc." The one goofy css
test that I found did not let me choose any web fonts so I could not check
your garamond but I did check my local Garamond Pro. and all of the same
discretionary ligatures are present in my local garamond.

------
djvu9
I feel nowadays to meaningfully talk about performance of N-queens problem N
should start with 18 instead of 8.. Here is a C++ solution with 40 lines of
code and can solve 18-queens in less than a minute on a laptop:

[https://gist.github.com/jdeng/7915749](https://gist.github.com/jdeng/7915749)

~~~
IanCal
Are you calculating the total number of solutions, or just finding a solution?
The latter has a closed form solution:
[https://gist.github.com/IanCal/1858601](https://gist.github.com/IanCal/1858601)

That'll do ~N=1e6 in a second (I'm sure there are lots of optimisations that
could be done in it though), or solutions for N=4 to N=1000 in less than 10.

The more complex problem is finding out how many solutions there are in total.

~~~
djvu9
Counting all the solutions.
[http://en.wikipedia.org/wiki/Eight_queens_puzzle#Counting_so...](http://en.wikipedia.org/wiki/Eight_queens_puzzle#Counting_solutions)

------
IbJacked
Hard to believe I wrote anything useful back in the day, having just an A, X,
and Y.

~~~
wglb
These new instruction sets look like a serious luxury, don't they?

~~~
davidad_
They truly, truly are.

------
zurn
It would be interesting hear from a compiler guy why it's hard to produce this
sort of code on x86 compiling from C or OpenCL. Bonus points for GPU version
:)

------
mar77i
reminds me hof how I taught the c64 one of wolfram's 1d automata (as in
[http://www.mathpuzzle.com/MAA/42-From%20Keyfobs%20to%20Ringt...](http://www.mathpuzzle.com/MAA/42-From%20Keyfobs%20to%20Ringtones/CARule600116.gif)
) by heavily relying on lookup tables and bit shifting which got down the
calculation times to about 30s for the whole 320x200 hires screen. Fun times.

------
Patient0
This looks like a great blog. The previous entry on virtual memory is also
pretty informative.

------
plumeria
Fellow wikipedians: Curiously, there is no English Wikipedia entry for MCPL.

~~~
muglug
Yes – I don't think it ever served much purpose beyond academic papers. The
referenced paper's author, Martin Richards, never mentioned it when supervised
me. Another language of his, BCPL, is here:
[http://en.wikipedia.org/wiki/BCPL](http://en.wikipedia.org/wiki/BCPL)

Edit: Turns out he created it after I graduated. See here:
[http://www.cl.cam.ac.uk/~mr10/MCPL.html](http://www.cl.cam.ac.uk/~mr10/MCPL.html)
for more info

