I suggest an alternate approach that would use a training set consisting of real-world data of sudoku puzzles that are in the process of being solved: before a box is filled in, and then after it's filled in. This would teach the network to solve the puzzle one step at a time, instead of all at once.
If you insist on only using the as-received puzzle and the solution for training, my intuition is that you'll need more layers than 10 in the NN, and much much more data.
A taxonomy of what complex solving strategies need to be learned already exists: see e.g. http://www.sudokuwiki.org/y_wing_strategy and the categories "basic", "tough", diabolical", and "extreme". My guess is that the NN (using the original author's strategy, but with more layers and lots more data) will eventually do well on "basic" and "tough", but will start to miss on some of the latter conditions depending on the training set and how the network is set up.
C.f. AlphaGo, which invented new strategies that proved interesting to human players.
Of course, if you're training your NN to solve a real-world problem, you wouldn't choose to train it in this way, you would do as you suggested and feed it as much relevant training data as possible, to give it a head-start.
(Your points RE: sizing and quantity of training data sound reasonable and are out of my expertise, so nothings to add there).
Pick one open square, try a number that is possible for that, backtrack if you get stuck.
For better results, pick the open square which has the fewest number of possible choices.
Human solvers might try something more efficient, but the above strategy is not at all bad for computer implementation.
It's very much a solved problem with existing technology, see
adding a neural net to it doesn't help in any way.
I still find current AI as basically a filter. Image to text, speech to text e.tc
A smart AI would be able to figure out sudoku rules from a very small sample of games, it would then figure out a rudimentary backtracking algorithm. It would then sleep over it and figure out more patterns like naked singlets e.t.c and build a really fast sudoku solver with 100% accuracy. All this happening in a matter of seconds.
I don't think optimizing a rats nest of functions with SGD is that impressive. If a model lacks explanatory power what use is it to people?
In this case when the solution from the NN is incorrect what recourse do you have? The actual solution might be an arbitrary permutation of what the NN gave you and there is no way to tell which rows and columns will have to be reshuffled to get the actual solution or even if there is a solution. The constraints might be inconsistent and you will never know.
I imagine strategically removing numbers from solved puzzles so as to reinforce the neural connections for solution and filter out possible 'noise'
You could generate massive datasets that way -- it would be pretty easy to generate a few billion pre-images of a solution. I mean, a solution has 80 numbers and a puzzle about 20. 80 choose 20 is about 10^18, or a billion billion.
"can NN be applied to X class of problems with no human-added domain specific knowledge"
I'm not convinced applying them to areas like this one where the are clearly much better approaches teaches us anything useful for more interesting cases, but maybe it does.
 NB it's not immediately clear that OPs problem is in that class.
Maybe the useful thing here is that Sudoku is a problem space a lot of people understand pretty well even though it's not trivial. That makes a nice domain for thinking about and understanding NN. I guess that people who are familiar with the problem space of solving Sudoku and similar problems but who don't know much about NN will find this pretty interesting but people who already understand NN pretty well (or aren't familiar with solving puzzles like Sudoku) won't.
It's not obvious.
The problem that remains is how effective is the learning.
To late to add an comment to the original.
I don't think it teaches us much about sudoku, but any formerly 'hard' problem that we've solved using traditional means is perfect for rapidly generating arbitrary amounts of training data, which makes it very useful for learning about neural nets.
(I still think it would fail spectacularly on "super-human" sudokus.)
But why would you care about solving a class of problems with NNs when that class of problems already has much better solutions? Too many new programmers are running to NNs out of pure laziness. NNs are like magic. They solve the problem for you so you don't have to learn how to solve it yourself. Or at least that's the enticing promise.
> NNs are like magic. They solve the problem for you so you don't have to learn how to solve it yourself.
Sounds like you answered your own question; am I missing something?
For cases where that's not possible, iterative approaches can be very effective -- with not particularly optimized python, the worst solution time I ever saw was 1 sec on a 1st gen eeepc netbook. That had about 6 layers of iterative backtracking before it came up with the solution.
Edit: I decided to publish my implementation on GitHub. https://github.com/jcoffland/fsudoku
This is it, uploaded a few years later in a repo for soduku over sms:
length(Rows, 9), maplist(same_length(Rows), Rows),
maplist(row_booleans, Rows, BRows),
BRows = [As,Bs,Cs,Ds,Es,Fs,Gs,Hs,Is],
blocks(As, Bs, Cs), blocks(Ds, Es, Fs), blocks(Gs, Hs, Is).
blocks(, , ).
blocks([N1,N2,N3|Ns1], [N4,N5,N6|Ns2], [N7,N8,N9|Ns3]) :-
blocks(Ns1, Ns2, Ns3).
card1(Bs) :- sat(card(,Bs)).
row_booleans(Row, Bs) :-
maplist(cell_boolean, Row, Bs).
cell_boolean(Num, Bs) :-
element(Num, Bs, 1).
Here is an example Soduku problem, taken from Donald Knuth's The Art of Computer Programming:
?- knuth(b, Rows),
[1, 2, 3, 4, 5, 6, 7, 8, 9].
[5, 9, 7, 3, 8, 2, 6, 1, 4].
[6, 8, 4, 1, 7, 9, 3, 2, 5].
[9, 5, 6, 2, 3, 1, 8, 4, 7].
[7, 3, 1, 5, 4, 8, 9, 6, 2].
[2, 4, 8, 9, 6, 7, 1, 5, 3].
[8, 7, 9, 6, 2, 5, 4, 3, 1].
[4, 6, 5, 7, 1, 3, 2, 9, 8].
[3, 1, 2, 8, 9, 4, 5, 7, 6].
Acronyms are always a pet peeve of mine. I would know what a SAT is - since I've spent more than 50% of my life with Computer Science. However, not many people would understand the acronym you'd use. So, it would be great if in general one mentions what SAT refers to rather than using the acronym. Mention the long form the first time and then use the short form later in your sentence.
Sorry if this came across as too affront, just a personal observation. I saw this a couple of ago with GRUs on another thread too.
Also, I set as a intro to C practical writing a SAT solver which can easily solve any real-world Sudoku instance.
Now, neither will be that good at the job they are made for. But that is far from unique to this field. Consider, a car/bike/house/whatever is not exactly hard to build for backyard use. Getting world class, on the other hand, escalates to difficult very quickly.
Also, saying that NN are easy with a library like tensorflow is akin to saying that Sudoku solvers are easy with Prolog. Of course they are! The library/language does the vast majority of the heavy lifting.
Also, in my personal experience, SAT is currently applicable to more problems than Neural Networks (although that is changing as Neural Networks get applied to more things). You can use SAT solvers to solve any problem in NP (well, other things as well but it rapidly gets painful). That's a lot of problems!
A neural network isn't simpler in any sense of the word. You might as well throw a simulated annealer at the problem.
Great idea! I'm sure D-WAVE would be happy to work with you on a quantum sudoku solver.
That is, the "exact cover" nature of Sudoku is not immediately obvious to everyone. At least, it wasn't obvious to me. Seeing how quickly you can map it to that and then get a solution was a lot of fun and ridiculously educational.
That is, yes, sat solvers have a lot of heuristics to speed them up. But, almost by definition, we understand those heuristics do can reason about the answers they give. NNs, however, are notorious for being incredibly opaque. They give his probabilistic answers, but my understanding is we really only trust them probabilistically and can't explain their answers.
http://ravimohan.blogspot.se/2007/04/learning-from-sudoku-so... - enjoy :D
As a concrete example, say you misplace a '2' somewhere within a given puzzle. Obviously, this cell is incorrect. But depending on the nature of what the NN has learned, it may believe the row (resp. column, 3x3 box constraint) already has the '2' in it, so tries to fill its correct spot with another number. Which of course then leads to the column and/or 3x3 box of that cell to learn an incorrect value, starting the process over again.
This same phenomenon can be seen in the game Kenken; depending on the strategies you use at any given point in the game, one mistake can propagate outward pretty quickly and spoil large sections of the puzzle.
I thought it was interesting that the "hard" category seemed to have more wrong answers than the 2 harder categories. Maybe there just aren't enough samples, but in the data shown in the table, it seems like a big difference.
No need to do that. It would suffice to just include the random seed in the script so that its results are reproducible.
There was some interesting work a while ago that solved sudoku using spiking neural networks too  (Fig. 5) with some nice associated theory.
The paper which showed that the general case of Sudoku is NP-complete was apparently this one  (Linked from wikipedia)
But you don't. Even what you might call "brute force" Sudoku solvers start from the given hints in the puzzle, which reduce the search space considerably. Given a set of Sudoku hints with a unique solution, you won't need "years or eons" with even a very simple solver.
A normal "very simple" solver of the kind you're talking about will solve the solution very quickly, in seconds or less, and it will also not be a constant time algorithm.
Generation of all possible boards is little more complex but I can hardly see how it could take "eons" - with proper implementation (Apache Spark :P ) and reasonably powerful hardware (1000 CPU core cluster, ha! ha! ha!) it should run under 1 day and take just little over 13 TB of disk/RAM space :).
The paper linked to in the OP's article says: "Due to the sheer number of sudoku solution grids a brute force search would have been infeasible, but we found a better approach to make this project possible. Our software for exhaustively searching through a completed sudoku grid, named checker, was originally released in 2006. However, this first version was rather slow. Indeed, the paper  estimates that our original checker of late 2006 would take over 300,000 processor-years in order to search every sudoku grid."
The estimate in the comment (9!*9!) seems to be implying a simple enumeration, not a complex strategy of symmetry-folding. But even if you do reduce the enumeration, the authors of that paper say their software requires 800 CPU years. I'm not making any claims about whether getting that down to a day might be possible, but I wish you good luck. By all means, show everyone how to do it with a proper implementation and a large cluster! ;)
Anyway, we are in agreement that enumerating all boards would be quite inefficient.
I'm also not saying that this is true for all Sudoku solvers, since you could write one that sometimes just decides to uselessly burn cycles for a few centuries before returning a correct solution.
Here's that simple argument again: Sudoku can be solved in time at most exponential in the input size. The input size of 9x9 Sudoku is constant. The exponential of that constant is a constant. Not a very interesting argument? I agree, so I'm dropping this now.
And apparently quantum computers can solve them using sheer brute force alone, but these articles don't go into enough detail (i.e: can they solve the 'evil' puzzles?)
Some of these are the hardest possible puzzles, with a solution, for the size.
It needed only 150 tries to solve the hard puzzle.
Maybe an NN for Sudoku should use AlphaGo type of architecture.