
Deep learning for chess - mlla
http://erikbern.com/?p=841
======
thomasahle
As the author of sunfish, I'd like to point out something about learning in
chess, which is a topic that interests me a great deal.

When sunfish (with its just 111 python lines of chess logic) 'evaluates' a
position, it uses the perhaps simplest known effective method: A piece-square
table. The method ignores any interplay between pieces on the board and
calculates the sum of _Table[coord][type]_ for each piece on the table. E.g. a
white knight on a1 may be worth 73 'units', and on f3 it may be worth 98
'units'. That's all there is to it. _Any program which has greater than this
level of precision, and equally precise searching, should be able to beat
sunfish_.

The above may sound naive - and it is - but actually most of the advanced
ideas used in chess evaluation functions, can be generalized from this method.
"Rook connection" is just a measure that includes two pieces instead of one,
and "pawn shield" is the generalization to three pieces. Experiments with
grandmasters reveal they "recall" positions in 'chunks' of connected pieces.
And this memory is what they use to guide their search. (Papers like
'Perception in chess' and lots of newer research).

So, the role of machine learning in modern engines is to tune the parameters
for evaluation and search pruning (deciding what positions are worth examining
deeper). For the actual decision of which piece to move to where, you still
need search algorithms to crunch millions of positions per second.

~~~
erikbern
Sunfish is really impressive work. From my (brief) understanding of Sunfish,
the evaluation function is essentially equivalent to a hardcoded 1 layer
network in Deep Pink.

You're right that everything else equal, a better evaluation function should
lead to a better chess engine. However in practice I think _better_ evaluation
functions means _slower_ evaluation function. So there's some really
interesting trade-off there. I doubt humans evaluate more than a few thousand
positions, so it seems like a slow but more accurate evaluation function could
play chess pretty well

~~~
thomasahle
One interesting line of research, I think, is using 1 or 2 layered networks to
'simulate' more complex evaluation functions. If you could train such a
network to get within a 10% error of Stockfish's evaluation, then you might be
able to distil that network as a faster evaluator to plug back into Stockfish
for an even stronger engine. As you say, one hard problem is probably finding
actually interesting positions to sample for the training.

Anyhow, it's fun to see how engines like these battle it out. It may also be
that your approach can yield a more 'fun to play' engine for us mortals.

~~~
erikbern
I think that's pretty useful approach. It's kind of similar to Hinton's latest
work on model compression:
[http://www.ttic.edu/dl/dark14.pdf](http://www.ttic.edu/dl/dark14.pdf)

The problem with deep models is when you end up having more than 1 hidden
layers, you have a big matrix multiplication to get between the layers. If
your hidden layers are a few thousand units, that's still pretty slow. Doing
things in minibatches or on the GPU speeds it up significantly, but I'm
guessing it's still orders of magnitudes slower than whatever Stockfish uses

~~~
thomasahle
Sure, the second layer would have to be very sparse. That makes sense since
most multi-piece 'chunks' are not really that interesting.

------
halfcat
This has been tried many times before, with better-but-still-lackluster
results. Sunfish is impressive because it's written in Python and in a tiny
number of lines while still being readable. I LOVE Sunfish, but it is among
the weakest chess engines in existence. That deep learning could not break
even against Sunfish seems rather unimpressive.

The author seems to have a not-very-deep understanding of computer chess. Some
examples:

> _Better search algorithm. I’m currently using Negamax with alpha-beta
> pruning, whereas Sunfish uses MTD-f_

MTD-F is not _better_ , just a different way to accomplish more-or-less the
same thing. MTD-F is a binary-search equivalent of the alpha-beta family of
search. In fact, naively switching to MTD-F will probably result in worse
playing ability. It takes some time to get it tuned right, and even then it is
not objectively better.

> _Better evaluation function...By generating “harder” training examples
> (ideally fed from mistakes it made) it should learn a better model_

This is what every beginning chess programmer on the Computer Chess Club
message boards and rec.games.chess.computer has wanted to try for the last 20+
years. It has been empirically demonstrated that for best results, the
evaluation function should remain simple and fast. Improving evaluation rarely
fixes "dumb mistakes". That's what search is for. Efficient search makes up
for a multitude of evaluation mistakes.

> _Faster evaluation function: It might be possible to train a smaller (but
> maybe deeper) version of the same neural network_

If the evaluation function was reduced to literally take zero time to execute,
it would not help significantly. It's a linear improvement being thrown at an
exponential problem.

I would LOVE if there was a new approach to computer chess, but the current
"smart brute force" approach is so far advanced and successful, it is hard to
imagine another approach being competitive.

~~~
erikbern
Thanks for the comment. I'm definitely happy to admit I have very limited
understanding of computer chess and I think the likelihood of stumbling upon a
new approach to chess engines is close to zero.

That being said, I'm not convinced there's anything "magic" about the fast
deep search. There's a broad trend of big complex hand-tuned models being
replaced by simple models with more parameters and many many orders of
magnitude more training data (eg. machine translation, computer vision).
There's probably a lot of domains where this approach doesn't work though.
Maybe chess is one of those applications. We don't really know until we try :)

~~~
Alansaccount
Hi Erik,

I've been associated with computer chess for a while, and was even given a
shout-out by Vas Rajlich for optimization discussion prior to the release of
Rybka 3.

I think this critic's view on the primacy of fast evaluation, at the expense
in accuracy is off base. Looking at the two top engines today, Stockfish and
Komodo, Stockfish has a great search function, but a poor eval, while Komodo's
eval is better, but it's search isn't as good. It's pretty clear that bad
evaluation is limiting, even at the leaves of a deep search.

Anyway, I have more basic questions about why you did what you did. First, it
seems that your use of 12 x 64 sets of values to represent state information
is non-optimal and even not sufficient. It seems that you are introducing a
lot of unnecessary and undesirable additional degrees of freedom by
duplicating the piece-square state inputs for white and black. When I've done
this in the past, I used 6 x 64 sets with positive ones for squares with a
white piece and negative ones for squares with a black piece. Do you see any
advantage in not taking advantage of this symmetry?

Second, you really need side to move, castling state for each side, and the
pesky en passant indicator (just as is required in a FEN representation).
Luckily these don't significantly increase the number of inputs.

I worked on a similar project with a friend at University of Warsaw a number
of years back. We generated static evals using FANN based on the complete
input state, and trained with the value of the search plus eval after 30
seconds per move We used Rybka 2.2n for this, which was the strongest engine
available at the time.

We both moved on after some preliminary success, mainly because there wasn't
any way to conceivably make a top engine due to the closed nature of the top
engines at that time. This is no longer an issue, and if someone had a black
box evaluation function that produced a better eval than the current Stockfish
static eval, it would be trivial to do the substitution.

Best Regards, Alan

------
maaaats
> _Still, even an amateur player probably makes near-optimal moves for most
> time._

This is far from true. An AI that only looks one-two moves ahead to make sure
it doesn't hang a piece or allow mate in one would beat many amateur players
(at least with ~5min time control). That's essentially what Sunfish, what he's
comparing against, does. Note that Sunfish isn't a particular "good" AI, it
would be more interesting comparing it to a "proper" chess AI.

~~~
maaaats
But with that said, I think the concept here is cool. The fact that it doesn't
really know the rules of chess but still can play is interesting. I just think
that it maybe should have selected a different database for its games; a
master-database instead of one filled with amateur games. Of course, there are
far many more amateur games, so such a masterbase is much smaller which may be
a problem.

------
bainsfather
It would be nice to know how strong the program was. What was its grade?

Saying it can beat a terrible player doesn't mean much.

Saying it can beat Sunfish (a python program with grade of ?maybe ELO 1100?
(i.e. not at all strong)) sometimes, when it has a time advantage, is not
impressive.

I'd really like to know how much better (if at all) the evaluation function is
- e.g. can the program beat itself, if one side uses a 'standard' evaluation
function?

Machine Learning is big on measuring outcomes. It is odd that the one outcome
that is important here is not measured!

Some caveats: I realise this is someone's hobby project - I do not mean to
rubbish it. I'm just saying that the work&writeup could have been much
improved by adding this information.

~~~
wodenokoto
The problem here is very much effectiveness per time unit, as the author
writes. It would probably need to be re-implemented in optimized C code in
order to truly test it against other optimized engines.

I think the take away here is how relatively easy it is to make a decent AI
for a complex game using neural networks.

~~~
ogrisel
It's using theano that generates CUDA code. Before reimplementing I would
start with profiling the generated code to spot the main computational
bottlenecks.

~~~
ogrisel
Actually:

> Faster evaluation function: I didn’t use the GPU for playing, only for
> training.

So indeed it could probably be optimized. But first I would profile as it can
be the case that the bottleneck is Matrix Matrix multiplication in numpy which
already delegates computation to an optimized third party library (e.g.
OpenBLAS). Maybe it's worth using the GPU at play time as well.

~~~
bainsfather
"Maybe it's worth using the GPU at play time as well."

The difficulty there is that sending data to and from the GPU is slow. You
need to avoid data transfers, which might mean trying to do everything on the
GPU. But the SIMD cores on the GPU are likely to perform poorly due to all the
branch statements in chess code.

~~~
Karunamon
This may be naive, but board state can be encoded very efficiently using
FEN[1] - as an example, the starting board state only takes up 57 bytes.

[1]:[https://chessprogramming.wikispaces.com/Forsyth-
Edwards+Nota...](https://chessprogramming.wikispaces.com/Forsyth-
Edwards+Notation)

~~~
bainsfather
There is a delay from sending _any_ data to/from the GPU.

Generally you aim to give the GPU a 'large' task (or many small tasks), then
ask for the answer(s) in 1 batch and wait while it is transferred across.

If you have many small tasks where each answer is sent back separately, and
you need that answer before requesting the next task, then you will be very
slow, even if the data sent/received is small. A naive implementation of the
chess algorithm here (with alpha-beta pruning (which has many if-then
branches)) would be like this :/

------
no_gravity
A little nitpicking: Everytime the author writes "infinite", a more accurate
word would be "enough". For example:

    
    
        if you had infinite computing capacity,
        you could actually solve chess.
    

The statement is correct. But you do not necessarily need infinite capacity to
solve chess. Just enough capacity.

Would be interesting to estimate how much capacity.

~~~
maaku
More than you would have if the entire universe were transformed into
computonium.

So effectively infinite.

~~~
no_gravity
Not sure the universe is incapable of doing this.

The number of states a chess field can have is 13. It's either empty or has
one of the 6 different pieces in black or white on it.

So 13^64 is an upper limit for the number of positions.

We could solve chess if we could put these 13^64 positions into a tree, right?

13^64 = 10^71.

The number of atoms in the observable universe is estimated to be 10^80.

So even the observable universe might be big enough to form this tree. Even if
we use big clunky objects like atoms.

We do not have any idea of the size of the unobservable universe.

And I don't know how many states an atom can have. Who knows, maybe a single
atom can solve chess if it's programmed correctly? According to quantum
theory, pretty small objects like electrons can store and process an amazing
(infinite?) amount of information in certain ways.

~~~
Someone
_" So 13^64 is an upper limit for the number of positions."_

Your conclusion is correct by a huge margin but the argument is flawed. There
is state that is not in the board position:

    
    
      - white or black to move?
      - is an en passant capture possible?
      - is castling still possible?
      - how many moves since the last lawn move or capture?

~~~
no_gravity
You are right. If I count correctly, that's another 17 bit of information. If
we make no effort to encode it efficiently. Let's say 10^6 additional states.

That brings us to an upper limit of 10^77 states the board can be in.

Still enough atoms in the observable universe to assign one to each state.

~~~
Someone
As I said "your conclusion is correct".

I think you need fewer than 17 bits: 1 for "who is to move", 4 for castling,
6.6 for the 50 moves (100 ply) rule would leave less than 6 for en passant.
There potentially are 7 possible en passant moves (white pawns at a5, c5, e5,
g5 and black ones at b5, d5, f5, and h5 or vice versa), but all you need to
store is "the n-th pawn just moved" with n in 0..8. That is just over 3 bits.

There is a way more significant flaw in your logic, though: the "three times
the same board position with the same player to move is a draw" rule. That
explodes your estimate, as there could be (if we do not call in additional
chess rules) up to 10^71 positions hat have been visited earlier.

With that in mind, I am not sure your limit is correct. There are at most 96
pawn moves in a game (16 pawns walk in 6 moves to the other side of the board)
and 30 captures (7 pieces each plus 8 promoted pawns each). With the 50-move
rule, that gives at most 126 x 50 = 6300 moves in a game. Each move can be
encoded in 12 bits (starting and ending square for white's and black's move),
so we need 75600 bits. Taking 2^10 = 1000, that gives me 10^22680 possible
games as an upper limit.

That is different from what [http://www.chess.com/article/view/the-open-file
---is-chess-i...](http://www.chess.com/article/view/the-open-file---is-chess-
infinite) or
[http://members.iinet.net.au/~ray/Chessgames.htm](http://members.iinet.net.au/~ray/Chessgames.htm)
(several estimates) claim, but (if we assume that zero is a misprint) close to
Hardy's upper limit, so at least I am in good company.

~~~
no_gravity
Thinking about it again, maybe we don't need to encode off-the-board
information in the tree at all.

The tree would start from the starting position and then branch out to all
legal moves in each step. So each node in the tree would have the previous
moves encoded in it's position in the tree.

So my initial upper limit of 10^71 nodes might hold true. No need to encode
information about castling, black or white to move, en passant etc.

Repeat positions are another issue though. Do we have to encode them? My 10^71
tree would not contain them. At first I thought we can leave them out. Now I'm
not so sure anymore.

A move that leads to a repeat position is certainly not necessary to win a
game. You could play the winning moves right away.

But it can be necessary to force a draw. Hmm...

You might be right. Maybe the "repeat position" issue kills the upper limit of
10^71 nodes in the tree of chess moves. So we have to resort to your upper
limit of 10^22680.

------
sytelus
It's surprising that it is able to win 1/3rd of the time. The problem here is
that input does not lie in any continuous space. I mean, you may have 1
billion board states in your training but is it possible to interpolate values
of other states using this? For example, for one vector representing certain
board state, even a slight change may have completely different outcome. I
would think most learning methods, including deep learning, would excel when
there is some sort of interpolatable continuity in inputs on which prediction
is desired. Therefore the challenge would be transform discontinuity in one
board state to another to more continuous space.

~~~
V-2
My thoughts exactly. As an exercise it's surely interesting, but neural
network approach is inherently unsuitable for chess.

Chess requires 100% accuracy and in which just because positions are similar,
doesn't mean that best moves in these positions have to be in any way similar
too.

On the other hand, it sort of mimics the way human player thinks, in terms of
recognizing certain patterns. After all, even grandmasters do not bruteforce
their way through all possible combinations. We use a hybrid approach:
recognize certain strategic patterns first (to drastically narrow down the
search tree), and perform calculations on the top of that.

Chess engines can wipe the floor with any player where tactics is involved;
the trick of beating a computer is to close the game and take advantage of the
fact that it's not able to formulate a long-term PLAN (whose consequences are
beyond its horizon).

See how Nakamura repeatedly beat Rybka in blitz games a few years ago, eg.:
[http://www.chessgames.com/perl/chessgame?gid=1497429](http://www.chessgames.com/perl/chessgame?gid=1497429)
\- very instructive :)

~~~
sushirain
> just because positions are similar, doesn't mean that best moves in these
> positions have to be in any way similar too.

To alleviate this, one can add more abstract/heuristic information about the
position to the input (indicators for complex relations between several
pieces). This kind of high-dimensional vector would be more robust to small
changes, and make the objective function more smooth. Perhaps the non-
linearities introduced by the three layers cannot do this as effectively.

------
leeber
I wrote a program to solve chess once. After I realized that it would take a
massive amount of computing resources to finish in my lifetime, I abandoned
the project.

Most interesting to me is that it really isn't that hard to create a program
to solve chess (i.e. the logic behind it), it just would take too much
time/money to actually do it.

It's much more difficult to create AIs and approximations like this.

Kinda weird once you realize that fact...approximating a solution to chess is
much more difficult, logic wise, than actually solving chess.

Though I wouldn't be surprised if chess is solved in the next couple decades
or so.

~~~
thomasahle
Actually it just might be possible to do, at least in a probabilistic sense.
The author of Rybka at least managed to 'prove' (in a weak sense) that a
certain opening is unplayable. Quite facinating:
[http://en.chessbase.com/post/rajlich-busting-the-king-s-
gamb...](http://en.chessbase.com/post/rajlich-busting-the-king-s-gambit-this-
time-for-sure)

~~~
ajkjk
That article is an April Fool's joke. See [http://en.chessbase.com/post/the-
chebase-april-fool-s-prank](http://en.chessbase.com/post/the-chebase-april-
fool-s-prank).

~~~
thomasahle
A second of April joke. Cheeky. The follow up comments from Vas Rajlich are
still interesting though.

~~~
slm_HN
Vas Rajlich could write a program that solved chess. He would simply have to
find an existing program that solved chess and copy it.

------
tmalsburg2
I really like this work but the performance of the network against Sunfish is
not particularly informative. What I'd like to know is whether this evaluation
function captures any non-trivial properties of the board. If it only captures
simple heuristics such as "more pieces are better," that's not very
interesting. I think it would be worth trying to find out what is actually
captured in the network. If the evaluation function is really smart, i.e.
capturing non-trivial properties of the position, it could guide a much more
focused and thus more efficient search. This is basically what humans do.
That, however, would require a modified version of the network that has a
continuous output value telling how promising a position is compared to the
alternatives. If the evaluation function doesn't play well, that may be really
interesting, too, if the mistakes are psychologically plausible. It seems at
least possible that this is the case because the network was trained on data
sets containing human errors. In general, I think the value of this approach
lies in the potential for investigating and replicating human performance
rather than developing a stronger chess engine. The problem of playing strong
is pretty much solved. What's more interesting now is to develop chess engines
that play bad but in psychologically plausible ways.

~~~
sehugg
Well, the problem of playing strong is pretty well solved if you have lots of
computing resources. The problem of playing very strongly on a battery-powered
mobile device (for example) is not yet solved. This is where an insanely-
accurate evaluation function would come in handy.

~~~
tmalsburg2
Ok, agreed. However, I'd guess that engines like Stockfish and Shredder beat
most of the chess playing population even when running on an iPhone. Advanced
players may not be impressed as they are familiar with methods specifically
developed to beat engines but that's not relevant for the majority of players.
These people want engines that play at their level without making ridiculous
artificial mistakes. Playing against current engines is completely pointless
for beginning and intermediate players. Much progress could be made there.

------
jeremysalwen
It seems like you could get some improvements by simply training the
evaluation function on the output of the entire Deep Pink system including the
negamax search.

This would be a very easy way of getting more training data, and is actually
very nice theoretically. Assuming enough training time and a complex enough
evaluation function, etc, you'd eventually solve chess.

I may check out the code and try this myself...

------
svantana
Interesting work, however training on data seems unnecessary; chess would be
perfect for unsupervised learning - initially it could be trained against an
existing chess program, but as the models improve, they could start competing
against eachother. Although one would probably need some way of scoring any
given board position (compare with DeepMind's Atari playing).

~~~
kylebrown
If you input a score ("{-1,0,1} on final positions") its effectively a label,
that makes the training supervised rather than unsupervised. See [1] for good
reasons to be skeptical of unsupervised learning in general.

See [2] for a twist on the DeepMind Atari player. They use Monte Carlo Tree
Search (MCTS of automated Go playing fame) to generate training data. By
feeding that more carefully generated gameplay data into the deep q-learning
net, they exceed DeepMind's (non-MCTS-coupled) performance.

1\. [http://karpathy.github.io/2014/07/03/feature-learning-
escapa...](http://karpathy.github.io/2014/07/03/feature-learning-escapades/)

2\. [http://www-personal.umich.edu/~rickl/pubs/guo-singh-lee-
lewi...](http://www-personal.umich.edu/~rickl/pubs/guo-singh-lee-lewis-
wang-2014-nips.pdf)

------
toolslive
the Houdini (and I think Rybka too) evaluation function is tweaked by letting
the engine play zillions of micro games against itself in a tournament. One
such micro game lasts a few seconds, and each of the players has a different
setting of the parameters for the evaluation function (material, position,
...). You could apply the same meta strategy here.

~~~
tkirby
Stockfish has a distributed network to allow anyone to donate computer time to
test new patches. Currently running nearly 400 games/minute.

[http://tests.stockfishchess.org/tests](http://tests.stockfishchess.org/tests)

------
mattxxx
Super cool; training game players is super appealing to me as a math-software-
engineer-person.

My babble: if DeepPink can gage its uncertainty on a move, it'd be cool to see
a hybrid system in-action. Plus, "DeepFish" has a cool name.

Either way - nice! And thanks for putting the source on GitHub; I will have a
goof with it!

------
jfoster
A naive brute force of Chess wouldn't end. Consider the case where both
players make moves that perpetuate the game rather than working toward an
ending.

~~~
jefffoster
Chess has a 50 move stalemate condition ([http://en.wikipedia.org/wiki/Fifty-
move_rule](http://en.wikipedia.org/wiki/Fifty-move_rule)) that prevents this
happening.

------
skc
Silly question, why does the author only approximate the number of possible
positions in chess?

~~~
tromp
It's not like Go where you can actually count the number of reachable
positions.

Deciding whether a single chess position is reachable can be really hard.
There is a whole class of so called "retrograde" chess problems focussed on
that.

For example, is the following position reachable?

White: Kc3 Ba4 Black: Kd1 Rb5 Bd5

------
mooneater
Loved it, and very impressed with how concise the source code is.

------
yangzx
A neat and elegant neural architecture for learning chess.

------
heroku
Chess AI is a myth, bruteforce is the only way to win.

~~~
kylebrown
How do you know that humans don't use intelligent bruteforce approximations to
win?

------
anonfunction
Site is down. I highly recommend using CloudFlare or another cacheing solution
to avoid server overload in times of high traffic.

