Hacker News new | past | comments | ask | show | jobs | submit login
Deep learning for chess (erikbern.com)
157 points by mlla on Dec 2, 2014 | hide | past | favorite | 73 comments

As the author of sunfish, I'd like to point out something about learning in chess, which is a topic that interests me a great deal.

When sunfish (with its just 111 python lines of chess logic) 'evaluates' a position, it uses the perhaps simplest known effective method: A piece-square table. The method ignores any interplay between pieces on the board and calculates the sum of Table[coord][type] for each piece on the table. E.g. a white knight on a1 may be worth 73 'units', and on f3 it may be worth 98 'units'. That's all there is to it. Any program which has greater than this level of precision, and equally precise searching, should be able to beat sunfish.

The above may sound naive - and it is - but actually most of the advanced ideas used in chess evaluation functions, can be generalized from this method. "Rook connection" is just a measure that includes two pieces instead of one, and "pawn shield" is the generalization to three pieces. Experiments with grandmasters reveal they "recall" positions in 'chunks' of connected pieces. And this memory is what they use to guide their search. (Papers like 'Perception in chess' and lots of newer research).

So, the role of machine learning in modern engines is to tune the parameters for evaluation and search pruning (deciding what positions are worth examining deeper). For the actual decision of which piece to move to where, you still need search algorithms to crunch millions of positions per second.

Sunfish is really impressive work. From my (brief) understanding of Sunfish, the evaluation function is essentially equivalent to a hardcoded 1 layer network in Deep Pink.

You're right that everything else equal, a better evaluation function should lead to a better chess engine. However in practice I think better evaluation functions means slower evaluation function. So there's some really interesting trade-off there. I doubt humans evaluate more than a few thousand positions, so it seems like a slow but more accurate evaluation function could play chess pretty well

One interesting line of research, I think, is using 1 or 2 layered networks to 'simulate' more complex evaluation functions. If you could train such a network to get within a 10% error of Stockfish's evaluation, then you might be able to distil that network as a faster evaluator to plug back into Stockfish for an even stronger engine. As you say, one hard problem is probably finding actually interesting positions to sample for the training.

Anyhow, it's fun to see how engines like these battle it out. It may also be that your approach can yield a more 'fun to play' engine for us mortals.

I think that's pretty useful approach. It's kind of similar to Hinton's latest work on model compression: http://www.ttic.edu/dl/dark14.pdf

The problem with deep models is when you end up having more than 1 hidden layers, you have a big matrix multiplication to get between the layers. If your hidden layers are a few thousand units, that's still pretty slow. Doing things in minibatches or on the GPU speeds it up significantly, but I'm guessing it's still orders of magnitudes slower than whatever Stockfish uses

Sure, the second layer would have to be very sparse. That makes sense since most multi-piece 'chunks' are not really that interesting.

Thanks for sunfish, great work!

This has been tried many times before, with better-but-still-lackluster results. Sunfish is impressive because it's written in Python and in a tiny number of lines while still being readable. I LOVE Sunfish, but it is among the weakest chess engines in existence. That deep learning could not break even against Sunfish seems rather unimpressive.

The author seems to have a not-very-deep understanding of computer chess. Some examples:

>Better search algorithm. I’m currently using Negamax with alpha-beta pruning, whereas Sunfish uses MTD-f

MTD-F is not better, just a different way to accomplish more-or-less the same thing. MTD-F is a binary-search equivalent of the alpha-beta family of search. In fact, naively switching to MTD-F will probably result in worse playing ability. It takes some time to get it tuned right, and even then it is not objectively better.

>Better evaluation function...By generating “harder” training examples (ideally fed from mistakes it made) it should learn a better model

This is what every beginning chess programmer on the Computer Chess Club message boards and rec.games.chess.computer has wanted to try for the last 20+ years. It has been empirically demonstrated that for best results, the evaluation function should remain simple and fast. Improving evaluation rarely fixes "dumb mistakes". That's what search is for. Efficient search makes up for a multitude of evaluation mistakes.

>Faster evaluation function: It might be possible to train a smaller (but maybe deeper) version of the same neural network

If the evaluation function was reduced to literally take zero time to execute, it would not help significantly. It's a linear improvement being thrown at an exponential problem.

I would LOVE if there was a new approach to computer chess, but the current "smart brute force" approach is so far advanced and successful, it is hard to imagine another approach being competitive.

Thanks for the comment. I'm definitely happy to admit I have very limited understanding of computer chess and I think the likelihood of stumbling upon a new approach to chess engines is close to zero.

That being said, I'm not convinced there's anything "magic" about the fast deep search. There's a broad trend of big complex hand-tuned models being replaced by simple models with more parameters and many many orders of magnitude more training data (eg. machine translation, computer vision). There's probably a lot of domains where this approach doesn't work though. Maybe chess is one of those applications. We don't really know until we try :)

Hi Erik,

I've been associated with computer chess for a while, and was even given a shout-out by Vas Rajlich for optimization discussion prior to the release of Rybka 3.

I think this critic's view on the primacy of fast evaluation, at the expense in accuracy is off base. Looking at the two top engines today, Stockfish and Komodo, Stockfish has a great search function, but a poor eval, while Komodo's eval is better, but it's search isn't as good. It's pretty clear that bad evaluation is limiting, even at the leaves of a deep search.

Anyway, I have more basic questions about why you did what you did. First, it seems that your use of 12 x 64 sets of values to represent state information is non-optimal and even not sufficient. It seems that you are introducing a lot of unnecessary and undesirable additional degrees of freedom by duplicating the piece-square state inputs for white and black. When I've done this in the past, I used 6 x 64 sets with positive ones for squares with a white piece and negative ones for squares with a black piece. Do you see any advantage in not taking advantage of this symmetry?

Second, you really need side to move, castling state for each side, and the pesky en passant indicator (just as is required in a FEN representation). Luckily these don't significantly increase the number of inputs.

I worked on a similar project with a friend at University of Warsaw a number of years back. We generated static evals using FANN based on the complete input state, and trained with the value of the search plus eval after 30 seconds per move We used Rybka 2.2n for this, which was the strongest engine available at the time.

We both moved on after some preliminary success, mainly because there wasn't any way to conceivably make a top engine due to the closed nature of the top engines at that time. This is no longer an issue, and if someone had a black box evaluation function that produced a better eval than the current Stockfish static eval, it would be trivial to do the substitution.

Best Regards, Alan

> Still, even an amateur player probably makes near-optimal moves for most time.

This is far from true. An AI that only looks one-two moves ahead to make sure it doesn't hang a piece or allow mate in one would beat many amateur players (at least with ~5min time control). That's essentially what Sunfish, what he's comparing against, does. Note that Sunfish isn't a particular "good" AI, it would be more interesting comparing it to a "proper" chess AI.

But with that said, I think the concept here is cool. The fact that it doesn't really know the rules of chess but still can play is interesting. I just think that it maybe should have selected a different database for its games; a master-database instead of one filled with amateur games. Of course, there are far many more amateur games, so such a masterbase is much smaller which may be a problem.

It would be nice to know how strong the program was. What was its grade?

Saying it can beat a terrible player doesn't mean much.

Saying it can beat Sunfish (a python program with grade of ?maybe ELO 1100? (i.e. not at all strong)) sometimes, when it has a time advantage, is not impressive.

I'd really like to know how much better (if at all) the evaluation function is - e.g. can the program beat itself, if one side uses a 'standard' evaluation function?

Machine Learning is big on measuring outcomes. It is odd that the one outcome that is important here is not measured!

Some caveats: I realise this is someone's hobby project - I do not mean to rubbish it. I'm just saying that the work&writeup could have been much improved by adding this information.

The problem here is very much effectiveness per time unit, as the author writes. It would probably need to be re-implemented in optimized C code in order to truly test it against other optimized engines.

I think the take away here is how relatively easy it is to make a decent AI for a complex game using neural networks.

I do not think that is correct. With chess, you face exponential branching that quickly overwhelms your language-dependent speedup.

Is your python code e.g. 10x times slower than C code? With a branching factor of e.g. 10, that would mean you only search to depth 9 ply whilst C would search to 10 ply [a]. You are not losing that much! A ballpark ELO grade would not vary qualitatively between python and C.

[a] With alpha-beta search, maybe it is 8 ply vs 10 ply. Also the branching factor is larger than 10. But the point remains valid.

It's using theano that generates CUDA code. Before reimplementing I would start with profiling the generated code to spot the main computational bottlenecks.


> Faster evaluation function: I didn’t use the GPU for playing, only for training.

So indeed it could probably be optimized. But first I would profile as it can be the case that the bottleneck is Matrix Matrix multiplication in numpy which already delegates computation to an optimized third party library (e.g. OpenBLAS). Maybe it's worth using the GPU at play time as well.

"Maybe it's worth using the GPU at play time as well."

The difficulty there is that sending data to and from the GPU is slow. You need to avoid data transfers, which might mean trying to do everything on the GPU. But the SIMD cores on the GPU are likely to perform poorly due to all the branch statements in chess code.

This may be naive, but board state can be encoded very efficiently using FEN[1] - as an example, the starting board state only takes up 57 bytes.


There is a delay from sending any data to/from the GPU.

Generally you aim to give the GPU a 'large' task (or many small tasks), then ask for the answer(s) in 1 batch and wait while it is transferred across.

If you have many small tasks where each answer is sent back separately, and you need that answer before requesting the next task, then you will be very slow, even if the data sent/received is small. A naive implementation of the chess algorithm here (with alpha-beta pruning (which has many if-then branches)) would be like this :/

There might still be ways to speed up at play time by evaluating batches of candidate points in the same evaluation function call in the GPU.

CUDA is used to learn the evaluation function, at play time it uses a Python implementation of Negamax that uses this evaluation function to estimate how good are the leaf positions.

A little nitpicking: Everytime the author writes "infinite", a more accurate word would be "enough". For example:

    if you had infinite computing capacity,
    you could actually solve chess.
The statement is correct. But you do not necessarily need infinite capacity to solve chess. Just enough capacity.

Would be interesting to estimate how much capacity.

I think "infinite" was used in the engineering-sense of the word and not in the stricter mathematical-sense. For me, as an electrical engineer, infinity is often somewhere beyond 1 ms.

People write that chess has ~ 10^42 board positions. If you read the paper about solving checkers (which you should because it's excellent), they evaluated about sqrt(N) positions, and the final solution took cuberoot(N) board positions. IF that extrapolates to chess, it means that the weak solution would take 10^21 evaluations and 10^14 storage. That's big, but in the realm of feasibility.

More than you would have if the entire universe were transformed into computonium.

So effectively infinite.

Not sure the universe is incapable of doing this.

The number of states a chess field can have is 13. It's either empty or has one of the 6 different pieces in black or white on it.

So 13^64 is an upper limit for the number of positions.

We could solve chess if we could put these 13^64 positions into a tree, right?

13^64 = 10^71.

The number of atoms in the observable universe is estimated to be 10^80.

So even the observable universe might be big enough to form this tree. Even if we use big clunky objects like atoms.

We do not have any idea of the size of the unobservable universe.

And I don't know how many states an atom can have. Who knows, maybe a single atom can solve chess if it's programmed correctly? According to quantum theory, pretty small objects like electrons can store and process an amazing (infinite?) amount of information in certain ways.

Hi no_gravity - you've used the word "estaminated" a couple of times. Just in case it's not a simple typo, the word you're after is "estimated".

Thank you! Fixed it.

"So 13^64 is an upper limit for the number of positions."

Your conclusion is correct by a huge margin but the argument is flawed. There is state that is not in the board position:

  - white or black to move?
  - is an en passant capture possible?
  - is castling still possible?
  - how many moves since the last lawn move or capture?

You are right. If I count correctly, that's another 17 bit of information. If we make no effort to encode it efficiently. Let's say 10^6 additional states.

That brings us to an upper limit of 10^77 states the board can be in.

Still enough atoms in the observable universe to assign one to each state.

As I said "your conclusion is correct".

I think you need fewer than 17 bits: 1 for "who is to move", 4 for castling, 6.6 for the 50 moves (100 ply) rule would leave less than 6 for en passant. There potentially are 7 possible en passant moves (white pawns at a5, c5, e5, g5 and black ones at b5, d5, f5, and h5 or vice versa), but all you need to store is "the n-th pawn just moved" with n in 0..8. That is just over 3 bits.

There is a way more significant flaw in your logic, though: the "three times the same board position with the same player to move is a draw" rule. That explodes your estimate, as there could be (if we do not call in additional chess rules) up to 10^71 positions hat have been visited earlier.

With that in mind, I am not sure your limit is correct. There are at most 96 pawn moves in a game (16 pawns walk in 6 moves to the other side of the board) and 30 captures (7 pieces each plus 8 promoted pawns each). With the 50-move rule, that gives at most 126 x 50 = 6300 moves in a game. Each move can be encoded in 12 bits (starting and ending square for white's and black's move), so we need 75600 bits. Taking 2^10 = 1000, that gives me 10^22680 possible games as an upper limit.

That is different from what http://www.chess.com/article/view/the-open-file---is-chess-i... or http://members.iinet.net.au/~ray/Chessgames.htm (several estimates) claim, but (if we assume that zero is a misprint) close to Hardy's upper limit, so at least I am in good company.

Thinking about it again, maybe we don't need to encode off-the-board information in the tree at all.

The tree would start from the starting position and then branch out to all legal moves in each step. So each node in the tree would have the previous moves encoded in it's position in the tree.

So my initial upper limit of 10^71 nodes might hold true. No need to encode information about castling, black or white to move, en passant etc.

Repeat positions are another issue though. Do we have to encode them? My 10^71 tree would not contain them. At first I thought we can leave them out. Now I'm not so sure anymore.

A move that leads to a repeat position is certainly not necessary to win a game. You could play the winning moves right away.

But it can be necessary to force a draw. Hmm...

You might be right. Maybe the "repeat position" issue kills the upper limit of 10^71 nodes in the tree of chess moves. So we have to resort to your upper limit of 10^22680.

All the non-board-state info of a chess game could be easily captured in, like, two bytes.

I think you overestimated the possible chess positions by a couple orders of magnitude. You can't have a chess configuration with an arbitrary number of pieces.

Also there are restrictions on which pieces can be on which squares (for example: pawns can't move backwards)

That's why I gave an upper bound, not an estimation.

It's surprising that it is able to win 1/3rd of the time. The problem here is that input does not lie in any continuous space. I mean, you may have 1 billion board states in your training but is it possible to interpolate values of other states using this? For example, for one vector representing certain board state, even a slight change may have completely different outcome. I would think most learning methods, including deep learning, would excel when there is some sort of interpolatable continuity in inputs on which prediction is desired. Therefore the challenge would be transform discontinuity in one board state to another to more continuous space.

Doesn't this hold true for other domains where deep learning seems to be successful, for example natural language processing? My intuition would be that initial layers of a deep network actually learn smooth-ish representations of the input space.

My thoughts exactly. As an exercise it's surely interesting, but neural network approach is inherently unsuitable for chess.

Chess requires 100% accuracy and in which just because positions are similar, doesn't mean that best moves in these positions have to be in any way similar too.

On the other hand, it sort of mimics the way human player thinks, in terms of recognizing certain patterns. After all, even grandmasters do not bruteforce their way through all possible combinations. We use a hybrid approach: recognize certain strategic patterns first (to drastically narrow down the search tree), and perform calculations on the top of that.

Chess engines can wipe the floor with any player where tactics is involved; the trick of beating a computer is to close the game and take advantage of the fact that it's not able to formulate a long-term PLAN (whose consequences are beyond its horizon).

See how Nakamura repeatedly beat Rybka in blitz games a few years ago, eg.: http://www.chessgames.com/perl/chessgame?gid=1497429 - very instructive :)

> just because positions are similar, doesn't mean that best moves in these positions have to be in any way similar too.

To alleviate this, one can add more abstract/heuristic information about the position to the input (indicators for complex relations between several pieces). This kind of high-dimensional vector would be more robust to small changes, and make the objective function more smooth. Perhaps the non-linearities introduced by the three layers cannot do this as effectively.

I wrote a program to solve chess once. After I realized that it would take a massive amount of computing resources to finish in my lifetime, I abandoned the project.

Most interesting to me is that it really isn't that hard to create a program to solve chess (i.e. the logic behind it), it just would take too much time/money to actually do it.

It's much more difficult to create AIs and approximations like this.

Kinda weird once you realize that fact...approximating a solution to chess is much more difficult, logic wise, than actually solving chess.

Though I wouldn't be surprised if chess is solved in the next couple decades or so.

Well, theoretically a brute force universally solves any set of constraints, just takes too long. Intelligence is really only about efficiency and timescales, i.e. the dumbest algorithm would look insanely smart to us if it ran fast enough.

Actually it just might be possible to do, at least in a probabilistic sense. The author of Rybka at least managed to 'prove' (in a weak sense) that a certain opening is unplayable. Quite facinating: http://en.chessbase.com/post/rajlich-busting-the-king-s-gamb...

That article is an April Fool's joke. See http://en.chessbase.com/post/the-chebase-april-fool-s-prank.

A second of April joke. Cheeky. The follow up comments from Vas Rajlich are still interesting though.

Vas Rajlich could write a program that solved chess. He would simply have to find an existing program that solved chess and copy it.

From a quick Google there appear to be about 10^120 possible chess games. So if you could store each game as 1 bit then:

Landauer's principle says we need at least 10^-21 Joules to change a single bit. That means we need 10^99 Joules to run through all our games. The mass-energy of the observable universe is 10^69 Joules.

Therefore, if we could convert one million trillion trillion observable universes entirely into energy, then we'd have enough to run through all our chess games on our theoretically perfect computing machine.

I'm not sure we'll get that done in the next couple of decades.

[of course, there may well be shortcuts that we can take to cut down that number a bit!]

There's weak and strong level of solving chess.

You don't need to know the best move in every position of every possible chess game in order to play perfectly.

Eg. playing with White it is enough for you to always know what your best move is. And once you stick to it, then 99.9999... of these possible games will NEVER occur, because that would require White making at least one non-perfect move on the way.

You might be interested in learning about AIXI. It is in fact trivially easy to create a general intelligence effective in arbitrary domains ... if you assume infinite computational capability. It goes to show that the problem of artificial intelligence is really approximations and heuristics.

Its trivially easy to write some formulas down on paper ;) I was looking up AIXI the other day, but couldn't figure out what the acronym stands for.

Actual implementations of AIXI are also quite easy. I've written one before. Not that it's very interesting to run -- it is basically indistinguishable from an infinite loop.

As for what the acronym stands for, heck if I know.

Last time I looked we still hadn't created the computing resources needed to solve chess before the heat death of the universe.

I'm curious to see how the solution to chess will play out (hoping it happens in my lifetime).

Yes, it would be interesting to see if the solution is like that of checkers where the solved variant is always a draw, or would it be a win for white. (Or much less likely from evidence so far, a win for black).

I really like this work but the performance of the network against Sunfish is not particularly informative. What I'd like to know is whether this evaluation function captures any non-trivial properties of the board. If it only captures simple heuristics such as "more pieces are better," that's not very interesting. I think it would be worth trying to find out what is actually captured in the network. If the evaluation function is really smart, i.e. capturing non-trivial properties of the position, it could guide a much more focused and thus more efficient search. This is basically what humans do. That, however, would require a modified version of the network that has a continuous output value telling how promising a position is compared to the alternatives. If the evaluation function doesn't play well, that may be really interesting, too, if the mistakes are psychologically plausible. It seems at least possible that this is the case because the network was trained on data sets containing human errors. In general, I think the value of this approach lies in the potential for investigating and replicating human performance rather than developing a stronger chess engine. The problem of playing strong is pretty much solved. What's more interesting now is to develop chess engines that play bad but in psychologically plausible ways.

Well, the problem of playing strong is pretty well solved if you have lots of computing resources. The problem of playing very strongly on a battery-powered mobile device (for example) is not yet solved. This is where an insanely-accurate evaluation function would come in handy.

Ok, agreed. However, I'd guess that engines like Stockfish and Shredder beat most of the chess playing population even when running on an iPhone. Advanced players may not be impressed as they are familiar with methods specifically developed to beat engines but that's not relevant for the majority of players. These people want engines that play at their level without making ridiculous artificial mistakes. Playing against current engines is completely pointless for beginning and intermediate players. Much progress could be made there.

It seems like you could get some improvements by simply training the evaluation function on the output of the entire Deep Pink system including the negamax search.

This would be a very easy way of getting more training data, and is actually very nice theoretically. Assuming enough training time and a complex enough evaluation function, etc, you'd eventually solve chess.

I may check out the code and try this myself...

Interesting work, however training on data seems unnecessary; chess would be perfect for unsupervised learning - initially it could be trained against an existing chess program, but as the models improve, they could start competing against eachother. Although one would probably need some way of scoring any given board position (compare with DeepMind's Atari playing).

If you input a score ("{-1,0,1} on final positions") its effectively a label, that makes the training supervised rather than unsupervised. See [1] for good reasons to be skeptical of unsupervised learning in general.

See [2] for a twist on the DeepMind Atari player. They use Monte Carlo Tree Search (MCTS of automated Go playing fame) to generate training data. By feeding that more carefully generated gameplay data into the deep q-learning net, they exceed DeepMind's (non-MCTS-coupled) performance.

1. http://karpathy.github.io/2014/07/03/feature-learning-escapa...

2. http://www-personal.umich.edu/~rickl/pubs/guo-singh-lee-lewi...

Chess's too big for this.

the Houdini (and I think Rybka too) evaluation function is tweaked by letting the engine play zillions of micro games against itself in a tournament. One such micro game lasts a few seconds, and each of the players has a different setting of the parameters for the evaluation function (material, position, ...). You could apply the same meta strategy here.

Stockfish has a distributed network to allow anyone to donate computer time to test new patches. Currently running nearly 400 games/minute.


Super cool; training game players is super appealing to me as a math-software-engineer-person.

My babble: if DeepPink can gage its uncertainty on a move, it'd be cool to see a hybrid system in-action. Plus, "DeepFish" has a cool name.

Either way - nice! And thanks for putting the source on GitHub; I will have a goof with it!

A naive brute force of Chess wouldn't end. Consider the case where both players make moves that perpetuate the game rather than working toward an ending.

Chess has a 50 move stalemate condition (http://en.wikipedia.org/wiki/Fifty-move_rule) that prevents this happening.

Silly question, why does the author only approximate the number of possible positions in chess?

It's not like Go where you can actually count the number of reachable positions.

Deciding whether a single chess position is reachable can be really hard. There is a whole class of so called "retrograde" chess problems focussed on that.

For example, is the following position reachable?

White: Kc3 Ba4 Black: Kd1 Rb5 Bd5

Because the real number is unknown?

Loved it, and very impressed with how concise the source code is.

A neat and elegant neural architecture for learning chess.

Chess AI is a myth, bruteforce is the only way to win.

How do you know that humans don't use intelligent bruteforce approximations to win?

Site is down. I highly recommend using CloudFlare or another cacheing solution to avoid server overload in times of high traffic.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact