
Training GPT-2 to Play Chess - simulate
https://slatestarcodex.com/2020/01/06/a-very-unlikely-chess-game/
======
dwohnitmok
An amusing point from the comments

> It’s not even trying to be competitive, it’s just guessing how the game will
> continue. If you blunder, it might guess that this must be a game between
> two blundering fools, and play accordingly.

In a certain sense, GPT-2 is optimized to "look good to people interested in
AI." Above all else it tries to generate plausibly-human-looking things, while
completely oblivious of any other goal. This makes it an interesting fit for
scenarios with objective scoring criteria. It may never be "good" at the
scenario, only entertaining to human observers.

~~~
maxander
It’s a pity that the common annotation for a “surprising” move is to follow
the move code with an examination mark (or two) rather than preceding it;
otherwise we would have a simple way of making the model generate surprising
moves on command. :)

~~~
drusepth
You could always train another model off reversed move notation and wrap it
with something that reverses the output again. ;)

~~~
lowdose
Then you probably make a rookie mistake and assume every opponent counters
your actions with a fully functioning system 2 in Kahneman speak.

------
epenson
Reminds me of an old project of mine: n-gram chess. Similarly ok in openings,
awful in endgames, and generally bad at chess.

[https://github.com/ElliotPenson/n-gram-
chess](https://github.com/ElliotPenson/n-gram-chess)

~~~
ginger_beer_m
I bet if you benchmark your n-gram vs GPT-2, it will perform the same.

~~~
ginger_beer_m
For chess application i mean

------
thomasahle
I made similar chess engine using FastText:
[https://github.com/thomasahle/fastchess](https://github.com/thomasahle/fastchess)

It is surprising to me that you can predict optimal/strong engine moves with
27% accuracy using a completely trivial linear model, that is by a single
matrix multiplication.

I wonder how well it would compete with this GPT-2 engine.

------
veselin
GPT-2 is byte pair encoding and transformer. Is there any indication that BPE
plays any role here, because the vocabulary is fixed? If not, then it is only
the transformer that is interesting and this post is just trying to use the
name of the model, because it sounds cool. And actually giving moves directly
to transformer may improve the results.

~~~
sillysaurusx
It's unknown what role if any BPE plays. I was surprised to discover that the
final probability of a move is equal to the probability of each token from the
root prompt, i.e. even though "1.Nf3 e5" is encoded as ['1', '.', 'N', 'f',
'3', ' e', '5'] the probability of e5 seems unaffected by the fact that Nf3 is
3 tokens as opposed to one.

You're right that coming up with a token mapping could help things. It's a bit
tricky to do that right now. Your options for fitting a custom vocab seems to
be "use sentencepiece to fit a vocab, then modify the gpt-2 codebase to use
the sentencepiece library for decoding".

I am honestly not sure if the output of sentencepiece is compatible with
traditional encoders. What I mean is, it doesn't seem to generate an
encoder.json + vocab.bpe file. It seemed to be some other kind of format. So
I'm not sure if the tooling that has evolved around OpenAI's encoder format
would be applicable there. I really don't know, though.

According to this slatestarcodex comment, someone got superior results on
solely algebraic notation (which looks like g1f3 instead of Nf3):
[https://www.reddit.com/r/slatestarcodex/comments/el87vo/a_ve...](https://www.reddit.com/r/slatestarcodex/comments/el87vo/a_very_unlikely_chess_game/fdh0vqd/)

Another extension that might help is to periodically inject the full FEN board
state. This was the format we were going to try next, which injects the full
FEN after every move:
[https://gist.github.com/shawwn/318606c112774ad070f94de9c8288...](https://gist.github.com/shawwn/318606c112774ad070f94de9c8288e0a)

I'm so happy to get to work with GPT-2 1.5B. It's been a lot of fun to train.

By the way, if you like this kind of thing, you'll love Elo World.
[https://www.youtube.com/watch?v=DpXy041BIlA](https://www.youtube.com/watch?v=DpXy041BIlA)

------
TehShrike
This is the chess version of all those "type these two words into your phone
and keep clicking the next word" memes

it's not going to generate anything meaningful, it's meant to get close enough
to realistic to be either funny or interesting

I was very tickled.

~~~
cjbillington
Similar, but GPT-2 is better at text prediction than the Markov chains used on
your phone.

~~~
jefftk
Do phones use Markov chains at this point? It feels like they've gotten better
recently, and I wonder if maybe they're using something fancier?

~~~
juped
Markov chains are pretty smart - you've probably just trained yours more

------
amasad
This is amusing but doesn't really prove anything special about GPT-2 or
general intelligence. You can probably get similar results with an n-gram
model.

~~~
inimino
Though this is not particularly strong, I don't think you would get similar
strength from an n-gram model. You need longer-term correlations, which is
generally where transformers do well.

~~~
sillysaurusx
Someone apparently did it with n-grams in 2015, and it reaches move 13 or so:
[https://twitter.com/kcimc/status/1214713412963291136](https://twitter.com/kcimc/status/1214713412963291136)

Someone else tried this with GPT-2 a few months ago on algebraic notation and
their engine seems to get to move 40 without blundering:
[https://www.reddit.com/r/slatestarcodex/comments/el87vo/a_ve...](https://www.reddit.com/r/slatestarcodex/comments/el87vo/a_very_unlikely_chess_game/fdh0vqd/)

Board state + algebraic notation might be the trick to make a strong engine.

------
YeGoblynQueenne
>> How impressed should we be that the same AI can write poems, compose music,
and play chess, without having been designed for any of those tasks? I still
don’t know.

For the record, you can do the same things with a Hidden Markov Model (or
hand-crafted rules) and the results won't be very different. Except that they
won't elicit breatheless articles about being a "step towards general
intelligence".

~~~
sabalaba
The text generated by GPT-2 is far superior to HMMs. GPT-2 was able to perform
unsupervised machine translation and answered more than 5x as many questions
correct on the SQUAD Q&A dataset than the previous best pure neural model.

Not to mention that the text generated by GPT-2 can often fool an online
reader whereas HMMs have the problem of being long-term incoherent and don’t
reference back to subjects of the sentence like GPT-2 often does.

I’m not staying you should believe the AI hype in news media. But the paper
does contain a lot of thorough analysis and comparison to the previous state
of the art.

[https://cdn.openai.com/better-language-
models/language_model...](https://cdn.openai.com/better-language-
models/language_models_are_unsupervised_multitask_learners.pdf)

~~~
YeGoblynQueenne
Leaving the question of machine translation etc aside for the moment, this is
about playing chess from textual examples of play. There is no reason to
assume that, even if GPT-2 was really any good at machine translation, that it
would be any good at chess.

I guess people think "it's a powerful model so it should do well in any task"
but that's typically not the case for neural nets. I know what OpenAI claims
about how it can do a little bit of everything, machine translation benchmarks
are borked and I bet so are question answering ones (which I confess I don't
know much about).

------
sillysaurusx
Hello everybody. I made this notebook. If you like this kind of thing, please
subscribe to gwern's patreon.
[https://patreon.com/gwern](https://patreon.com/gwern)

It's a GPT-2 1.5B model trained on the kingbase 2019 dataset. (>3M games of
>2000 ELO rating.) It was trained for 400k steps with batch size 6 using 140
TPUs in 24h using a technique known as swarm training. Here's an incomplete
whitepaper on swarm training: [https://www.docdroid.net/faDq8Bu/swarm-
training-v01a.pdf](https://www.docdroid.net/faDq8Bu/swarm-training-v01a.pdf)

The dataset is available here:

    
    
      gsutil cp gs://gpt-2-poetry/data/kingbase-ftfy.txt .
    

Each line is of the form [Result "0-1"] [WhiteElo "2715"] [BlackElo "2793"] 1.
e4 ...

Result 0-1 means black won; 1-0 means white won; 1/2-1/2 means a draw.

At runtime I prompt it with [Result "0-1"] and a high elo for white and black
to make it more likely to generate higher level moves.

Our next project will be a GPT-2 IRC bot where you can talk with simulated
people. We currently have one that wasn't trained for very long, yet the
preliminary results are interesting enough to warrant a more serious time
investment.
[https://twitter.com/theshawwn/status/1208667331230089216](https://twitter.com/theshawwn/status/1208667331230089216)

Many people have asked for a thorough technical writeup which I hope to make
available soon. In the meantime, you an read some of our GPT-2 1.5B adventures
here:
[https://www.gwern.net/GPT-2#gpt-2-1.5b](https://www.gwern.net/GPT-2#gpt-2-1.5b)

Lastly, someone on /r/slatestarcodex apparently did this exact same thing a
few months ago. They trained on algebraic notation instead of PGN format,
which is basically x1y1x2y2 coordinate form with no mention of the type of
piece. It was also trained on 1B moves. The engine is superior to ours and can
apparently reach move 40 without blundering, according to the replay.
[https://www.reddit.com/r/slatestarcodex/comments/el87vo/a_ve...](https://www.reddit.com/r/slatestarcodex/comments/el87vo/a_very_unlikely_chess_game/fdh0vqd/)

I have also been porting the stylegan2 codebase to TPUs to facilitate swarm
training. We hope to train on a very large dataset like the entirety of
danbooru2018. No promises, but results are interesting so far.
[https://twitter.com/theshawwn/status/1214245145664802817](https://twitter.com/theshawwn/status/1214245145664802817)

I hope you all found this enjoyable. The GCE bill is currently $50, which I'm
keeping an eye on. (Go subscribe to gwern's patreon to see more projects like
this!)

------
empath75
Seems like it just memorized openings.

~~~
p1esk
Like all good human players do

~~~
lacker
It memorizes openings like an expert level player, then plays the rest of the
game like a six year old who just learned the rules.

~~~
antupis
changing softmax to something else might fix that when there is a limited
number of good moves softmax is far from optimal.

------
asdfefasdfeb
1.e4 e5 2.Ke2 Nc6 3.Kf3 g6 4.Kg4 Bg7 5.Nf3 h6 6.Nxe5 Bxe5 7.d4 Bg7 8. e5 at
this point the notebook started allocating more memory and after it became
irresponsive.

~~~
sillysaurusx
Odd. If you click the stop button on the cell, it'll turn into a play button.
If you click that, it should resume where it left off.

If you happen to reproduce this, let me know.

------
AlexCoventry
> GPT2 Chess update: I wrote some code to calculate the probability of all
> valid chess moves. It can reach endgame now.[0]

Shocking. Our AI overlords will soon stumble into power, if we only point out
where they're slipping up.

[0]
[https://twitter.com/theshawwn/status/1213559429293060099](https://twitter.com/theshawwn/status/1213559429293060099)

------
macherm
Funny attempt! Challenge: win by minimizing the number of moves. My record so
far is mate in 8 moves:
[https://lichess.org/pG4S7RcF](https://lichess.org/pG4S7RcF)

------
wwarner
this is hilarious and also a great idea. i dont see any reason why you can't
play a few million games against itself and other engines and see where it
takes you. less efficient than alpha zero probably, but how much so?

~~~
sillysaurusx
It should be similarly efficient. AlphaZero used 1,000 TPUv1's to generate
self-play games, and a much smaller number of TPUs to train the model on the
previous self-play results. Whenever it generated a model that was >= 55%
better, that became the new model.

The same algorithm could be applied here.

~~~
jeffshek
It would not be close to similarly efficient. They have completely different
loss functions.

~~~
sillysaurusx
You're right, "efficient" should be substituted with "possible". We're
certainly not claiming that this is a smart way to do it, just that you can.

Still, I think that there's a chance it could work well. Each move could be
prefixed with the final outcome of the game, which is the technique either
alphazero or muzero uses.

