
Building an AlphaZero AI using Python and Keras - davidfoster
https://applied-data.science/blog/how-to-build-your-own-alphazero-ai-using-python-and-keras/
======
glinscott
There is a public distributed effort happening for Go right now:
[http://zero.sjeng.org/](http://zero.sjeng.org/). They've been doing a
fantastic job, and just recently fixed a big training bug that has resulted in
a large strength increase.

I ported over from GCP's Go implementation to chess:
[https://github.com/glinscott/leela-chess](https://github.com/glinscott/leela-
chess). The distributed part isn't ready to go yet, we are still working the
bugs out using supervised training, but will be launching soon!

~~~
yazr
From the leela-zero page :

> If your CPU is not very recent (Haswell or newer, Ryzen or newer),
> performance will be outright bad,

Obviously TPU >> GPU >> CPU

But are there special vector instructions added in Haswell? Or is this just
general preference for a multi core newish cpu ?

~~~
cbuq
Not sure if this is their reasoning, but Haswell introduced AVX2 instructions
([https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Adv...](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2))

------
superbatfish
In 1989, Victor Allis "solved" the game of Connect 4, proving (apparently)
that the first player can always force a win, even if both sides play
perfectly.

In 1996, Giuliano Bertoletti implemented Victor Allis's strategy in a program
named Velena:

[http://www.ce.unipr.it/~gbe/velena.html](http://www.ce.unipr.it/~gbe/velena.html)

It's written in C. If someone can get it to compile on a modern system, it
would be interesting to see how well the AlphaZero approach fares against a
supposedly perfect AI.

~~~
gwern
To benchmark A0 further, you can take whatever NN you're training and instead
train it against the 'ground truth' of Velena's moves when going first (the NN
predicts for each possible move if Velena would take it or not, 0/1 labels).
Then one can know how close does the A0 NN trained via expert iteration
approaches the Velena-supervised NN and how much compute is spent on the need
to train from scratch?

------
chrisfosterelli
Can someone share some intuition of the tradeoffs between monte-carlo tree
search compared to vanilla policy gradient reinforcement learning?

MCTS has gotten really popular as of AlphaZero, but it's not clear to me how
this compares to more simple reinforcement learning techniques that just have
a softmax output of the possible moves the agent can make. My intuition is
that MCTS is better for planning, but takes longer to train/evaluate. Is that
true? Is there some games one will work better than the other?

~~~
ericjang
In vanilla policy gradient, one plays the game to the end and then bumps the
probability of _all_ actions taken by the agent up (if AlphaGo won) or down
(if it lost). This is very slow because there ~150 moves in an expert game,
and we do not know which moves caused decisive victory or loss - i.e. the
problem of "long term credit assignment". Also, it is actually preferable to
compute the policy gradient with respect to the _advantage_ of every action,
so we encourage actions that were better than average and punish actions that
were worse than average - otherwise the policy gradient estimator has high
variance.

I think about MCTS in the following way: suppose you have a perfect
"simulator" for some reinforcement learning task you are trying to accomplish
(i.e. real-world robot grasping for a cup). Then instead of trying to grasp
the cup over and over again, you can just try/"plan" in simulation until you
arrive at a motion plan that picks up the cup.

MCTS is exactly a "planning" module, and it works so well in Go because the
simulator fidelity is perfect. AlphaGo can't model adversary behavior
perfectly, but MCTS and the policy network complement each other because the
policy reduces the search space of MCTS. As long as the best adversary is not
far away from the space that MCTS + policy is able to consider, AlphaGo can
match or beat the adversary. Then, we train the value network to amortize the
computation of the MTCS operator (via Bellman equality). Finally, self-play is
an elegant solution for keeping adversary + policy close to each other.

For more rigorous mathematical intuition, Ferenc Huszar has a nice blog post
on MCTS as a "policy improvement operator": [http://www.inference.vc/alphago-
zero-policy-improvement-and-...](http://www.inference.vc/alphago-zero-policy-
improvement-and-vector-fields/)

~~~
chrisfosterelli
Thanks for the response! I'm familiar with how AlphaZero works, just primarily
curious about how performance/speed compare in situations with perfect
information and where it is possible to perfectly simulate
(pong/go/chess/etc).

I did not realize that MCTS helps with the credit assignment problem, that's
really interesting!

------
bluetwo
FYI: The AlphaGo documentary is now on Netflix.

~~~
dmix
Direct link:
[https://www.netflix.com/title/80190844](https://www.netflix.com/title/80190844)

~~~
alanfalcon
Thank you (and GP!) I know what I’ll be doing for the next 90 minutes.

------
frenchie4111
Shameless self plug. I spent a Saturday morning doing a similar (no monte-
carlo, no AI library) thing recently with tic-tac-toe. I based this mostly on
intuition, would love any feedback.

[https://github.com/frenchie4111/genetic-algorithm-
playground...](https://github.com/frenchie4111/genetic-algorithm-
playground/blob/master/tictactoe.ipynb)

~~~
anfractuosity
I've only skimmed it a little bit so far, I'm just wondering, if both players
use the optimal strategy, shouldn't the game always result in a draw?

Which is why I was a bit confused by the target '80% Win/Tie rate going
second', but I could well be missing something.

Edit: I'm an idiot, I see that the opponent takes random moves now. Seems a
fun project :), a while ago I built a very simple rule-based tic-tac-toe thing
in lisp, but the rules were all hardcoded alas.

------
Avery3R
Use this to get rid of the obnoxiously large sticky header
[https://alisdair.mcdiarmid.org/kill-sticky-
headers/](https://alisdair.mcdiarmid.org/kill-sticky-headers/)

------
Will_Parker
> Not quite as complex as Go, but there are still 4,531,985,219,092 game
> positions in total, so not trivial for a laptop to learn how to play well
> with zero human input.

That's a small enough state space that it is indeed trivial to brute force it
on a laptop.

Putting aside that though, it would be interesting to compare vs a standard
alpha-beta pruning minimax algorithm running at various depth levels.

------
smortaz
Thanks for the great demo! Uploaded to Azure Notebooks in case anyone wants to
run/play/edit...

[https://notebooks.azure.com/smortaz/libraries/Demo-
DeepReinf...](https://notebooks.azure.com/smortaz/libraries/Demo-
DeepReinforcementLearning)

Click Clone to get your own copy, then Run the run.ipynb file.

------
thrw3249845
As an aside, does anybody know the monospace font that we see in the
screenshots? Here, for instance: [https://cdn-
images-1.medium.com/max/1200/1*8zfDGlLuXfiLGnWlz...](https://cdn-
images-1.medium.com/max/1200/1*8zfDGlLuXfiLGnWlzvZwmQ.png)

~~~
fileeditview
I don't think it's Office Code Pro but it looks very similar to me. Maybe that
helps you with the search!

------
datascientist
RISE Lab's Ray platform (now includes RLlib) is another option
[https://www.oreilly.com/ideas/introducing-rllib-a-
composable...](https://www.oreilly.com/ideas/introducing-rllib-a-composable-
and-scalable-reinforcement-learning-library)

------
wyattk
Does anyone have a different link to this? It's insecure and Cisco keeps
blocking it, so I can't just proceed from Chrome.

~~~
chrisfosterelli
The link looks https to me, but this is where it redirects if that's of any
use: [https://medium.com/applied-data-science/how-to-build-your-
ow...](https://medium.com/applied-data-science/how-to-build-your-own-
alphazero-ai-using-python-and-keras-7f664945c188)

------
m3kw9
I like the title more if it’s “Roll your own Alpha zero using Keras and
python”

------
poopchute
Is there magic incantation I have to say to get this to compile? Jupyter says
I'm missing things when I try to run it (despite installing the things with
pip)

------
make3
AI trained for a perfect simulation is not AI. It is exactly the only easy
scenario where AI is easy.

~~~
red75prime
Too vague. Do you predict that AlphaZero will not be able to beat humans in a
game like Go but where there's 1 in 100 chance for a stone to land in
unintended place?

------
_pdp_
Very nice find! I loved it!

------
X6S1x6Okd1st
If you actually want to contribute towards an open source AlphaZero
implementation you may want to checkout [https://github.com/gcp/leela-
zero](https://github.com/gcp/leela-zero)

