
Alpha Go Zero: How and Why It Works - Mageek
http://tim.hibal.org/blog/alpha-zero-how-and-why-it-works/
======
shghs
The main reason AlphaGo Zero learns so much faster than its predecessors is
because it uses temporal-difference learning.[1] This effectively removes a
_huge_ amount of the value network's state space for the learning algorithm to
search through, since it bakes in the assumption that a move's value ought to
equal that of the best available move in the following board position, which
is exactly what you'd expect for a game like Go.

A secondary reason for AlphaGo Zero's performance is that it combines both
value and policy networks into a single network, since it's redundant to have
two networks for move selection.

These are the two biggest distinguishing characteristics of AlphaGo Zero
compared to previous AlphaGos, and the OP doesn't discuss either of them.

[1]
[https://en.wikipedia.org/wiki/Temporal_difference_learning](https://en.wikipedia.org/wiki/Temporal_difference_learning)

~~~
smallnamespace
Interestingly, the idea behind temporal difference learning is more or less
the intuition behind how people price derivatives in finance.

The expected value of a contract at time T, estimated at some time t < T, is
assumed to be equal (up to discounting) for all t -- e.g. if today we think
the contract will be worth $100 a year later, then we also think that the
_expected estimate, made n months from now, of the value [12-n] months later,
will also be $100_. This allows you to shrink the state space considerably.

You can usually work out the payoff of a derivatives in different scenarios
given rational exercise decisions by all contract participants. The
calculation assumes that every market participant makes the best possible
decision given the information they had available at the time by either
explicitly or implicitly building a tree and working backwards, back-
propagating the 'future' value back to the root.

This closely resembles the modeling of a discrete adversarial game, except the
payoffs need to make reference to random market variables like the stock
price, so the tree nodes are not just indexed by participant action, but also
by variables.

There's actually a nice resemblance between the Longstaff-Schwarz method of
pricing American options and MCTS + Alphago, except that the former is using
kernel regressions instead of deep neural nets and we sample from a continuous
space with an assumed probability distribution instead of a discrete space
guided by a policy network [1].

[1]
[https://people.math.ethz.ch/~hjfurrer/teaching/LongstaffSchw...](https://people.math.ethz.ch/~hjfurrer/teaching/LongstaffSchwartzAmericanOptionsLeastSquareMonteCarlo.pdf)

~~~
arnioxux
I think the bellman equation (which is used extensively in reinforcement
learning) is also taught in stochastic calculus for finance (except in the
continuous form?).
[https://en.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E2%80%...](https://en.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E2%80%93Bellman_equation)

My memory is hazy so there might not be a real connection here.

~~~
jey
Yup, and a lot more! The Hamilton-Jacobi-Bellman equations come up in anything
that can be formulated as an optimal control problem.

I'm not an expert, but he is:
[http://www.athenasc.com/dpbook.html](http://www.athenasc.com/dpbook.html)

------
gwern
Some additional discussion:
[https://www.reddit.com/r/reinforcementlearning/comments/778v...](https://www.reddit.com/r/reinforcementlearning/comments/778vbk/mastering_the_game_of_go_without_human_knowledge/)
[https://www.reddit.com/r/MachineLearning/comments/7apvr4/d_i...](https://www.reddit.com/r/MachineLearning/comments/7apvr4/d_it_seems_like_alphago_zeros_biggest_successes/)

------
saagarjha
If the author's here: some of the math formulas don't render correctly. In
particular, 10^170 is parsed as 10^{1}70, and $5478$ shows up without TeX
applied to it.

~~~
Mageek
Thanks, fixed those

~~~
dmix
Another minor fix:

> A new paper was released a few days detailing a new neural net

I believe you mean "a few days _ago_ "?

~~~
Mageek
kk

------
unpseudo
They are two things a human brain does when playing chess or go: evaluating a
position and mentally playing some positions (by doing a search tree).

The AlphaGo neural network is able to do the first part (evaluating positions)
but the search tree is still a hand crafted algorithm. Do they have plans to
work on a version with a pure neural network? (i.e. a version which would be
able to learn how to do a search tree.)

~~~
nojvek
Who is to say that the human brain doesn't have a different cell type to do
this kind of search. The neocortex is shaped in a different way than other
neural cells so I imagine search and evaluation are different architectures.

But you're right. A NN way to do Montecarlo search in GPUs would make things
even simpler.

------
nemo1618
Would be really cool to see a generic framework for this, where you can plug
in the rules of your discrete-deterministic-game-with-perfect-information and
get a superhuman bot. Does something like this already exist?

~~~
nojvek
If anyone's interested who understands alpha zero in depth. I would love to
start a github project to implement a super not that players checkers, tic tac
toe, chess, go etc and all related games on the browser via an openai like
interface.

------
fspeech
Given that AlphaGo Zero was trained on several million games of self plays,
each game involving hundreds of steps, each step with 1600 MCTS simulations,
the total number of board positions it has considered is on the order of
trillions. While impressive it pales in comparison to the number of possible
board positions of 10^170
([https://en.m.wikipedia.org/wiki/Go_and_mathematics](https://en.m.wikipedia.org/wiki/Go_and_mathematics)).
So its amazing performance tells us that:

1\. Possibly the elegant rule of the game cuts down the search space so much
that there is a learnable function that gives us optimal MCTS supervision;

2\. Or CNN approximates human visual intuition so well, so while Zero has not
evaluated so many board positions it has evaluated all the positions that
human has ever considered - so it remains possible that a different network
could produce different strategies and be better than Zero.

------
pmarreck
> It is interesting to see how quickly the field of AI is progressing. Those
> who claim we will be able to see the robot overlords coming in time should
> take heed - these AI's will only be human-level for a brief instant before
> blasting past us into superhuman territories, never to look back.

This final paragraph is just editorializing. A computer will never care about
anything (including games like Go and domination of other beings) that it is
not programmed to imitate care about, and will thus remain perennially
unmotivated.

Also, my intuition says that gradient descent is an ugly hack and that there
HAS to be some better way (like a direct way) to get at the inverse of a
matrix (not just in specific cases but in the general case!), but I digress,
and not being a mathematician, perhaps someone has already proved somehow that
a general method to directly and efficiently invert all possible matrices is
impossible

~~~
thomasahle
What do you mean with inverting a matrix? In the usual definition it is well
studied and understood
[https://en.m.wikipedia.org/wiki/Invertible_matrix#Methods_of...](https://en.m.wikipedia.org/wiki/Invertible_matrix#Methods_of_matrix_inversion)

~~~
gugagore
They might be talking about an interpretation of gradient descent that this
article provides: [http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-
descent....](http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html)

~~~
pmarreck
I elaborated here:
[https://news.ycombinator.com/item?id=15639863](https://news.ycombinator.com/item?id=15639863)

------
EGreg
I wonder how the STYLE of Alpha Go Zero is regarded by human experts. Is it
far different from AlphaGo? Why bother learning from AlphaGo if they can learn
from AlphaGo Zero?

Did they unleash a second "Master" program?

I am wondering if the "better" strategy moves are now super wacky and weird
and break all theory.

~~~
danielvf
At least initial reports are that alphaGo Zero is more human-like than Master.
Zero packs even more of the inhuman ability to pick the most critical part of
the board for each move, but less weird looking stuff.

In fact, one of the obvious differences between AlphaGo Zero and top human
players, is much more play on safe opening spots, which has been out of
fashion among human pros for a hundred years or so.

~~~
gizmo686
>At least initial reports are that alphaGo Zero is more human-like than
Master.

Empirically, this is not correct. The original AlphaGo achieved a 57% accuracy
at predicting expert moves.

I can't find an exact number, but based on the graph in the nature article,
AlphaGo Zero has a less than 50% accuracy at predicting human moves.
Eyeballing the graph, it looks like the supervized learning variant of AlphaGo
Zero scored <55% at predicting human moves. Better than reinforcement learning
AlphaGo Zero, but still worse then the original AlphaGo.

Of course, it is not clear that ability to predict _the_ move humans will play
is the best metric to measure how human like a computer plays. It is just the
only objective metric we have. Although, if this were an actual research
question, we could probably come up with better metrics.

[https://storage.googleapis.com/deepmind-
media/alphago/AlphaG...](https://storage.googleapis.com/deepmind-
media/alphago/AlphaGoNaturePaper.pdf)

[https://www.nature.com/nature/journal/v550/n7676/pdf/nature2...](https://www.nature.com/nature/journal/v550/n7676/pdf/nature24270.pdf)

~~~
taneq
I'd have thought the metric to watch was "% of moves made which seemed kooky
to human experts", not "% of human experts moves with which the program
agrees."

------
naveen99
someone posted an attempt at an open source implementation of alphagozero
[https://github.com/yhyu13/AlphaGOZero-python-
tensorflow](https://github.com/yhyu13/AlphaGOZero-python-tensorflow)

anyone try it yet ?

~~~
xelxebar
There's also Leela Zero [1].

[1] [https://github.com/gcp/leela-zero](https://github.com/gcp/leela-zero)

------
bluetwo
Saw the AlphaGo movie at a festival recently.

Been following the AlphaGo Zero developments, which leap-frog what was going
on in the movie (although still very much worth seeing).

One thing I was curious about is if Go would be considered solved, either hard
or weakly solved, since AlphaGo Zero at this point doesn't seem to be able to
be beat by any living human. Wikipedia does not list it as solved in either
sense, and I was wondering if this was an oversight.

~~~
gizmo686
Go still has not been Ultra-weakly solved (eg. we do not know who is supposed
to win).

If AlphaGo has weakly solved go, then it should either have a 100% win rate
when playing against itself as white, or a 100% win rate when playing against
itself as black.

~~~
nojvek
Or 100% rate of a draw. Like tic tac toe.

~~~
gamegoblin
Due to komi, a go game cannot be drawn. That is, the 2nd player is awarded
some number of points for the disadvantage of moving 2nd, and that number of
points typically has a fractional 0.5 in it to break ties.

The most common komi values these days are 6.5 or 7.5, depending on ruleset.

~~~
dvdkhlng
Depending on the rules, a game may be drawn due to an "eternal life" (an
endless forced repetition).

[https://senseis.xmp.net/?EternalLife](https://senseis.xmp.net/?EternalLife)

In Go this is very rare (as opposed to e.g. chess).

------
erikb
I don't get what is new in the set of attributes that this article describes.

Monte Carlo was already used in 2005 in AIs playing on KGS. Gradient Descent
is a basic algorithm is a basic algorithm that I saw in an AI class in ~2008
as well. I bet both are even a lot older and well known by all experts.

This is not what makes AlphaGo special or Zero successful. The curious thing
about Zero is that usually with Gradient Descent you run a huge risk of
running into a local maximum and then stop evolving because every evolution
makes you not better than the current step.

So one question is actually how they used these same old algorithms so much
more efficiently, and the second question is how did they overcame the local
maximum problem. Additionally there may be other problems involved that
experts know better than me.

But an explanation of basic algorthims can't be the answer.

~~~
gwern
> The curious thing about Zero is that usually with Gradient Descent you run a
> huge risk of running into a local maximum and then stop evolving because
> every evolution makes you not better than the current step.

No. The curious thing is that you can train a godawful huge NN with 40 layers
via pure self-play with no checkpoints or baselines or pretraining or library
of hard problems or any kind of stabilization mechanism, and it _won 't_
diverge but will learn incredibly rapidly and well and stably. As Silver says
in the AmA, _all_ their attempts at pure self-play ran into the usual
divergence problems where the training explodes and engages in catastrophic
forgetting, which is what the RL folklore predicts will happen if you try to
do that. Local maximums are not the problem - the problem is the self-play
can't even find a local maximum much less maintain it or improve it.

> So one question is actually how they used these same old algorithms so much
> more efficiently, and the second question is how did they overcame the local
> maximum problem.

Er, this is exactly what OP is all about: the Monte Carlo tree search
supervision. That's how they used them.

~~~
psb217
If you read the AGZ paper closely, they actually use checkpoints during
training. Specifically, during training they only perform updates to the
"stable" set of parameters when the current "learning" set of parameters
produces a policy which beats the stable set at least 55% of the time. The
current stable parameters are what they use for generating the self-play data
which they use to update the current "learning" parameters. I believe this is
only mentioned in the supplementary material...

~~~
gwern
I did and I would point out that while they use checkpoints, the training
curves indicate this is not necessary, and what I meant is that they do _not_
use the usual self-play (and evolutionary) mechanism of checkpoints from
throughout the training history which is necessary to combat catastrophic
forgetting (and which apparently wasn't enough to stabilize Zero on its own as
it is the single most obvious thing to do but Silver notes all the pre-Zero
self-plays diverged until they finally came up with that of MCTS supervision).
The checkpoint mechanism there appears no more necessary than checkpoints in
training any NN - it's critical to avoid a random error or bug wasting weeks
of time but does not affect the training dynamics in any important way.

~~~
gwern
(And Anthony et al 2017 don't use checkpoints at all, noting that it slows
things down a lot for no benefit in their Hex agent.)

------
zeristor
Are there any plans to do this for Chess?

I imagine that this is an iteration of the Alpha Go engine, people working on
this are very current with Alpha Go.

If Chess is similar, then wouldn't DeepMind be able to bootstrap game
knowledge. Perhaps this isn't a big goal, but Chess is Chess after all.

~~~
naveen99
There was a project for chess, called giraffe
[https://bitbucket.org/waterreaction/giraffe](https://bitbucket.org/waterreaction/giraffe)
whose author shut it down after joining google deep mind:
[http://www.talkchess.com/forum/viewtopic.php?t=59003](http://www.talkchess.com/forum/viewtopic.php?t=59003)

He thinks it's only a matter of time till machine learning beats hand crafted
systems like stockfish even in chess.

[http://arxiv.org/abs/1509.01549](http://arxiv.org/abs/1509.01549)

~~~
nojvek
Does alphazero beat stock fish purely by self play. That would be huge. Stock
fish is the result of so many hand crafted optimizations over a large game
play dataset.

~~~
thom
Intuitively you'd expect deep learning to improve on Stockfish's evaluation of
positions, but perhaps not at the same level of throughput. I'm also intrigued
as to whether a purely self-taught AI system can compare in the endgame to an
engine with access to a 7-piece endgame tablebase, whose evaluation of those
positions is obviously _perfect_.

------
partycoder
Go has been studied for hundreds of years. In many cases, by people who study
the game since their childhood and work on it as a full-time occupation.

The consequence of Alpha Go Zero is that it can, in a matter of days,
disregard and surpass all human knowledge about the game.

Maximizing a score margin has been equated for a long time with maximizing
your probability of winning. Alpha Go doesn't play like that... it's a
substantial paradigm shift. If you see the early commentaries you will see
that human players were initially saying Alpha Go moves were mistakes because
they were slow and wasted opportunities to get more territory, to then realize
that Alpha Go was actually winning.

~~~
ImSkeptical
I think people have always understood the difference between maximising
probability of victory versus score. Even in amateur games you'll get the
feeling "I could fight here, and it would be complicated, but I might get a
huge advantage, or I could just play safe and keep my advantage."

------
yters
Are there adversarial examples for Alpha Go Zero?

~~~
danielvf
What do you mean?

AlphaGo Zero has played itself, and they have published 20 of those games.

~~~
Barrin92
the OP is referring to the use of a GAN, which is a different type of setup.

GAN's were not used for alphaGo, as the article points out Deepmind uses
reinforcement learning and MCTS.

~~~
nschucher
They're most likely referring to adversarial attacks where degenerate inputs
are constructed that could cause AlphaGo Zero to perform sub-optimally or
catastrophically fail (see OpenAI Research [0]). This is distinct from
generative adversarial networks (GANs) or adversarial self-play (which I guess
AlphaGo Zero is an example of).

[0] [https://blog.openai.com/adversarial-example-
research/](https://blog.openai.com/adversarial-example-research/)

------
javajosh
"On a long enough timeline, everything is a discrete game." (With apologies to
_Fight Club_)

Personally, I look forward to the day when the software I own works for me to
the extent of optimizing the decisions I make during the day, even many
mundane ones. Properly executed, such a system could make a big difference in
my quality of life. I believe that a big piece that is missing is a solid
life-model, a life-representation, that can be optimized. Once that is
defined, an RNN or MCS can optimize it and I can reap the benefits.

~~~
ben_w
If it is optimising all your decisions, even the mundane ones, is it really
_your_ life any more? If it plugged into your brainstem and took over, by
definition nobody else would notice the difference (you always did what it
said anyway, or you’d be suboptimal), so why not just flood the 20 watt
carbon-based neural network with bliss drugs and let the silicon-based neural
network be the person instead?

~~~
javajosh
I think sometimes it's hard for us to zoom out and solve bigger problems that
we might not see. Like, if I don't enjoy the chore I'm doing, I can make it
easier, or I can _eliminate_ it. (Paying a bill in person is a good example).
Or, maybe there is an object that is generating a lot of work for me, like a
house, and there is an alternative object (an apartment). Heck, sometimes it
would be nice to have a record of all the things I enjoy and just have an AI
path-find to maximize revisiting those things.

------
QML
What makes this different from a minimax algorithm with alpha-beta pruning?

~~~
gwern
Aside from MCTS being a different tree search method, there is no 'closing of
the loop'. In regular MCTS, it is far from unheard of to do the random
playouts with instead some 'heavier' heuristic to make the playouts a little
better estimators of the node value, but the heavy playouts do not do any kind
of learning, the heuristic you start with is what you end with; what makes
this analogous to policy iteration (hence the names for the Zero algorithm of
'tree iteration' or 'expert iteration') is that the refined estimates from the
multiple heavy playouts are then used to improve the heavy playout heuristic
(ie. a NN which can be optimized via backpropagation in a supervised learning
of board position -> value). Then in a self-play setting, the MCTS continually
refines its heavy heuristic (the NN) until it's so good that the NN+MCTS is
superhuman. Then at play time you can drop the MCTS entirely and just use the
heavy heuristic to do a very simple tree search to choose a move (which I
think might actually be a minimax with a fixed depth but I forget).

------
cgearhart
I wonder how AlphaGo Zero would fare against the others if they were all using
the same search algorithm, and I wonder how the search depth vs breadth
changes in Zero compared to earlier variants.

------
Blazespinnaker
Mageek, any reason why they haven't applied this to chess yet?

~~~
glinscott
MCTS has done very poorly on chess compared to alpha-beta. Chess has a high
number of forcing moves in capture sequences, and it's been very difficult to
evaluate positions that are not settled. Traditionally an algorithm called a
quiescence search is used, but it relies on doing an evaluation at each node
of the search, which would be prohibitive with the latency for a network
evaluation.

One of the things that amazed me the most about AlphaGo Zero was that they
didn't do any tricks to minimize the latency of the network evaluation!

Still, it's certainly worth a try, I'd be extremely interested to see what
style of chess a self-trained MCTS chess version would have :).

------
jacobkg
Can this technique be used to write a strong chess engine?

~~~
Choco31415
Yes. As described, all that’s needed is a way to imagine all possible moves
from a game state, and check if a game state corresponds with win/tie/loss.
That is possible in Tic-Tac-Toe, Chess, and Go.

~~~
Blazespinnaker
Surprisingly, they haven't done it though. Wonder why. Is it because of the
number of different state changes that can occur from a particular position?
Maybe Go is easier to solve than chess.

It would be very interesting to see a computer improve on chess openings.

~~~
evandijk70
First of all, Go is not 'solved'. While the AI plays better than the best
humans, it does not play perfectly (nor will any AI program in the forseeable
future).

If by 'solving' you mean, developing a strong AI program to play chess, I
think the reason they haven't done it because chess is easier to solve than
Go. The best computer chess engines are already a lot better than the best
humans. Thus, there is not much to gain for Google by finding a new approach.

There have been attempts to use similar approaches for chess, but they al
seemed to perform worse than the traditional approach for chess. (Alpha-beta
search with a hand-written evaluation function and null-move pruning).

The differences between Go and Chess that might explain this are:

\- Chess has a smaller branching factor (the number of legal moves is around
40 in a typical position in chess, whereas it's over 100 in a typical position
for Go)

\- A chess position is easier to evaluate. Simply counting the material
already gives a decent evaluation in a lot of cases. This supplemented by
hand-written rules gives a solid evaluation of a position

\- The average game of chess is shorter. A game of chess lasts between 20 and
80 moves, whereas a game of Go lasts around 200 games. This makes it a lot
more feasible to extend the search far enough to oversee the consequences of
each move in chess when compared to Go

------
falcor84
As someone who has played many a game of Tic-Tac-Toe, I found the numerical
examples really hard to follow. s(0,5) is obviously the winning move for the X
player, but for some reason all examples seem to favor s(0,1).

~~~
alkonaut
Winning move how? With which board?

with

    
    
        O__
        _X_
        O_X
    

Then X can't win in the next move and must choose (0,1) between the two O's
the left column in order to not lose.

~~~
falcor84
D'oh, thanks. I suppose I just misread the oh's.

