
Introduction to Monte Carlo Tree Search - wlkr
http://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/
======
shoo
UCB bandit algorithms are quite simple to understand, and the generalisation
of them by the UCT algorithm to tree search spaces seems like a pretty clean
idea.

I had a go at using UCT to explore a search space earlier this year (not
games, but selecting a good sequence of decisions inside a simulation).

The problem I ran into with this approach was that the algorithm has no way to
generalise or lift experience from one explored state to other similar /
unexplored states, and my search space was large. So having to evaluate each
state at least once was prohibitive.

This isn't really a criticism of how well UCT can tackle the problem as
framed, but more a criticism of how much better one might be able to do with a
bit more structure -- e.g. perhaps with the addition of some kind of
similarity metric between states, along with some bounds or assumptions on how
the values of "similar" states should relate to each other.

It seems like combining some machine learning / statistical / approximation
techniques to approximate the true state->value function with a UCT-like
approach to guide exploration could be very effective for the right problem,
but getting it to work well might be more of an art than a science.

If anyone has any pointers to reading about this kind of thing, I'd be most
grateful! I have previously read about "approximate dynamic programming"
techniques, e.g. [1]

[1] [http://castlelab.princeton.edu/Papers/Powell-
NRLWhat%20you%2...](http://castlelab.princeton.edu/Papers/Powell-
NRLWhat%20you%20should%20know%20about%20approximate%20dynamic%20programming.pdf)

~~~
versteegen
Any sequence of one of more decisions with varying results _is_ a game, but
you probably know that.

Aside from estimating the value of a node from similar nodes, another option
is to reduce the search space by putting them in the same information set.
This means that to one player they appear indistinguishable, so the same
strategy is played in all of them, while the other players might still be able
to distinguish them. If your game has a huge state space it's likely that some
of the state just hardly matters some of the time to some of the opponents.
Other methods of fusing states have the disadvantage that all players have to
care about the same information.

To implement MCTS with information sets, use the Information Set MCTS
algorithm (ISMCTS) [1], which is a pretty obvious and intuitive extension of
MCTS. (Has nobody really described ISMCTS before?) Whitehouse's PhD thesis [2]
goes into more detail, and looks at using ISMCTS to reduce the search space in
games ("ICARUS"), though I didn't read that chapter. (Earlier this year I
mostly-implemented ISMCTS for Cordial Minuet, a game with simultaneous moves
in addition to imperfect information. Information sets are unavoidable in
simultaneous-move games.)

Also, you will find loads of papers specifically discussing speeding up MCTS
for Go, many of which generalise.

[1]
[http://eprints.whiterose.ac.uk/75048/1/CowlingPowleyWhitehou...](http://eprints.whiterose.ac.uk/75048/1/CowlingPowleyWhitehouse2012.pdf)
[2]
[http://etheses.whiterose.ac.uk/8117/](http://etheses.whiterose.ac.uk/8117/)

~~~
jeffbradberry
Excellent. I did want to eventually write something showing how to modify this
algorithm to handle games with randomness, and ISMCTS at a glance looks like
it does just that. Thanks!

~~~
versteegen
Actually, you don't need information sets to handle games with chance nodes.
Standard MCTS works just fine for those. (Don't use a bandit algorithm at the
chance nodes, just pick randomly.) Information sets are only needed if you
have uncertainty, either because players make simultaneous moves or a player
doesn't know (or forgets) the complete state of the game.

------
jasisz
This algorithm (in it's regular form used often in games and examples) has one
interesting "downside" I was exploring some time ago - selection is performed
using the UCB formula. So basically it tries to maximize the player payout.
But in the most games this is in fact impractical assumption, because we end
up tending to expand branches that will be most likely not chosen by our
opponent. As in the example (I assume gray moves are "our" moves) - we will
much more likely choose to expand 5/6 branch instead of the 2/4, that will be
in fact more likely chosen by our opponent.

~~~
panic
In a game state where your opponent will be choosing the next move, you should
select the next state to expand based on a UCB formula involving your
opponent's expected payout, not your own.

~~~
jasisz
But this requires storing and back propagating this info for the other players
- something I really haven't seen in any examples (nor in this article). We
cannot also assume that game is always zero-sum game and this information is
not needed.

------
JD557
A quick note to the author (in case he reads this): I think the "4/8" node
should be "5/8" in all images.

~~~
jeffbradberry
Actually, I think it should be 3/8, since it's supposed to be the opposing
player's win count.

These diagrams were supposed to just be an update of the diagrams from the
MCTS Wikipedia article, adding the correct backpropagation values, but I see
there are more problems than that.

~~~
jasisz
This is exactly the problem I've written about - AFAIK basic UCT algorithm
does not model your opponent in the selection phase (only in simulation when
some heavier than random logic is applied).

~~~
jeffbradberry
Since the win counts are only incremented for nodes "moved into" by the
winning player, UCB does correctly model the opponent's behavior here.

~~~
jasisz
In your example script yes, but basic UCT does not do that - simply because
UCT was not meant only for multi-player games in the beginning. And this is
some assumption we make about our opponents (actually that they want to
maximize their payout, not to e.g. win or minimize our score). Of course this
is a very straightforward "application" of UCT or MCTS to the multiplayer
games that was done in works of Sturtevant and Cazenave in 2008.

But it is not so easy to know about this, e.g. this is a very recent change to
wikipedia page on the topic
[https://en.wikipedia.org/w/index.php?title=Monte_Carlo_tree_...](https://en.wikipedia.org/w/index.php?title=Monte_Carlo_tree_search&action=historysubmit&type=revision&diff=671256114&oldid=667621950)
and very often people writing on the topic are not pointing this out at all,
which I find very strange and misleading.

Also in the classic MCTS you should select move which has most visits, not the
one with the highest percentage of wins.

~~~
jeffbradberry
In what appears to be the originating paper for UCT ("Bandit based Monte-Carlo
Planning", L. Kocsis and C. Szepesvári, 2006),

* The choice of action to be returned is the one "with the highest average observed long-term reward"

* For simplicity, the payout value used in the paper is 1=win and 0=loss, which will result in the agents maximizing their wins. Presumably one could choose other payout values (i.e. points for that player in games that have that concept) to adjust the priority of the agents. The mathematics does not seem to forbid it.

* This paper uses as their first experiment a multi-player game. They state, "...for P-games UCT is modified to a negamax-style: In MIN nodes the negative of estimated action-values is used in the action selection procedures." It is straightforward and more generalizable to games with N > 2 players to simply record values from that player's standpoint in the first place, instead of manipulating it after the fact in this manner.

I hope this clarifies some things.

~~~
jasisz
From the "An analysis of UCT in multi-player games", Nathan Sturtevant, 2008:
"Multi-player UCT is nearly identical to regular UCT. At the highest level of
the algorithm, the tree is repeatedly sampled until it is time to make an
action. The sampling process is illustrated in Figure 2. The only difference
between this code and a two-player implementation is that in line 5 the
average score for player p is used instead of a single average payoff for the
state."

I think this is kind of a clear statement that original paper (and after it a
lot of writing on the topic) may be lacking. Of course people used this simple
generalization before and it is pretty straightforward, but it is not that
obvious at a first glance. And I've seen quite a lot of code examples, images
explaining UCT for games and articles that were just not saying a word on
this. Or even worse - just doing it wrong for multiplayer games.

Choice of action is a different topic, as I remember correctly there was also
a paper proving that win rate and most robust branch are in the end performing
the same ;)

Hope you will continue this series, because it is really good and code
examples are really nice!

------
CyberDildonics
sounds a lot like gaussian kd-trees

[http://graphics.stanford.edu/papers/gkdtrees/gkdtrees.pdf](http://graphics.stanford.edu/papers/gkdtrees/gkdtrees.pdf)

Instead of doing a brute force nearest neighbor search, samples are propagated
through a tree.

This technique is exotic, but not new.

