
A Deep Dive into Monte Carlo Tree Search - brilee
http://www.moderndescartes.com/essays/deep_dive_mcts/
======
cgearhart
Nice write-up & demo. MCTS—UCT particularly—is a hidden gem of symbolic AI
that hasn’t found it’s deserved notoriety; I suspect because it’s not as
flashy as the successes of deep learning. I think we’ll be seeing a lot more
of it in the future though, as it has proven to be a remarkably flexible
general search algorithm with simple, tunable mechanics to balance
exploration/exploitation. It also works well in both classical search and
adversarial search with minor modifications. A lesser-known example of its use
in classical search is that at the time DeepMind made waves with performance
on Atari games with Deep Q-Learning, the state of the art results (better than
deep-q results on most games) were actually MCTS-UCB for planning/policy
learning.

~~~
2bitencryption
it's wonderful in situations where the breadth of search is impractically
enormous, but the depth limited. like complex games :)

~~~
throwawa123412
Is this some sort of "accepted wisdom" ??

(i.e. Does MTCS handles breadth better than depth)

I understand the MCTS rollout becomes difficult with 1000s of moves ahead. But
we can "learn" a leaf evaluator, or "learn" to limit the branch out at deeper
levels, etc etc.

Are there key papers or conclusions on this issue?

~~~
2bitencryption
the reasoning for that thinking is simply because MCTS is comparable to DFS
while something like minimax is comparable to BFS.

Of course, because MCTS searches in a depth-first manner, the deeper the depth
the less effective it is, because you have less time for your rollouts (as
your rollouts are more expensive).

But not many board games (if we are limiting ourselves to this example) will
have thousands of moves until end of game -- they may, however, have an
exponentially large breadth of moves.

For a game like Go, MCTS could simulate thousands of rollouts in the amount of
time it would take minimax to evaluate two plies ahead.

------
dragontamer
I dunno if we can call this a "deep dive". Its a good introductory blog post
on the subject, but without a discussion on multi-armed bandits, its kind of
impossible to fully understand UCT, and the Monte Carlo Tree Search built on
top of it.

The Multi-Armed Bandit is a very simple / idealistic model. You have many slot
machines, all of which have different probabilities of winning. Your ONLY
ability to estimate your future earnings is through experimentation: trying
the slot machines one at a time.

The Multi-Armed Bandit problem then asks: what is the ideal strategy for
searching through all of these slot machines?

UCT, despite being a very very simple algorithm (its just a sqrt and a few
adds), is asymptotically ideal. There are better multi-armed bandit solvers
out there, but UCT is incredibly fast to compute and ideal for the "inner-most
loop" of AI problems. That's why its used.

\---------

Go (or really, any turn-based zero-sum game) can be seen as a multi-armed
bandit problem.

Which moves will most likely result in a win for me? Well, you try a move by
simulating it out, then record the win/loss statistic as a "win" or "loss" as
a slot-machine in the multi-armed bandit model. Then, you try again, and
again, and again.

The two factors of the UCT algorithm are exploration (trying new slot
machines), and exploitation (repeatedly trying the best slot machine).

UCT proves that "exploring" at a rate of approximately sqrt(total visits) is
asymptotically perfect. After 10,000 "slot-machine pulls", you want to make
sure that you have ~100-visits in all of your choices. (Exploration factor).

Otherwise, you want to favor the slot-machine that has won the most.
(Exploitation factor)

The sqrt(explortation) + exploitation factor is UCT in a nut-shell.

At the end of the say, UCT is an asymptotically perfect "multi-armed bandit"
solver composed of just a sqrt(), a few multiplies, and a few "adds" here and
there. Its outrageously cheap to calculate but (when configured properly)
explores the tree "asymptotically perfectly". (Like Heapsort. Its not the best
sorting algorithm, but it's asymptotically perfect, and thus useful)

\---------

Anyway, this blog post is fine and all. But without explaining the 'Q' or 'U'
formula, or even saying what they mean (exploration factor vs exploitation
factor), its certainly not a "deep dive".

And there's no discussion for why you'd expect UCT to do well. Just a simple
statement for why UCT is asymptotically ideal for the multi-armed bandit
problem would go a long way for the reader's understanding IMO.

