Hacker News new | comments | ask | show | jobs | submit login
A Deep Dive into Monte Carlo Tree Search (moderndescartes.com)
150 points by brilee 8 months ago | hide | past | web | favorite | 6 comments

Nice write-up & demo. MCTS—UCT particularly—is a hidden gem of symbolic AI that hasn’t found it’s deserved notoriety; I suspect because it’s not as flashy as the successes of deep learning. I think we’ll be seeing a lot more of it in the future though, as it has proven to be a remarkably flexible general search algorithm with simple, tunable mechanics to balance exploration/exploitation. It also works well in both classical search and adversarial search with minor modifications. A lesser-known example of its use in classical search is that at the time DeepMind made waves with performance on Atari games with Deep Q-Learning, the state of the art results (better than deep-q results on most games) were actually MCTS-UCB for planning/policy learning.

It is true that AI researchers are drawn to DNNs like moths to the flame, but AlphaGo garnered quite a bit of attention and it uses MCTS in a very interesting way

it's wonderful in situations where the breadth of search is impractically enormous, but the depth limited. like complex games :)

Is this some sort of "accepted wisdom" ??

(i.e. Does MTCS handles breadth better than depth)

I understand the MCTS rollout becomes difficult with 1000s of moves ahead. But we can "learn" a leaf evaluator, or "learn" to limit the branch out at deeper levels, etc etc.

Are there key papers or conclusions on this issue?

the reasoning for that thinking is simply because MCTS is comparable to DFS while something like minimax is comparable to BFS.

Of course, because MCTS searches in a depth-first manner, the deeper the depth the less effective it is, because you have less time for your rollouts (as your rollouts are more expensive).

But not many board games (if we are limiting ourselves to this example) will have thousands of moves until end of game -- they may, however, have an exponentially large breadth of moves.

For a game like Go, MCTS could simulate thousands of rollouts in the amount of time it would take minimax to evaluate two plies ahead.

I dunno if we can call this a "deep dive". Its a good introductory blog post on the subject, but without a discussion on multi-armed bandits, its kind of impossible to fully understand UCT, and the Monte Carlo Tree Search built on top of it.

The Multi-Armed Bandit is a very simple / idealistic model. You have many slot machines, all of which have different probabilities of winning. Your ONLY ability to estimate your future earnings is through experimentation: trying the slot machines one at a time.

The Multi-Armed Bandit problem then asks: what is the ideal strategy for searching through all of these slot machines?

UCT, despite being a very very simple algorithm (its just a sqrt and a few adds), is asymptotically ideal. There are better multi-armed bandit solvers out there, but UCT is incredibly fast to compute and ideal for the "inner-most loop" of AI problems. That's why its used.


Go (or really, any turn-based zero-sum game) can be seen as a multi-armed bandit problem.

Which moves will most likely result in a win for me? Well, you try a move by simulating it out, then record the win/loss statistic as a "win" or "loss" as a slot-machine in the multi-armed bandit model. Then, you try again, and again, and again.

The two factors of the UCT algorithm are exploration (trying new slot machines), and exploitation (repeatedly trying the best slot machine).

UCT proves that "exploring" at a rate of approximately sqrt(total visits) is asymptotically perfect. After 10,000 "slot-machine pulls", you want to make sure that you have ~100-visits in all of your choices. (Exploration factor).

Otherwise, you want to favor the slot-machine that has won the most. (Exploitation factor)

The sqrt(explortation) + exploitation factor is UCT in a nut-shell.

At the end of the say, UCT is an asymptotically perfect "multi-armed bandit" solver composed of just a sqrt(), a few multiplies, and a few "adds" here and there. Its outrageously cheap to calculate but (when configured properly) explores the tree "asymptotically perfectly". (Like Heapsort. Its not the best sorting algorithm, but it's asymptotically perfect, and thus useful)


Anyway, this blog post is fine and all. But without explaining the 'Q' or 'U' formula, or even saying what they mean (exploration factor vs exploitation factor), its certainly not a "deep dive".

And there's no discussion for why you'd expect UCT to do well. Just a simple statement for why UCT is asymptotically ideal for the multi-armed bandit problem would go a long way for the reader's understanding IMO.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact