(i.e. Does MTCS handles breadth better than depth)
I understand the MCTS rollout becomes difficult with 1000s of moves ahead. But we can "learn" a leaf evaluator, or "learn" to limit the branch out at deeper levels, etc etc.
Are there key papers or conclusions on this issue?
Of course, because MCTS searches in a depth-first manner, the deeper the depth the less effective it is, because you have less time for your rollouts (as your rollouts are more expensive).
But not many board games (if we are limiting ourselves to this example) will have thousands of moves until end of game -- they may, however, have an exponentially large breadth of moves.
For a game like Go, MCTS could simulate thousands of rollouts in the amount of time it would take minimax to evaluate two plies ahead.
The Multi-Armed Bandit is a very simple / idealistic model. You have many slot machines, all of which have different probabilities of winning. Your ONLY ability to estimate your future earnings is through experimentation: trying the slot machines one at a time.
The Multi-Armed Bandit problem then asks: what is the ideal strategy for searching through all of these slot machines?
UCT, despite being a very very simple algorithm (its just a sqrt and a few adds), is asymptotically ideal. There are better multi-armed bandit solvers out there, but UCT is incredibly fast to compute and ideal for the "inner-most loop" of AI problems. That's why its used.
Go (or really, any turn-based zero-sum game) can be seen as a multi-armed bandit problem.
Which moves will most likely result in a win for me? Well, you try a move by simulating it out, then record the win/loss statistic as a "win" or "loss" as a slot-machine in the multi-armed bandit model. Then, you try again, and again, and again.
The two factors of the UCT algorithm are exploration (trying new slot machines), and exploitation (repeatedly trying the best slot machine).
UCT proves that "exploring" at a rate of approximately sqrt(total visits) is asymptotically perfect. After 10,000 "slot-machine pulls", you want to make sure that you have ~100-visits in all of your choices. (Exploration factor).
Otherwise, you want to favor the slot-machine that has won the most. (Exploitation factor)
The sqrt(explortation) + exploitation factor is UCT in a nut-shell.
At the end of the say, UCT is an asymptotically perfect "multi-armed bandit" solver composed of just a sqrt(), a few multiplies, and a few "adds" here and there. Its outrageously cheap to calculate but (when configured properly) explores the tree "asymptotically perfectly". (Like Heapsort. Its not the best sorting algorithm, but it's asymptotically perfect, and thus useful)
Anyway, this blog post is fine and all. But without explaining the 'Q' or 'U' formula, or even saying what they mean (exploration factor vs exploitation factor), its certainly not a "deep dive".
And there's no discussion for why you'd expect UCT to do well. Just a simple statement for why UCT is asymptotically ideal for the multi-armed bandit problem would go a long way for the reader's understanding IMO.