Monte Carlo tree search

The Algorithm

Monte Carlo tree search is a non-deterministic algorithm that is guaranteed to converge to Minimax. This algorithm creates a tree of nodes, which look like

class Node {
    double wins = 0;
    uint count = 1;
    list<Node> children;

The general algorithm repeatedly

  1. Start at the root
  2. Recursively selects already explored children in a Magic way until it reaches an unexplored child. Add a node for the unexplored child. Update the count field in all these nodes
  3. Plays a sequence of random moves for both sides and returns the result back up the tree to the previously unexplored node. Do NOT create new nodes for this playthrough.
  4. Percolates the results up in a Magic way.
Each of these iterations is called a rollout. After performing a large number of these, the algorithm returns the action at the root that will lead to the child with the highest value-to-count ratio.

So, let's explain what these two pieces of Magic are. The first describes how you choose which child to go down. The procedure is

  1. If any of your children are unexpored, then explore them.
  2. Otherwise, randomly choose from your children, but make it so that the probably of choosing child $i$ is proportional to \begin{equation} \frac{w_i}{n_i} + c \cdot \sqrt{\frac{\ln{N}}{n_i}} \label{foo} \end{equation} where
    • $w_i$ is the wins of the child from the parent's point of view
    • $c$ is an empirically determined positive constant that indicates how important exploration is compared to explotation
    • $N$ is the number of times the parent has been visited
    • $n_i$ is the number of times the child has been visited - i.e. count
The second piece of Magic is how percolating the results back up the tree works. This is actually pretty straightforward:
  1. Simply pass the utility of the result (generaly 1.0 for a win, 0.5 for a tie, and 0 for a loss) back up the game until you get to the intial unexplored node.
  2. Simply add the result to this node's wins and all its ancestors.
Note, you should flip the value at every level, so that the utility is correct form the point-of-view of the parent.


Equation $\ref{foo}$ is based on an idea called upper-confidence bound, which basically says that you should explore the move with the highest plausible value.

While proving this is very difficult, it is a fact that using this function to choose your child will guarantee that this Monte Carlo approach will converge to Minimax as the number of rollouts approaches infinity.

To get a feel for while this is the case, consider a node with two children, $A$ and $B$, where $A$ always leads to a win and $B$ always leads to a loss. If we assume, for simplicity, that $c=0.2$, then the weight given to $A$ after a single exploration is 1.17, while $B$ has a weight of 0.17, so we'll choose $A$ 87% of the time.

If we explore $A$ 870 times and explore $B$ 130 times, these weights change to 1.02 and 0.05, meaning we'll choose $A$ 95% of the time. As we explore $A$ more and more (compared to $B$), the parent (which is essentially a weighted sum of its children) converges to $A$'s value.