Monte Carlo tree search
The Algorithm
Monte Carlo tree search is a non-deterministic algorithm that is guaranteed to converge to Minimax. This algorithm creates a tree of nodes, which look like
class Node { double wins = 0; uint count = 1; list<Node> children; }
The general algorithm repeatedly
- Start at the root
- Recursively selects already explored children in a Magic way until it reaches an unexplored child. Add a node for the unexplored child. Update the
count
field in all these nodes - Plays a sequence of random moves for both sides and returns the result back up the tree to the previously unexplored node. Do NOT create new nodes for this playthrough.
- Percolates the results up in a Magic way.
So, let's explain what these two pieces of Magic are. The first describes how you choose which child to go down. The procedure is
- If any of your children are unexpored, then explore them.
- Otherwise, randomly choose from your children, but make it so that the probably of choosing child $i$ is proportional to
\begin{equation}
\frac{w_i}{n_i} + c \cdot \sqrt{\frac{\ln{N}}{n_i}}
\label{foo}
\end{equation}
where
- $w_i$ is the
wins
of the child from the parent's point of view - $c$ is an empirically determined positive constant that indicates how important exploration is compared to explotation
- $N$ is the number of times the parent has been visited
- $n_i$ is the number of times the child has been visited - i.e.
count
- $w_i$ is the
- Simply pass the utility of the result (generaly 1.0 for a win, 0.5 for a tie, and 0 for a loss) back up the game until you get to the intial unexplored node.
- Simply add the result to this node's
wins
and all its ancestors.
Explanation
Equation $\ref{foo}$ is based on an idea called upper-confidence bound, which basically says that you should explore the move with the highest plausible value.
While proving this is very difficult, it is a fact that using this function to choose your child will guarantee that this Monte Carlo approach will converge to Minimax as the number of rollouts approaches infinity.
To get a feel for while this is the case, consider a node with two children, $A$ and $B$, where $A$ always leads to a win and $B$ always leads to a loss. If we assume, for simplicity, that $c=0.2$, then the weight given to $A$ after a single exploration is 1.17, while $B$ has a weight of 0.17, so we'll choose $A$ 87% of the time.
If we explore $A$ 870 times and explore $B$ 130 times, these weights change to 1.02 and 0.05, meaning we'll choose $A$ 95% of the time. As we explore $A$ more and more (compared to $B$), the parent (which is essentially a weighted sum of its children) converges to $A$'s value.