# Monte Carlo tree search

### The Algorithm

*Monte Carlo tree search* is a non-deterministic algorithm that is guaranteed to converge to Minimax. This algorithm creates a tree of nodes, which look like

class Node { double wins = 0; uint count = 1; list<Node> children; }

The general algorithm repeatedly

- Start at the root
- Recursively selects already explored children in a Magic way until it reaches an unexplored child. Add a node for the unexplored child. Update the
`count`

field in all these nodes - Plays a sequence of random moves for both sides and returns the result back up the tree to the previously unexplored node. Do NOT create new nodes for this playthrough.
- Percolates the results up in a Magic way.

*rollout*. After performing a large number of these, the algorithm returns the action at the root that will lead to the child with the highest value-to-count ratio.

So, let's explain what these two pieces of Magic are. The first describes how you choose which child to go down. The procedure is

- If any of your children are unexpored, then explore them.
- Otherwise, randomly choose from your children, but make it so that the probably of choosing child $i$ is proportional to
\begin{equation}
\frac{w_i}{n_i} + c \cdot \sqrt{\frac{\ln{N}}{n_i}}
\label{foo}
\end{equation}
where
- $w_i$ is the
`wins`

of the child from the parent's point of view - $c$ is an empirically determined positive constant that indicates how important exploration is compared to explotation
- $N$ is the number of times the parent has been visited
- $n_i$ is the number of times the child has been visited - i.e.
`count`

- $w_i$ is the

- Simply pass the utility of the result (generaly 1.0 for a win, 0.5 for a tie, and 0 for a loss) back up the game until you get to the intial unexplored node.
- Simply add the result to this node's
`wins`

and all its ancestors.

### Explanation

Equation $\ref{foo}$ is based on an idea called *upper-confidence bound*, which basically says that you should explore the move with the highest plausible value.

While proving this is very difficult, it is a fact that using this function to choose your child will guarantee that this Monte Carlo approach will converge to Minimax as the number of rollouts approaches infinity.

To get a feel for while this is the case, consider a node with two children, $A$ and $B$, where $A$ always leads to a win and $B$ always leads to a loss. If we assume, for simplicity, that $c=0.2$, then the weight given to $A$ after a single exploration is 1.17, while $B$ has a weight of 0.17, so we'll choose $A$ 87% of the time.

If we explore $A$ 870 times and explore $B$ 130 times, these weights change to 1.02 and 0.05, meaning we'll choose $A$ 95% of the time. As we explore $A$ more and more (compared to $B$), the parent (which is essentially a weighted sum of its children) converges to $A$'s value.