MCTS

AlphaZero.MCTSModule

A generic, standalone implementation of Monte Carlo Tree Search. It can be used on any game that implements GameInterface and with any external oracle.

Both a synchronous and an asynchronous version are implemented, which share most of their code. When browsing the sources for the first time, we recommend that you study the sychronous version first.

source

Oracles

AlphaZero.MCTS.evaluateFunction
MCTS.evaluate(oracle::Oracle, state)

Evaluate a single state from the current player's perspective.

Return a pair (P, V) where:

  • P is a probability vector on GI.available_actions(Game(state))
  • V is a scalar estimating the value or win probability for white.
source
AlphaZero.MCTS.evaluate_batchFunction
MCTS.evaluate_batch(oracle::Oracle, states)

Evaluate a batch of states.

Expect a vector of states and return a vector of (P, V) pairs.

A default implementation is provided that calls MCTS.evaluate sequentially on each position.

source
AlphaZero.MCTS.RolloutOracleType
MCTS.RolloutOracle{Game}(γ=1.) <: MCTS.Oracle{Game}

This oracle estimates the value of a position by simulating a random game from it (a rollout). Moreover, it puts a uniform prior on available actions. Therefore, it can be used to implement the "vanilla" MCTS algorithm.

source

Environment

AlphaZero.MCTS.EnvType
MCTS.Env{Game}(oracle; <keyword args>) where Game

Create and initialize an MCTS environment with a given oracle.

Keyword Arguments

  • nworkers=1: numbers of asynchronous workers (see below)
  • fill_batches=false: if true, a constant batch size is enforced for evaluation requests, by completing batches with dummy entries if necessary
  • gamma=1.: the reward discount factor
  • cpuct=1.: exploration constant in the UCT formula
  • noise_ϵ=0., noise_α=1.: parameters for the dirichlet exploration noise (see below)
  • prior_temperature=1.: temperature to apply to the oracle's output to get the prior probability vector used by MCTS.

Asynchronous MCTS

  • If nworkers == 1, MCTS is run in a synchronous fashion and the oracle is invoked through MCTS.evaluate.

  • If nworkers > 1, nworkers asynchronous workers are spawned, along with an additional task to serve state evaluation requests. Such requests are processed by batches of size nworkers using MCTS.evaluate_batch.

Dirichlet Noise

A naive way to ensure exploration during training is to adopt an ϵ-greedy policy, playing a random move at every turn instead of using the policy prescribed by MCTS.policy with probability ϵ. The problem with this naive strategy is that it may lead the player to make terrible moves at critical moments, thereby biasing the policy evaluation mechanism.

A superior alternative is to add a random bias to the neural prior for the root node during MCTS exploration: instead of considering the policy $p$ output by the neural network in the UCT formula, one uses $(1-ϵ)p + ϵη$ where $η$ is drawn once per call to MCTS.explore! from a Dirichlet distribution of parameter $α$.

source

Profiling Utilities