Benchmark

AlphaZero.BenchmarkModule

Utilities to evaluate players against one another.

Typically, between each training iteration, different players that possibly depend on the current neural network compete against a set of baselines.

source

Evaluations

AlphaZero.Benchmark.runFunction
Benchmark.run(env::Env, duel::Benchmark.Evaluation, progress=nothing)

Run a benchmark duel and return a Report.Evaluation.

If a progress is provided, next!(progress) is called after each simulated game.

source

Players

AlphaZero.Benchmark.PlayerType
Benchmark.Player

Abstract type to specify a player that can be featured in a benchmark duel.

Subtypes must implement the following functions:

  • Benchmark.instantiate(player, nn): instantiate the player specification into an AbstractPlayer given a neural network
  • Benchmark.name(player): return a String describing the player
source
AlphaZero.Benchmark.NetworkOnlyType
Benchmark.NetworkOnly(;τ=1.0) <: Benchmark.Player

Player that uses the policy output by the learnt network directly, instead of relying on MCTS.

source

Minmax Baseline

AlphaZero.MinMax.PlayerType
MinMax.Player <: AbstractPlayer

A stochastic minmax player, to be used as a baseline.

MinMax.Player(;depth, amplify_rewards, τ=0.)

The minmax player explores the game tree exhaustively at depth depth to build an estimate of the Q-value of each available action. Then, it chooses an action as follows:

  • If there are winning moves (with value Inf), one of them is picked uniformly at random.
  • If all moves are losing (with value -Inf), one of them is picked uniformly at random.

Otherwise,

  • If the temperature τ is zero, a move is picked uniformly among those with maximal Q-value (there is usually only one choice).
  • If the temperature τ is nonzero, the probability of choosing action $a$ is proportional to $e^{\frac{q_a}{Cτ}}$ where $q_a$ is the Q value of action $a$ and $C$ is the maximum absolute value of all finite Q values, making the decision invariant to rescaling of GameInterface.heuristic_value.

If the amplify_rewards option is set to true, every received positive reward is converted to $∞$ and every negative reward is converted to $-∞$.

source