Benchmark

AlphaZero.BenchmarkModule

Utilities to evaluate players against one another.

Typically, between each training iteration, different players that possibly depend on the current neural network compete against a set of baselines.

source

Duels

AlphaZero.Benchmark.DuelType
Benchmark.Duel(player, baseline; num_games)

Specify a duel that consists in num_games games between player and baseline, each of them of type Benchmark.Player.

Optional keyword arguments

  • reset_every: if set, the MCTS tree is reset every reset_mcts_every games to avoid running out of memory
  • color_policy has type ColorPolicy and is ALTERNATE_COLORS by default
source
AlphaZero.Benchmark.DuelOutcomeType
Benchmark.DuelOutcome

The outcome of a duel between two players.

Fields

  • player and baseline are String fields containing the names of both players involved in the duel
  • avgr is the averagereward collected by player
  • rewards is the sequence of rewards collected by player (one per game)
  • redundancy is the ratio of duplicate positions encountered during the evaluation, not counting the initial position. If this number is too high, you may want to increase the move selection temperature.
  • time is the computing time spent running the duel, in seconds
source

Players

AlphaZero.Benchmark.PlayerType
Benchmark.Player

Abstract type to specify a player that can be featured in a benchmark duel.

Subtypes must implement the following functions:

  • Benchmark.instantiate(player, nn): instantiate the player specification into an AbstractPlayer given a neural network
  • Benchmark.name(player): return a String describing the player
source
AlphaZero.Benchmark.NetworkOnlyType
Benchmark.NetworkOnly(;τ=1.0) <: Benchmark.Player

Player that uses the policy output by the learnt network directly, instead of relying on MCTS.

source

Minmax Baseline

AlphaZero.MinMax.PlayerType
MinMax.Player{Game} <: AbstractPlayer{Game}

A stochastic minmax player, to be used as a baseline.

MinMax.Player{Game}(;depth, amplify_rewards, τ=0.)

The minmax player explores the game tree exhaustively at depth depth to build an estimate of the Q-value of each available action. Then, it chooses an action as follows:

  • If there are winning moves (with value Inf), one of them is picked uniformly at random.
  • If all moves are losing (with value -Inf), one of them is picked uniformly at random.

Otherwise,

  • If the temperature τ is zero, a move is picked uniformly among those with maximal Q-value (there is usually only one choice).
  • If the temperature τ is nonzero, the probability of choosing action $a$ is proportional to $e^{\frac{q_a}{Cτ}}$ where $q_a$ is the Q value of action $a$ and $C$ is the maximum absolute value of all finite Q values, making the decision invariant to rescaling of GameInterface.heuristic_value.

If the amplify_rewards option is set to true, every received positive reward is converted to $∞$ and every negative reward is converted to $-∞$.

source