Benchmark

AlphaZero.Benchmark — Module

Utilities to evaluate players against one another.

Typically, between each training iteration, different players that possibly depend on the current neural network compete against a set of baselines.

source

AlphaZero.Benchmark.Report — Type

Benchmark.Report = Vector{Benchmark.DuelOutcome}

A benchmark report is a vector of Benchmark.DuelOutcome objects.

source

Duels

AlphaZero.Benchmark.Duel — Type

Benchmark.Duel(player, baseline; num_games)

Specify a duel that consists in num_games games between player and baseline, each of them of type Benchmark.Player.

Optional keyword arguments

reset_every: if set, the MCTS tree is reset every reset_mcts_every games to avoid running out of memory
color_policy has type ColorPolicy and is ALTERNATE_COLORS by default

source

AlphaZero.Benchmark.DuelOutcome — Type

Benchmark.DuelOutcome

The outcome of a duel between two players.

Fields

player and baseline are String fields containing the names of both players involved in the duel
avgz is the average reward collected by player
redundancy is the ratio of duplicate positions encountered during the evaluation, not counting the initial position. If this number is too high, you may want to increase the move selection temperature.
rewards is a vector containing all rewards collected by player (one per game played)
time is the computing time spent running the duel, in seconds

source

AlphaZero.Benchmark.run — Function

Benchmark.run(env::Env, duel::Benchmark.Duel, progress=nothing)

Run a benchmark duel and return a Benchmark.DuelOutcome.

If a progress is provided, next!(progress) is called after each simulated game.

source

Players

AlphaZero.Benchmark.Player — Type

Benchmark.Player

Abstract type to specify a player that can be featured in a benchmark duel.

Subtypes must implement the following functions:

Benchmark.instantiate(player, nn): instantiate the player specification into an AbstractPlayer given a neural network
Benchmark.name(player): return a String describing the player

source

AlphaZero.Benchmark.Full — Type

Benchmark.Full(params) <: Benchmark.Player

Full AlphaZero player that combines MCTS with the learnt network.

Argument params has type MctsParams.

source

AlphaZero.Benchmark.NetworkOnly — Type

Benchmark.NetworkOnly(;use_gpu=true, τ=1.0) <: Benchmark.Player

Player that uses the policy output by the learnt network directly, instead of relying on MCTS.

source

AlphaZero.Benchmark.MctsRollouts — Type

Benchmark.MctsRollouts(params) <: Benchmark.Player

Pure MCTS baseline that uses rollouts to evaluate new positions.

Argument params has type MctsParams.

source

AlphaZero.Benchmark.MinMaxTS — Type

Benchmark.MinMaxTS(;depth, τ=0.) <: Benchmark.Player

Minmax baseline, which relies on MinMax.Player.

source

AlphaZero.Benchmark.Solver — Type

Benchmark.Solver(;ϵ) <: Benchmark.Player

Perfect solver that plays randomly with probability ϵ.

source

Minmax Baseline

AlphaZero.MinMax — Module

A simple implementation of the minmax tree search algorithm, to be used as a baseline against AlphaZero. Heuristic board values are provided by the GameInterface.heuristic_value function.

source

AlphaZero.MinMax.Player — Type

MinMax.Player{Game} <: AbstractPlayer{Game}

A stochastic minmax player, to be used as a baseline.

MinMax.Player{Game}(;depth, amplify_rewards, τ=0.)

The minmax player explores the game tree exhaustively at depth depth to build an estimate of the Q-value of each available action. Then, it chooses an action as follows:

If there are winning moves (with value Inf), one of them is picked uniformly at random.
If all moves are losing (with value -Inf), one of them is picked uniformly at random.

Otherwise,

If the temperature τ is zero, a move is picked uniformly among those with maximal Q-value (there is usually only one choice).
If the temperature τ is nonzero, the probability of choosing action $a$ is proportional to $e^{\frac{q_a}{Cτ}}$ where $q_a$ is the Q value of action $a$ and $C$ is the maximum absolute value of all finite Q values, making the decision invariant to rescaling of GameInterface.heuristic_value.

If the amplify_rewards option is set to true, every received positive reward is converted to $∞$ and every negative reward is converted to $-∞$.

source