Benchmark
AlphaZero.Benchmark
— ModuleUtilities to evaluate players against one another.
Typically, between each training iteration, different players that possibly depend on the current neural network compete against a set of baselines.
Evaluations
AlphaZero.Benchmark.Evaluation
— TypeEvaluation
Abstract type for a benchmark item specification.
AlphaZero.Benchmark.Single
— TypeSingle <: Evaluation
Evaluating a single player in a one-player game.
AlphaZero.Benchmark.Duel
— TypeDuel <: Evaluation
Evaluating a player by pitting it against a baseline player in a two-player game.
AlphaZero.Benchmark.run
— FunctionBenchmark.run(env::Env, duel::Benchmark.Evaluation, progress=nothing)
Run a benchmark duel and return a Report.Evaluation
.
If a progress
is provided, next!(progress)
is called after each simulated game.
Players
AlphaZero.Benchmark.Player
— TypeBenchmark.Player
Abstract type to specify a player that can be featured in a benchmark duel.
Subtypes must implement the following functions:
Benchmark.instantiate(player, nn)
: instantiate the player specification into anAbstractPlayer
given a neural networkBenchmark.name(player)
: return aString
describing the player
AlphaZero.Benchmark.Full
— TypeBenchmark.Full(params) <: Benchmark.Player
Full AlphaZero player that combines MCTS with the learnt network.
Argument params
has type MctsParams
.
AlphaZero.Benchmark.NetworkOnly
— TypeBenchmark.NetworkOnly(;τ=1.0) <: Benchmark.Player
Player that uses the policy output by the learnt network directly, instead of relying on MCTS.
AlphaZero.Benchmark.MctsRollouts
— TypeBenchmark.MctsRollouts(params) <: Benchmark.Player
Pure MCTS baseline that uses rollouts to evaluate new positions.
Argument params
has type MctsParams
.
AlphaZero.Benchmark.MinMaxTS
— TypeBenchmark.MinMaxTS(;depth, τ=0.) <: Benchmark.Player
Minmax baseline, which relies on MinMax.Player
.
Minmax Baseline
AlphaZero.MinMax
— ModuleA simple implementation of the minmax tree search algorithm, to be used as a baseline against AlphaZero. Heuristic board values are provided by the GameInterface.heuristic_value
function.
AlphaZero.MinMax.Player
— TypeMinMax.Player <: AbstractPlayer
A stochastic minmax player, to be used as a baseline.
MinMax.Player(;depth, amplify_rewards, τ=0.)
The minmax player explores the game tree exhaustively at depth depth
to build an estimate of the Q-value of each available action. Then, it chooses an action as follows:
- If there are winning moves (with value
Inf
), one of them is picked uniformly at random. - If all moves are losing (with value
-Inf
), one of them is picked uniformly at random.
Otherwise,
- If the temperature
τ
is zero, a move is picked uniformly among those with maximal Q-value (there is usually only one choice). - If the temperature
τ
is nonzero, the probability of choosing action $a$ is proportional to $e^{\frac{q_a}{Cτ}}$ where $q_a$ is the Q value of action $a$ and $C$ is the maximum absolute value of all finite Q values, making the decision invariant to rescaling ofGameInterface.heuristic_value
.
If the amplify_rewards
option is set to true, every received positive reward is converted to $∞$ and every negative reward is converted to $-∞$.