Benchmark
AlphaZero.Benchmark — ModuleUtilities to evaluate players against one another.
Typically, between each training iteration, different players that possibly depend on the current neural network compete against a set of baselines.
Evaluations
AlphaZero.Benchmark.Evaluation — TypeEvaluationAbstract type for a benchmark item specification.
AlphaZero.Benchmark.Single — TypeSingle <: EvaluationEvaluating a single player in a one-player game.
AlphaZero.Benchmark.Duel — TypeDuel <: EvaluationEvaluating a player by pitting it against a baseline player in a two-player game.
AlphaZero.Benchmark.run — FunctionBenchmark.run(env::Env, duel::Benchmark.Evaluation, progress=nothing)Run a benchmark duel and return a Report.Evaluation.
If a progress is provided, next!(progress) is called after each simulated game.
Players
AlphaZero.Benchmark.Player — TypeBenchmark.PlayerAbstract type to specify a player that can be featured in a benchmark duel.
Subtypes must implement the following functions:
Benchmark.instantiate(player, nn): instantiate the player specification into anAbstractPlayergiven a neural networkBenchmark.name(player): return aStringdescribing the player
AlphaZero.Benchmark.Full — TypeBenchmark.Full(params) <: Benchmark.PlayerFull AlphaZero player that combines MCTS with the learnt network.
Argument params has type MctsParams.
AlphaZero.Benchmark.NetworkOnly — TypeBenchmark.NetworkOnly(;τ=1.0) <: Benchmark.PlayerPlayer that uses the policy output by the learnt network directly, instead of relying on MCTS.
AlphaZero.Benchmark.MctsRollouts — TypeBenchmark.MctsRollouts(params) <: Benchmark.PlayerPure MCTS baseline that uses rollouts to evaluate new positions.
Argument params has type MctsParams.
AlphaZero.Benchmark.MinMaxTS — TypeBenchmark.MinMaxTS(;depth, τ=0.) <: Benchmark.PlayerMinmax baseline, which relies on MinMax.Player.
Minmax Baseline
AlphaZero.MinMax — ModuleA simple implementation of the minmax tree search algorithm, to be used as a baseline against AlphaZero. Heuristic board values are provided by the GameInterface.heuristic_value function.
AlphaZero.MinMax.Player — TypeMinMax.Player <: AbstractPlayerA stochastic minmax player, to be used as a baseline.
MinMax.Player(;depth, amplify_rewards, τ=0.)The minmax player explores the game tree exhaustively at depth depth to build an estimate of the Q-value of each available action. Then, it chooses an action as follows:
- If there are winning moves (with value
Inf), one of them is picked uniformly at random. - If all moves are losing (with value
-Inf), one of them is picked uniformly at random.
Otherwise,
- If the temperature
τis zero, a move is picked uniformly among those with maximal Q-value (there is usually only one choice). - If the temperature
τis nonzero, the probability of choosing action $a$ is proportional to $e^{\frac{q_a}{Cτ}}$ where $q_a$ is the Q value of action $a$ and $C$ is the maximum absolute value of all finite Q values, making the decision invariant to rescaling ofGameInterface.heuristic_value.
If the amplify_rewards option is set to true, every received positive reward is converted to $∞$ and every negative reward is converted to $-∞$.