Training Parameters
General
AlphaZero.Params
— TypeParams
The AlphaZero training hyperparameters.
Parameter | Type | Default |
---|---|---|
self_play | SelfPlayParams | - |
learning | LearningParams | - |
arena | Union{Nothing, ArenaParams } | - |
memory_analysis | Union{Nothing, MemAnalysisParams} | nothing |
num_iters | Int | - |
use_symmetries | Bool | false |
ternary_rewards | Bool | false |
mem_buffer_size | PLSchedule{Int} | - |
Explanation
The AlphaZero training process consists in num_iters
iterations. Each iteration can be decomposed into a self-play phase (see SelfPlayParams
) and a learning phase (see LearningParams
).
ternary_rewards
: set totrue
if the rewards issued by the game environment always belong to $\{-1, 0, 1\}$ so that the logging and profiling tools can take advantage of this property.use_symmetries
: if set totrue
, board symmetries are used for data augmentation before learning.mem_buffer_size
: size schedule of the memory buffer, in terms of number of samples. It is typical to start with a small memory buffer that is grown progressively so as to wash out the initial low-quality self-play data more quickly.memory_analysis
: parameters for the memory analysis step that is performed at each iteration (seeMemAnalysisParams
), ornothing
if no analysis is to be performed.
AlphaGo Zero Parameters
In the original AlphaGo Zero paper:
- About 5 millions games of self-play are played across 200 iterations.
- The memory buffer contains 500K games, which makes about 100M samples as an average game of Go lasts about 200 turns.
Self-Play
AlphaZero.SelfPlayParams
— TypeSelfPlayParams
Parameters governing self-play.
Parameter | Type | Default |
---|---|---|
mcts | MctsParams | - |
sim | SimParams | - |
AlphaGo Zero Parameters
In the original AlphaGo Zero paper, sim.num_games=25_000
(5 millions games of self-play across 200 iterations).
Learning
AlphaZero.LearningParams
— TypeLearningParams
Parameters governing the learning phase of a training iteration, where the neural network is updated to fit the data in the memory buffer.
Parameter | Type | Default |
---|---|---|
use_gpu | Bool | false |
use_position_averaging | Bool | true |
samples_weighing_policy | SamplesWeighingPolicy | - |
optimiser | OptimiserSpec | - |
l2_regularization | Float32 | - |
rewards_renormalization | Float32 | 1f0 |
nonvalidity_penalty | Float32 | 1f0 |
batch_size | Int | - |
loss_computation_batch_size | Int | - |
min_checkpoints_per_epoch | Float64 | - |
max_batches_per_checkpoint | Int | - |
num_checkpoints | Int | - |
Description
The neural network goes through num_checkpoints
series of n
updates using batches of size batch_size
drawn from memory, where n
is defined as follows:
n = min(max_batches_per_checkpoint, ntotal ÷ min_checkpoints_per_epoch)
with ntotal
the total number of batches in memory. Between each series, the current network is evaluated against the best network so far (see ArenaParams
).
nonvalidity_penalty
is the multiplicative constant of a loss term that corresponds to the average probability weight that the network puts on invalid actions.batch_size
is the batch size used for gradient descent.loss_computation_batch_size
is the batch size that is used to compute the loss between each epochs.- All rewards are divided by
rewards_renormalization
before the MSE loss is computed. - If
use_position_averaging
is set to true, samples in memory that correspond to the same board position are averaged together. The merged sample is reweighted according tosamples_weighing_policy
.
AlphaGo Zero Parameters
In the original AlphaGo Zero paper:
- The batch size for gradient updates is $2048$.
- The L2 regularization parameter is set to $10^{-4}$.
- Checkpoints are produced every 1000 training steps, which corresponds to seeing about 20% of the samples in the memory buffer: $(1000 × 2048) / 10^7 ≈ 0.2$.
- It is unclear how many checkpoints are taken or how many training steps are performed in total.
AlphaZero.SamplesWeighingPolicy
— TypeSamplesWeighingPolicy
During self-play, early board positions are possibly encountered many times across several games. The corresponding samples can be merged together and given a weight $W$ that is a nondecreasing function of the number $n$ of merged samples:
CONSTANT_WEIGHT
: $W(n) = 1$LOG_WEIGHT
: $W(n) = \log_2(n) + 1$LINEAR_WEIGHT
: $W(n) = n$
Arena
AlphaZero.ArenaParams
— TypeArenaParams
Parameters that govern the evaluation process that compares the current neural network with the best one seen so far (which is used to generate data).
Parameter | Type | Default |
---|---|---|
mcts | MctsParams | - |
sim | SimParams | - |
update_threshold | Float64 | - |
Explanation (two-player games)
- The two competing networks are instantiated into two MCTS players of parameter
mcts
and then playsim.num_games
games. - The evaluated network replaces the current best one if its average collected reward is greater or equal than
update_threshold
.
Explanation (single-player games)
- The two competing networks play
sim.num_games
games each. - The evaluated network replaces the current best one if its average collected rewards exceeds the average collected reward of the old one by
update_threshold
at least.
Remarks
- See
necessary_samples
to make an informed choice forsim.num_games
.
AlphaGo Zero Parameters
In the original AlphaGo Zero paper, 400 games are played to evaluate a network and the update_threshold
parameter is set to a value that corresponds to a 55% win rate.
Memory Analysis
AlphaZero.MemAnalysisParams
— TypeMemAnalysisParams
Parameters governing the analysis of the memory buffer (for debugging and profiling purposes).
Parameter | Type | Default |
---|---|---|
num_game_stages | Int | - |
Explanation
The memory analysis consists in partitioning the memory buffer in num_game_stages
parts of equal size, according to the number of remaining moves until the end of the game for each sample. Then, the quality of the predictions of the current neural network is evaluated on each subset (see Report.Memory
).
This is useful to get an idea of how the neural network performance varies depending on the game stage (typically, good value estimates for endgame board positions are available earlier in the training process than good values for middlegame positions).
MCTS
AlphaZero.MctsParams
— TypeParameters of an MCTS player.
Parameter | Type | Default |
---|---|---|
num_iters_per_turn | Int | - |
gamma | Float64 | 1. |
cpuct | Float64 | 1. |
temperature | AbstractSchedule{Float64} | ConstSchedule(1.) |
dirichlet_noise_ϵ | Float64 | - |
dirichlet_noise_α | Float64 | - |
prior_temperature | Float64 | 1. |
Explanation
An MCTS player picks an action as follows. Given a game state, it launches num_iters_per_turn
MCTS iterations, with UCT exploration constant cpuct
. Rewards are discounted using the gamma
factor.
Then, an action is picked according to the distribution $π$ where $π_i ∝ n_i^{1/τ}$ with $n_i$ the number of times that the $i^{\text{th}}$ action was visited and $τ$ the temperature
parameter.
It is typical to use a high value of the temperature parameter $τ$ during the first moves of a game to increase exploration and then switch to a small value. Therefore, temperature
is am AbstractSchedule
.
For information on parameters cpuct
, dirichlet_noise_ϵ
, dirichlet_noise_α
and prior_temperature
, see MCTS.Env
.
AlphaGo Zero Parameters
In the original AlphaGo Zero paper:
- The discount factor
gamma
is set to 1. - The number of MCTS iterations per move is 1600, which corresponds to 0.4s of computation time.
- The temperature is set to 1 for the 30 first moves and then to an infinitesimal value.
- The $ϵ$ parameter for the Dirichlet noise is set to $0.25$ and the $α$ parameter to $0.03$, which is consistent with the heuristic of using $α = 10/n$ with $n$ the maximum number of possibles moves, which is $19 × 19 + 1 = 362$ in the case of Go.
Simulations
AlphaZero.SimParams
— TypeSimParams
Parameters for parallel game simulations.
These parameters are common to self-play data generation, neural network evaluation and benchmarking.
Parameter | Type | Default |
---|---|---|
num_games | Int | - |
num_workers | Int | - |
batch_size | Int | - |
use_gpu | Bool | false |
fill_batches | Bool | true |
flip_probability | Float64 | 0. |
reset_every | Union{Nothing, Int} | 1 |
alternate_colors | Float64 | false |
Explanations
- On each machine (process),
num_workers
simulation tasks are spawned. Inference requests are processed by an inference server by batch of sizebatch_size
. Note that we must havebatch_size <= num_workers
. - If
fill_batches
is set totrue
, we make sure that batches sent to the neural network for inference have constant size. - Both players are reset (e.g. their MCTS trees are emptied) every
reset_every
games (or never ifnothing
is passed). - To add randomization and before every game turn, the game board is "flipped" according to a symmetric transformation with probability
flip_probability
. - In the case of (symmetric) two-player games and if
alternate_colors
is set totrue
, then the colors of both players are swapped between each simulated game.
Utilities
AlphaZero.necessary_samples
— Functionnecessary_samples(ϵ, β) = log(1 / β) / (2 * ϵ^2)
Compute the number of times $N$ that a random variable $X \sim \text{Ber}(p)$ has to be sampled so that if the empirical average of $X$ is greather than $1/2 + ϵ$, then $p > 1/2$ with probability at least $1-β$.
This bound is based on Hoeffding's inequality .
AlphaZero.AbstractSchedule
— TypeAbstractSchedule{R}
Abstract type for a parameter schedule, which represents a function from nonnegative integers to numbers of type R
. Subtypes must implement the getindex(s::AbstractSchedule, i::Int)
operator.
AlphaZero.StepSchedule
— TypeStepSchedule{R} <: AbstractSchedule{R}
Type for step function schedules.
Constructor
StepSchedule(;start, change_at, values)
Return a schedule that has initial value start
. For all i
, the schedule takes value values[i]
at step change_at[i]
.
AlphaZero.PLSchedule
— TypePLSchedule{R} <: AbstractSchedule{R}
Type for piecewise linear schedules.
Constructors
PLSchedule(cst)
Return a schedule with a constant value cst
.
PLSchedule(xs, ys)
Return a piecewise linear schedule such that:
- For all
i
,(xs[i], ys[i])
belongs to the schedule's graph. - Before
xs[1]
, the schedule has valueys[1]
. - After
xs[end]
, the schedule has valueys[end]
.
AlphaZero.CyclicSchedule
— FunctionCyclicSchedule(base, mid, term; n, xmid=0.45, xback=0.90)
Return the PLSchedule
that is typically used for cyclic learning rate scheduling.