# Training Parameters

## General

`AlphaZero.Params`

— Type`Params`

The AlphaZero training hyperparameters.

Parameter | Type | Default |
---|---|---|

`self_play` | `SelfPlayParams` | - |

`learning` | `LearningParams` | - |

`arena` | [`Union{Nothing, ArenaParams` }] | - |

`memory_analysis` | `Union{Nothing, MemAnalysisParams}` | `nothing` |

`num_iters` | `Int` | - |

`use_symmetries` | `Bool` | `false` |

`ternary_rewards` | `Bool` | `false` |

`mem_buffer_size` | `PLSchedule{Int}` | - |

**Explanation**

The AlphaZero training process consists in `num_iters`

iterations. Each iteration can be decomposed into a self-play phase (see `SelfPlayParams`

) and a learning phase (see `LearningParams`

).

`ternary_rewards`

: set to`true`

if the rewards issued by the game environment always belong to $\{-1, 0, 1\}$ so that the logging and profiling tools can take advantage of this property.`use_symmetries`

: if set to`true`

, board symmetries are used for data augmentation before learning.`mem_buffer_size`

: size schedule of the memory buffer, in terms of number of samples. It is typical to start with a small memory buffer that is grown progressively so as to wash out the initial low-quality self-play data more quickly.`memory_analysis`

: parameters for the memory analysis step that is performed at each iteration (see`MemAnalysisParams`

), or`nothing`

if no analysis is to be performed.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper:

- About 5 millions games of self-play are played across 200 iterations.
- The memory buffer contains 500K games, which makes about 100M samples as an average game of Go lasts about 200 turns.

## Self-Play

`AlphaZero.SelfPlayParams`

— Type`SelfPlayParams`

Parameters governing self-play.

Parameter | Type | Default |
---|---|---|

`mcts` | `MctsParams` | - |

`num_games` | `Int` | - |

`num_workers` | `Int` | - |

`use_gpu` | `Bool` | `false` |

`reset_mcts_every` | `Union{Int, Nothing}` | `1` |

**Explanation**

- The MCTS tree is reset every
`reset_mcts_every`

games (or never if`nothing`

is passed).

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper, `num_games=25_000`

(5 millions games of self-play across 200 iterations).

## Learning

`AlphaZero.LearningParams`

— Type`LearningParams`

Parameters governing the learning phase of a training iteration, where the neural network is updated to fit the data in the memory buffer.

Parameter | Type | Default |
---|---|---|

`use_gpu` | `Bool` | `false` |

`use_position_averaging` | `Bool` | `true` |

`samples_weighing_policy` | `SamplesWeighingPolicy` | - |

`optimiser` | `OptimiserSpec` | - |

`l2_regularization` | `Float32` | - |

`nonvalidity_penalty` | `Float32` | `1f0` |

`batch_size` | `Int` | - |

`loss_computation_batch_size` | `Int` | - |

`min_checkpoints_per_epoch` | `Float64` | - |

`max_batches_per_checkpoint` | `Int` | - |

`num_checkpoints` | `Int` | - |

**Description**

The neural network goes through `num_checkpoints`

series of `n`

updates using batches of size `batch_size`

drawn from memory, where `n`

is defined as follows:

`n = min(max_batches_per_checkpoint, ntotal ÷ min_checkpoints_per_epoch)`

with `ntotal`

the total number of batches in memory. Between each series, the current network is evaluated against the best network so far (see `ArenaParams`

).

`nonvalidity_penalty`

is the multiplicative constant of a loss term that corresponds to the average probability weight that the network puts on invalid actions.`batch_size`

is the batch size used for gradient descent.`loss_computation_batch_size`

is the batch size that is used to compute the loss between each epochs.- If
`use_position_averaging`

is set to true, samples in memory that correspond to the same board position are averaged together. The merged sample is reweighted according to`samples_weighing_policy`

.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper:

- The batch size for gradient updates is $2048$.
- The L2 regularization parameter is set to $10^{-4}$.
- Checkpoints are produced every 1000 training steps, which corresponds to seeing about 20% of the samples in the memory buffer: $(1000 × 2048) / 10^7 ≈ 0.2$.
- It is unclear how many checkpoints are taken or how many training steps are performed in total.

`AlphaZero.SamplesWeighingPolicy`

— Type`SamplesWeighingPolicy`

During self-play, early board positions are possibly encountered many times across several games. The corresponding samples can be merged together and given a weight $W$ that is a nondecreasing function of the number $n$ of merged samples:

`CONSTANT_WEIGHT`

: $W(n) = 1$`LOG_WEIGHT`

: $W(n) = \log_2(n) + 1$`LINEAR_WEIGHT`

: $W(n) = n$

## Arena

`AlphaZero.ArenaParams`

— Type`ArenaParams`

Parameters that govern the evaluation process that compares the current neural network with the best one seen so far (which is used to generate data).

Parameter | Type | Default |
---|---|---|

`mcts` | `MctsParams` | - |

`num_games` | `Int` | - |

`num_workers` | `Int` | - |

`flip_probability` | `Float64` | `0.` |

`reset_mcts_every` | `Union{Nothing, Int}` | `1` |

`update_threshold` | `Float64` | - |

**Explanation**

- The two competing networks are instantiated into two MCTS players of parameter
`mcts`

and then play`num_games`

games, switching color after each game. - The evaluated network is to replace the current best if its average collected reward is greater or equal than
`update_threshold`

. - The MCTS trees of both players are reset every
`reset_mcts_every`

games (or never if`nothing`

is passed). - To add randomization and before every game turn, the game board is "flipped" according to a symmetric transformation with probability
`flip_probability`

.

**Remarks**

- See
`necessary_samples`

to make an informed choice for`num_games`

.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper, 400 games are played to evaluate a network and the `update_threshold`

parameter is set to a value that corresponds to a 55% win rate.

## Memory Analysis

`AlphaZero.MemAnalysisParams`

— Type`MemAnalysisParams`

Parameters governing the analysis of the memory buffer (for debugging and profiling purposes).

Parameter | Type | Default |
---|---|---|

`num_game_stages` | `Int` | - |

**Explanation**

The memory analysis consists in partitioning the memory buffer in `num_game_stages`

parts of equal size, according to the number of remaining moves until the end of the game for each sample. Then, the quality of the predictions of the current neural network is evaluated on each subset (see `Report.Memory`

).

This is useful to get an idea of how the neural network performance varies depending on the game stage (typically, good value estimates for endgame board positions are available earlier in the training process than good values for middlegame positions).

## MCTS

`AlphaZero.MctsParams`

— TypeParameters of an MCTS player.

Parameter | Type | Default |
---|---|---|

`num_iters_per_turn` | `Int` | - |

`gamma` | `Float64` | `1.` |

`cpuct` | `Float64` | `1.` |

`temperature` | `AbstractSchedule{Float64}` | `ConstSchedule(1.)` |

`dirichlet_noise_ϵ` | `Float64` | - |

`dirichlet_noise_α` | `Float64` | - |

`prior_temperature` | `Float64` | `1.` |

**Explanation**

An MCTS player picks an action as follows. Given a game state, it launches `num_iters_per_turn`

MCTS iterations, with UCT exploration constant `cpuct`

. Rewards are discounted using the `gamma`

factor.

Then, an action is picked according to the distribution $π$ where $π_i ∝ n_i^τ$ with $n_i$ the number of times that the $i^{\text{th}}$ action was visited and $τ$ the `temperature`

parameter.

It is typical to use a high value of the temperature parameter $τ$ during the first moves of a game to increase exploration and then switch to a small value. Therefore, `temperature`

is am `AbstractSchedule`

.

For information on parameters `cpuct`

, `dirichlet_noise_ϵ`

, `dirichlet_noise_α`

and `prior_temperature`

, see `MCTS.Env`

.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper:

- The discount factor
`gamma`

is set to 1. - The number of MCTS iterations per move is 1600, which corresponds to 0.4s of computation time.
- The temperature is set to 1 for the 30 first moves and then to an infinitesimal value.
- The $ϵ$ parameter for the Dirichlet noise is set to $0.25$ and the $α$ parameter to $0.03$, which is consistent with the heuristic of using $α = 10/n$ with $n$ the maximum number of possibles moves, which is $19 × 19 + 1 = 362$ in the case of Go.

## Utilities

`AlphaZero.necessary_samples`

— Function`necessary_samples(ϵ, β) = log(1 / β) / (2 * ϵ^2)`

Compute the number of times $N$ that a random variable $X \sim \text{Ber}(p)$ has to be sampled so that if the empirical average of $X$ is greather than $1/2 + ϵ$, then $p > 1/2$ with probability at least $1-β$.

This bound is based on Hoeffding's inequality .

`AlphaZero.AbstractSchedule`

— Type`AbstractSchedule{R}`

Abstract type for a parameter schedule, which represents a function from nonnegative integers to numbers of type `R`

. Subtypes must implement the `getindex(s::AbstractSchedule, i::Int)`

operator.

`AlphaZero.StepSchedule`

— Type`StepSchedule{R} <: AbstractSchedule{R}`

Type for step function schedules.

**Constructor**

`StepSchedule(;start, change_at, values)`

Return a schedule that has initial value `start`

. For all `i`

, the schedule takes value `values[i]`

at step `change_at[i]`

.

`AlphaZero.PLSchedule`

— Type`PLSchedule{R} <: AbstractSchedule{R}`

Type for piecewise linear schedules.

**Constructors**

`PLSchedule(cst)`

Return a schedule with a constant value `cst`

.

`PLSchedule(xs, ys)`

Return a piecewise linear schedule such that:

- For all
`i`

,`(xs[i], ys[i])`

belongs to the schedule's graph. - Before
`xs[1]`

, the schedule has value`ys[1]`

. - After
`xs[end]`

, the schedule has value`ys[end]`

.

`AlphaZero.CyclicSchedule`

— Function`CyclicSchedule(base, mid, term; n, xmid=0.45, xback=0.90)`

Return the `PLSchedule`

that is typically used for cyclic learning rate scheduling.