Training Reports

AlphaZero.ReportModule

Analytical reports generated during training, for debugging and hyperparameters tuning.

source
AlphaZero.Report.PerfsType
Report.Perfs

Performances report for a subroutine.

  • time: total time spent, in seconds
  • allocated: amount of memory allocated, in bytes
  • gc_time: total amount of time spent in the garbage collector
source

Self-Play Phase

AlphaZero.Report.SelfPlayType
Report.SelfPlay

Report generated after the self-play phase of an iteration.

  • samples_gen_speed: average number of samples generated per second
  • average_exploration_depth: see MCTS.average_exploration_depth
  • mcts_memory_footprint: estimation of the maximal memory footprint of the MCTS tree during self-play, as computed by MCTS.approximate_memory_footprint
  • memory_size: number of samples in the memory buffer at the end of the self-play phase
  • memory_num_distinct_boards: number of distinct board positions in the memory buffer at the end of the self-play phase
source

Memory Analysis Phase

AlphaZero.Report.MemoryType
Report.Memory

Report generated by the memory analysis phase of an iteration. It features statistics for

  • the whole memory buffer (all_samples::Report.Samples)
  • the samples collected during the last self-play iteration (latest_batch::Report.Samples)
  • the subsets of the memory buffer corresponding to different game stages: (per_game_stage::Vector{Report.StageSamples})

See MemAnalysisParams.

source
AlphaZero.Report.SamplesType
Report.Samples

Statistics about a set of samples, as collected during memory analysis.

  • num_samples: total number of samples
  • num_boards: number of distinct board positions
  • Wtot: total weight of the samples
  • status: Report.LearningStatus statistics of the current network on the samples
source
AlphaZero.Report.StageSamplesType
Report.StageSamples

Statistics for the samples corresponding to a particular game stage, as collected during memory analysis.

The samples whose statistics are collected in the samples_stats field correspond to historical positions where the number of remaining moves until the end of the game was in the range defined by the min_remaining_length and max_remaining_length fields.

source

Learning Phase

AlphaZero.Report.LearningType
Report.Learning

Report generated at the end of the learning phase of an iteration.

  • time_convert, time_loss, time_train and time_eval are the amounts of time (in seconds) spent at converting the samples, computing losses, performing gradient updates and evaluating checkpoints respectively
  • initial_status: status before the learning phase, as an object of type Report.LearningStatus
  • losses: loss value on each minibatch
  • checkpoints: vector of Report.Checkpoint reports
  • nn_replaced: true if the best neural network was replaced
source
AlphaZero.Report.CheckpointType
Report.Checkpoint

Report generated after a checkpoint evaluation.

  • batch_id: number of batches after which the checkpoint was computed
  • evaluation: evaluation report from the arena, of type Report.Evaluation
  • status_after: learning status at the checkpoint, as an object of type Report.LearningStatus
  • nn_replaced: true if the current best neural network was updated after the checkpoint
source
AlphaZero.Report.LearningStatusType
Report.LearningStatus

Statistics about the performance of the neural network on a subset of the memory buffer.

  • loss: detailed loss on the samples, as an object of type Report.Loss
  • Hp: average entropy of the $π$ component of samples (MCTS policy); this quantity is independent of the network and therefore constant during a learning iteration
  • Hpnet: average entropy of the network's prescribed policy on the samples
source
AlphaZero.Report.LossType
Report.Loss

Decomposition of the loss in a sum of terms (all have type Float32).

  • L is the total loss: L == Lp + Lv + Lreg + Linv
  • Lp is the policy cross-entropy loss term
  • Lv is the average value mean square error
  • Lreg is the L2 regularization loss term
  • Linv is the loss term penalizing the average weight put by the network on invalid actions
source

Evaluatons and benchmarks

AlphaZero.Report.EvaluationType
Report.Evaluation

The outcome of evaluating a player against a baseline player.

Two-player Games

  • rewards is the sequence of rewards collected by the evaluated player
  • avgr is the average reward collected by the evaluated player
  • baseline_rewards is nothing

Single-player Games

  • rewards is the sequence of rewards collected by the evaluated player
  • baseline_rewards is the sequence of rewards collected by the baseline player
  • avgr is equal to mean(rewards) - mean(baseline_rewards)

Common Fields

  • legend is a string describing the evaluation
  • redundancy is the ratio of duplicate positions encountered during the evaluation, not counting the initial position. If this number is too high, you may want to increase the move selection temperature.
  • time is the computing time spent running the evaluation, in seconds
source