In detail: normally, KataGo would record each position that was given a "full" search instead of a "fast" search (see "Playout Cap Randomization" here [KataGo's paper](https://arxiv.org/abs/1902.10565) for understanding "full" vs "fast" searches) with a "frequency weight" of 1 in the training data, i.e. writing it into the training data once. With Policy Surprise Weighting, instead, among all full-searched moves in the game, KataGo redistributes their frequency weights so that about half of the total frequency weight is assigned uniformly, giving a baseline frequency weight of 0.5, and the other half of the frequency weight is distributed proportionally to the KL-divergence from the policy prior to the policy training target for that move. Then, each position is written down in the training data `floor(frequency_weight)` many times, as well as an additional time with a probability of `frequency_weight - floor(frequency_weight)`. This results in "surprising" positions being written down much more often.
0 commit comments