Skip to content

Commit 554c56e

Browse files
authored
Update KataGoMethods.md
1 parent cd0ed6c commit 554c56e

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

docs/KataGoMethods.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ KataGo overweights the frequency of training samples where the policy training t
124124

125125
In detail: normally, KataGo would record each position that was given a "full" search instead of a "fast" search (see "Playout Cap Randomization" here [KataGo's paper](https://arxiv.org/abs/1902.10565) for understanding "full" vs "fast" searches) with a "frequency weight" of 1 in the training data, i.e. writing it into the training data once. With Policy Surprise Weighting, instead, among all full-searched moves in the game, KataGo redistributes their frequency weights so that about half of the total frequency weight is assigned uniformly, giving a baseline frequency weight of 0.5, and the other half of the frequency weight is distributed proportionally to the KL-divergence from the policy prior to the policy training target for that move. Then, each position is written down in the training data `floor(frequency_weight)` many times, as well as an additional time with a probability of `frequency_weight - floor(frequency_weight)`. This results in "surprising" positions being written down much more often.
126126

127-
In KataGo, the method used is *not* like importance sampling where the position is seen more often but the gradient of the sample is scaled down proportionally to the increased frequency, to avoid bias. We simply sample the position more frequently, using full weight for the sample. The purpose of the policy is simply to suggest moves for MCTS exploration, and unlike a predictive classifier or stochastically sampling a distribution or other similar methods where having an unbiased output is good, biasing rare good moves upward and having them learned a bit more quickly seems fairly innocuous (and in the theoretical limit of optimal play, *any* policy distribution supported on the set of optimal moves is equally optimal).
127+
In KataGo, the method used is *not* like importance sampling where the position is seen more often and the gradient of the sample is scaled down proportionally to the increased frequency to avoid bias. We simply sample the position more frequently, using full weight for the sample without scaling. The purpose of the policy is simply to suggest moves for MCTS exploration, and unlike a predictive classifier or stochastically sampling a distribution or other similar methods where having an unbiased output is good, biasing rare good moves upward and having them learned a bit more quickly seems fairly innocuous (and in the theoretical limit of optimal play, *any* policy distribution supported on the set of optimal moves is equally optimal).
128128

129129
Some additional minor notes:
130130

0 commit comments

Comments
 (0)