Update KataGoMethods.md

lightvector · web-flow · commit 554c56e051ed · 2024-08-29T15:34:15.000-04:00
diff --git a/docs/KataGoMethods.md b/docs/KataGoMethods.md
@@ -124,7 +124,7 @@ KataGo overweights the frequency of training samples where the policy training t
 
 In detail: normally, KataGo would record each position that was given a "full" search instead of a "fast" search (see "Playout Cap Randomization" here [KataGo's paper](https://arxiv.org/abs/1902.10565) for understanding "full" vs "fast" searches) with a "frequency weight" of 1 in the training data, i.e. writing it into the training data once. With Policy Surprise Weighting, instead, among all full-searched moves in the game, KataGo redistributes their frequency weights so that about half of the total frequency weight is assigned uniformly, giving a baseline frequency weight of 0.5, and the other half of the frequency weight is distributed proportionally to the KL-divergence from the policy prior to the policy training target for that move. Then, each position is written down in the training data `floor(frequency_weight)` many times, as well as an additional time with a probability of `frequency_weight - floor(frequency_weight)`. This results in "surprising" positions being written down much more often.
 
-In KataGo, the method used is *not* like importance sampling where the position is seen more often but the gradient of the sample is scaled down proportionally to the increased frequency, to avoid bias. We simply sample the position more frequently, using full weight for the sample. The purpose of the policy is simply to suggest moves for MCTS exploration, and unlike a predictive classifier or stochastically sampling a distribution or other similar methods where having an unbiased output is good, biasing rare good moves upward and having them learned a bit more quickly seems fairly innocuous (and in the theoretical limit of optimal play, *any* policy distribution supported on the set of optimal moves is equally optimal).
+In KataGo, the method used is *not* like importance sampling where the position is seen more often and the gradient of the sample is scaled down proportionally to the increased frequency to avoid bias. We simply sample the position more frequently, using full weight for the sample without scaling. The purpose of the policy is simply to suggest moves for MCTS exploration, and unlike a predictive classifier or stochastically sampling a distribution or other similar methods where having an unbiased output is good, biasing rare good moves upward and having them learned a bit more quickly seems fairly innocuous (and in the theoretical limit of optimal play, *any* policy distribution supported on the set of optimal moves is equally optimal).
 
 Some additional minor notes: