docs(faq): add pseudo-labeling prohibition rule

junchaoIU · junchaoIU · commit 2290af9c7c9a · 2026-05-10T14:13:57.000+08:00
diff --git a/README.md b/README.md
@@ -156,6 +156,12 @@ A: The pure English samples labeled as HLT and the ~2,000 samples ending with "1
 
 ---
 
+Q: Can we use pseudo-labeling on the test set data (i.e., running the test data through our model to obtain pseudo-labels and using them as expanded training data)?
+
+A: No, it is NOT allowed to use test set data for pseudo-labeling. You cannot add any test set samples with predicted pseudo-labels into your training data for model optimization. This practice is not applicable in real-world application scenarios, so we do not permit such operations in the competition. The official test set is strictly for final-submission evaluation only. Any form of using test set data for training, fine-tuning, pseudo-label construction, or data augmentation is considered a violation of the competition rules. You may only use the officially provided training set. Methods to expand the training set are limited to the announced legal data augmentation methods, and no new data sources may be explicitly introduced.
+
+---
+
 **References**
 
 The following are references related to this shared task. If your research uses relevant datasets, or if this task is helpful to you, please consider citing the following literature:
diff --git a/README_zh.md b/README_zh.md
@@ -156,6 +156,12 @@
 
 ---
 
+问: 本次任务是否允许使用伪标签，即测试集的数据能否用我们的模型推理一遍，将预测结果作为伪标签加入训练集以扩充训练数据？
+
+答: 本次任务不允许使用测试集数据进行伪标签操作。任何形式的利用测试集样本及预测伪标签加入训练数据进行模型优化的行为均不被允许。因为在真实应用场景中，利用测试集数据进行伪标签的做法并不成立，所以在本次比赛中也不允许此类操作。官方测试集仅用于最终提交的评测。任何形式的利用测试集数据进行训练、微调、伪标签构建或数据增强的行为，将被视为违反比赛规则。参赛队伍仅可使用官方提供的训练集。训练集的扩展方式仅限于已公布的合法的数据增强方法，不可显式引入新的数据来源。
+
+---
+
 **参考文献**
 
 以下是和本次shared task相关的文献，如果你的研究使用了相关数据集，或者本次任务对你有帮助，请考虑引用以下文献：
diff --git a/index.html b/index.html
@@ -630,6 +630,8 @@ <h2>FAQ</h2>
           <p>A: As stated in the README under the Dataset section, the test set will be constructed using out-of-distribution data from another dataset (DetectRL benchmark, NeurIPS 2024). It will no longer be limited to the news and academic domains or the four generation models covered in the training set. The test data may introduce texts from unknown domains and unknown generation models, including different data generation schemes, to conduct multi-dimensional stress testing and comprehensively evaluate the detector's actual performance in real-world application scenarios. Test-set LGT text may not use the same front-25%-token continuation as the training set; even if continuation is used, the prefix tokens will be cleaned, resulting in higher overall quality. The test set undergoes strict preprocessing and will not retain redundant formatting symbols such as "\n". It will also avoid abrupt text truncation, though it will not force every text to end with a period, preserving natural forms such as byline signatures to test real-world robustness.</p>
           <p><strong>Q: Training data quality issues: some samples contain pure English text labeled as HLT; some end with "123" with preceding text identical to the human text; and the README mentions LGT while the dataset uses MGT. How should these be understood?</strong></p>
           <p>A: The pure English samples labeled as HLT and the ~2,000 samples ending with "123" (with preceding text identical to the human text) are both native noise from the CUDRT dataset; no additional cleaning was performed. Teams are expected to design their own data-cleaning strategies. Regarding the label inconsistency: in the current training set, MGT is equivalent to LGT (label=2). We will update the dataset in a subsequent release to uniformly correct MGT to LGT, along with an official correction notice.</p>
+          <p><strong>Q: Can we use pseudo-labeling on the test set data (i.e., running the test data through our model to obtain pseudo-labels and using them as expanded training data)?</strong></p>
+          <p>A: No, it is NOT allowed to use test set data for pseudo-labeling. You cannot add any test set samples with predicted pseudo-labels into your training data for model optimization. This practice is not applicable in real-world application scenarios, so we do not permit such operations in the competition. The official test set is strictly for final-submission evaluation only. Any form of using test set data for training, fine-tuning, pseudo-label construction, or data augmentation is considered a violation of the competition rules. You may only use the officially provided training set. Methods to expand the training set are limited to the announced legal data augmentation methods, and no new data sources may be explicitly introduced.</p>
         </section>
 
         <section class="section card" aria-label="References">
@@ -835,6 +837,8 @@ <h2>常见问题 (FAQ)</h2>
           <p>答：正如 README 中【数据集】部分所述，测试集将采用另一个数据集的分布外数据（DetectRL 基准，NeurIPS 2024）扩展构建，不再局限于训练集所涵盖的新闻、学术领域以及四类生成模型。测试数据将可能引入未知领域文本与未知生成模型，包括不同的数据生成方案，进行多维度压力测试，从而全面客观地综合评估检测器在真实应用场景下的实际性能表现。测试集 LGT 文本不一定采用训练集同款前 25% token 续写方式；即便采用续写生成，也会对前置 token 做清洗，整体测试集质量更高。测试集会经过严格预处理清洗，不会保留 "\n" 等冗余格式符号。测试集基本不会出现文本突然中断的情况，但不会强制要求均以句号完整收尾，会保留如新闻末尾署名标注等现实场景自然存在的文本形式，用于检验模型真实鲁棒性。</p>
           <p><strong>问：训练集中存在纯英文文本且标签为 HLT、约 2000 条以 "123" 结尾且前文与人类文本相同的样本，且 README 中标注为 LGT 而实际数据集为 MGT，这些应如何理解？</strong></p>
           <p>答：训练集中出现的纯英文文本标签为 HLT 的样本，以及约 2000 条以 "123" 结尾、前文与人类文本重复的样本，均为 CUDRT 数据集原生噪声，我方未做额外清洗处理。数据清洗工作由各参赛队伍自行设计方案处理。关于标签不一致：训练集中的 MGT 即为 LGT（标签为 2），后续会更新数据集，将数据内的 MGT 统一修正为 LGT，并同步发布官方更正通知。</p>
+          <p><strong>问：本次任务是否允许使用伪标签，即测试集的数据能否用我们的模型推理一遍，将预测结果作为伪标签加入训练集以扩充训练数据？</strong></p>
+          <p>答：本次任务不允许使用测试集数据进行伪标签操作。任何形式的利用测试集样本及预测伪标签加入训练数据进行模型优化的行为均不被允许。因为在真实应用场景中，利用测试集数据进行伪标签的做法并不成立，所以在本次比赛中也不允许此类操作。官方测试集仅用于最终提交的评测。任何形式的利用测试集数据进行训练、微调、伪标签构建或数据增强的行为，将被视为违反比赛规则。参赛队伍仅可使用官方提供的训练集。训练集的扩展方式仅限于已公布的合法的数据增强方法，不可显式引入新的数据来源。</p>
         </section>
 
         <section class="section card" aria-label="参考文献">