PaddlePaddle
diff --git a/‎README.md
Lines changed: 7 additions & 3 deletions b/‎README.md
Lines changed: 7 additions & 3 deletions
diff --git a/‎csrc/gpu/all_reduce.cu
Lines changed: 2 additions & 0 deletions b/‎csrc/gpu/all_reduce.cu
Lines changed: 2 additions & 0 deletions
diff --git a/‎csrc/gpu/all_reduce.cuh
Lines changed: 1 addition & 1 deletion b/‎csrc/gpu/all_reduce.cuh
Lines changed: 1 addition & 1 deletion
diff --git a/‎csrc/gpu/cpp_extensions.cu
Lines changed: 0 additions & 2 deletions b/‎csrc/gpu/cpp_extensions.cu
Lines changed: 0 additions & 2 deletions
diff --git a/‎csrc/gpu/get_output.cc
Lines changed: 16 additions & 4 deletions b/‎csrc/gpu/get_output.cc
Lines changed: 16 additions & 4 deletions
diff --git a/‎csrc/gpu/multi_head_latent_attention.cu
Lines changed: 2 additions & 0 deletions b/‎csrc/gpu/multi_head_latent_attention.cu
Lines changed: 2 additions & 0 deletions
diff --git a/‎csrc/gpu/save_with_output_msg.cc
Lines changed: 21 additions & 4 deletions b/‎csrc/gpu/save_with_output_msg.cc
Lines changed: 21 additions & 4 deletions
diff --git a/‎csrc/setup_cuda.py
Lines changed: 3 additions & 3 deletions b/‎csrc/setup_cuda.py
Lines changed: 3 additions & 3 deletions
diff --git a/‎csrc/tools/build_wheel.sh
Lines changed: 27 additions & 23 deletions b/‎csrc/tools/build_wheel.sh
Lines changed: 27 additions & 23 deletions
diff --git a/‎llm/alignment/rl/README.md
Lines changed: 9 additions & 1 deletion b/‎llm/alignment/rl/README.md
Lines changed: 9 additions & 1 deletion
@@ -33,17 +33,20 @@
 
 ## News 📢
 
-* **2025.03.17 《DeepSeek-R1满血版单机部署实测》** 🔥🔥🔥 飞桨框架3.0大模型推理部署全面升级，支持多款主流大模型，DeepSeek-R1满血版实现单机部署，吞吐提升一倍！欢迎广大用户开箱体验～现已开启有奖活动：完成 DeepSeek-R1-MTP 单机部署任务、提交高质量测评 blog，即可实时赢取奖金！💰💰💰
-报名[地址](https://www.wjx.top/vm/OlzzmbG.aspx#)， 活动详情：https://github.com/PaddlePaddle/PaddleNLP/issues/10166 ， 参考文档：https://github.com/PaddlePaddle/PaddleNLP/issues/10157 。
+* **2025.04.29 PaddleNLP 现已支持 Qwen3 系列模型**: Qwen3 系列模型支持持两种思考模式，预训练约 36 万亿个 token、119 种语言和方言。包括六个 Dense 模型, Qwen3-32B、Qwen3-14B、Qwen3-8B、Qwen3-4B、Qwen3-1.7B 和 Qwen3-0.6B。两个 MoE 模型的权重：Qwen3-235B-A22B，Qwen3-30B-A3B。
 
 * **2025.03.12 [PaddleNLP v3.0 Beta4](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v3.0.0-beta4)**：全面支持 DeepSeek V3/R1/R1-Distill, 及 QwQ-32B 等热门思考模型。**DeepSeek V3/R1完整版支持 FP8、INT8、4-bit 量化推理，MTP 投机解码**。单机 FP8推理输出超**1000 tokens/s**; 4-bit 推理输出超**2100 tokens/s**! 发布新版推理部署镜像，热门模型[一键部署](https://paddlenlp.readthedocs.io/zh/latest/llm/server/docs/general_model_inference.html)。推理部署[使用文档](https://paddlenlp.readthedocs.io/zh/latest/llm/docs/predict/index.html)全面更新，体验全面提升！自研下一代通用信息抽取模型 PP-UIE [全新发布](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/application/information_extraction)，支持8K 长度信息抽取。新增大模型 Embedding 训练，支持 INF-CL 超大 batch size 训练。新增[MergeKit](https://paddlenlp.readthedocs.io/zh/latest/llm/docs/mergekit.html)模型融合工具，缓解对齐代价。低资源训练全面优化，16G 小显存可以流畅训练。
 
-* **2025.03.06 PaddleNLP 现已支持 Qwen/QwQ-32B 模型**: 其模型参数仅有 32B，但其数学推理、编程能力和通用能力可与具备 671B 参数（其中 37B 被激活）的 DeepSeek-R1 媲美。借助 PaddleNLP 3.0套件，现可实现多种并行策略[微调训练](./llm/README.md)、[高性能推理、低比特量化](./llm/docs/predict/qwen.md)和[服务化部署](./llm/server/README.md)。
 
 * **2025.02.10 PaddleNLP 现已支持 DeepSeek-R1系列模型，[在线使用](https://aistudio.baidu.com/projectdetail/8775758)**：依托全新的 PaddleNLP 3.0套件，DeepSeek-R1系列模型现已全面支持。凭借数据并行、数据分组切分并行、模型并行、流水线并行以及专家并行等一系列先进的分布式训练能力，结合 Paddle 框架独有的列稀疏注意力掩码表示技术——FlashMask 方法，DeepSeek-R1系列模型在训练过程中显著降低了显存消耗，同时取得了卓越的训练性能提升。
 
 <details><summary> <b>点击展开</b> </summary><div>
 
+* **2025.03.17 《DeepSeek-R1满血版单机部署实测》** 🔥🔥🔥 飞桨框架3.0大模型推理部署全面升级，支持多款主流大模型，DeepSeek-R1满血版实现单机部署，吞吐提升一倍！欢迎广大用户开箱体验～现已开启有奖活动：完成 DeepSeek-R1-MTP 单机部署任务、提交高质量测评 blog，即可实时赢取奖金！💰💰💰
+报名[地址](https://www.wjx.top/vm/OlzzmbG.aspx#)， 活动详情：https://github.com/PaddlePaddle/PaddleNLP/issues/10166 ， 参考文档：https://github.com/PaddlePaddle/PaddleNLP/issues/10157 。
+
+* **2025.03.06 PaddleNLP 现已支持 Qwen/QwQ-32B 模型**: 其模型参数仅有 32B，但其数学推理、编程能力和通用能力可与具备 671B 参数（其中 37B 被激活）的 DeepSeek-R1 媲美。借助 PaddleNLP 3.0套件，现可实现多种并行策略[微调训练](./llm/README.md)、[高性能推理、低比特量化](./llm/docs/predict/qwen.md)和[服务化部署](./llm/server/README.md)。
+
 * **2025.02.20 🔥🔥《PP-UIE 信息抽取智能引擎全新升级》** 强化零样本学习能力，支持极少甚至零标注数据实现高效冷启动与迁移学习，显著降低数据标注成本；具备处理长文本能力，支持 8192 个 Token 长度文档信息抽取，实现跨段落识别关键信息，形成完整理解；提供完整可定制化的训练和推理全流程，训练效率相较于 LLama-Factory 实现了1.8倍的提升。
 2月26日（周三）19：00为您深度解析全新 PP-UIE 技术方案及在部署方面的功能、优势与技巧。报名链接：https://www.wjx.top/vm/mBKC6pb.aspx?udsid=606418
 
@@ -119,6 +122,7 @@
 |           [Qwen2.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)            | Qwen/Qwen2.5-0.5B, Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-7B-Instruct-1M, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-14B-Instruct-1M, Qwen/Qwen2.5-32B, Qwen/Qwen2.5-32B-Instruct, Qwen/Qwen2.5-72B, Qwen/Qwen2.5-72B-Instruct          |
 |         [Qwen2.5-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)         | Qwen/Qwen2.5-Math-1.5B, Qwen/Qwen2.5-Math-1.5B-Instruct, Qwen/Qwen2.5-Math-7B, Qwen/Qwen2.5-Math-7B-Instruct, Qwen/Qwen2.5-Math-72B, Qwen/Qwen2.5-Math-72B-Instruct, Qwen/Qwen2.5-Math-RM-72B                                                                                                                                                                                                 |
 |        [Qwen2.5-Coder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)         | Qwen/Qwen2.5-Coder-1.5B, Qwen/Qwen2.5-Coder-1.5B-Instruct, Qwen/Qwen2.5-Coder-7B, Qwen/Qwen2.5-Coder-7B-Instruct                                                                                                                                                                                                                                                                              |
+|           [Qwen3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)              | Qwen/Qwen3-0.6B, Qwen/Qwen3-1.7B, Qwen/Qwen3-4B, Qwen/Qwen3-8B, Qwen/Qwen3-14B, Qwen/Qwen3-32B, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-235B-A22B, Qwen/Qwen3-0.6B-Base, Qwen/Qwen3-1.7B-Base, Qwen/Qwen3-4B-Base, Qwen/Qwen3-8B-Base, Qwen/Qwen3-14B-Base, Qwen/Qwen3-30B-A3B-Base                                                                                                                    |
 |             [QwQ](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/)              | Qwen/QwQ-32B, Qwen/QwQ-32B-Preview                                                                                                                                                                                                                                                                                                                                                            |
 |            [Yuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/yuan/)             | IEITYuan/Yuan2-2B, IEITYuan/Yuan2-51B, IEITYuan/Yuan2-102B                                                                                                                                                                                                                                                                                                                                    |
 
 
@@ -72,12 +72,14 @@ void all_reduce(fptr_t _fa, paddle::Tensor& inp, paddle::Tensor& out,
                           reinterpret_cast<half*>(out.data()), out.numel());
       break;
     }
+#if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
     case phi::DataType::BFLOAT16: {
       fa->allreduce<nv_bfloat16>(
           stream, reinterpret_cast<nv_bfloat16*>(reg_buffer),
           reinterpret_cast<nv_bfloat16*>(out.data()), out.numel());
       break;
     }
+#endif
     default:
       throw std::runtime_error(
           "custom allreduce only supports float32, float16 and bfloat16");
 
@@ -98,7 +98,7 @@ DINLINE half& assign_add(half& a, half b) {
 }
 DINLINE float& assign_add(float& a, float b) { return a += b; }
 
-#if (__CUDA_ARCH__ >= 800 || !defined(__CUDA_ARCH__))
+#if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
 DINLINE float upcast_s(nv_bfloat16 val) { return __bfloat162float(val); }
 template <>
 DINLINE nv_bfloat16 downcast_s(float val) {
 
@@ -238,11 +238,9 @@ std::vector<paddle::Tensor> GetPaddingOffsetV2(const paddle::Tensor& input_ids,
 
 void SaveOutMmsg(const paddle::Tensor& x,
                  const paddle::Tensor& not_need_stop, // cpu
-                 const paddle::Tensor& msg_queue_id,      // cpu
                  int64_t rank_id);
 
 void GetOutput(const paddle::Tensor& x,
-               const paddle::Tensor& msg_queue_id, // cpu
                int64_t rank_id,
                bool wait_flag);
 
 
@@ -20,21 +20,33 @@
 #include "paddle/extension.h"
 
 #define MAX_BSZ 512
+// #define GET_OUTPUT_DEBUG
 
 struct msgdata {
     long mtype;
     int mtext[MAX_BSZ + 2];   // stop_flag, bsz, tokens
 };
 
 void GetOutput(const paddle::Tensor& x,
-               const paddle::Tensor& msg_queue_id,
                int64_t rank_id,
                bool wait_flag) {
   if (rank_id > 0) return;
 
   static struct msgdata msg_rcv;
-  int queue_id_val = msg_queue_id.data<int>()[0];
-  static key_t key = ftok("./", queue_id_val);
+  int msg_queue_id = 1;
+  if (const char* inference_msg_queue_id_env_p =
+          std::getenv("INFERENCE_MSG_QUEUE_ID")) {
+      std::string inference_msg_queue_id_env_str(
+          inference_msg_queue_id_env_p);
+      int inference_msg_queue_id_from_env =
+          std::stoi(inference_msg_queue_id_env_str);
+#ifdef GET_OUTPUT_DEBUG
+      std::cout << "Your INFERENCE_MSG_QUEUE_ID is: "
+                << inference_msg_queue_id_from_env << std::endl;
+#endif
+      msg_queue_id = inference_msg_queue_id_from_env;
+  }
+  static key_t key = ftok("./", msg_queue_id);
 
   static int msgid = msgget(key, IPC_CREAT | 0666);
 
@@ -62,7 +74,7 @@ void GetOutput(const paddle::Tensor& x,
 }
 
 PD_BUILD_OP(get_output)
-    .Inputs({"x", "msg_queue_id"})
+    .Inputs({"x"})
     .Attrs({"rank_id: int64_t",
             "wait_flag: bool"})
     .Outputs({"x_out"})
 
@@ -205,6 +205,7 @@ std::vector<paddle::Tensor> MultiHeadLatentAttention(
   meta_data.batch_size = cum_offsets.dims()[0];
 
   switch (query.dtype()) {
+#if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
     case paddle::DataType::BFLOAT16: {
       return MultiHeadLatentAttentionKernel<paddle::DataType::BFLOAT16>(
           meta_data,
@@ -253,6 +254,7 @@ std::vector<paddle::Tensor> MultiHeadLatentAttention(
           causal,
           speculate_decoder);
     }
+#endif
     case paddle::DataType::FLOAT16: {
       return MultiHeadLatentAttentionKernel<paddle::DataType::FLOAT16>(
           meta_data,
 
@@ -20,6 +20,7 @@
 #include "paddle/extension.h"
 
 #define MAX_BSZ 512
+// #define SAVE_WITH_OUTPUT_DEBUG
 
 struct msgdata {
     long mtype;
@@ -28,16 +29,32 @@ struct msgdata {
 
 void SaveOutMmsg(const paddle::Tensor& x,
                  const paddle::Tensor& not_need_stop, // cpu
-                 const paddle::Tensor& msg_queue_id,      // cpu
                  int64_t rank_id) {
     if (rank_id > 0) return;
     auto x_cpu = x.copy_to(paddle::CPUPlace(), false);
     int64_t *x_data = x_cpu.data<int64_t>();
     auto not_need_stop_data = not_need_stop.data<bool>()[0];
 
     static struct msgdata msg_sed;
-    int queue_id_val = msg_queue_id.data<int>()[0];
-    static key_t key = ftok("./", queue_id_val);
+    int msg_queue_id = 1;
+    if (const char* inference_msg_queue_id_env_p =
+            std::getenv("INFERENCE_MSG_QUEUE_ID")) {
+        std::string inference_msg_queue_id_env_str(
+            inference_msg_queue_id_env_p);
+        int inference_msg_queue_id_from_env =
+            std::stoi(inference_msg_queue_id_env_str);
+        msg_queue_id = inference_msg_queue_id_from_env;
+#ifdef SAVE_WITH_OUTPUT_DEBUG
+        std::cout << "Your INFERENCE_MSG_QUEUE_ID is: "
+                  << inference_msg_queue_id_from_env << std::endl;
+#endif
+    } else {
+#ifdef SAVE_WITH_OUTPUT_DEBUG
+        std::cout << "Failed to got INFERENCE_MSG_QUEUE_ID at env, use default."
+                  << std::endl;
+#endif
+    }
+    static key_t key = ftok("./", msg_queue_id);
     static int msgid = msgget(key, IPC_CREAT | 0666);
 
     msg_sed.mtype = 1;
@@ -54,7 +71,7 @@ void SaveOutMmsg(const paddle::Tensor& x,
 }
 
 PD_BUILD_OP(save_output)
-    .Inputs({"x", "not_need_stop", "msg_queue_id"})
+    .Inputs({"x", "not_need_stop"})
     .Attrs({"rank_id: int64_t"})
     .Outputs({"x_out"})
     .SetInplaceMap({{"x", "x_out"}})
 
@@ -130,14 +130,11 @@ def get_gencode_flags():
     "./gpu/speculate_decoding_kernels/speculate_save_output.cc",
     "./gpu/speculate_decoding_kernels/speculate_get_output.cc",
     "./gpu/save_output_dygraph.cu",
-    "./gpu/cpp_extensions.cu",
     "./gpu/all_reduce.cu",
     "./gpu/quantization/per_token_group_quant.cu",
     "./gpu/quantization/per_tensor_quant_fp8.cu",
 ]
 sources += find_end_files("./gpu/speculate_decoding_kernels", ".cu")
-sources += find_end_files("./gpu/moe/fused_moe/cutlass_kernels/moe_gemm/", ".cu")
-sources += find_end_files("./gpu/moe/fused_moe/", ".cu")
 
 nvcc_compile_args = gencode_flags
 update_git_submodule()
@@ -174,6 +171,9 @@ def get_gencode_flags():
 
     sources += find_end_files("./gpu/append_attn", ".cu")
     sources += find_end_files("./gpu/append_attn/template_instantiation", ".cu")
+    sources += find_end_files("./gpu/moe/fused_moe/cutlass_kernels/moe_gemm/", ".cu")
+    sources += find_end_files("./gpu/moe/fused_moe/", ".cu")
+    sources += "./gpu/cpp_extensions.cu",
 
 
 fp8_auto_gen_directory = "gpu/cutlass_kernels/fp8_gemm_fused/autogen"
 
@@ -61,7 +61,7 @@ function generate_sm_version(){
         sm_versions=($SM_VERSION )
     elif [ "$ARCHITECTURE" = "all" ]; then
         if awk -v version="$cuda_version" 'BEGIN { exit !(version >= 12.0) }'; then
-          sm_versions=(70 75 80 80 86 89 90 )
+          sm_versions=(70 75 80 86 89 90 )
         else
           sm_versions=(70 75 80 86 89 ) 
         fi 
@@ -72,10 +72,12 @@ function generate_sm_version(){
 }
 
 function create_directories(){
-  mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops
-  touch $OPS_SRC_DIR/tmp/setup.py
-  touch $OPS_SRC_DIR/tmp/paddlenlp_ops/__init__.py
-  echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+  for sm_version in "${sm_versions[@]}"; do
+    echo "create sm$sm_version"
+    mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops
+    touch $OPS_SRC_DIR/tmp/setup.py
+    touch $OPS_SRC_DIR/tmp/paddlenlp_ops/__init__.py
+    echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -93,6 +95,7 @@ function create_directories(){
 
 import os
 from datetime import datetime
+import paddle
 
 from setuptools import find_packages, setup
 
@@ -109,14 +112,19 @@ def read(file: str):
         content = f.read().strip()
     return content
 
-
 def read_version():
     """
     read version and return content
     """
     __version__ = "3.0.0b4.post"
+
     formatted_date = datetime.now().date().strftime("%Y%m%d")
-    __version__ = __version__.replace(".post", ".post{}".format(formatted_date))
+    cuda_version = float(paddle.version.cuda())
+    sm_version=80
+    paddle_commit = paddle.__git_commit__[:7]
+    build_tag = "{}+cuda{}sm{}paddle{}".format(formatted_date, cuda_version, sm_version, paddle_commit)
+
+    __version__ = __version__.replace(".post", ".post{}".format(build_tag))
     
     return __version__
 
@@ -184,12 +192,9 @@ try:
 except ImportError:
     logger.WARNING(f"No {module_name} ")
 ' > $OPS_SRC_DIR/tmp/paddlenlp_ops/__init__.py
-
-  for sm_version in "${sm_versions[@]}"; do
-        echo "create sm$sm_version"
-        mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}
-        touch $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}/__init__.py
-        echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+    mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}
+    touch $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}/__init__.py
+    echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -210,6 +215,7 @@ try:
 except ImportError:
     logger.WARNING("No paddlenlp_ops_'${sm_version}' ops")
 ' > $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}/__init__.py
+    build_ops
   done
 }
 
@@ -228,11 +234,11 @@ function init() {
 }
 
 function build_ops() {
-    for sm_version in "${sm_versions[@]}"; do
-        echo "Building and installing for sm_version: $sm_version"
-        build_and_install_ops $sm_version
-    done
-    return 
+    echo "Building and installing for sm_version: $sm_version"
+    build_and_install_ops $sm_version
+    build_and_install_whl
+    unittest
+    cleanup
 }
 
 function copy_ops(){
@@ -269,6 +275,7 @@ function build_and_install_whl() {
   echo -e "${BLUE}[build]${NONE} building paddlenlp_ops wheel..."
   rm -rf ./dist
   cd ${TMP_DIR}
+  sed -i "s/sm_version=80/sm_version=${sm_version}/g" setup.py
   ${python} setup.py bdist_wheel --dist-dir ./$DIST_DIR
   if [ $? -ne 0 ]; then
     echo -e "${RED}[FAIL]${NONE} build paddlenlp_ops wheel failed !"
@@ -286,7 +293,8 @@ function build_and_install_whl() {
   fi
   echo -e "${BLUE}[install]${NONE} ${GREEN}paddlenlp_ops install success\n"
   cd ..
-  mv $DIST_DIR ../
+  mkdir -p ../$DIST_DIR
+  mv $DIST_DIR/* ../$DIST_DIR/
   cd ..
 }
 
@@ -321,10 +329,6 @@ trap 'abort' 0
 set -e
 
 init
-build_ops
-build_and_install_whl
-unittest
-cleanup
 
 # get Paddle version
 PADDLE_VERSION=`${python} -c "import paddle; print(paddle.version.full_version)"`
 
@@ -13,11 +13,16 @@ REINFORCE++ 是经典 REINFORCE 算法的改进版本，通过融合 PPO 的关
 ```shell
 git clone https://github.com/PaddlePaddle/PaddleNLP.git
 ```
-3. 安装 paddlenlp_ops，参考 PaddleNLP/csrc 进行安装（必需）
+3. 安装 paddlenlp_ops 推理算子，参考 PaddleNLP/csrc 进行安装（必需）
 ```shell
 cd your_PaddleNLP_path/csrc
 python setup_cuda.py install
 ```
+4. 安装 fused_ln 和 fast_ln 训练算子，参考 PaddleNLP/slm/model_zoo/gpt-3/external_ops (必须)
+```shell
+cd your_PaddleNLP_path/slm/model_zoo/gpt-3/external_ops
+python setup.py install
+```
 
 ## 支持模型
 
@@ -165,6 +170,9 @@ export FLAGS_mla_use_tensorcore=0
 export FLAGS_cascade_attention_max_partition_size=2048
 
 python -u -m paddle.distributed.launch --devices "0,1,2,3" run_rl.py ../../config/qwen/grpo_argument.yaml
+
+# QWEN32B 2k prompt + 30k response 9台8x80G 显卡训练命令如下：
+# python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_rl.py ../../config/qwen/grpo_32b_argument.yaml
 ```
 我们提供根据上述脚本可复现的[wandb 日志](https://api.wandb.ai/links/junyu/5jiulhem)。
Original file line number	Diff line number	Diff line change
`@@ -98,7 +98,7 @@ DINLINE half& assign_add(half& a, half b) {`
`98`	`98`	`}`
`99`	`99`	`DINLINE float& assign_add(float& a, float b) { return a += b; }`
`100`	`100`
`101`		`-#if (__CUDA_ARCH__ >= 800 \|\| !defined(__CUDA_ARCH__))`
	`101`	`+#if (!defined(__CUDA_ARCH__) \|\| __CUDA_ARCH__ >= 800)`
`102`	`102`	`DINLINE float upcast_s(nv_bfloat16 val) { return __bfloat162float(val); }`
`103`	`103`	`template <>`
`104`	`104`	`DINLINE nv_bfloat16 downcast_s(float val) {`