Skip to content

Commit c215848

Browse files
committed
Integrate DataProto into the GRPO, Resolve Conflict
2 parents ff65924 + ddcb722 commit c215848

File tree

87 files changed

+6330
-1212
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

87 files changed

+6330
-1212
lines changed

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,17 +33,20 @@
3333

3434
## News 📢
3535

36-
* **2025.03.17 《DeepSeek-R1满血版单机部署实测》** 🔥🔥🔥 飞桨框架3.0大模型推理部署全面升级,支持多款主流大模型,DeepSeek-R1满血版实现单机部署,吞吐提升一倍!欢迎广大用户开箱体验~现已开启有奖活动:完成 DeepSeek-R1-MTP 单机部署任务、提交高质量测评 blog,即可实时赢取奖金!💰💰💰
37-
报名[地址](https://www.wjx.top/vm/OlzzmbG.aspx#), 活动详情:https://github.com/PaddlePaddle/PaddleNLP/issues/10166 , 参考文档:https://github.com/PaddlePaddle/PaddleNLP/issues/10157
36+
* **2025.04.29 PaddleNLP 现已支持 Qwen3 系列模型**: Qwen3 系列模型支持持两种思考模式,预训练约 36 万亿个 token、119 种语言和方言。包括六个 Dense 模型, Qwen3-32B、Qwen3-14B、Qwen3-8B、Qwen3-4B、Qwen3-1.7B 和 Qwen3-0.6B。两个 MoE 模型的权重:Qwen3-235B-A22B,Qwen3-30B-A3B。
3837

3938
* **2025.03.12 [PaddleNLP v3.0 Beta4](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v3.0.0-beta4)**:全面支持 DeepSeek V3/R1/R1-Distill, 及 QwQ-32B 等热门思考模型。**DeepSeek V3/R1完整版支持 FP8、INT8、4-bit 量化推理,MTP 投机解码**。单机 FP8推理输出超**1000 tokens/s**; 4-bit 推理输出超**2100 tokens/s**! 发布新版推理部署镜像,热门模型[一键部署](https://paddlenlp.readthedocs.io/zh/latest/llm/server/docs/general_model_inference.html)。推理部署[使用文档](https://paddlenlp.readthedocs.io/zh/latest/llm/docs/predict/index.html)全面更新,体验全面提升!自研下一代通用信息抽取模型 PP-UIE [全新发布](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/application/information_extraction),支持8K 长度信息抽取。新增大模型 Embedding 训练,支持 INF-CL 超大 batch size 训练。新增[MergeKit](https://paddlenlp.readthedocs.io/zh/latest/llm/docs/mergekit.html)模型融合工具,缓解对齐代价。低资源训练全面优化,16G 小显存可以流畅训练。
4039

41-
* **2025.03.06 PaddleNLP 现已支持 Qwen/QwQ-32B 模型**: 其模型参数仅有 32B,但其数学推理、编程能力和通用能力可与具备 671B 参数(其中 37B 被激活)的 DeepSeek-R1 媲美。借助 PaddleNLP 3.0套件,现可实现多种并行策略[微调训练](./llm/README.md)[高性能推理、低比特量化](./llm/docs/predict/qwen.md)[服务化部署](./llm/server/README.md)
4240

4341
* **2025.02.10 PaddleNLP 现已支持 DeepSeek-R1系列模型,[在线使用](https://aistudio.baidu.com/projectdetail/8775758)**:依托全新的 PaddleNLP 3.0套件,DeepSeek-R1系列模型现已全面支持。凭借数据并行、数据分组切分并行、模型并行、流水线并行以及专家并行等一系列先进的分布式训练能力,结合 Paddle 框架独有的列稀疏注意力掩码表示技术——FlashMask 方法,DeepSeek-R1系列模型在训练过程中显著降低了显存消耗,同时取得了卓越的训练性能提升。
4442

4543
<details><summary> <b>点击展开</b> </summary><div>
4644

45+
* **2025.03.17 《DeepSeek-R1满血版单机部署实测》** 🔥🔥🔥 飞桨框架3.0大模型推理部署全面升级,支持多款主流大模型,DeepSeek-R1满血版实现单机部署,吞吐提升一倍!欢迎广大用户开箱体验~现已开启有奖活动:完成 DeepSeek-R1-MTP 单机部署任务、提交高质量测评 blog,即可实时赢取奖金!💰💰💰
46+
报名[地址](https://www.wjx.top/vm/OlzzmbG.aspx#), 活动详情:https://github.com/PaddlePaddle/PaddleNLP/issues/10166 , 参考文档:https://github.com/PaddlePaddle/PaddleNLP/issues/10157
47+
48+
* **2025.03.06 PaddleNLP 现已支持 Qwen/QwQ-32B 模型**: 其模型参数仅有 32B,但其数学推理、编程能力和通用能力可与具备 671B 参数(其中 37B 被激活)的 DeepSeek-R1 媲美。借助 PaddleNLP 3.0套件,现可实现多种并行策略[微调训练](./llm/README.md)[高性能推理、低比特量化](./llm/docs/predict/qwen.md)[服务化部署](./llm/server/README.md)
49+
4750
* **2025.02.20 🔥🔥《PP-UIE 信息抽取智能引擎全新升级》** 强化零样本学习能力,支持极少甚至零标注数据实现高效冷启动与迁移学习,显著降低数据标注成本;具备处理长文本能力,支持 8192 个 Token 长度文档信息抽取,实现跨段落识别关键信息,形成完整理解;提供完整可定制化的训练和推理全流程,训练效率相较于 LLama-Factory 实现了1.8倍的提升。
4851
2月26日(周三)19:00为您深度解析全新 PP-UIE 技术方案及在部署方面的功能、优势与技巧。报名链接:https://www.wjx.top/vm/mBKC6pb.aspx?udsid=606418
4952

@@ -119,6 +122,7 @@
119122
| [Qwen2.5](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/) | Qwen/Qwen2.5-0.5B, Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-7B-Instruct-1M, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-14B-Instruct-1M, Qwen/Qwen2.5-32B, Qwen/Qwen2.5-32B-Instruct, Qwen/Qwen2.5-72B, Qwen/Qwen2.5-72B-Instruct |
120123
| [Qwen2.5-Math](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/) | Qwen/Qwen2.5-Math-1.5B, Qwen/Qwen2.5-Math-1.5B-Instruct, Qwen/Qwen2.5-Math-7B, Qwen/Qwen2.5-Math-7B-Instruct, Qwen/Qwen2.5-Math-72B, Qwen/Qwen2.5-Math-72B-Instruct, Qwen/Qwen2.5-Math-RM-72B |
121124
| [Qwen2.5-Coder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/) | Qwen/Qwen2.5-Coder-1.5B, Qwen/Qwen2.5-Coder-1.5B-Instruct, Qwen/Qwen2.5-Coder-7B, Qwen/Qwen2.5-Coder-7B-Instruct |
125+
| [Qwen3](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/) | Qwen/Qwen3-0.6B, Qwen/Qwen3-1.7B, Qwen/Qwen3-4B, Qwen/Qwen3-8B, Qwen/Qwen3-14B, Qwen/Qwen3-32B, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-235B-A22B, Qwen/Qwen3-0.6B-Base, Qwen/Qwen3-1.7B-Base, Qwen/Qwen3-4B-Base, Qwen/Qwen3-8B-Base, Qwen/Qwen3-14B-Base, Qwen/Qwen3-30B-A3B-Base |
122126
| [QwQ](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen/) | Qwen/QwQ-32B, Qwen/QwQ-32B-Preview |
123127
| [Yuan2](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/yuan/) | IEITYuan/Yuan2-2B, IEITYuan/Yuan2-51B, IEITYuan/Yuan2-102B |
124128

csrc/gpu/all_reduce.cu

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,12 +72,14 @@ void all_reduce(fptr_t _fa, paddle::Tensor& inp, paddle::Tensor& out,
7272
reinterpret_cast<half*>(out.data()), out.numel());
7373
break;
7474
}
75+
#if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
7576
case phi::DataType::BFLOAT16: {
7677
fa->allreduce<nv_bfloat16>(
7778
stream, reinterpret_cast<nv_bfloat16*>(reg_buffer),
7879
reinterpret_cast<nv_bfloat16*>(out.data()), out.numel());
7980
break;
8081
}
82+
#endif
8183
default:
8284
throw std::runtime_error(
8385
"custom allreduce only supports float32, float16 and bfloat16");

csrc/gpu/all_reduce.cuh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ DINLINE half& assign_add(half& a, half b) {
9898
}
9999
DINLINE float& assign_add(float& a, float b) { return a += b; }
100100

101-
#if (__CUDA_ARCH__ >= 800 || !defined(__CUDA_ARCH__))
101+
#if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
102102
DINLINE float upcast_s(nv_bfloat16 val) { return __bfloat162float(val); }
103103
template <>
104104
DINLINE nv_bfloat16 downcast_s(float val) {

csrc/gpu/cpp_extensions.cu

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -238,11 +238,9 @@ std::vector<paddle::Tensor> GetPaddingOffsetV2(const paddle::Tensor& input_ids,
238238

239239
void SaveOutMmsg(const paddle::Tensor& x,
240240
const paddle::Tensor& not_need_stop, // cpu
241-
const paddle::Tensor& msg_queue_id, // cpu
242241
int64_t rank_id);
243242

244243
void GetOutput(const paddle::Tensor& x,
245-
const paddle::Tensor& msg_queue_id, // cpu
246244
int64_t rank_id,
247245
bool wait_flag);
248246

csrc/gpu/get_output.cc

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,33 @@
2020
#include "paddle/extension.h"
2121

2222
#define MAX_BSZ 512
23+
// #define GET_OUTPUT_DEBUG
2324

2425
struct msgdata {
2526
long mtype;
2627
int mtext[MAX_BSZ + 2]; // stop_flag, bsz, tokens
2728
};
2829

2930
void GetOutput(const paddle::Tensor& x,
30-
const paddle::Tensor& msg_queue_id,
3131
int64_t rank_id,
3232
bool wait_flag) {
3333
if (rank_id > 0) return;
3434

3535
static struct msgdata msg_rcv;
36-
int queue_id_val = msg_queue_id.data<int>()[0];
37-
static key_t key = ftok("./", queue_id_val);
36+
int msg_queue_id = 1;
37+
if (const char* inference_msg_queue_id_env_p =
38+
std::getenv("INFERENCE_MSG_QUEUE_ID")) {
39+
std::string inference_msg_queue_id_env_str(
40+
inference_msg_queue_id_env_p);
41+
int inference_msg_queue_id_from_env =
42+
std::stoi(inference_msg_queue_id_env_str);
43+
#ifdef GET_OUTPUT_DEBUG
44+
std::cout << "Your INFERENCE_MSG_QUEUE_ID is: "
45+
<< inference_msg_queue_id_from_env << std::endl;
46+
#endif
47+
msg_queue_id = inference_msg_queue_id_from_env;
48+
}
49+
static key_t key = ftok("./", msg_queue_id);
3850

3951
static int msgid = msgget(key, IPC_CREAT | 0666);
4052

@@ -62,7 +74,7 @@ void GetOutput(const paddle::Tensor& x,
6274
}
6375

6476
PD_BUILD_OP(get_output)
65-
.Inputs({"x", "msg_queue_id"})
77+
.Inputs({"x"})
6678
.Attrs({"rank_id: int64_t",
6779
"wait_flag: bool"})
6880
.Outputs({"x_out"})

csrc/gpu/multi_head_latent_attention.cu

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,7 @@ std::vector<paddle::Tensor> MultiHeadLatentAttention(
205205
meta_data.batch_size = cum_offsets.dims()[0];
206206

207207
switch (query.dtype()) {
208+
#if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
208209
case paddle::DataType::BFLOAT16: {
209210
return MultiHeadLatentAttentionKernel<paddle::DataType::BFLOAT16>(
210211
meta_data,
@@ -253,6 +254,7 @@ std::vector<paddle::Tensor> MultiHeadLatentAttention(
253254
causal,
254255
speculate_decoder);
255256
}
257+
#endif
256258
case paddle::DataType::FLOAT16: {
257259
return MultiHeadLatentAttentionKernel<paddle::DataType::FLOAT16>(
258260
meta_data,

csrc/gpu/save_with_output_msg.cc

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
#include "paddle/extension.h"
2121

2222
#define MAX_BSZ 512
23+
// #define SAVE_WITH_OUTPUT_DEBUG
2324

2425
struct msgdata {
2526
long mtype;
@@ -28,16 +29,32 @@ struct msgdata {
2829

2930
void SaveOutMmsg(const paddle::Tensor& x,
3031
const paddle::Tensor& not_need_stop, // cpu
31-
const paddle::Tensor& msg_queue_id, // cpu
3232
int64_t rank_id) {
3333
if (rank_id > 0) return;
3434
auto x_cpu = x.copy_to(paddle::CPUPlace(), false);
3535
int64_t *x_data = x_cpu.data<int64_t>();
3636
auto not_need_stop_data = not_need_stop.data<bool>()[0];
3737

3838
static struct msgdata msg_sed;
39-
int queue_id_val = msg_queue_id.data<int>()[0];
40-
static key_t key = ftok("./", queue_id_val);
39+
int msg_queue_id = 1;
40+
if (const char* inference_msg_queue_id_env_p =
41+
std::getenv("INFERENCE_MSG_QUEUE_ID")) {
42+
std::string inference_msg_queue_id_env_str(
43+
inference_msg_queue_id_env_p);
44+
int inference_msg_queue_id_from_env =
45+
std::stoi(inference_msg_queue_id_env_str);
46+
msg_queue_id = inference_msg_queue_id_from_env;
47+
#ifdef SAVE_WITH_OUTPUT_DEBUG
48+
std::cout << "Your INFERENCE_MSG_QUEUE_ID is: "
49+
<< inference_msg_queue_id_from_env << std::endl;
50+
#endif
51+
} else {
52+
#ifdef SAVE_WITH_OUTPUT_DEBUG
53+
std::cout << "Failed to got INFERENCE_MSG_QUEUE_ID at env, use default."
54+
<< std::endl;
55+
#endif
56+
}
57+
static key_t key = ftok("./", msg_queue_id);
4158
static int msgid = msgget(key, IPC_CREAT | 0666);
4259

4360
msg_sed.mtype = 1;
@@ -54,7 +71,7 @@ void SaveOutMmsg(const paddle::Tensor& x,
5471
}
5572

5673
PD_BUILD_OP(save_output)
57-
.Inputs({"x", "not_need_stop", "msg_queue_id"})
74+
.Inputs({"x", "not_need_stop"})
5875
.Attrs({"rank_id: int64_t"})
5976
.Outputs({"x_out"})
6077
.SetInplaceMap({{"x", "x_out"}})

csrc/setup_cuda.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -130,14 +130,11 @@ def get_gencode_flags():
130130
"./gpu/speculate_decoding_kernels/speculate_save_output.cc",
131131
"./gpu/speculate_decoding_kernels/speculate_get_output.cc",
132132
"./gpu/save_output_dygraph.cu",
133-
"./gpu/cpp_extensions.cu",
134133
"./gpu/all_reduce.cu",
135134
"./gpu/quantization/per_token_group_quant.cu",
136135
"./gpu/quantization/per_tensor_quant_fp8.cu",
137136
]
138137
sources += find_end_files("./gpu/speculate_decoding_kernels", ".cu")
139-
sources += find_end_files("./gpu/moe/fused_moe/cutlass_kernels/moe_gemm/", ".cu")
140-
sources += find_end_files("./gpu/moe/fused_moe/", ".cu")
141138

142139
nvcc_compile_args = gencode_flags
143140
update_git_submodule()
@@ -174,6 +171,9 @@ def get_gencode_flags():
174171

175172
sources += find_end_files("./gpu/append_attn", ".cu")
176173
sources += find_end_files("./gpu/append_attn/template_instantiation", ".cu")
174+
sources += find_end_files("./gpu/moe/fused_moe/cutlass_kernels/moe_gemm/", ".cu")
175+
sources += find_end_files("./gpu/moe/fused_moe/", ".cu")
176+
sources += "./gpu/cpp_extensions.cu",
177177

178178

179179
fp8_auto_gen_directory = "gpu/cutlass_kernels/fp8_gemm_fused/autogen"

csrc/tools/build_wheel.sh

Lines changed: 27 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ function generate_sm_version(){
6161
sm_versions=($SM_VERSION )
6262
elif [ "$ARCHITECTURE" = "all" ]; then
6363
if awk -v version="$cuda_version" 'BEGIN { exit !(version >= 12.0) }'; then
64-
sm_versions=(70 75 80 80 86 89 90 )
64+
sm_versions=(70 75 80 86 89 90 )
6565
else
6666
sm_versions=(70 75 80 86 89 )
6767
fi
@@ -72,10 +72,12 @@ function generate_sm_version(){
7272
}
7373

7474
function create_directories(){
75-
mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops
76-
touch $OPS_SRC_DIR/tmp/setup.py
77-
touch $OPS_SRC_DIR/tmp/paddlenlp_ops/__init__.py
78-
echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
75+
for sm_version in "${sm_versions[@]}"; do
76+
echo "create sm$sm_version"
77+
mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops
78+
touch $OPS_SRC_DIR/tmp/setup.py
79+
touch $OPS_SRC_DIR/tmp/paddlenlp_ops/__init__.py
80+
echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
7981
#
8082
# Licensed under the Apache License, Version 2.0 (the "License");
8183
# you may not use this file except in compliance with the License.
@@ -93,6 +95,7 @@ function create_directories(){
9395
9496
import os
9597
from datetime import datetime
98+
import paddle
9699
97100
from setuptools import find_packages, setup
98101
@@ -109,14 +112,19 @@ def read(file: str):
109112
content = f.read().strip()
110113
return content
111114
112-
113115
def read_version():
114116
"""
115117
read version and return content
116118
"""
117119
__version__ = "3.0.0b4.post"
120+
118121
formatted_date = datetime.now().date().strftime("%Y%m%d")
119-
__version__ = __version__.replace(".post", ".post{}".format(formatted_date))
122+
cuda_version = float(paddle.version.cuda())
123+
sm_version=80
124+
paddle_commit = paddle.__git_commit__[:7]
125+
build_tag = "{}+cuda{}sm{}paddle{}".format(formatted_date, cuda_version, sm_version, paddle_commit)
126+
127+
__version__ = __version__.replace(".post", ".post{}".format(build_tag))
120128
121129
return __version__
122130
@@ -184,12 +192,9 @@ try:
184192
except ImportError:
185193
logger.WARNING(f"No {module_name} ")
186194
' > $OPS_SRC_DIR/tmp/paddlenlp_ops/__init__.py
187-
188-
for sm_version in "${sm_versions[@]}"; do
189-
echo "create sm$sm_version"
190-
mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}
191-
touch $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}/__init__.py
192-
echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
195+
mkdir -p $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}
196+
touch $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}/__init__.py
197+
echo '# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
193198
#
194199
# Licensed under the Apache License, Version 2.0 (the "License");
195200
# you may not use this file except in compliance with the License.
@@ -210,6 +215,7 @@ try:
210215
except ImportError:
211216
logger.WARNING("No paddlenlp_ops_'${sm_version}' ops")
212217
' > $OPS_SRC_DIR/tmp/paddlenlp_ops/sm${sm_version}/__init__.py
218+
build_ops
213219
done
214220
}
215221

@@ -228,11 +234,11 @@ function init() {
228234
}
229235

230236
function build_ops() {
231-
for sm_version in "${sm_versions[@]}"; do
232-
echo "Building and installing for sm_version: $sm_version"
233-
build_and_install_ops $sm_version
234-
done
235-
return
237+
echo "Building and installing for sm_version: $sm_version"
238+
build_and_install_ops $sm_version
239+
build_and_install_whl
240+
unittest
241+
cleanup
236242
}
237243

238244
function copy_ops(){
@@ -269,6 +275,7 @@ function build_and_install_whl() {
269275
echo -e "${BLUE}[build]${NONE} building paddlenlp_ops wheel..."
270276
rm -rf ./dist
271277
cd ${TMP_DIR}
278+
sed -i "s/sm_version=80/sm_version=${sm_version}/g" setup.py
272279
${python} setup.py bdist_wheel --dist-dir ./$DIST_DIR
273280
if [ $? -ne 0 ]; then
274281
echo -e "${RED}[FAIL]${NONE} build paddlenlp_ops wheel failed !"
@@ -286,7 +293,8 @@ function build_and_install_whl() {
286293
fi
287294
echo -e "${BLUE}[install]${NONE} ${GREEN}paddlenlp_ops install success\n"
288295
cd ..
289-
mv $DIST_DIR ../
296+
mkdir -p ../$DIST_DIR
297+
mv $DIST_DIR/* ../$DIST_DIR/
290298
cd ..
291299
}
292300

@@ -321,10 +329,6 @@ trap 'abort' 0
321329
set -e
322330

323331
init
324-
build_ops
325-
build_and_install_whl
326-
unittest
327-
cleanup
328332

329333
# get Paddle version
330334
PADDLE_VERSION=`${python} -c "import paddle; print(paddle.version.full_version)"`

llm/alignment/rl/README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,16 @@ REINFORCE++ 是经典 REINFORCE 算法的改进版本,通过融合 PPO 的关
1313
```shell
1414
git clone https://github.com/PaddlePaddle/PaddleNLP.git
1515
```
16-
3. 安装 paddlenlp_ops,参考 PaddleNLP/csrc 进行安装(必需)
16+
3. 安装 paddlenlp_ops 推理算子,参考 PaddleNLP/csrc 进行安装(必需)
1717
```shell
1818
cd your_PaddleNLP_path/csrc
1919
python setup_cuda.py install
2020
```
21+
4. 安装 fused_ln 和 fast_ln 训练算子,参考 PaddleNLP/slm/model_zoo/gpt-3/external_ops (必须)
22+
```shell
23+
cd your_PaddleNLP_path/slm/model_zoo/gpt-3/external_ops
24+
python setup.py install
25+
```
2126

2227
## 支持模型
2328

@@ -165,6 +170,9 @@ export FLAGS_mla_use_tensorcore=0
165170
export FLAGS_cascade_attention_max_partition_size=2048
166171

167172
python -u -m paddle.distributed.launch --devices "0,1,2,3" run_rl.py ../../config/qwen/grpo_argument.yaml
173+
174+
# QWEN32B 2k prompt + 30k response 9台8x80G 显卡训练命令如下:
175+
# python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_rl.py ../../config/qwen/grpo_32b_argument.yaml
168176
```
169177
我们提供根据上述脚本可复现的[wandb 日志](https://api.wandb.ai/links/junyu/5jiulhem)
170178

0 commit comments

Comments
 (0)