Skip to content

Commit 5cd8fd5

Browse files
authored
add tokencls triton inference server for ernie-3.0 (#2502)
* add triton service for ernie-3.0 * modify triton code * Optimize the triton server code * Optimize the triton server code * triton server add tokencls
1 parent 016db3c commit 5cd8fd5

File tree

9 files changed

+465
-25
lines changed

9 files changed

+465
-25
lines changed

model_zoo/ernie-3.0/deploy/serving/token_cls_service.py

+2
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,8 @@ def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
125125
"pos": [start, len(token_label) - 1],
126126
"entity":
127127
input_data[batch][start:len(token_label) - 1],
128+
"label":
129+
label_name,
128130
})
129131
value.append(items)
130132
out_dict = {

model_zoo/ernie-3.0/deploy/triton/README.md

+101-19
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
- [环境准备](#环境准备)
77
- [模型转换](#模型转换)
88
- [部署模型](#部署模型)
9+
- [客户端请求](#客户端请求)
910

1011
## 环境准备
1112
需要[准备PaddleNLP的运行环境](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)和Triton Server的运行环境。
@@ -43,79 +44,142 @@ python3 -m pip install faster_tokenizers
4344

4445
使用Triton做服务化部署时,选择ONNX Runtime后端运行需要先将模型转换成ONN格式。
4546

46-
下载ERNIE 3.0的新闻分类模型(如果有已训练好的模型,跳过此步骤):
47+
下载ERNIE 3.0的新闻分类模型、序列标注模型(如果有已训练好的模型,跳过此步骤):
4748
```bash
4849
# 下载并解压新闻分类模型
4950
wget https://paddlenlp.bj.bcebos.com/models/transformers/ernie_3.0/tnews_pruned_infer_model.zip
5051
unzip tnews_pruned_infer_model.zip
52+
53+
# 下载并解压序列标注模型
54+
wget https://paddlenlp.bj.bcebos.com/models/transformers/ernie_3.0/msra_ner_pruned_infer_model.zip
55+
unzip msra_ner_pruned_infer_model.zip
5156
```
5257

5358
使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下,以下命令成功运行后,将会在当前目录下生成model.onnx模型文件。
5459
```bash
5560
# 模型地址根据实际填写即可
5661
# 转换新闻分类模型
57-
paddle2onnx --model_dir tnews_pruned_infer_model/ --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
62+
paddle2onnx --model_dir tnews_pruned_infer_model --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
5863

59-
# 将转换好的ONNX模型移动到模型仓库目录
64+
# 将转换好的ONNX模型移动到分类任务的模型仓库目录
6065
mv model.onnx /models/ernie_seqcls_model/1
66+
67+
# 转换序列标注模型
68+
paddle2onnx --model_dir --model_filename msra_ner_pruned_infer_model float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
69+
70+
# 将转换好的ONNX模型移动到序列标注任务的模型仓库目录
71+
mv model.onnx /models/ernie_tokencls_model/1
6172
```
6273
Paddle2ONNX的命令行参数说明请查阅:[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9)
6374

64-
模型下载转换好之后,models目录结构如下:
75+
模型下载转换好之后,分类任务的models目录结构如下:
6576
```
6677
models
67-
├── ernie_seqcls
78+
├── ernie_seqcls # 分类任务的pipeline
6879
│   ├── 1
69-
│   └── config.pbtxt
70-
├── ernie_seqcls_model
80+
│   └── config.pbtxt # 通过这个文件组合前后处理和模型推理
81+
├── ernie_seqcls_model # 分类任务的模型推理
7182
│   ├── 1
7283
│   │   └── model.onnx
7384
│   └── config.pbtxt
74-
├── ernie_seqcls_postprocess
85+
├── ernie_seqcls_postprocess # 分类任务后处理
7586
│   ├── 1
7687
│   │   └── model.py
7788
│   └── config.pbtxt
78-
└── ernie_tokenizer
89+
└── ernie_tokenizer # 预处理分词
7990
├── 1
8091
│   └── model.py
8192
└── config.pbtxt
8293
```
8394

8495
## 部署模型
85-
8696
triton目录包含启动pipeline服务的配置和发送预测请求的代码,包括:
8797

8898
```
8999
models # Triton启动需要的模型仓库,包含模型和服务配置文件
90100
seq_cls_rpc_client.py # 新闻分类任务发送pipeline预测请求的脚本
101+
token_cls_rpc_client.py # 序列标注任务发送pipeline预测请求的脚本
91102
```
92103

93-
### 启动服务
104+
*注意*:启动服务时,Triton Server的每个python后端进程默认申请`64M`内存,默认启动的docker无法启动多个python后端节点。有两个解决方案:
105+
- 1.启动容器时设置`shm-size`参数, 比如:`docker run -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash`
106+
- 2.启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M: `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
94107

108+
### 分类任务
95109
在容器内执行下面命令启动服务:
96110
```
111+
# 默认启动models下所有模型
97112
tritonserver --model-repository=/models
113+
114+
# 可通过参数只启动分类任务
115+
tritonserver --model-repository=/models --model-control-mode=explicit --load-model=ernie_seqcls
98116
```
99117
输出打印如下:
100118
```
101119
I0601 08:08:27.951220 8697 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5c1c000000' with size 268435456
102120
I0601 08:08:27.953774 8697 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
103-
I0601 08:08:27.958255 8697 model_repository_manager.cc:1022] loading: ernie_seqcls_postprocess:1
104-
I0601 08:08:28.058467 8697 model_repository_manager.cc:1022] loading: ernie_seqcls_model:1
105-
I0601 08:08:28.062170 8697 python.cc:1875] TRITONBACKEND_ModelInstanceInitialize: ernie_seqcls_postprocess_0 (CPU device 0)
106-
I0601 08:08:28.158848 8697 model_repository_manager.cc:1022] loading: ernie_tokenizer:1
121+
I0601 08:08:27.958255 8697 model_repository_manager.
122+
...
123+
I0613 08:59:20.577820 10021 server.cc:592]
124+
+----------------------------+---------+--------+
125+
| Model | Version | Status |
126+
+----------------------------+---------+--------+
127+
| ernie_seqcls | 1 | READY |
128+
| ernie_seqcls_model | 1 | READY |
129+
| ernie_seqcls_postprocess | 1 | READY |
130+
| ernie_tokenizer | 1 | READY |
131+
+----------------------------+---------+--------+
107132
...
108133
I0601 07:15:15.923270 8059 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
109134
I0601 07:15:15.923604 8059 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
110135
I0601 07:15:15.964984 8059 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
111136
```
112137

113-
*注意:*启动服务时,Triton Server的每个python后端进程默认申请`64M`内存,默认启动的docker无法启动多个python后端节点。两个解决方案:
114-
- 1.启动容器时设置`shm-size`参数, 比如:`docker run -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash`
115-
- 2.启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M: `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
138+
### 序列标注任务
139+
在容器内执行下面命令启动序列标注服务:
140+
```
141+
tritonserver --model-repository=/models --model-control-mode=explicit --load-model=ernie_tokencls --backend-config=python,shm-default-byte-size=10485760
142+
```
143+
输出打印如下:
144+
```
145+
I0601 08:08:27.951220 8697 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5c1c000000' with size 268435456
146+
I0601 08:08:27.953774 8697 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
147+
I0601 08:08:27.958255 8697 model_repository_manager.
148+
...
149+
I0613 08:59:20.577820 10021 server.cc:592]
150+
+----------------------------+---------+--------+
151+
| Model | Version | Status |
152+
+----------------------------+---------+--------+
153+
| ernie_tokencls | 1 | READY |
154+
| ernie_tokencls_model | 1 | READY |
155+
| ernie_tokencls_postprocess | 1 | READY |
156+
| ernie_tokenizer | 1 | READY |
157+
+----------------------------+---------+--------+
158+
...
159+
I0601 07:15:15.923270 8059 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
160+
I0601 07:15:15.923604 8059 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
161+
I0601 07:15:15.964984 8059 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
162+
```
163+
164+
## 客户端请求
165+
客户端请求可以在本地执行脚本请求;也可以下载官方客户端镜像,在容器中执行。
116166

167+
本地执行脚本需要先安装依赖:
168+
```
169+
pip install grpcio
170+
pip install tritonclient==2.10.0
171+
```
172+
173+
拉取官网镜像并启动容器:
174+
```
175+
# 拉取镜像
176+
docker pull nvcr.io/nvidia/tritonserver:21.10-py3-sdk
117177
118-
#### 启动client测试
178+
#启动容器
179+
docker run -it --net=host --name triton_client -v /path/to/triton:/triton_code nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash
180+
```
181+
182+
### 分类任务
119183
注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
120184
```
121185
python seq_cls_grpc_client.py
@@ -126,3 +190,21 @@ python seq_cls_grpc_client.py
126190
{'label': array([4]), 'confidence': array([0.53198355], dtype=float32)}
127191
acc: 0.5731
128192
```
193+
194+
### 序列标注任务
195+
注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
196+
```
197+
python token_cls_grpc_client.py
198+
```
199+
输出打印如下:
200+
```
201+
input data: 北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。
202+
The model detects all entities:
203+
entity: 北京 label: LOC pos: [0, 1]
204+
entity: 重庆 label: LOC pos: [6, 7]
205+
entity: 成都 label: LOC pos: [12, 13]
206+
input data: 原产玛雅故国的玉米,早已成为华夏大地主要粮食作物之一。
207+
The model detects all entities:
208+
entity: 玛雅 label: LOC pos: [2, 3]
209+
entity: 华夏 label: LOC pos: [14, 15]
210+
```

model_zoo/ernie-3.0/deploy/triton/models/ernie_seqcls_model/config.pbtxt

+8-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# onnxruntime 后端
12
platform: "onnxruntime_onnx"
23
max_batch_size: 64
34
input [
@@ -22,15 +23,21 @@ output [
2223

2324
instance_group [
2425
{
26+
# 创建1个实例
2527
count: 1
28+
# 使用GPU推理(KIND_CPU、KIND_GPU)
2629
kind: KIND_GPU
2730
}
2831
]
2932

3033
optimization {
31-
graph: {level: -1}
34+
# 图优化级别: 默认开启所有优化,-1开启基本优化,1开启额外扩展优化(比如fuse)
35+
graph: {level: 1}
3236
}
3337
38+
# 设置节点内并行的线程数, 0代表采用默认值,即CPU核心数
3439
parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
40+
# 设置执行图时顺序执行还是并行执行,0表示顺序,1表示并行(适合分支很多的模型)
3541
parameters { key: "execution_mode" value: { string_value: "0" } }
42+
# 设置并行执行图的线程数,当execution_mode设置为1时才生效
3643
parameters { key: "inter_op_thread_count" value: { string_value: "0" } }
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
name: "ernie_tokencls"
2+
platform: "ensemble"
3+
max_batch_size: 64
4+
input [
5+
{
6+
name: "INPUT"
7+
data_type: TYPE_STRING
8+
dims: [ 1 ]
9+
}
10+
]
11+
output [
12+
{
13+
name: "OUTPUT"
14+
data_type: TYPE_STRING
15+
dims: [ 1 ]
16+
}
17+
]
18+
ensemble_scheduling {
19+
step [
20+
{
21+
model_name: "ernie_tokenizer"
22+
model_version: 1
23+
input_map {
24+
key: "INPUT_0"
25+
value: "INPUT"
26+
}
27+
output_map {
28+
key: "OUTPUT_0"
29+
value: "tokenizer_input_ids"
30+
}
31+
output_map {
32+
key: "OUTPUT_1"
33+
value: "tokenizer_token_type_ids"
34+
}
35+
},
36+
{
37+
model_name: "ernie_tokencls_model"
38+
model_version: 1
39+
input_map {
40+
key: "input_ids"
41+
value: "tokenizer_input_ids"
42+
}
43+
input_map {
44+
key: "token_type_ids"
45+
value: "tokenizer_token_type_ids"
46+
}
47+
output_map {
48+
key: "linear_113.tmp_1"
49+
value: "OUTPUT_2"
50+
}
51+
},
52+
{
53+
model_name: "ernie_tokencls_postprocess"
54+
model_version: 1
55+
input_map {
56+
key: "POST_INPUT"
57+
value: "OUTPUT_2"
58+
}
59+
output_map {
60+
key: "POST_OUTPUT"
61+
value: "OUTPUT"
62+
}
63+
}
64+
]
65+
}
66+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
platform: "onnxruntime_onnx"
2+
max_batch_size: 64
3+
input [
4+
{
5+
name: "input_ids"
6+
data_type: TYPE_INT64
7+
dims: [ -1 ]
8+
},
9+
{
10+
name: "token_type_ids"
11+
data_type: TYPE_INT64
12+
dims: [ -1 ]
13+
}
14+
]
15+
output [
16+
{
17+
name: "linear_113.tmp_1"
18+
data_type: TYPE_FP32
19+
dims: [ -1, 7 ]
20+
}
21+
]
22+
23+
instance_group [
24+
{
25+
count: 1
26+
kind: KIND_GPU
27+
}
28+
]
29+
30+
optimization {
31+
# 图优化级别, -1代表最高,0、1表示
32+
graph: {level: -1}
33+
}
34+
35+
parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
36+
parameters { key: "execution_mode" value: { string_value: "0" } }
37+
parameters { key: "inter_op_thread_count" value: { string_value: "0" } }

0 commit comments

Comments
 (0)