Skip to content

Commit b507eeb

Browse files
committed
add readme file
1 parent 2c139d4 commit b507eeb

File tree

3 files changed

+187
-15
lines changed

3 files changed

+187
-15
lines changed

llm/benchmark/rl/README.md

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# Large Language Model Throughput Testing Framework
2+
3+
This framework is designed to test the **throughput performance** of Large Language Models (LLMs) across different deployment methods: **offline inference** (using PaddlePaddle and PyTorch) and **online inference** (via API). It specifically focuses on evaluating the model's ability to handle **batch queries**, measuring throughput in tokens per second under various configurations and batch sizes.
4+
5+
## Features
6+
7+
* **Diverse Deployment Method Support:** Tests LLMs deployed via online API, and offline inference with PaddlePaddle and PyTorch.
8+
* **Batch Query Throughput Calculation:** Accurately measures throughput (tokens/s) for concurrent queries, providing insights into the model's performance under load.
9+
* **Detailed Time Logging:** Records the total time for each batch processing operation.
10+
11+
## How It Works
12+
13+
This framework operates by sending batched requests to specified endpoints (API or local inference scripts) and collecting performance data on how the model generates responses for multiple queries simultaneously.
14+
15+
1. **Input Queries:** You provide a set of questions as test input, typically from a `.parquet` or text file.
16+
2. **Batch Processing:** The framework groups these questions into batches of a specified `rollout_input_batch_size`.
17+
3. **Generate Responses:** For each query within a batch, the framework requests the model to generate `rollout_n` responses.
18+
4. **Time Measurement:** The total time from sending a batch of questions to receiving all corresponding responses is recorded.
19+
5. **Token Statistics:** The total number of tokens generated across all responses in a batch is summed up.
20+
6. **Throughput Calculation:** Throughput (tokens/s) is calculated by dividing the total tokens generated by the total time taken for the batch to complete.
21+
22+
## Usage
23+
24+
This section details how to run throughput tests for each deployment method using the provided shell scripts.
25+
26+
### Data Preparation
27+
28+
To run the tests, you'll first need to download and extract the `rl_data.tar.gz` archive, which contains the GSM8K dataset in a suitable format for testing.
29+
30+
```
31+
cd llm/benchmark/rl
32+
wget https://paddle-qa.bj.bcebos.com/paddlenlp/rl_data.tar.gz
33+
tar -zxvf rl_data.tar.gz
34+
```
35+
36+
Extract the contents of the archive. This will create a data folder containing the GSM8K dataset.
37+
38+
### Online API Inference
39+
40+
This script tests the throughput of a remote LLM API.
41+
42+
**Configuration (`api_serve.sh`):**
43+
44+
```bash
45+
output_dir="api_serve_results"
46+
47+
python api_serve.py \
48+
--openai_urls "your_url1" "your_url2"\
49+
--api_keys "key1" "key2" \
50+
--model "Qwen2.5-7B-Instruct-1M" \
51+
--tokenizer "Qwen/Qwen2.5-7B-Instruct-1M" \
52+
--input_file ./data/gsm8k/instruct/train.parquet \
53+
--output_dir ${output_dir} \
54+
--rollout_input_batch_size 8 \
55+
--rollout_n 8 \
56+
--top_p 1.0 \
57+
--temperature 0.7 \
58+
--max_dec_len 8192 \
59+
--limit_rows 512
60+
```
61+
62+
* **`--openai_urls`**: URLs of the API endpoints to test.
63+
* **`--api_keys`**: API keys for authentication (if required).
64+
* **`--model`**: Name of the model being tested.
65+
* **`--tokenizer`**: Path or name of the tokenizer.
66+
* **`--input_file`**: Path to the input dataset file.
67+
* **`--output_dir`**: Directory to save output results.
68+
* **`--rollout_input_batch_size`**: The batch size for API requests.
69+
* **`--rollout_n`**: Number of responses to generate for each input query.
70+
* **`--max_dec_len`**: Maximum decoding length for responses.
71+
* **`--limit_rows`**: Limit the number of input rows processed.
72+
73+
**Run command:**
74+
75+
```bash
76+
bash scripts/api_serve.sh
77+
```
78+
79+
### Offline PaddlePaddle Inference
80+
81+
This script tests the throughput of an LLM using offline PaddlePaddle inference, potentially with distributed processing.
82+
83+
**Configuration (`paddle_infer.sh`):**
84+
85+
```bash
86+
unset PADDLE_TRAINERS_NUM
87+
unset PADDLE_ELASTIC_JOB_ID
88+
unset PADDLE_TRAINER_ENDPOINTS
89+
unset DISTRIBUTED_TRAINER_ENDPOINTS
90+
unset FLAGS_START_PORT
91+
unset PADDLE_ELASTIC_TIMEOUT
92+
93+
export PYTHONPATH="your_paddlenlp_path/PaddleNLP":$PYTHONPATH
94+
export PYTHONPATH="your_paddlenlp_path/PaddleNLP/llm":$PYTHONPATH
95+
96+
export FLAGS_set_to_1d=False
97+
export NVIDIA_TF32_OVERRIDE=0
98+
export FLAGS_dataloader_use_file_descriptor=False
99+
export HF_DATASETS_DOWNLOAD_TIMEOUT=1
100+
export FLAGS_gemm_use_half_precision_compute_type=False
101+
export FLAGS_force_cublaslt_no_reduced_precision_reduction=True
102+
103+
export FLAGS_custom_allreduce=0
104+
export FLAGS_mla_use_tensorcore=0
105+
export FLAGS_cascade_attention_max_partition_size=2048
106+
107+
export CUDA_VISIBLE_DEVICES=4,5,6,7
108+
output_dir="pdpd_bf16_offline"
109+
110+
python -u -m paddle.distributed.launch --log_dir ${output_dir}/logs --gpus ${CUDA_VISIBLE_DEVICES} paddle_infer.py \
111+
--actor_model_name_or_path your_model_name \
112+
--max_src_len 2048 \
113+
--min_dec_len 32 \
114+
--max_dec_len 30720 \
115+
--top_p 1.0 \
116+
--temperature 1.0 \
117+
--rollout_input_batch_size 4 \
118+
--rollout_n 8 \
119+
--rollout_max_num_seqs 24 \
120+
--rollout_quant_type "" \
121+
--tensor_parallel_degree 4 \
122+
--limit_rows 640 \
123+
--input_file file.parquet \
124+
--output_dir ${output_dir} > ./paddleinfer.log 2>&1
125+
```
126+
127+
* **`CUDA_VISIBLE_DEVICES`**: Specifies the GPUs to be used.
128+
* **`paddle.distributed.launch`**: Initiates a distributed PaddlePaddle training/inference job.
129+
* **`--actor_model_name_or_path`**: Path to the pre-trained model.
130+
* **`--max_src_len`**: Maximum source sequence length.
131+
* **`--rollout_input_batch_size`**: The batch size for inference.
132+
* **`--rollout_n`**: Number of responses to generate for each input query.
133+
* **`--tensor_parallel_degree`**: Degree of tensor parallelism for distributed inference.
134+
* **`--input_file`**: Path to the input dataset file.
135+
* **`--output_dir`**: Directory to save output results and logs.
136+
137+
### Offline PyTorch Inference
138+
139+
This script tests the throughput of an LLM using offline PyTorch inference.
140+
141+
**Configuration (`torch_infer.sh`):**
142+
143+
```bash
144+
export CUDA_VISIBLE_DEVICES=4,5,6,7
145+
146+
output_dir="vllm_bf16_offline_flashattn"
147+
148+
python torch_infer.py \
149+
--actor_model_name_or_path Qwen/Qwen2.5-7B-Instruct-1M \
150+
--max_src_len 2048 \
151+
--min_dec_len 32 \
152+
--max_dec_len 30720 \
153+
--top_p 1.0 \
154+
--temperature 1.0 \
155+
--rollout_input_batch_size 4 \
156+
--rollout_n 8 \
157+
--tensor_parallel_degree 4 \
158+
--limit_rows 640 \
159+
--input_file ./data/gsm8k/instruct/train.parquet \
160+
--output_dir ${output_dir} \
161+
--gpu_memory_utilization 0.8 > ./torchinferflashattn.log 2>&1
162+
```
163+
* **`--gpu_memory_utilization`**: Fraction of GPU memory to be reserved for the model.
164+
165+
-----
166+
167+
## Output Results
168+
169+
The `output_dir` contains the following files:
170+
171+
**1. Statistics Files**
172+
`dispersed_stats.csv`
173+
174+
Per-batch request length and throughput statistics. Fields:
175+
`batch_index, rollout_lengths, min_length, max_length, avg_length, completion_time, throughput_tokens_per_sec`
176+
177+
`global_stats.csv`
178+
179+
Aggregated global metrics. Fields:
180+
`batch_index, min_response_tokens, max_response_tokens, avg_response_tokens, total_response_tokens, completion_time, throughput_tokens_per_sec`
181+
182+
**2. Detailed Records**
183+
`rollout_details.jsonl`
184+
185+
Raw per-request outputs (JSON Lines format), including input/output text.
186+

llm/benchmark/rl/api_serve.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,10 @@
2626
import pandas as pd
2727
from openai import AsyncOpenAI
2828
from tqdm import tqdm
29-
from transformers import logging
3029
from utils import RangeSet
3130

3231
from paddlenlp.transformers import AutoTokenizer
33-
34-
logging.set_verbosity_info()
35-
logger = logging.get_logger(__name__)
32+
from paddlenlp.utils.log import logger
3633

3734

3835
@dataclass

llm/benchmark/rl/torch_infer.py

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -70,17 +70,6 @@ def chunk(all_input_ids, size):
7070
return [all_input_ids[i : i + size] for i in range(0, len(all_input_ids), size)]
7171

7272

73-
@contextmanager
74-
def switch_level_context(level="ERROR"):
75-
original_level = logger.logLevel
76-
logger.set_level(level)
77-
78-
try:
79-
yield
80-
finally:
81-
logger.set_level(original_level)
82-
83-
8473
class DumpyInferenceTask:
8574
def __init__(self, args):
8675
self.args = args

0 commit comments

Comments
 (0)