add readme file

SdeeRK · SdeeRK · commit b507eeb84cdd · 2025-05-21T21:33:36.000+08:00
diff --git a/llm/benchmark/rl/README.md b/llm/benchmark/rl/README.md
@@ -0,0 +1,186 @@
+# Large Language Model Throughput Testing Framework
+
+This framework is designed to test the **throughput performance** of Large Language Models (LLMs) across different deployment methods: **offline inference** (using PaddlePaddle and PyTorch) and **online inference** (via API). It specifically focuses on evaluating the model's ability to handle **batch queries**, measuring throughput in tokens per second under various configurations and batch sizes.
+
+## Features
+
+  * **Diverse Deployment Method Support:** Tests LLMs deployed via online API, and offline inference with PaddlePaddle and PyTorch.
+  * **Batch Query Throughput Calculation:** Accurately measures throughput (tokens/s) for concurrent queries, providing insights into the model's performance under load.
+  * **Detailed Time Logging:** Records the total time for each batch processing operation.
+
+## How It Works
+
+This framework operates by sending batched requests to specified endpoints (API or local inference scripts) and collecting performance data on how the model generates responses for multiple queries simultaneously.
+
+1.  **Input Queries:** You provide a set of questions as test input, typically from a `.parquet` or text file.
+2.  **Batch Processing:** The framework groups these questions into batches of a specified `rollout_input_batch_size`.
+3.  **Generate Responses:** For each query within a batch, the framework requests the model to generate `rollout_n` responses.
+4.  **Time Measurement:** The total time from sending a batch of questions to receiving all corresponding responses is recorded.
+5.  **Token Statistics:** The total number of tokens generated across all responses in a batch is summed up.
+6.  **Throughput Calculation:** Throughput (tokens/s) is calculated by dividing the total tokens generated by the total time taken for the batch to complete.
+
+## Usage
+
+This section details how to run throughput tests for each deployment method using the provided shell scripts.
+
+### Data Preparation
+
+To run the tests, you'll first need to download and extract the `rl_data.tar.gz` archive, which contains the GSM8K dataset in a suitable format for testing.
+
+```
+cd llm/benchmark/rl
+wget https://paddle-qa.bj.bcebos.com/paddlenlp/rl_data.tar.gz     
+tar -zxvf rl_data.tar.gz 
+```
+
+Extract the contents of the archive. This will create a data folder containing the GSM8K dataset.
+
+### Online API Inference
+
+This script tests the throughput of a remote LLM API.
+
+**Configuration (`api_serve.sh`):**
+
+```bash
+output_dir="api_serve_results"
+
+python api_serve.py \
+    --openai_urls "your_url1" "your_url2"\
+    --api_keys "key1" "key2" \
+    --model "Qwen2.5-7B-Instruct-1M" \
+    --tokenizer "Qwen/Qwen2.5-7B-Instruct-1M" \
+    --input_file ./data/gsm8k/instruct/train.parquet \
+    --output_dir ${output_dir} \
+    --rollout_input_batch_size 8 \
+    --rollout_n 8 \
+    --top_p 1.0 \
+    --temperature 0.7 \
+    --max_dec_len 8192 \
+    --limit_rows 512
+```
+
+  * **`--openai_urls`**: URLs of the API endpoints to test.
+  * **`--api_keys`**: API keys for authentication (if required).
+  * **`--model`**: Name of the model being tested.
+  * **`--tokenizer`**: Path or name of the tokenizer.
+  * **`--input_file`**: Path to the input dataset file.
+  * **`--output_dir`**: Directory to save output results.
+  * **`--rollout_input_batch_size`**: The batch size for API requests.
+  * **`--rollout_n`**: Number of responses to generate for each input query.
+  * **`--max_dec_len`**: Maximum decoding length for responses.
+  * **`--limit_rows`**: Limit the number of input rows processed.
+
+**Run command:**
+
+```bash
+bash scripts/api_serve.sh
+```
+
+### Offline PaddlePaddle Inference
+
+This script tests the throughput of an LLM using offline PaddlePaddle inference, potentially with distributed processing.
+
+**Configuration (`paddle_infer.sh`):**
+
+```bash
+unset PADDLE_TRAINERS_NUM
+unset PADDLE_ELASTIC_JOB_ID
+unset PADDLE_TRAINER_ENDPOINTS
+unset DISTRIBUTED_TRAINER_ENDPOINTS
+unset FLAGS_START_PORT
+unset PADDLE_ELASTIC_TIMEOUT
+
+export PYTHONPATH="your_paddlenlp_path/PaddleNLP":$PYTHONPATH
+export PYTHONPATH="your_paddlenlp_path/PaddleNLP/llm":$PYTHONPATH
+
+export FLAGS_set_to_1d=False
+export NVIDIA_TF32_OVERRIDE=0
+export FLAGS_dataloader_use_file_descriptor=False
+export HF_DATASETS_DOWNLOAD_TIMEOUT=1
+export FLAGS_gemm_use_half_precision_compute_type=False
+export FLAGS_force_cublaslt_no_reduced_precision_reduction=True
+
+export FLAGS_custom_allreduce=0
+export FLAGS_mla_use_tensorcore=0
+export FLAGS_cascade_attention_max_partition_size=2048
+
+export CUDA_VISIBLE_DEVICES=4,5,6,7
+output_dir="pdpd_bf16_offline"
+
+python -u -m paddle.distributed.launch --log_dir ${output_dir}/logs --gpus ${CUDA_VISIBLE_DEVICES} paddle_infer.py \
+  --actor_model_name_or_path your_model_name \
+  --max_src_len 2048 \
+  --min_dec_len 32 \
+  --max_dec_len 30720 \
+  --top_p 1.0 \
+  --temperature 1.0 \
+  --rollout_input_batch_size 4 \
+  --rollout_n 8 \
+  --rollout_max_num_seqs 24 \
+  --rollout_quant_type "" \
+  --tensor_parallel_degree 4 \
+  --limit_rows 640 \
+  --input_file file.parquet \
+  --output_dir ${output_dir} > ./paddleinfer.log 2>&1
+```
+
+  * **`CUDA_VISIBLE_DEVICES`**: Specifies the GPUs to be used.
+  * **`paddle.distributed.launch`**: Initiates a distributed PaddlePaddle training/inference job.
+  * **`--actor_model_name_or_path`**: Path to the pre-trained model.
+  * **`--max_src_len`**: Maximum source sequence length.
+  * **`--rollout_input_batch_size`**: The batch size for inference.
+  * **`--rollout_n`**: Number of responses to generate for each input query.
+  * **`--tensor_parallel_degree`**: Degree of tensor parallelism for distributed inference.
+  * **`--input_file`**: Path to the input dataset file.
+  * **`--output_dir`**: Directory to save output results and logs.
+
+### Offline PyTorch Inference
+
+This script tests the throughput of an LLM using offline PyTorch inference.
+
+**Configuration (`torch_infer.sh`):**
+
+```bash
+export CUDA_VISIBLE_DEVICES=4,5,6,7
+
+output_dir="vllm_bf16_offline_flashattn"
+
+python torch_infer.py \
+    --actor_model_name_or_path Qwen/Qwen2.5-7B-Instruct-1M \
+    --max_src_len 2048 \
+    --min_dec_len 32 \
+    --max_dec_len 30720 \
+    --top_p 1.0 \
+    --temperature 1.0 \
+    --rollout_input_batch_size 4 \
+    --rollout_n 8 \
+    --tensor_parallel_degree 4 \
+    --limit_rows 640 \
+    --input_file ./data/gsm8k/instruct/train.parquet \
+    --output_dir ${output_dir} \
+    --gpu_memory_utilization 0.8 > ./torchinferflashattn.log 2>&1
+```
+  * **`--gpu_memory_utilization`**: Fraction of GPU memory to be reserved for the model.
+
+-----
+
+## Output Results
+
+The `output_dir` contains the following files:  
+
+**1. Statistics Files**  
+• `dispersed_stats.csv`  
+
+  Per-batch request length and throughput statistics. Fields:  
+  `batch_index, rollout_lengths, min_length, max_length, avg_length, completion_time, throughput_tokens_per_sec`  
+
+• `global_stats.csv`  
+
+  Aggregated global metrics. Fields:  
+  `batch_index, min_response_tokens, max_response_tokens, avg_response_tokens, total_response_tokens, completion_time, throughput_tokens_per_sec`  
+
+**2. Detailed Records**  
+• `rollout_details.jsonl`  
+
+  Raw per-request outputs (JSON Lines format), including input/output text.
+
diff --git a/llm/benchmark/rl/api_serve.py b/llm/benchmark/rl/api_serve.py
@@ -26,13 +26,10 @@
 import pandas as pd
 from openai import AsyncOpenAI
 from tqdm import tqdm
-from transformers import logging
 from utils import RangeSet
 
 from paddlenlp.transformers import AutoTokenizer
-
-logging.set_verbosity_info()
-logger = logging.get_logger(__name__)
+from paddlenlp.utils.log import logger
 
 
 @dataclass
diff --git a/llm/benchmark/rl/torch_infer.py b/llm/benchmark/rl/torch_infer.py
@@ -70,17 +70,6 @@ def chunk(all_input_ids, size):
     return [all_input_ids[i : i + size] for i in range(0, len(all_input_ids), size)]
 
 
-@contextmanager
-def switch_level_context(level="ERROR"):
-    original_level = logger.logLevel
-    logger.set_level(level)
-
-    try:
-        yield
-    finally:
-        logger.set_level(original_level)
-
-
 class DumpyInferenceTask:
     def __init__(self, args):
         self.args = args