From 9b4d419d0ab0790c36a1e406fc151636c8934ed2 Mon Sep 17 00:00:00 2001 From: Rob Mulla Date: Fri, 10 Oct 2025 11:58:15 -0400 Subject: [PATCH 1/2] Docs: Address feedback from bug bash summary - Update Qwen2.5-32B recipe with correct benchmark commands and step numbering. - Add a TPU/model sizing guide to the main vLLM README. - Clarify the purpose of 'docker exec' in all recipes. - Standardize the example log output format. --- inference/trillium/vLLM/Llama3.1/README.md | 15 ++++++------- inference/trillium/vLLM/Qwen2.5-32B/README.md | 21 +++++++------------ inference/trillium/vLLM/Qwen3/README.md | 14 +++++++------ inference/trillium/vLLM/README.md | 21 +++++++++++++++++++ 4 files changed, 44 insertions(+), 27 deletions(-) diff --git a/inference/trillium/vLLM/Llama3.1/README.md b/inference/trillium/vLLM/Llama3.1/README.md index dbf4be8..05594d2 100644 --- a/inference/trillium/vLLM/Llama3.1/README.md +++ b/inference/trillium/vLLM/Llama3.1/README.md @@ -129,14 +129,11 @@ vllm serve meta-llama/Llama-3.1-70B-Instruct \ For the 8B model on a v6e-1 (1-chip) instance, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`. It takes a few minutes depending on the model size to prepare the server. -Once you see the below snippet in the logs, it means that the server is ready -to serve requests or run benchmarks: +Once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests. ```bash -INFO: Started server process [7] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) +(APIServer pid=7) INFO: Waiting for application startup. +(APIServer pid=7) INFO: Application startup complete. ``` ## Step 7: Prepare the test environment @@ -153,12 +150,16 @@ export PROJECT=your-tpu-project gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE ``` -## Step 8: access the running container +## Step 8: Access the running container + +To run the benchmark and install dependencies, you first need to enter the running container. ```bash sudo docker exec -it $USER-vllm bash ``` +The following steps for testing and benchmarking should be executed from within this container shell. + ## Step 9: Test the server Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model. diff --git a/inference/trillium/vLLM/Qwen2.5-32B/README.md b/inference/trillium/vLLM/Qwen2.5-32B/README.md index fc23f6e..73eccbc 100644 --- a/inference/trillium/vLLM/Qwen2.5-32B/README.md +++ b/inference/trillium/vLLM/Qwen2.5-32B/README.md @@ -63,19 +63,15 @@ Now we serve the vllm server. Make sure you keep this terminal open for the enti ```bash export MAX_MODEL_LEN=4096 export TP=4 # number of chips -# export RATIO=0.8 -# export PREFIX_LEN=0 -VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-32B --seed 42 --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN +vllm serve Qwen/Qwen2.5-32B --seed 42 --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN ``` -It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks: +It takes a few minutes depending on the model size to prepare the server - once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests or run benchmarks: ```bash -INFO: Started server process [7] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) +(APIServer pid=7) INFO: Waiting for application startup. +(APIServer pid=7) INFO: Application startup complete. ``` ## Step 7: Prepare the test environment @@ -113,7 +109,7 @@ curl http://localhost:8000/v1/completions \ }' ``` -## Step 9: Preparing the test image +## Step 10: Install Benchmark Dependencies You might need to install datasets as it's not available in the base vllm image. @@ -121,7 +117,7 @@ You might need to install datasets as it's not available in the base vllm image. pip install datasets ``` -## Step 10: Run the benchmarking +## Step 11: Run the benchmarking Finally, we are ready to run the benchmark: @@ -132,16 +128,13 @@ export HF_TOKEN= cd /workspace/vllm -python benchmarks/benchmark_serving.py \ - --backend vllm \ +vllm bench serve \ --model "Qwen/Qwen2.5-32B" \ --dataset-name random \ --num-prompts 1000 \ --random-input-len=$MAX_INPUT_LEN \ --random-output-len=$MAX_OUTPUT_LEN \ --seed 100 - # --random-range-ratio=$RATIO \ - # --random-prefix-len=$PREFIX_LEN ``` The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size. diff --git a/inference/trillium/vLLM/Qwen3/README.md b/inference/trillium/vLLM/Qwen3/README.md index cbab256..7e98743 100644 --- a/inference/trillium/vLLM/Qwen3/README.md +++ b/inference/trillium/vLLM/Qwen3/README.md @@ -113,13 +113,11 @@ vllm serve Qwen/Qwen3-32B \ For the 4B model, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`. -It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks: +It takes a few minutes depending on the model size to prepare the server - once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests or run benchmarks: ```bash -INFO: Started server process [7] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) +(APIServer pid=7) INFO: Waiting for application startup. +(APIServer pid=7) INFO: Application startup complete. ``` ## Step 7: Prepare the test environment @@ -136,12 +134,16 @@ export PROJECT=your-tpu-project gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE ``` -## Step 8: access the running container +## Step 8: Access the running container + +To run the benchmark and install dependencies, you first need to enter the running container. ```bash sudo docker exec -it $USER-vllm bash ``` +The following steps for testing and benchmarking should be executed from within this container shell. + ## Step 9: Test the server Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model. diff --git a/inference/trillium/vLLM/README.md b/inference/trillium/vLLM/README.md index 4ff5f7d..0076822 100644 --- a/inference/trillium/vLLM/README.md +++ b/inference/trillium/vLLM/README.md @@ -10,3 +10,24 @@ This repository provides examples demonstrating how to deploy and serve vLLM on These models were chosen for demonstration purposes only. You can serve any model from this list: [vLLM Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html) If you are looking for GKE-based deployment, please refer to this documentation: [Serve an LLM using TPU Trillium on GKE with vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-vllm-tpu) + +## Choosing the Right TPU Configuration + +Selecting the appropriate TPU size is critical for performance and cost-effectiveness. The goal is to use the smallest TPU configuration that can accommodate the model's memory requirements. These recommendations assume the model is running in a standard 16-bit precision format like bfloat16 or float16. + +* **✅ Recommended:** The most cost-effective configuration. +* **⚠️ Overkill:** The model will run, but the TPU is larger and more expensive than necessary. +* **❌ Insufficient Memory:** The model will not fit in the TPU's memory. + +| Model | v6e-1 (32 GB) | v6e-4 (128 GB) | v6e-8 (256 GB) | +| :---- | :---: | :---: | :---: | +| **Qwen3-4B** | ✅ | ⚠️ | ⚠️ | +| **Qwen2.5-VL-7B**| ✅ | ⚠️ | ⚠️ | +| **Llama3.1-8B** | ✅ | ⚠️ | ⚠️ | +| **Qwen2.5-32B** | ❌ | ✅ | ⚠️ | +| **Qwen3-32B** | ❌ | ✅ | ⚠️ | +| **Llama3.1-70B**| ❌ | ❌ | ✅ | + +**Note on Topology:** The topology (e.g., `2x2` for 4 chips, `2x4` for 8 chips) describes the physical arrangement of the TPU chips. This layout affects the communication speed between chips. While any valid topology with the correct number of chips will work, a more compact topology (like `2x2` vs. `1x4`) can reduce latency and improve performance for communication-heavy models. For general use, the default topology is usually sufficient, but performance-critical applications may benefit from tuning this setting. + +**Note on Availability:** Acquiring on-demand TPUs can be challenging due to high demand. If you encounter capacity limits in one zone, we recommend trying a different zone or using [Queued Resources](https://cloud.google.com/tpu/docs/queued-resources) to ensure you get the required capacity. From eee1fa6c6aca2e69f15ebb3300f38ca2d7f9b319 Mon Sep 17 00:00:00 2001 From: Rob Mulla Date: Wed, 15 Oct 2025 11:10:17 -0400 Subject: [PATCH 2/2] chore: re-trigger CI