From 9b4d419d0ab0790c36a1e406fc151636c8934ed2 Mon Sep 17 00:00:00 2001
From: Rob Mulla <robmulla@google.com>
Date: Fri, 10 Oct 2025 11:58:15 -0400
Subject: [PATCH 1/2] Docs: Address feedback from bug bash summary

- Update Qwen2.5-32B recipe with correct benchmark commands and step numbering.
- Add a TPU/model sizing guide to the main vLLM README.
- Clarify the purpose of 'docker exec' in all recipes.
- Standardize the example log output format.
---
 inference/trillium/vLLM/Llama3.1/README.md    | 15 ++++++-------
 inference/trillium/vLLM/Qwen2.5-32B/README.md | 21 +++++++------------
 inference/trillium/vLLM/Qwen3/README.md       | 14 +++++++------
 inference/trillium/vLLM/README.md             | 21 +++++++++++++++++++
 4 files changed, 44 insertions(+), 27 deletions(-)

diff --git a/inference/trillium/vLLM/Llama3.1/README.md b/inference/trillium/vLLM/Llama3.1/README.md
index dbf4be8..05594d2 100644
--- a/inference/trillium/vLLM/Llama3.1/README.md
+++ b/inference/trillium/vLLM/Llama3.1/README.md
@@ -129,14 +129,11 @@ vllm serve meta-llama/Llama-3.1-70B-Instruct \
 For the 8B model on a v6e-1 (1-chip) instance, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`.
 
 It takes a few minutes depending on the model size to prepare the server.
-Once you see the below snippet in the logs, it means that the server is ready
-to serve requests or run benchmarks:
+Once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests.
 
 ```bash
-INFO:     Started server process [7]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+(APIServer pid=7) INFO:     Waiting for application startup.
+(APIServer pid=7) INFO:     Application startup complete.
 ```
 
 ## Step 7: Prepare the test environment
@@ -153,12 +150,16 @@ export PROJECT=your-tpu-project
 gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
 ```
 
-## Step 8: access the running container
+## Step 8: Access the running container
+
+To run the benchmark and install dependencies, you first need to enter the running container.
 
 ```bash
 sudo docker exec -it $USER-vllm bash
 ```
 
+The following steps for testing and benchmarking should be executed from within this container shell.
+
 ## Step 9: Test the server
 
 Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.
diff --git a/inference/trillium/vLLM/Qwen2.5-32B/README.md b/inference/trillium/vLLM/Qwen2.5-32B/README.md
index fc23f6e..73eccbc 100644
--- a/inference/trillium/vLLM/Qwen2.5-32B/README.md
+++ b/inference/trillium/vLLM/Qwen2.5-32B/README.md
@@ -63,19 +63,15 @@ Now we serve the vllm server. Make sure you keep this terminal open for the enti
 ```bash
 export MAX_MODEL_LEN=4096
 export TP=4 # number of chips
-# export RATIO=0.8
-# export PREFIX_LEN=0
 
-VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-32B --seed 42 --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN
+vllm serve Qwen/Qwen2.5-32B --seed 42 --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN
 ```
 
-It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks:
+It takes a few minutes depending on the model size to prepare the server - once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests or run benchmarks:
 
 ```bash
-INFO:     Started server process [7]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+(APIServer pid=7) INFO:     Waiting for application startup.
+(APIServer pid=7) INFO:     Application startup complete.
 ```
 
 ## Step 7: Prepare the test environment
@@ -113,7 +109,7 @@ curl http://localhost:8000/v1/completions \
     }'
 ```
 
-## Step 9: Preparing the test image
+## Step 10: Install Benchmark Dependencies
 
 You might need to install datasets as it's not available in the base vllm image.
 
@@ -121,7 +117,7 @@ You might need to install datasets as it's not available in the base vllm image.
 pip install datasets
 ```
 
-## Step 10:  Run the benchmarking
+## Step 11:  Run the benchmarking
 
 Finally, we are ready to run the benchmark:
 
@@ -132,16 +128,13 @@ export HF_TOKEN=<your HF token>
 
 cd /workspace/vllm
 
-python benchmarks/benchmark_serving.py \
-    --backend vllm \
+vllm bench serve \
     --model "Qwen/Qwen2.5-32B"  \
     --dataset-name random \
     --num-prompts 1000 \
     --random-input-len=$MAX_INPUT_LEN \
     --random-output-len=$MAX_OUTPUT_LEN \
     --seed 100
-    # --random-range-ratio=$RATIO \
-    # --random-prefix-len=$PREFIX_LEN
 ```
 
 The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size.
diff --git a/inference/trillium/vLLM/Qwen3/README.md b/inference/trillium/vLLM/Qwen3/README.md
index cbab256..7e98743 100644
--- a/inference/trillium/vLLM/Qwen3/README.md
+++ b/inference/trillium/vLLM/Qwen3/README.md
@@ -113,13 +113,11 @@ vllm serve Qwen/Qwen3-32B \
 
 For the 4B model, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`.
 
-It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks:
+It takes a few minutes depending on the model size to prepare the server - once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests or run benchmarks:
 
 ```bash
-INFO:     Started server process [7]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+(APIServer pid=7) INFO:     Waiting for application startup.
+(APIServer pid=7) INFO:     Application startup complete.
 ```
 
 ## Step 7: Prepare the test environment
@@ -136,12 +134,16 @@ export PROJECT=your-tpu-project
 gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
 ```
 
-## Step 8: access the running container
+## Step 8: Access the running container
+
+To run the benchmark and install dependencies, you first need to enter the running container.
 
 ```bash
 sudo docker exec -it $USER-vllm bash
 ```
 
+The following steps for testing and benchmarking should be executed from within this container shell.
+
 ## Step 9: Test the server
 
 Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.
diff --git a/inference/trillium/vLLM/README.md b/inference/trillium/vLLM/README.md
index 4ff5f7d..0076822 100644
--- a/inference/trillium/vLLM/README.md
+++ b/inference/trillium/vLLM/README.md
@@ -10,3 +10,24 @@ This repository provides examples demonstrating how to deploy and serve vLLM on
 These models were chosen for demonstration purposes only. You can serve any model from this list: [vLLM Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
 
 If you are looking for GKE-based deployment, please refer to this documentation: [Serve an LLM using TPU Trillium on GKE with vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-vllm-tpu)
+
+## Choosing the Right TPU Configuration
+
+Selecting the appropriate TPU size is critical for performance and cost-effectiveness. The goal is to use the smallest TPU configuration that can accommodate the model's memory requirements. These recommendations assume the model is running in a standard 16-bit precision format like bfloat16 or float16.
+
+*   **✅ Recommended:** The most cost-effective configuration.
+*   **⚠️ Overkill:** The model will run, but the TPU is larger and more expensive than necessary.
+*   **❌ Insufficient Memory:** The model will not fit in the TPU's memory.
+
+| Model | v6e-1 (32 GB) | v6e-4 (128 GB) | v6e-8 (256 GB) |
+| :---- | :---: | :---: | :---: |
+| **Qwen3-4B** | ✅ | ⚠️ | ⚠️ |
+| **Qwen2.5-VL-7B**| ✅ | ⚠️ | ⚠️ |
+| **Llama3.1-8B** | ✅ | ⚠️ | ⚠️ |
+| **Qwen2.5-32B** | ❌ | ✅ | ⚠️ |
+| **Qwen3-32B** | ❌ | ✅ | ⚠️ |
+| **Llama3.1-70B**| ❌ | ❌ | ✅ |
+
+**Note on Topology:** The topology (e.g., `2x2` for 4 chips, `2x4` for 8 chips) describes the physical arrangement of the TPU chips. This layout affects the communication speed between chips. While any valid topology with the correct number of chips will work, a more compact topology (like `2x2` vs. `1x4`) can reduce latency and improve performance for communication-heavy models. For general use, the default topology is usually sufficient, but performance-critical applications may benefit from tuning this setting.
+
+**Note on Availability:** Acquiring on-demand TPUs can be challenging due to high demand. If you encounter capacity limits in one zone, we recommend trying a different zone or using [Queued Resources](https://cloud.google.com/tpu/docs/queued-resources) to ensure you get the required capacity.

From eee1fa6c6aca2e69f15ebb3300f38ca2d7f9b319 Mon Sep 17 00:00:00 2001
From: Rob Mulla <rob.mulla@gmail.com>
Date: Wed, 15 Oct 2025 11:10:17 -0400
Subject: [PATCH 2/2] chore: re-trigger CI