Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions inference/trillium/vLLM/Llama3.1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,14 +129,11 @@ vllm serve meta-llama/Llama-3.1-70B-Instruct \
For the 8B model on a v6e-1 (1-chip) instance, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`.

It takes a few minutes depending on the model size to prepare the server.
Once you see the below snippet in the logs, it means that the server is ready
to serve requests or run benchmarks:
Once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests.

```bash
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
(APIServer pid=7) INFO: Waiting for application startup.
(APIServer pid=7) INFO: Application startup complete.
```

## Step 7: Prepare the test environment
Expand All @@ -153,12 +150,16 @@ export PROJECT=your-tpu-project
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
```

## Step 8: access the running container
## Step 8: Access the running container

To run the benchmark and install dependencies, you first need to enter the running container.

```bash
sudo docker exec -it $USER-vllm bash
```

The following steps for testing and benchmarking should be executed from within this container shell.

## Step 9: Test the server

Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.
Expand Down
21 changes: 7 additions & 14 deletions inference/trillium/vLLM/Qwen2.5-32B/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,19 +63,15 @@ Now we serve the vllm server. Make sure you keep this terminal open for the enti
```bash
export MAX_MODEL_LEN=4096
export TP=4 # number of chips
# export RATIO=0.8
# export PREFIX_LEN=0

VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-32B --seed 42 --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN
vllm serve Qwen/Qwen2.5-32B --seed 42 --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN
```

It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks:
It takes a few minutes depending on the model size to prepare the server - once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests or run benchmarks:

```bash
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
(APIServer pid=7) INFO: Waiting for application startup.
(APIServer pid=7) INFO: Application startup complete.
```

## Step 7: Prepare the test environment
Expand Down Expand Up @@ -113,15 +109,15 @@ curl http://localhost:8000/v1/completions \
}'
```

## Step 9: Preparing the test image
## Step 10: Install Benchmark Dependencies

You might need to install datasets as it's not available in the base vllm image.

```bash
pip install datasets
```

## Step 10: Run the benchmarking
## Step 11: Run the benchmarking

Finally, we are ready to run the benchmark:

Expand All @@ -132,16 +128,13 @@ export HF_TOKEN=<your HF token>

cd /workspace/vllm

python benchmarks/benchmark_serving.py \
--backend vllm \
vllm bench serve \
--model "Qwen/Qwen2.5-32B" \
--dataset-name random \
--num-prompts 1000 \
--random-input-len=$MAX_INPUT_LEN \
--random-output-len=$MAX_OUTPUT_LEN \
--seed 100
# --random-range-ratio=$RATIO \
# --random-prefix-len=$PREFIX_LEN
```

The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size.
Expand Down
14 changes: 8 additions & 6 deletions inference/trillium/vLLM/Qwen3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,13 +113,11 @@ vllm serve Qwen/Qwen3-32B \

For the 4B model, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`.

It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks:
It takes a few minutes depending on the model size to prepare the server - once you see the `Application startup complete.` message in the logs, it means that the server is ready to serve requests or run benchmarks:

```bash
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
(APIServer pid=7) INFO: Waiting for application startup.
(APIServer pid=7) INFO: Application startup complete.
```

## Step 7: Prepare the test environment
Expand All @@ -136,12 +134,16 @@ export PROJECT=your-tpu-project
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
```

## Step 8: access the running container
## Step 8: Access the running container

To run the benchmark and install dependencies, you first need to enter the running container.

```bash
sudo docker exec -it $USER-vllm bash
```

The following steps for testing and benchmarking should be executed from within this container shell.

## Step 9: Test the server

Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.
Expand Down
21 changes: 21 additions & 0 deletions inference/trillium/vLLM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,24 @@ This repository provides examples demonstrating how to deploy and serve vLLM on
These models were chosen for demonstration purposes only. You can serve any model from this list: [vLLM Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)

If you are looking for GKE-based deployment, please refer to this documentation: [Serve an LLM using TPU Trillium on GKE with vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-vllm-tpu)

## Choosing the Right TPU Configuration

Selecting the appropriate TPU size is critical for performance and cost-effectiveness. The goal is to use the smallest TPU configuration that can accommodate the model's memory requirements. These recommendations assume the model is running in a standard 16-bit precision format like bfloat16 or float16.

* **✅ Recommended:** The most cost-effective configuration.
* **⚠️ Overkill:** The model will run, but the TPU is larger and more expensive than necessary.
* **❌ Insufficient Memory:** The model will not fit in the TPU's memory.

| Model | v6e-1 (32 GB) | v6e-4 (128 GB) | v6e-8 (256 GB) |
| :---- | :---: | :---: | :---: |
| **Qwen3-4B** | ✅ | ⚠️ | ⚠️ |
| **Qwen2.5-VL-7B**| ✅ | ⚠️ | ⚠️ |
| **Llama3.1-8B** | ✅ | ⚠️ | ⚠️ |
| **Qwen2.5-32B** | ❌ | ✅ | ⚠️ |
| **Qwen3-32B** | ❌ | ✅ | ⚠️ |
| **Llama3.1-70B**| ❌ | ❌ | ✅ |

**Note on Topology:** The topology (e.g., `2x2` for 4 chips, `2x4` for 8 chips) describes the physical arrangement of the TPU chips. This layout affects the communication speed between chips. While any valid topology with the correct number of chips will work, a more compact topology (like `2x2` vs. `1x4`) can reduce latency and improve performance for communication-heavy models. For general use, the default topology is usually sufficient, but performance-critical applications may benefit from tuning this setting.

**Note on Availability:** Acquiring on-demand TPUs can be challenging due to high demand. If you encounter capacity limits in one zone, we recommend trying a different zone or using [Queued Resources](https://cloud.google.com/tpu/docs/queued-resources) to ensure you get the required capacity.