Skip to content

Commit a1b15de

Browse files
authored
Qwen3 235b example (#5425)
* Add qwen3 example * Add port * Add example output * Add host 0.0.0.0 * Add news
1 parent 2b5774c commit a1b15de

File tree

3 files changed

+68
-5
lines changed

3 files changed

+68
-5
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030

3131
----
3232
:fire: *News* :fire:
33+
- [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)
3334
- [Mar 2025] Run and serve **Google Gemma 3** using SkyPilot [**example**](./llm/gemma3/)
3435
- [Feb 2025] Prepare and serve **Retrieval Augmented Generation (RAG) with DeepSeek-R1**: [**blog post**](https://blog.skypilot.co/deepseek-rag), [**example**](./llm/rag/)
3536
- [Feb 2025] Run and serve **DeepSeek-R1 671B** using SkyPilot and SGLang with high throughput: [**example**](./llm/deepseek-r1/)

llm/qwen/README.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,22 @@
1-
# Serving Qwen2 on Your Own Kubernetes or Cloud
1+
# Serving Qwen3/Qwen2 on Your Own Kubernetes or Cloud
22

33
[Qwen2](https://github.com/QwenLM/Qwen2) is one of the top open LLMs.
44
As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard).
55

6-
**Update (Sep 18, 2024) -** SkyPilot now supports the [**Qwen2.5**](https://qwenlm.github.io/blog/qwen2.5/) model!
6+
**Update (Apr 28, 2025) -** SkyPilot now supports the [**Qwen3**](https://qwenlm.github.io/blog/qwen3/) model!
7+
8+
📰 **Update (Sep 18, 2024) -** SkyPilot now supports the [**Qwen2.5**](https://qwenlm.github.io/blog/qwen2.5/) model!
79

810
📰 **Update (Jun 6, 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5.
911

1012
📰 **Update (April 26, 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) to serve the 110B model.
1113

14+
## One command to start a Qwen3
15+
16+
```bash
17+
sky launch -c qwen qwen3-235b.yaml
18+
```
19+
1220
<p align="center">
1321
<img src="https://i.imgur.com/d7tEhAl.gif" alt="qwen" width="600"/>
1422
</p>
@@ -32,7 +40,7 @@ After [installing SkyPilot](https://docs.skypilot.co/en/latest/getting-started/i
3240
1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [qwen25-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen25-72b.yaml) or [qwen25-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen25-7b.yaml) for a smaller model):
3341

3442
```console
35-
sky launch -c qwen qwen15-110b.yaml
43+
sky launch -c qwen qwen3-235b.yaml
3644
```
3745
2. Send a request to the endpoint for completion:
3846
```bash
@@ -41,7 +49,7 @@ ENDPOINT=$(sky status --endpoint 8000 qwen)
4149
curl http://$ENDPOINT/v1/completions \
4250
-H "Content-Type: application/json" \
4351
-d '{
44-
"model": "Qwen/Qwen1.5-110B-Chat",
52+
"model": "Qwen/Qwen3-235B-A22B-FP8",
4553
"prompt": "My favorite food is",
4654
"max_tokens": 512
4755
}' | jq -r '.choices[0].text'
@@ -52,7 +60,7 @@ curl http://$ENDPOINT/v1/completions \
5260
curl http://$ENDPOINT/v1/chat/completions \
5361
-H "Content-Type: application/json" \
5462
-d '{
55-
"model": "Qwen/Qwen1.5-110B-Chat",
63+
"model": "Qwen/Qwen3-235B-A22B-FP8",
5664
"messages": [
5765
{
5866
"role": "system",
@@ -66,6 +74,25 @@ curl http://$ENDPOINT/v1/chat/completions \
6674
"max_tokens": 512
6775
}' | jq -r '.choices[0].message.content'
6876
```
77+
<details>
78+
<summary>Qwen3 output</summary>
79+
80+
```
81+
The concept of "the best food" is highly subjective and depends on personal preferences, cultural background, dietary needs, and even mood! For example:
82+
83+
- **Some crave comfort foods** like macaroni and cheese, ramen, or dumplings.
84+
- **Others prioritize health** and might highlight dishes like quinoa bowls, grilled salmon, or fresh salads.
85+
- **Global favorites** often include pizza, sushi, tacos, or curry.
86+
- **Unique or adventurous eaters** might argue for dishes like insects, fermented foods, or molecular gastronomy creations.
87+
88+
Could you clarify what you mean by "best"? For instance:
89+
- Are you asking about taste, health benefits, cultural significance, or something else?
90+
- Are you looking for a specific dish, ingredient, or cuisine?
91+
92+
This helps me tailor a more meaningful answer! 😊
93+
```
94+
95+
</details>
6996

7097
## Running Multimodal Qwen2-VL
7198

llm/qwen/qwen3-235b.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
envs:
2+
MODEL_NAME: Qwen/Qwen3-235B-A22B-FP8
3+
4+
service:
5+
# Specifying the path to the endpoint to check the readiness of the replicas.
6+
readiness_probe:
7+
path: /v1/chat/completions
8+
post_data:
9+
model: $MODEL_NAME
10+
messages:
11+
- role: user
12+
content: Hello! What is your name?
13+
max_tokens: 1
14+
initial_delay_seconds: 1200
15+
# How many replicas to manage.
16+
replicas: 2
17+
18+
19+
resources:
20+
accelerators: {A100:8, A100-80GB:4, A100-80GB:8, H100:8, H200:8}
21+
disk_size: 1024
22+
disk_tier: best
23+
memory: 32+
24+
ports: 8000
25+
26+
setup: |
27+
uv pip install "sglang>=0.4.6"
28+
29+
run: |
30+
export PATH=$PATH:/sbin
31+
export SGL_ENABLE_JIT_DEEPGEMM=1
32+
# --tp 4 is required even with 8 GPUs, as the output size
33+
# of qwen3 is not divisible by quantization block_n=128
34+
python3 -m sglang.launch_server --model $MODEL_NAME \
35+
--tp 4 --reasoning-parser qwen3 --port 8000 --host 0.0.0.0

0 commit comments

Comments
 (0)