Skip to content

Commit 09b1495

Browse files
committed
Merge branch 'upstream-main' into tms/add_mamba
2 parents fb846ce + 7508a3d commit 09b1495

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+928
-469
lines changed

.buildkite/download-images.sh

Lines changed: 0 additions & 14 deletions
This file was deleted.

.buildkite/run-tpu-test.sh

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
set -e
2+
3+
# Build the docker image.
4+
docker build -f Dockerfile.tpu -t vllm-tpu .
5+
6+
# Set up cleanup.
7+
remove_docker_container() { docker rm -f tpu-test || true; }
8+
trap remove_docker_container EXIT
9+
# Remove the container that might not be cleaned up in the previous run.
10+
remove_docker_container
11+
12+
# For HF_TOKEN.
13+
source /etc/environment
14+
# Run a simple end-to-end example.
15+
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu \
16+
python3 /workspace/vllm/examples/offline_inference_tpu.py

.buildkite/test-pipeline.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ steps:
1212
fast_check_only: true
1313
commands:
1414
- pytest -v -s async_engine # Async Engine
15-
- bash ../.buildkite/download-images.sh # Inputs
1615
- pytest -v -s test_inputs.py
1716
- pytest -v -s multimodal
1817
- pytest -v -s test_utils.py # Utils
@@ -82,7 +81,6 @@ steps:
8281
working_dir: "/vllm-workspace/tests"
8382
num_gpus: 2
8483
commands:
85-
- bash ../.buildkite/download-images.sh
8684
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py
8785
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
8886
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
@@ -155,7 +153,6 @@ steps:
155153
- label: Inputs Test
156154
#mirror_hardwares: [amd]
157155
commands:
158-
- bash ../.buildkite/download-images.sh
159156
- pytest -v -s test_inputs.py
160157
- pytest -v -s multimodal
161158

@@ -175,7 +172,6 @@ steps:
175172
- label: Vision Language Models Test
176173
mirror_hardwares: [amd]
177174
commands:
178-
- bash ../.buildkite/download-images.sh
179175
- pytest -v -s models -m vlm
180176

181177
- label: Prefix Caching Test

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ RUN --mount=type=bind,from=mamba-builder,src=/usr/src/mamba,target=/usr/src/mamb
172172
python3 -m pip install /usr/src/mamba/*.whl --no-cache-dir
173173

174174
RUN --mount=type=cache,target=/root/.cache/pip \
175-
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.0.8/flashinfer-0.0.8+cu121torch2.3-cp310-cp310-linux_x86_64.whl
175+
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.0.9/flashinfer-0.0.9+cu121torch2.3-cp310-cp310-linux_x86_64.whl
176176
#################### vLLM installation IMAGE ####################
177177

178178

Dockerfile.tpu

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,4 @@ COPY . /workspace/vllm
1515
ENV VLLM_TARGET_DEVICE="tpu"
1616
RUN cd /workspace/vllm && python setup.py develop
1717

18-
# Re-install outlines to avoid dependency errors.
19-
# The outlines version must follow requirements-common.txt.
20-
RUN pip uninstall outlines -y
21-
RUN pip install "outlines>=0.0.43"
22-
2318
CMD ["/bin/bash"]

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,15 @@ Easy, fast, and cheap LLM serving for everyone
1616

1717
---
1818

19+
**The Fifth vLLM Bay Area Meetup (July 24th 5pm-8pm PT)**
20+
21+
We are excited to announce our fifth vLLM Meetup!
22+
Join us to hear the vLLM's recent updates and the upcoming roadmap.
23+
Additionally, our collaborators from AWS will be presenting their insights and experiences in deploying vLLM.
24+
Register now [here](https://lu.ma/lp0gyjqr) and be part of the event!
25+
26+
---
27+
1928
*Latest News* 🔥
2029
- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing).
2130
- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing).
@@ -90,6 +99,7 @@ vLLM is a community project. Our compute resources for development and testing a
9099
- Databricks
91100
- DeepInfra
92101
- Dropbox
102+
- Google Cloud
93103
- Lambda Lab
94104
- NVIDIA
95105
- Replicate

docs/requirements-docs.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ sphinx==6.2.1
22
sphinx-book-theme==1.0.1
33
sphinx-copybutton==0.5.2
44
myst-parser==2.0.0
5-
sphinx-argparse
5+
sphinx-argparse==0.4.0
66

77
# packages to install to build the documentation
88
pydantic

docs/source/community/sponsors.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ vLLM is a community project. Our compute resources for development and testing a
1313
- Databricks
1414
- DeepInfra
1515
- Dropbox
16+
- Google Cloud
1617
- Lambda Lab
1718
- NVIDIA
1819
- Replicate

docs/source/serving/distributed_serving.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
.. _distributed_serving:
22

3+
How to decide the distributed inference strategy?
4+
=================================================
5+
6+
Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
7+
8+
- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
9+
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
10+
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
11+
12+
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
13+
14+
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like ``# GPU blocks: 790``. Multiply the number by ``16`` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
15+
16+
.. note::
17+
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
18+
319
Distributed Inference and Serving
420
=================================
521

examples/llava_example.py

Lines changed: 3 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,13 @@
1-
import os
2-
import subprocess
3-
4-
from PIL import Image
5-
61
from vllm import LLM
7-
8-
# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.
9-
# You can use `.buildkite/download-images.sh` to download them
2+
from vllm.assets.image import ImageAsset
103

114

125
def run_llava():
136
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
147

158
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
169

17-
image = Image.open("images/stop_sign.jpg")
10+
image = ImageAsset("stop_sign").pil_image
1811

1912
outputs = llm.generate({
2013
"prompt": prompt,
@@ -28,25 +21,5 @@ def run_llava():
2821
print(generated_text)
2922

3023

31-
def main():
32-
run_llava()
33-
34-
3524
if __name__ == "__main__":
36-
# Download from s3
37-
s3_bucket_path = "s3://air-example-data-2/vllm_opensource_llava/"
38-
local_directory = "images"
39-
40-
# Make sure the local directory exists or create it
41-
os.makedirs(local_directory, exist_ok=True)
42-
43-
# Use AWS CLI to sync the directory, assume anonymous access
44-
subprocess.check_call([
45-
"aws",
46-
"s3",
47-
"sync",
48-
s3_bucket_path,
49-
local_directory,
50-
"--no-sign-request",
51-
])
52-
main()
25+
run_llava()

0 commit comments

Comments
 (0)