-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Fused moe tuning ep #20863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fused moe tuning ep #20863
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @robertgshaw2-redhat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the benchmarking capabilities for Mixture-of-Experts (MoE) models by introducing explicit support for expert parallelism. It includes updates to the benchmarking script to configure and validate expert distribution, provides new infrastructure for easier execution of these benchmarks, and incorporates a crucial type casting fix within the fused MoE kernel for robust all-to-all communication.
Highlights
- Expert Parallelism (EP) Benchmarking: Implemented support for expert parallelism in Mixture-of-Experts (MoE) benchmarks, allowing configuration via a new
--ep-size
argument and adding validation for expert distribution across parallel processes. - MoE Benchmark Infrastructure: Introduced a new
Dockerfile
for setting up a dedicated environment and aJustfile
with predefined commands to easily run MoE benchmarks across various models (e.g., Llama-Scout, Qwen, DeepSeek-R1) and configurations, including expert parallelism. - Dependency Management Update: Updated the
install_python_libraries.sh
script to useuv
for package installation, streamlining dependency management for expert parallelism kernels likepplx-kernels
andDeepEP
. - Fused MoE Kernel Fix: Applied a type casting fix (
.view(dtype=torch.uint32)
) to theindices
tensor in the fused MoEpplx_prepare_finalize.py
module, ensuring correct data type handling for all-to-all communication operations critical for expert dispatch and combine.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for expert parallelism (EP) tuning in the Fused MoE benchmark. It adds a new Dockerfile for environment setup, a Justfile for running benchmarks, and updates the benchmark script to handle the ep-size
parameter.
My review focuses on the new Dockerfile, where I've identified a couple of critical issues related to the build process that will cause it to fail or produce an incorrect environment. I've also included a suggestion for improving the Docker image size and build efficiency. The other changes in the benchmark scripts and tooling appear correct and well-implemented for the stated purpose.
@@ -0,0 +1,20 @@ | |||
ARG CUDA_VERSION=12.8.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Install vllm. | ||
WORKDIR /workspace/vllm | ||
RUN uv venv .vllm --python 3.12 | ||
RUN . .vllm/bin/activate && VLLM_USE_PRECOMPILED=1 uv pip install -e . | ||
|
||
# Checkout a specific commit. | ||
ENV VLLM_SHA=550f8a052cae03c7e14a46767f689ab09c1cc28d | ||
RUN git fetch && git checkout ${VLLM_SHA} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of operations for installing vLLM is incorrect. The current Dockerfile installs vLLM from the default branch and then checks out the specific commit defined by VLLM_SHA
. This means the installed version of vLLM is not the one specified by the SHA. The git checkout
command must be executed before installing the package to ensure the correct version is built and installed.
# Checkout a specific commit.
WORKDIR /workspace/vllm
ENV VLLM_SHA=550f8a052cae03c7e14a46767f689ab09c1cc28d
RUN git fetch && git checkout ${VLLM_SHA}
# Install vllm.
RUN uv venv .vllm --python 3.12
RUN . .vllm/bin/activate && VLLM_USE_PRECOMPILED=1 uv pip install -e .
ARG CUDA_VERSION=12.8.1 | ||
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 | ||
|
||
RUN apt update && apt install git -y && apt install curl -y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To optimize the Docker image size and improve build caching, it's a best practice to combine apt-get update
and apt-get install
into a single RUN
layer. Additionally, you should clean up the apt cache in the same layer to reduce the final image size. You can also install multiple packages in a single apt-get install
command.
RUN apt-get update && apt-get install -y --no-install-recommends git curl && rm -rf /var/lib/apt/lists/*
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update