-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
feat: Add support for speculators Eagle checkpoints #20436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add support for speculators Eagle checkpoints #20436
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces compatibility for Eagle models that have been converted using the speculators
library. It establishes a robust translation layer within vLLM to seamlessly handle differences in configuration and weight naming conventions between the speculators
ecosystem and vLLM's native Eagle implementation, enabling speculative decoding with these models.
Highlights
- Support for Speculators Eagle Models: This pull request enables vLLM to load and utilize Eagle models that have been converted using the
speculators
library. This bridges the compatibility gap between thespeculators
ecosystem and vLLM's native Eagle implementation, allowing for speculative decoding with these models. - Configuration Translation Layer: A new
SpeculatorsEagleConfig
class has been introduced. This class automatically detectsspeculators
-specific configuration formats (e.g.,speculators_model_type
,transformer_layer_config
,fusion_bias
,layernorms
) and translates them into vLLM's expected Eagle configuration format during model loading. - Automated Weight Remapping: Logic has been added to the
EAGLE.load_weights()
method to transparently remap weight names. This handles the differences betweenspeculators
' naming conventions (e.g.,fusion_fc.weight
,transformer.
) and vLLM's internal naming scheme, ensuring that weights are loaded correctly without manual intervention. - Seamless Integration and Backward Compatibility: The
get_config()
utility has been updated to automatically detectspeculators
Eagle models based on their configuration. These models are then routed through the new translation layer. This integration is designed to be additive, ensuring full backward compatibility with existing vLLM Eagle models.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for loading Eagle models from the speculators
library by introducing a configuration adapter and weight remapping. The overall approach is sound. The review focuses on the new SpeculatorsEagleConfig
implementation and identifies critical issues related to handling remote models from the Hugging Face Hub, which would prevent the feature from working in a common use case. Detailed suggestions are provided to fix these issues by properly fetching remote configurations, along with a minor suggestion to improve code maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post steps to run?
I wasn't able to run the verification script on this branch with speculators main.
@@ -334,6 +336,17 @@ def get_config( | |||
raise ValueError(error_message) from e | |||
|
|||
if config_format == ConfigFormat.HF: | |||
# Check if this is a speculators Eagle model | |||
if is_speculators_eagle_config(model): | |||
config = SpeculatorsEagleConfig.from_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all existing supported models just going through the PretrainedConfig pathway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
We don't need speculators to run the models; here are the steps: Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39 There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md for example convert an existing eagle-checkpoint using: speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct Output:
The converted checkpoint will be saved in Step 2: Checkout the current branch in vllm, and run the model: from vllm import LLM, SamplingParams
# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"
print("Loading models...")
# Create LLM with Eagle speculative decoding
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct", # target/verifier model
speculative_config={
"model": eagle_model_path, # Your Eagle model path
"num_speculative_tokens": 5, # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024,
)
print("Models loaded! Generating text...")
# Your prompt
prompt = "The benefits of open source software include"
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=100,
)
# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text
print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}") Output: ...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]
Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) |
cli does not seem to be working on that branch. |
Could you paste the command you used, and the trace back on that PR? |
I’ll touch base offline. It just didn’t recognize the speculators command |
transformer_config["architectures"] = [arch] | ||
|
||
# Build vLLM config | ||
vllm_config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need to add the verifier model as part of the config? How are the two differentiated in the vllm_config object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
This pull request has merge conflicts that must be resolved before it can be |
e9fecc1
to
8e1183c
Compare
85a11c7
to
81c9904
Compare
9f730f4
to
875b786
Compare
36b8502
to
00da923
Compare
vllm/config.py
Outdated
@@ -2767,9 +2767,15 @@ def __post_init__(self): | |||
# Automatically detect the method | |||
if self.method in ('eagle', 'eagle3'): | |||
pass | |||
elif hasattr(self.draft_model_config.hf_config, | |||
"speculators_model_type") and \ | |||
self.draft_model_config.hf_config.speculators_model_type in ("eagle", "eagle3"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this
vllm/config.py
Outdated
elif "eagle-" in self.draft_model_config.model.lower() or \ | ||
"eagle3-" in self.draft_model_config.model.lower(): | ||
self.method = "eagle" | ||
elif self.draft_model_config.hf_config.model_type == "eagle": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@@ -22,6 +24,27 @@ | |||
|
|||
logger = init_logger(__name__) | |||
|
|||
# Map speculators weight names to vLLM names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably live in the speculators config file
} | ||
|
||
|
||
def remap_speculators_weight_name(name: str) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
return num_lookahead_tokens | ||
|
||
@classmethod | ||
def _apply_eagle_v1_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are fine for now but wondering if we can combine the eagle1/eagle3 _apply methods defined here
35dbf03
to
8082d60
Compare
74815d6
to
213b5a4
Compare
@@ -0,0 +1 @@ | |||
VLLM_USE_V1=1 vllm serve nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717 >output_speculators_llama.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a readme for eagle3 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I think we can potentially clean-up additionally by moving some of the logic into the SpeculatorsEagleConfig/SpeculatorsConfig
830c3cf
to
d7591ee
Compare
- Add SpeculatorsEAGLEConfig for handling speculators format Eagle models - Support both Eagle-1 and Eagle-3 variants with proper config translation - Add detection function to identify speculators format models - Handle weight remapping between speculators and vLLM naming conventions Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add weight remapping function for speculators format in llama_eagle.py - Support HASS variant with additional layernorms - Add IdentityNorm for unified architecture to fix torch.compile compatibility - Propagate norm_before_residual config for Eagle3 models Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Enable Eagle3 speculative decoding for Qwen2 and Qwen3 - Import and register Eagle3 in supported draft model types Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Auto-detect speculators format models and configure speculative decoding - Allow 'vllm serve <speculators-model>' without explicit speculative config - Add draft_tensor_parallel_size CLI argument - Update tokenizer to use target model instead of draft model - Check speculators model type in draft config for proper Eagle detection Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add Eagle1 example scripts for Llama models - Add Eagle3 example scripts for Llama and Qwen models - Include both regular and float16 serving examples Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
…utils/configs/speculators_eagle.py Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
25d7544
to
65ee8f4
Compare
This pull request has merge conflicts that must be resolved before it can be |
Add support for Eagle models in speculators format
This PR adds support for Eagle models distributed in the "speculators" format, enabling seamless speculative decoding with simple
vllm serve
commands.What is speculators format?
Speculators format is an alternative packaging for Eagle models that includes both the draft model and configuration in a single repository. This format simplifies deployment by bundling everything needed for speculative decoding.
Key Features
1. Auto-detection of speculators format
Models in speculators format are automatically detected and configured for speculative decoding:
# Automatically detects speculators format and configures Eagle speculative decoding vllm serve nm-testing/eagle-llama3.1-8b-instruct-converted-0717
No need for complex
--speculative-config
arguments!2. Support for multiple Eagle variants
Standard Eagle-1 models:
HASS variant (Eagle with additional layernorms):
# Note: Use VLLM_DISABLE_COMPILE_CACHE=1 to avoid torch.compile cache conflicts VLLM_DISABLE_COMPILE_CACHE=1 vllm serve nm-testing/hass-llama3.1-8b-layernorms
Eagle-3 models:
3. Qwen model support with Eagle-3
# Eagle-3 now supports Qwen models vllm serve nm-testing/Speculator-Qwen3-8B-Eagle3-converted-0717
Example Commands
Basic usage (auto-detection):
With custom tensor parallelism for draft model:
Manual configuration (if needed):
Technical Details
Important Notes
CUDA Graph Caching: When switching between different Eagle variants (standard vs HASS), use
VLLM_DISABLE_COMPILE_CACHE=1
to avoid torch.compile cache conflicts.Draft Tensor Parallelism: Currently,
draft_tensor_parallel_size > 1
is not supported.V1 Engine: All examples assume V1 engine usage with
VLLM_USE_V1=1
.Testing
Example models for testing:
nm-testing/eagle-llama3.1-8b-instruct-converted-0717
nm-testing/hass-llama3.1-8b-layernorms
nm-testing/eagle3-llama3.1-8b-instruct-speculators
nm-testing/Speculator-Qwen3-8B-Eagle3-converted-0717