Skip to content

feat: Add support for speculators Eagle checkpoints #20436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Jul 3, 2025

Add support for Eagle models in speculators format

This PR adds support for Eagle models distributed in the "speculators" format, enabling seamless speculative decoding with simple vllm serve commands.

What is speculators format?

Speculators format is an alternative packaging for Eagle models that includes both the draft model and configuration in a single repository. This format simplifies deployment by bundling everything needed for speculative decoding.

Key Features

1. Auto-detection of speculators format

Models in speculators format are automatically detected and configured for speculative decoding:

# Automatically detects speculators format and configures Eagle speculative decoding
vllm serve nm-testing/eagle-llama3.1-8b-instruct-converted-0717

No need for complex --speculative-config arguments!

2. Support for multiple Eagle variants

Standard Eagle-1 models:

vllm serve nm-testing/eagle-llama3.1-8b-instruct-converted-0717

HASS variant (Eagle with additional layernorms):

# Note: Use VLLM_DISABLE_COMPILE_CACHE=1 to avoid torch.compile cache conflicts
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve nm-testing/hass-llama3.1-8b-layernorms

Eagle-3 models:

vllm serve nm-testing/eagle3-llama3.1-8b-instruct-speculators

3. Qwen model support with Eagle-3

# Eagle-3 now supports Qwen models
vllm serve nm-testing/Speculator-Qwen3-8B-Eagle3-converted-0717

Example Commands

Basic usage (auto-detection):

# The model is automatically detected as speculators format
# Target model, draft model, and speculative config are all configured automatically
vllm serve <speculators-format-model>

With custom tensor parallelism for draft model:

vllm serve <speculators-format-model> --draft-tensor-parallel-size 1

Manual configuration (if needed):

# You can still use manual configuration if preferred
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"method": "eagle", "model": "nm-testing/eagle-llama3.1-8b-instruct-converted-0717", "num_speculative_tokens": 5}'

Technical Details

  • Automatic detection of speculators format via config structure
  • Proper weight remapping between speculators and vLLM naming conventions
  • Support for both Eagle-1 and Eagle-3 architectures
  • Backwards compatibility with existing Eagle models
  • Unified architecture for HASS variants to ensure torch.compile compatibility

Important Notes

  1. CUDA Graph Caching: When switching between different Eagle variants (standard vs HASS), use VLLM_DISABLE_COMPILE_CACHE=1 to avoid torch.compile cache conflicts.

  2. Draft Tensor Parallelism: Currently, draft_tensor_parallel_size > 1 is not supported.

  3. V1 Engine: All examples assume V1 engine usage with VLLM_USE_V1=1.

Testing

Example models for testing:

  • Eagle-1: nm-testing/eagle-llama3.1-8b-instruct-converted-0717
  • HASS: nm-testing/hass-llama3.1-8b-layernorms
  • Eagle-3: nm-testing/eagle3-llama3.1-8b-instruct-speculators
  • Qwen + Eagle-3: nm-testing/Speculator-Qwen3-8B-Eagle3-converted-0717

Copy link

github-actions bot commented Jul 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces compatibility for Eagle models that have been converted using the speculators library. It establishes a robust translation layer within vLLM to seamlessly handle differences in configuration and weight naming conventions between the speculators ecosystem and vLLM's native Eagle implementation, enabling speculative decoding with these models.

Highlights

  • Support for Speculators Eagle Models: This pull request enables vLLM to load and utilize Eagle models that have been converted using the speculators library. This bridges the compatibility gap between the speculators ecosystem and vLLM's native Eagle implementation, allowing for speculative decoding with these models.
  • Configuration Translation Layer: A new SpeculatorsEagleConfig class has been introduced. This class automatically detects speculators-specific configuration formats (e.g., speculators_model_type, transformer_layer_config, fusion_bias, layernorms) and translates them into vLLM's expected Eagle configuration format during model loading.
  • Automated Weight Remapping: Logic has been added to the EAGLE.load_weights() method to transparently remap weight names. This handles the differences between speculators' naming conventions (e.g., fusion_fc.weight, transformer.) and vLLM's internal naming scheme, ensuring that weights are loaded correctly without manual intervention.
  • Seamless Integration and Backward Compatibility: The get_config() utility has been updated to automatically detect speculators Eagle models based on their configuration. These models are then routed through the new translation layer. This integration is designed to be additive, ensuring full backward compatibility with existing vLLM Eagle models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for loading Eagle models from the speculators library by introducing a configuration adapter and weight remapping. The overall approach is sound. The review focuses on the new SpeculatorsEagleConfig implementation and identifies critical issues related to handling remote models from the Hugging Face Hub, which would prevent the feature from working in a common use case. Detailed suggestions are provided to fix these issues by properly fetching remote configurations, along with a minor suggestion to improve code maintainability.

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

@@ -334,6 +336,17 @@ def get_config(
raise ValueError(error_message) from e

if config_format == ConfigFormat.HF:
# Check if this is a speculators Eagle model
if is_speculators_eagle_config(model):
config = SpeculatorsEagleConfig.from_pretrained(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all existing supported models just going through the PretrainedConfig pathway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@rahul-tuli
Copy link
Contributor Author

rahul-tuli commented Jul 4, 2025

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

@dsikka
Copy link
Contributor

dsikka commented Jul 5, 2025

Can you post steps to run?
I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

@rahul-tuli
Copy link
Contributor Author

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(

model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 

)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(

temperature=0.0,
max_tokens=100,

)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

  1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
  1. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
  1. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
  1. Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

Could you paste the command you used, and the trace back on that PR?

@dsikka
Copy link
Contributor

dsikka commented Jul 7, 2025

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(

model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 

)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(

temperature=0.0,
max_tokens=100,

)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

  1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
  1. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
  1. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
  1. Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

Could you paste the command you used, and the trace back on that PR?

I’ll touch base offline. It just didn’t recognize the speculators command

transformer_config["architectures"] = [arch]

# Build vLLM config
vllm_config = {
Copy link
Contributor

@dsikka dsikka Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to add the verifier model as part of the config? How are the two differentiated in the vllm_config object?

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

Copy link

mergify bot commented Jul 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 8, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch from e9fecc1 to 8e1183c Compare July 9, 2025 13:38
@mergify mergify bot removed the needs-rebase label Jul 9, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch 4 times, most recently from 85a11c7 to 81c9904 Compare July 9, 2025 14:30
@mergify mergify bot added llama Related to Llama models documentation Improvements or additions to documentation labels Jul 9, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch 3 times, most recently from 9f730f4 to 875b786 Compare July 15, 2025 13:22
@aarnphm aarnphm self-assigned this Jul 15, 2025
@mergify mergify bot added the performance Performance-related issues label Jul 15, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch from 36b8502 to 00da923 Compare July 15, 2025 18:14
vllm/config.py Outdated
@@ -2767,9 +2767,15 @@ def __post_init__(self):
# Automatically detect the method
if self.method in ('eagle', 'eagle3'):
pass
elif hasattr(self.draft_model_config.hf_config,
"speculators_model_type") and \
self.draft_model_config.hf_config.speculators_model_type in ("eagle", "eagle3"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this

vllm/config.py Outdated
elif "eagle-" in self.draft_model_config.model.lower() or \
"eagle3-" in self.draft_model_config.model.lower():
self.method = "eagle"
elif self.draft_model_config.hf_config.model_type == "eagle":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -22,6 +24,27 @@

logger = init_logger(__name__)

# Map speculators weight names to vLLM names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably live in the speculators config file

}


def remap_speculators_weight_name(name: str) -> Optional[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

return num_lookahead_tokens

@classmethod
def _apply_eagle_v1_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are fine for now but wondering if we can combine the eagle1/eagle3 _apply methods defined here

@mergify mergify bot added the qwen Related to Qwen models label Jul 17, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch 2 times, most recently from 35dbf03 to 8082d60 Compare July 17, 2025 19:14
@mergify mergify bot added the ci/build label Jul 17, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch from 74815d6 to 213b5a4 Compare July 17, 2025 21:39
@@ -0,0 +1 @@
VLLM_USE_V1=1 vllm serve nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717 >output_speculators_llama.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a readme for eagle3 as well?

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I think we can potentially clean-up additionally by moving some of the logic into the SpeculatorsEagleConfig/SpeculatorsConfig

@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch from 830c3cf to d7591ee Compare July 17, 2025 23:39
- Add SpeculatorsEAGLEConfig for handling speculators format Eagle models
- Support both Eagle-1 and Eagle-3 variants with proper config translation
- Add detection function to identify speculators format models
- Handle weight remapping between speculators and vLLM naming conventions

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add weight remapping function for speculators format in llama_eagle.py
- Support HASS variant with additional layernorms
- Add IdentityNorm for unified architecture to fix torch.compile compatibility
- Propagate norm_before_residual config for Eagle3 models

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Enable Eagle3 speculative decoding for Qwen2 and Qwen3
- Import and register Eagle3 in supported draft model types

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Auto-detect speculators format models and configure speculative decoding
- Allow 'vllm serve <speculators-model>' without explicit speculative config
- Add draft_tensor_parallel_size CLI argument
- Update tokenizer to use target model instead of draft model
- Check speculators model type in draft config for proper Eagle detection

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add Eagle1 example scripts for Llama models
- Add Eagle3 example scripts for Llama and Qwen models
- Include both regular and float16 serving examples

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
…utils/configs/speculators_eagle.py

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch from 25d7544 to 65ee8f4 Compare July 18, 2025 18:14
Copy link

mergify bot commented Jul 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot removed the needs-rebase label Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation llama Related to Llama models performance Performance-related issues qwen Related to Qwen models speculative-decoding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants