feat: Add support for speculators Eagle checkpoints #20436

rahul-tuli · 2025-07-03T11:32:29Z

Add support for Eagle models in speculators format

This PR adds support for Eagle models distributed in the "speculators" format, enabling seamless speculative decoding with simple vllm serve commands.

What is speculators format?

Speculators format is an alternative packaging for Eagle models that includes both the draft model and configuration in a single repository. This format simplifies deployment by bundling everything needed for speculative decoding.

Key Features

1. Auto-detection of speculators format

Models in speculators format are automatically detected and configured for speculative decoding:

# Automatically detects speculators format and configures Eagle speculative decoding
vllm serve nm-testing/eagle-llama3.1-8b-instruct-converted-0717

No need for complex --speculative-config arguments!

2. Support for multiple Eagle variants

Standard Eagle-1 models:

vllm serve nm-testing/eagle-llama3.1-8b-instruct-converted-0717

HASS variant (Eagle with additional layernorms):

# Note: Use VLLM_DISABLE_COMPILE_CACHE=1 to avoid torch.compile cache conflicts
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve nm-testing/hass-llama3.1-8b-layernorms

Eagle-3 models:

vllm serve nm-testing/eagle3-llama3.1-8b-instruct-speculators

3. Qwen model support with Eagle-3

# Eagle-3 now supports Qwen models
vllm serve nm-testing/Speculator-Qwen3-8B-Eagle3-converted-0717

Example Commands

Basic usage (auto-detection):

# The model is automatically detected as speculators format
# Target model, draft model, and speculative config are all configured automatically
vllm serve <speculators-format-model>

With custom tensor parallelism for draft model:

vllm serve <speculators-format-model> --draft-tensor-parallel-size 1

Manual configuration (if needed):

# You can still use manual configuration if preferred
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"method": "eagle", "model": "nm-testing/eagle-llama3.1-8b-instruct-converted-0717", "num_speculative_tokens": 5}'

Technical Details

Automatic detection of speculators format via config structure
Proper weight remapping between speculators and vLLM naming conventions
Support for both Eagle-1 and Eagle-3 architectures
Backwards compatibility with existing Eagle models
Unified architecture for HASS variants to ensure torch.compile compatibility

Important Notes

CUDA Graph Caching: When switching between different Eagle variants (standard vs HASS), use VLLM_DISABLE_COMPILE_CACHE=1 to avoid torch.compile cache conflicts.
Draft Tensor Parallelism: Currently, draft_tensor_parallel_size > 1 is not supported.
V1 Engine: All examples assume V1 engine usage with VLLM_USE_V1=1.

Testing

Example models for testing:

Eagle-1: nm-testing/eagle-llama3.1-8b-instruct-converted-0717
HASS: nm-testing/hass-llama3.1-8b-layernorms
Eagle-3: nm-testing/eagle3-llama3.1-8b-instruct-speculators
Qwen + Eagle-3: nm-testing/Speculator-Qwen3-8B-Eagle3-converted-0717

github-actions · 2025-07-03T11:32:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces compatibility for Eagle models that have been converted using the speculators library. It establishes a robust translation layer within vLLM to seamlessly handle differences in configuration and weight naming conventions between the speculators ecosystem and vLLM's native Eagle implementation, enabling speculative decoding with these models.

Highlights

Support for Speculators Eagle Models: This pull request enables vLLM to load and utilize Eagle models that have been converted using the speculators library. This bridges the compatibility gap between the speculators ecosystem and vLLM's native Eagle implementation, allowing for speculative decoding with these models.
Configuration Translation Layer: A new SpeculatorsEagleConfig class has been introduced. This class automatically detects speculators-specific configuration formats (e.g., speculators_model_type, transformer_layer_config, fusion_bias, layernorms) and translates them into vLLM's expected Eagle configuration format during model loading.
Automated Weight Remapping: Logic has been added to the EAGLE.load_weights() method to transparently remap weight names. This handles the differences between speculators' naming conventions (e.g., fusion_fc.weight, transformer.) and vLLM's internal naming scheme, ensuring that weights are loaded correctly without manual intervention.
Seamless Integration and Backward Compatibility: The get_config() utility has been updated to automatically detect speculators Eagle models based on their configuration. These models are then routed through the new translation layer. This integration is designed to be additive, ensuring full backward compatibility with existing vLLM Eagle models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for loading Eagle models from the speculators library by introducing a configuration adapter and weight remapping. The overall approach is sound. The review focuses on the new SpeculatorsEagleConfig implementation and identifies critical issues related to handling remote models from the Hugging Face Hub, which would prevent the feature from working in a common use case. Detailed suggestions are provided to fix these issues by properly fetching remote configurations, along with a minor suggestion to improve code maintainability.

vllm/transformers_utils/configs/speculators_eagle.py

dsikka

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

dsikka · 2025-07-04T17:45:13Z

vllm/transformers_utils/config.py

@@ -334,6 +336,17 @@ def get_config(
            raise ValueError(error_message) from e

    if config_format == ConfigFormat.HF:
+        # Check if this is a speculators Eagle model
+        if is_speculators_eagle_config(model):
+            config = SpeculatorsEagleConfig.from_pretrained(


Are all existing supported models just going through the PretrainedConfig pathway?

rahul-tuli · 2025-07-04T21:25:46Z

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

dsikka · 2025-07-05T21:18:36Z

Can you post steps to run?
I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

rahul-tuli · 2025-07-07T00:43:20Z

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:
speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct
Output:
2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle
The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:
from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 
)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(
temperature=0.0,
max_tokens=100,
)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")
Output:
...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.

Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.

Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
cli does not seem to be working on that branch.

Could you paste the command you used, and the trace back on that PR?

dsikka · 2025-07-07T00:44:30Z

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 
)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(
temperature=0.0,
max_tokens=100,
)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.

Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.

Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.
Could you paste the command you used, and the trace back on that PR?

I’ll touch base offline. It just didn’t recognize the speculators command

dsikka · 2025-07-07T18:59:27Z

vllm/transformers_utils/configs/speculators_eagle.py

+                transformer_config["architectures"] = [arch]
+
+        # Build vLLM config
+        vllm_config = {


Why don't we need to add the verifier model as part of the config? How are the two differentiated in the vllm_config object?

dsikka

question

mergify · 2025-07-08T03:03:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/models/llama_eagle.py

vllm/transformers_utils/configs/speculators_eagle.py

vllm/model_executor/models/llama_eagle.py

dsikka · 2025-07-16T14:19:24Z

vllm/config.py

@@ -2767,9 +2767,15 @@ def __post_init__(self):
                # Automatically detect the method
                if self.method in ('eagle', 'eagle3'):
                    pass
+                elif hasattr(self.draft_model_config.hf_config, 
+                           "speculators_model_type") and \
+                        self.draft_model_config.hf_config.speculators_model_type in ("eagle", "eagle3"):


why do we need this

dsikka · 2025-07-16T14:19:38Z

vllm/config.py

                elif "eagle-" in self.draft_model_config.model.lower() or \
                        "eagle3-" in self.draft_model_config.model.lower():
                    self.method = "eagle"
+                elif self.draft_model_config.hf_config.model_type == "eagle":


same as above

dsikka · 2025-07-16T14:20:22Z

vllm/model_executor/models/llama_eagle.py

@@ -22,6 +24,27 @@

 logger = init_logger(__name__)

+# Map speculators weight names to vLLM names


Should probably live in the speculators config file

dsikka · 2025-07-16T14:20:40Z

vllm/model_executor/models/llama_eagle.py

+}
+
+
+def remap_speculators_weight_name(name: str) -> Optional[str]:


same as above

dsikka · 2025-07-16T14:22:05Z

vllm/transformers_utils/configs/speculators_eagle.py

+        return num_lookahead_tokens
+
+    @classmethod
+    def _apply_eagle_v1_config(


these are fine for now but wondering if we can combine the eagle1/eagle3 _apply methods defined here

dsikka · 2025-07-17T21:38:49Z

examples/speculators_examples/eagle3/serve_llama.sh

@@ -0,0 +1 @@
+VLLM_USE_V1=1 vllm serve nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717 >output_speculators_llama.txt


Should we add a readme for eagle3 as well?

dsikka

Overall LGTM. I think we can potentially clean-up additionally by moving some of the logic into the SpeculatorsEagleConfig/SpeculatorsConfig

requirements/test.txt

- Add SpeculatorsEAGLEConfig for handling speculators format Eagle models - Support both Eagle-1 and Eagle-3 variants with proper config translation - Add detection function to identify speculators format models - Handle weight remapping between speculators and vLLM naming conventions Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Add weight remapping function for speculators format in llama_eagle.py - Support HASS variant with additional layernorms - Add IdentityNorm for unified architecture to fix torch.compile compatibility - Propagate norm_before_residual config for Eagle3 models Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Enable Eagle3 speculative decoding for Qwen2 and Qwen3 - Import and register Eagle3 in supported draft model types Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Auto-detect speculators format models and configure speculative decoding - Allow 'vllm serve <speculators-model>' without explicit speculative config - Add draft_tensor_parallel_size CLI argument - Update tokenizer to use target model instead of draft model - Check speculators model type in draft config for proper Eagle detection Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Add Eagle1 example scripts for Llama models - Add Eagle3 example scripts for Llama and Qwen models - Include both regular and float16 serving examples Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

…utils/configs/speculators_eagle.py Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

mergify · 2025-07-21T16:16:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

mergify bot added the speculative-decoding label Jul 3, 2025

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

dsikka reviewed Jul 4, 2025

View reviewed changes

dsikka reviewed Jul 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 8, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch from e9fecc1 to 8e1183c Compare July 9, 2025 13:38

mergify bot removed the needs-rebase label Jul 9, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch 4 times, most recently from 85a11c7 to 81c9904 Compare July 9, 2025 14:30

mergify bot added llama Related to Llama models documentation Improvements or additions to documentation labels Jul 9, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch 3 times, most recently from 9f730f4 to 875b786 Compare July 15, 2025 13:22

aarnphm self-assigned this Jul 15, 2025

dsikka reviewed Jul 15, 2025

View reviewed changes

vllm/model_executor/models/llama_eagle.py Outdated Show resolved Hide resolved

vllm/transformers_utils/configs/speculators_eagle.py Show resolved Hide resolved

mergify bot added the performance Performance-related issues label Jul 15, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch from 36b8502 to 00da923 Compare July 15, 2025 18:14

dsikka reviewed Jul 15, 2025

View reviewed changes

vllm/model_executor/models/llama_eagle.py Show resolved Hide resolved

rahul-tuli mentioned this pull request Jul 16, 2025

Add Eagle-3 Qwen support (follow-up to #20436) #21025

Closed

This was referenced Jul 16, 2025

Add Eagle-3 Qwen support (follow-up to #20436) rahul-tuli/vllm#2

Merged

Enable auto-detection for Eagle speculators format models rahul-tuli/vllm#3

Closed

dsikka suggested changes Jul 16, 2025

View reviewed changes

mergify bot added the qwen Related to Qwen models label Jul 17, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch 2 times, most recently from 35dbf03 to 8082d60 Compare July 17, 2025 19:14

mergify bot added the ci/build label Jul 17, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch from 74815d6 to 213b5a4 Compare July 17, 2025 21:39

dsikka reviewed Jul 17, 2025

View reviewed changes

requirements/test.txt Outdated Show resolved Hide resolved

rahul-tuli force-pushed the feat/speculators-eagle-support branch from 830c3cf to d7591ee Compare July 17, 2025 23:39

rahul-tuli added 12 commits July 18, 2025 14:14

feat: Add Eagle3 support for Qwen models

f448005

- Enable Eagle3 speculative decoding for Qwen2 and Qwen3 - Import and register Eagle3 in supported draft model types Signed-off-by: Rahul Tuli <rtuli@redhat.com>

docs: Add example scripts for serving Eagle models

3ec1495

- Add Eagle1 example scripts for Llama models - Add Eagle3 example scripts for Llama and Qwen models - Include both regular and float16 serving examples Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Add: README.nd for eagle1 examples, remove --enforce-eager flag

b8c051a

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Move speculators weight map and remap functions to vllm/transformers_…

c86cf81

…utils/configs/speculators_eagle.py Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Some minor cleanups

c40de9a

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

pre-commit hooks

a44ced7

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

remove changes to requirements/test.txt

d894317

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

remove file added in faulty rebase

57733ca

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Fix bug with Qwen

65ee8f4

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rahul-tuli force-pushed the feat/speculators-eagle-support branch from 25d7544 to 65ee8f4 Compare July 18, 2025 18:14

mergify bot added the needs-rebase label Jul 21, 2025

dsikka mentioned this pull request Jul 21, 2025

Standardization and documentation for speculators conversion pathways neuralmagic/speculators#63

Merged

Merge branch 'main' into feat/speculators-eagle-support

f958593

mergify bot removed the needs-rebase label Jul 22, 2025

		@@ -22,6 +24,27 @@

		logger = init_logger(__name__)

		# Map speculators weight names to vLLM names

		}


		def remap_speculators_weight_name(name: str) -> Optional[str]:

		@@ -0,0 +1 @@
		VLLM_USE_V1=1 vllm serve nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717 >output_speculators_llama.txt

Uh oh!

feat: Add support for speculators Eagle checkpoints #20436

Are you sure you want to change the base?

feat: Add support for speculators Eagle checkpoints #20436

Conversation

rahul-tuli commented Jul 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add support for Eagle models in speculators format

What is speculators format?

Key Features

1. Auto-detection of speculators format

2. Support for multiple Eagle variants

Standard Eagle-1 models:

HASS variant (Eagle with additional layernorms):

Eagle-3 models:

3. Qwen model support with Eagle-3

Example Commands

Basic usage (auto-detection):

With custom tensor parallelism for draft model:

Manual configuration (if needed):

Technical Details

Important Notes

Testing

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsikka commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahul-tuli commented Jul 7, 2025

UPDATE THIS PATH

Create LLM with Eagle speculative decoding

Your prompt

Generate text

Uh oh!

dsikka commented Jul 7, 2025

UPDATE THIS PATH

Create LLM with Eagle speculative decoding

Your prompt

Generate text

Uh oh!

dsikka Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rahul-tuli commented Jul 3, 2025 •

edited by github-actions bot

Loading

dsikka left a comment •

edited

Loading

rahul-tuli commented Jul 4, 2025 •

edited

Loading

dsikka commented Jul 5, 2025 •

edited

Loading

dsikka Jul 7, 2025 •

edited

Loading