[Feature Request] Improve Speculative Decoding feature during model compilation

## 🚀 Feature
Improve Speculative Decoding feature during model compilation

## Motivation
Recently, I have been trying to deploy a speculative decoding model on an edge device, but I encountered several issues:

**Medusa Model Support**:
The related issue can be found here: [mlc-ai/mlc-llm#3173](https://github.com/mlc-ai/mlc-llm/issues/3173)

**Excessive Memory Usage with Target Model**:
During the verification of draft tokens, the memory consumption increases dramatically — from 3.79 GB to 7.70 GB, as shown in the nsys memory report below. The growth is unexpectedly large and makes the deployment impractical on resource-constrained devices.
<img width="1135" height="641" alt="Image" src="https://github.com/user-attachments/assets/497a49ef-ac5c-4bc8-93d8-a8228d3b2b0b" />

## Alternatives

- [ ] Refactor the Medusa model.

- [ ] Add spec_len as a compile-time argument during model compilation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Improve Speculative Decoding feature during model compilation #3366

🚀 Feature

Motivation

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Improve Speculative Decoding feature during model compilation #3366

Description

🚀 Feature

Motivation

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions