🚀 Feature
Improve Speculative Decoding feature during model compilation
Motivation
Recently, I have been trying to deploy a speculative decoding model on an edge device, but I encountered several issues:
Medusa Model Support:
The related issue can be found here: mlc-ai/mlc-llm#3173
Excessive Memory Usage with Target Model:
During the verification of draft tokens, the memory consumption increases dramatically — from 3.79 GB to 7.70 GB, as shown in the nsys memory report below. The growth is unexpectedly large and makes the deployment impractical on resource-constrained devices.

Alternatives