Skip to content

[Feature request] Hybrid model cache: add --checkpoint-every-nb #2034

@aoleg

Description

@aoleg

Mainline llama.cpp added ggml-org#20087

I suggest adding a UI feature for hybrid models (such as Nemotron-H, Qwen 3.5, Jamba2-Mini) in the Context section to account for this feature (notably, context shifting is not supposed to work with RNN models, so these should likely be mutually exclusive).

Essentially, this creates cache checkpoints after processing every n batches during prompt processing. Why: it is really helpful when using apps such as SillyTavern that can insert an "anchoring" prompt N messages above the user's prompt (this is common practice to make the model adhere to system prompt guidelines). Without this feature, RNN models invalidate the whole cache, and the entire prompt must be reprocessed. With this feature, however, only the n batches of prompt caching are invalidated and must be reprocessed, which speeds up the usage significantly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions