[Feature request] Hybrid model cache: add --checkpoint-every-nb

Mainline llama.cpp added https://github.com/ggml-org/llama.cpp/pull/20087

I suggest adding a UI feature for hybrid models (such as Nemotron-H, Qwen 3.5, Jamba2-Mini) in the Context section to account for this feature (notably, context shifting is not supposed to work with RNN models, so these should likely be mutually exclusive).

Essentially, this creates cache checkpoints after processing every `n` batches during prompt processing. Why: it is really helpful when using apps such as SillyTavern that can insert an "anchoring" prompt N messages above the user's prompt (this is common practice to make the model adhere to system prompt guidelines). Without this feature, RNN models invalidate the whole cache, and the entire prompt must be reprocessed. With this feature, however, only the `n` batches of prompt caching are invalidated and must be reprocessed, which speeds up the usage significantly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Hybrid model cache: add --checkpoint-every-nb #2034

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature request] Hybrid model cache: add --checkpoint-every-nb #2034

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions