-
Notifications
You must be signed in to change notification settings - Fork 645
Description
Mainline llama.cpp added ggml-org#20087
I suggest adding a UI feature for hybrid models (such as Nemotron-H, Qwen 3.5, Jamba2-Mini) in the Context section to account for this feature (notably, context shifting is not supposed to work with RNN models, so these should likely be mutually exclusive).
Essentially, this creates cache checkpoints after processing every n batches during prompt processing. Why: it is really helpful when using apps such as SillyTavern that can insert an "anchoring" prompt N messages above the user's prompt (this is common practice to make the model adhere to system prompt guidelines). Without this feature, RNN models invalidate the whole cache, and the entire prompt must be reprocessed. With this feature, however, only the n batches of prompt caching are invalidated and must be reprocessed, which speeds up the usage significantly.