KV-Cache in the paper

In your paper, the inference memory consumption is about O(N) and is directly proportion to w. However, I do not understand where did 'w' comes from. In fact, I'm wondering why there is an additional block in your figure 3 on the left. Also, when moving from the being generated blocks to the to be generated blocks. Isn't there some KV-cache will be abandoned? I don't really understand why there is (w+n)*M_2 in your formula.
![image](https://github.com/user-attachments/assets/1b20cf42-74f0-41aa-985c-2d7f11304c93)
文中的KV仅与每个block和相邻的三个block有关，而n*n blocks 的KV-cache 不应该是n*n M2或者n*n*4 M2么， width是哪来的？ 求解
I'm more than delighted and thankful to your explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KV-Cache in the paper #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KV-Cache in the paper #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions