Skip to content

KV-Cache in the paper #23

@URRealHero

Description

@URRealHero

In your paper, the inference memory consumption is about O(N) and is directly proportion to w. However, I do not understand where did 'w' comes from. In fact, I'm wondering why there is an additional block in your figure 3 on the left. Also, when moving from the being generated blocks to the to be generated blocks. Isn't there some KV-cache will be abandoned? I don't really understand why there is (w+n)M_2 in your formula.
image
文中的KV仅与每个block和相邻的三个block有关,而n
n blocks 的KV-cache 不应该是nn M2或者nn*4 M2么, width是哪来的? 求解
I'm more than delighted and thankful to your explanation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions