Skip to content

How to finetune after replacing the original full attention with the SLA operator? #23

@limou102

Description

@limou102

I plan to use the SLA operator to replace the attention operator in the original model (e.g., wan2.1). The paper says finetuning is required, and there is a linear layer called proj_l in the SLA operator whose weights and bias must also be finetuned. I would like to know what the finetuning workflow should be.

I have thought of 3 approaches:

  1. After replacing full attention with SLA, sample the attention layer’s inputs and outputs using real model input data, and finetune the attention layer alone with MSE loss.
  2. After replacing the full attention operator with SLA, freeze the parameters of the other layers and perform end-to-end training using real model input–output data, but only finetune the SLA proj_l linear layer; the parameters of the other layers will remain unchanged.
  3. After replacing full attention with SLA, perform end-to-end finetuning of the whole model using real model input–output data, so parameters in other layers will also change.

Which approach would be better?

Also, after finetuning, when running unit tests comparing the SLA operator and the normal full attention operator, will the numerical results be almost identical?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions