How to finetune after replacing the original full attention with the SLA operator?

I plan to use the SLA operator to replace the attention operator in the original model (e.g., wan2.1). The paper says finetuning is required, and there is a linear layer called proj_l in the SLA operator whose weights and bias must also be finetuned. I would like to know what the finetuning workflow should be.

I have thought of 3 approaches:
1. After replacing full attention with SLA, sample the attention layer’s inputs and outputs using real model input data, and finetune the attention layer alone with MSE loss.
2. After replacing the full attention operator with SLA, freeze the parameters of the other layers and perform end-to-end training using real model input–output data, but only finetune the SLA proj_l linear layer; the parameters of the other layers will remain unchanged.
3. After replacing full attention with SLA, perform end-to-end finetuning of the whole model using real model input–output data, so parameters in other layers will also change.

Which approach would be better?

Also, after finetuning, when running unit tests comparing the SLA operator and the normal full attention operator, will the numerical results be almost identical?

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to finetune after replacing the original full attention with the SLA operator? #23

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to finetune after replacing the original full attention with the SLA operator? #23

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions