Skip to content

[Question] Is there any way to apply AdamW as optimizer for sparse features #2821

Open
@shan-jiang-faire

Description

@shan-jiang-faire

Hi team, I'm wondering whether there is any way to apply AdamW for sparse features.

I wrapped torch.optim.AdamW in KeyedOptimizerWrapper for dense features and it works. However, it seems that AdamW is not supported for sparse features sharded in EmbeddingBagCollection.

Basically if we set weight decay as Adam as 0, then manually decay the parameter θ by: θ(t) = θ(t-1) - γ* λ * θ(t-1) after finished back propagation by Adam, it is equal to AdamW. The problem here is:

  1. How to get the embedding value before back propagation (θ(t-1))? Especially when I have a group of values for a single key in the keyed jagged tensor.
  2. How to modify the embedding manually? Given they are automatically sharded, I guess it's more complicated than modify a tensor directly.

What is the best way to implement AdamW with available resources? Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions