[Question] Is there any way to apply AdamW as optimizer for sparse features

Hi team, I'm wondering whether there is any way to apply AdamW for sparse features. 

I wrapped torch.optim.AdamW in KeyedOptimizerWrapper for dense features and it works. However, it seems that AdamW is not supported for sparse features sharded in EmbeddingBagCollection.

Basically if we set weight decay as Adam as 0, then manually decay the parameter `θ` by: `θ(t) = θ(t-1) - γ* λ * θ(t-1)` after finished back propagation by Adam, it is equal to AdamW. The problem here is:
1) How to get the embedding value before back propagation (`θ(t-1)`)? Especially when I have a group of values for a single key in the keyed jagged tensor.
2) How to modify the embedding manually? Given they are automatically sharded, I guess it's more complicated than modify a tensor directly.

What is the best way to implement AdamW with available resources? Thank you very much!
​
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Is there any way to apply AdamW as optimizer for sparse features #2821

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Is there any way to apply AdamW as optimizer for sparse features #2821

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions