Open
Description
Hi team, I'm wondering whether there is any way to apply AdamW for sparse features.
I wrapped torch.optim.AdamW in KeyedOptimizerWrapper for dense features and it works. However, it seems that AdamW is not supported for sparse features sharded in EmbeddingBagCollection.
Basically if we set weight decay as Adam as 0, then manually decay the parameter θ
by: θ(t) = θ(t-1) - γ* λ * θ(t-1)
after finished back propagation by Adam, it is equal to AdamW. The problem here is:
- How to get the embedding value before back propagation (
θ(t-1)
)? Especially when I have a group of values for a single key in the keyed jagged tensor. - How to modify the embedding manually? Given they are automatically sharded, I guess it's more complicated than modify a tensor directly.
What is the best way to implement AdamW with available resources? Thank you very much!
Metadata
Metadata
Assignees
Labels
No labels