You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi team, I'm wondering whether there is any way to apply AdamW for sparse features.
I wrapped torch.optim.AdamW in KeyedOptimizerWrapper for dense features and it works. However, it seems that AdamW is not supported for sparse features sharded in EmbeddingBagCollection.
Basically if we set weight decay as Adam as 0, then manually decay the parameter θ by: θ(t) = θ(t-1) - γ* λ * θ(t-1) after finished back propagation by Adam, it is equal to AdamW. The problem here is:
How to get the embedding value before back propagation (θ(t-1))? Especially when I have a group of values for a single key in the keyed jagged tensor.
How to modify the embedding manually? Given they are automatically sharded, I guess it's more complicated than modify a tensor directly.
What is the best way to implement AdamW with available resources? Thank you very much!
The text was updated successfully, but these errors were encountered:
Hi team, I'm wondering whether there is any way to apply AdamW for sparse features.
I wrapped torch.optim.AdamW in KeyedOptimizerWrapper for dense features and it works. However, it seems that AdamW is not supported for sparse features sharded in EmbeddingBagCollection.
Basically if we set weight decay as Adam as 0, then manually decay the parameter
θ
by:θ(t) = θ(t-1) - γ* λ * θ(t-1)
after finished back propagation by Adam, it is equal to AdamW. The problem here is:θ(t-1)
)? Especially when I have a group of values for a single key in the keyed jagged tensor.What is the best way to implement AdamW with available resources? Thank you very much!
The text was updated successfully, but these errors were encountered: