Skip to content

[Question] Is there any way to apply AdamW as optimizer for sparse features #2821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shan-jiang-faire opened this issue Mar 14, 2025 · 0 comments

Comments

@shan-jiang-faire
Copy link

Hi team, I'm wondering whether there is any way to apply AdamW for sparse features.

I wrapped torch.optim.AdamW in KeyedOptimizerWrapper for dense features and it works. However, it seems that AdamW is not supported for sparse features sharded in EmbeddingBagCollection.

Basically if we set weight decay as Adam as 0, then manually decay the parameter θ by: θ(t) = θ(t-1) - γ* λ * θ(t-1) after finished back propagation by Adam, it is equal to AdamW. The problem here is:

  1. How to get the embedding value before back propagation (θ(t-1))? Especially when I have a group of values for a single key in the keyed jagged tensor.
  2. How to modify the embedding manually? Given they are automatically sharded, I guess it's more complicated than modify a tensor directly.

What is the best way to implement AdamW with available resources? Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant