I’m experiencing an issue while managing a large-scale embedding vocabulary using the EmbeddingCollection feature. Specifically, when I configure the EmbeddingShardingPlanner with the “row_wise” sharding type and fused compute kernels, the resulting DistributedModelParallel instance sets the requires_grad attribute of the embedding parameters to False. This prevents the gradients from being updated during training.
