Confused with the relation between gradient accumulation, self.log, W&B, LearningRateMonitor. #14569

yipliu · 2022-09-07T03:22:52Z

yipliu
Sep 7, 2022

Hi

I am very confused about using the relation between them. First, I give my understanding:

For gradient accumulation, we need the normalized gradients for backward. In PyTorch, we need

loss = self.criterion(logit, y)
loss = loss / K
loss. backward()

if (idx +1) % K ==0:
    optimizer.step()

For PL, the following are my codes

# trainer = Trainer(accumulate_grad_batches=K)
# batch_size = B

class MNISTLitModule(LightningModule):
    def __init__(self, net, optimizer):
        super().__init__()
        self.net=net
        self.loss=nn.CrossEntropyLoss()
    
    def forward(self, x):
        return self.net(x)
    def step(self, batch):
        x, y = batch
        logits=self.forward(x)
        loss=self.criterion(logits, y)
        return loss, preds, y
    def training_step(self, batch):
        loss, preds, targets = self.step(batch) # this loss is batch size loss
        self.log("train_loss", loss, on_step=False, on_epoch=True, prog_bar=True, logger=True)
        return {"loss": loss}

As discussed in here, reduce_fx will help us average the loss that we pass to self.log. The real loss we get (print on the bar) is self.value / self.cumulated_batch_size. So the loss = self.criterion(logits, y) will be used for backward？ or self.value / self.cumulated_batch_size?
The loss I got from loss=self.criterion(logits, y) is unnormalized gradients. So, the PL can do normalized gradients automatically? Do we just need to calculate the batch size loss and pass it to self.log? That is to see, if I set trainer = Trainer(accumulate_grad_batches=K), PL will do normalized gradients automatically to get normalized gradients for backward, even if I just return the batch size loss?
As discussed in here, the step in W&B is the optimization step which should be equal to self.global_step. The value of loss corresponds to the value of the self.global_step should be calculated as: loss *K. Do it ok that we just pass the batch size loss to self.log?
For LearningRateMonitor, the step in logging_interval is also the self.gloabl_step?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Confused with the relation between gradient accumulation, self.log, W&B, LearningRateMonitor. #14569

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Confused with the relation between gradient accumulation, self.log, W&B, LearningRateMonitor. #14569

Uh oh!

Uh oh!

yipliu Sep 7, 2022

Replies: 0 comments

yipliu
Sep 7, 2022