Skip to content

False positive iterable dataset warning for LitData StreamingDataset #20166

@awaelchli

Description

@awaelchli

Bug description

This warning shows up when running a LitData StreamingDataset with Trainer:

/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:122: Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.

What version are you seeing the problem on?

v2.4, master

How to reproduce the bug

import torch
import litgpt
from litgpt import GPT
from litgpt.pretrain import initialize_weights
from litdata.streaming import StreamingDataLoader, StreamingDataset, TokensLoader
import lightning as L


class LitLLM(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = GPT.from_name(name="micro-llama-300M")

    def on_train_start(self):
        initialize_weights(self.trainer, self.model, n_layer=self.model.config.n_layer, n_embd=self.model.config.n_embd)

    def training_step(self, batch):
        input_ids = batch.long()
        logits = self.model(input_ids)
        loss = litgpt.utils.chunked_cross_entropy(logits[..., :-1, :], input_ids[..., 1:])
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        warmup_steps = 500
        optimizer = torch.optim.AdamW(self.model.parameters(), lr=4e-4, weight_decay=0.1, betas=(0.9, 0.95))
        scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda step: min(step / warmup_steps, 1.0))
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "step"}}


if __name__ == "__main__":
    train_dataset = StreamingDataset("s3://tinyllama-template/slimpajama/train", item_loader=TokensLoader(block_size=128))
    train_dataloader = StreamingDataLoader(train_dataset, shuffle=True, batch_size=12, num_workers=1)

    trainer = L.Trainer(
        max_epochs=1,
        accumulate_grad_batches=4,
        precision="bf16-mixed",
    )
    with trainer.init_module(empty_init=True):
        model = LitLLM()

    trainer.fit(model, train_dataloader)

Error messages and logs

None

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @justusschock @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdata handlingGeneric data-related topic

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions