False positive iterable dataset warning for LitData StreamingDataset

### Bug description

This warning shows up when running a LitData StreamingDataset with Trainer:

```
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:122: Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
```

### What version are you seeing the problem on?

v2.4, master

### How to reproduce the bug

```python
import torch
import litgpt
from litgpt import GPT
from litgpt.pretrain import initialize_weights
from litdata.streaming import StreamingDataLoader, StreamingDataset, TokensLoader
import lightning as L


class LitLLM(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = GPT.from_name(name="micro-llama-300M")

    def on_train_start(self):
        initialize_weights(self.trainer, self.model, n_layer=self.model.config.n_layer, n_embd=self.model.config.n_embd)

    def training_step(self, batch):
        input_ids = batch.long()
        logits = self.model(input_ids)
        loss = litgpt.utils.chunked_cross_entropy(logits[..., :-1, :], input_ids[..., 1:])
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        warmup_steps = 500
        optimizer = torch.optim.AdamW(self.model.parameters(), lr=4e-4, weight_decay=0.1, betas=(0.9, 0.95))
        scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda step: min(step / warmup_steps, 1.0))
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "step"}}


if __name__ == "__main__":
    train_dataset = StreamingDataset("s3://tinyllama-template/slimpajama/train", item_loader=TokensLoader(block_size=128))
    train_dataloader = StreamingDataLoader(train_dataset, shuffle=True, batch_size=12, num_workers=1)

    trainer = L.Trainer(
        max_epochs=1,
        accumulate_grad_batches=4,
        precision="bf16-mixed",
    )
    with trainer.init_module(empty_init=True):
        model = LitLLM()

    trainer.fit(model, train_dataloader)
```


### Error messages and logs

None

### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```

</details>


### More info

_No response_

cc @justusschock @awaelchli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

False positive iterable dataset warning for LitData StreamingDataset #20166

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

False positive iterable dataset warning for LitData StreamingDataset #20166

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions