[Activation-checkpointing] add tensor dedup and param offloading #4247

kashif · 2025-10-10T08:51:18Z

What does this PR do?

Prevents redundant offloading when multiple tensor views share the same storage
Tracks and filters out model parameters during offloading

kashif · 2025-10-10T08:53:29Z

@sywangyi would you be kind enough to test this on your hardware and give me some feedback? thank you!

HuggingFaceDocBuilderDev · 2025-10-10T08:53:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-10-10T16:07:44Z

trl/models/activation_offloading.py


+# Try to import DTensor for FSDP v2 support
+try:
+    from torch.distributed._tensor import DTensor


do you know in which version DTensor was introduced? I'm wondering is this try/expect is needed

For reference we have the following in accelerate:

if is_torch_version(">=", "2.5.0"): from torch.distributed.tensor import DTensor else: # from torch 2.0.0 (oldest supported accelerate torch version), DTensor is in torch.distributed._tensor from torch.distributed._tensor import DTensor

but we also need to check for torch.distributed.is_available(), otherwise you might get import issue.

qgallouedec · 2025-10-10T16:08:56Z

trl/models/activation_offloading.py

+
+            # Check if tensor is a parameter or buffer
+            if isinstance(activation, torch.nn.Parameter) or (
+                hasattr(torch.nn, "Buffer") and isinstance(activation, torch.nn.Buffer)


same question here, is Buffer a recent addition?

no buffer has always been there from the start, I can clean this up

SunMarc

LGTM for fsdpv2 part !

SunMarc · 2025-10-10T16:27:32Z

trl/models/activation_offloading.py


+# Try to import DTensor for FSDP v2 support
+try:
+    from torch.distributed._tensor import DTensor


For reference we have the following in accelerate:

if is_torch_version(">=", "2.5.0"): from torch.distributed.tensor import DTensor else: # from torch 2.0.0 (oldest supported accelerate torch version), DTensor is in torch.distributed._tensor from torch.distributed._tensor import DTensor

but we also need to check for torch.distributed.is_available(), otherwise you might get import issue.

S1ro1 · 2025-10-10T22:06:34Z

trl/models/activation_offloading.py

+    Returns:
+        A tuple of (storage_pointer, dtype) that uniquely identifies the tensor's storage
+    """
+    storage_ptr = tensor.untyped_storage().data_ptr() + tensor.storage_offset()


Using data_ptr() can be a bit tricky with for example TorchAO quantized tensors etc, as those can return 0 for data_ptr(). I don't have a concrete example, just something to be aware of.

Edit: here is an example (i.e. float8linear, which can sometimes happen). https://github.com/huggingface/accelerate/blob/f0313a64a2f3de359924c85a98ee010c47b846ec/src/accelerate/accelerator.py#L3842

S1ro1 · 2025-10-10T22:10:53Z

trl/models/activation_offloading.py

+            # For FSDP v2: extract local tensor from DTensor
+            actual_tensor = p
+            if DTensor is not None and isinstance(p, DTensor) and hasattr(p, "_local_tensor"):
+                actual_tensor = p._local_tensor


Again, something to care for. If fp8 is used, it can return 0, viz here: https://github.com/huggingface/accelerate/blob/f0313a64a2f3de359924c85a98ee010c47b846ec/src/accelerate/accelerator.py#L3842

sywangyi · 2025-10-11T06:35:41Z

@sywangyi would you be kind enough to test this on your hardware and give me some feedback? thank you!

pytest tests/test_activation_offloading.py::TestActivationOffloading::test_parameter_filtering
pytest tests/test_activation_offloading.py::TestActivationOffloading::test_tensor_deduplication

these two cases pass in intel xpu

kashif · 2025-10-11T20:08:46Z

@S1ro1 ok i'll just skip FP8 activations

kashif · 2025-10-17T20:42:08Z

@SunMarc I have added support for broadcast and non-contiguous tensors

…into activation-dedup

sergiopaniego

lgtm!

kashif added 2 commits October 10, 2025 08:46

add tensor dedup and param offloading

c2f840d

fix formatting

d4bf577

kashif added 2 commits October 10, 2025 09:02

check if unique_storages_offloaded < total tensors

d86e825

fix for FSDP v2

f70a613

qgallouedec reviewed Oct 10, 2025

View reviewed changes

SunMarc approved these changes Oct 10, 2025

View reviewed changes

S1ro1 reviewed Oct 10, 2025

View reviewed changes

kashif added 2 commits October 11, 2025 17:16

Merge branch 'main' into activation-dedup

6dd3fc6

ignore fp8

3fab796

kashif added 6 commits October 15, 2025 09:50

Merge branch 'main' into activation-dedup

14bb4c5

Merge branch 'main' into activation-dedup

a463d40

checking if events exist before accessing them

3a29eb8

preserve stride information

254a34d

handle both broadcast and non-broadcast cases!

4660b1c

Merge branch 'main' into activation-dedup

e3f0d3c

sergiopaniego and others added 3 commits October 21, 2025 11:45

Merge branch 'main' into activation-dedup

fa35f8d

fix test

fcd0e12

Merge branch 'activation-dedup' of https://github.com/huggingface/trl …

ecadf3d

…into activation-dedup

sergiopaniego approved these changes Oct 21, 2025

View reviewed changes

kashif merged commit e2ab435 into main Oct 21, 2025
10 of 12 checks passed

kashif deleted the activation-dedup branch October 21, 2025 10:34

[Activation-checkpointing] add tensor dedup and param offloading #4247

[Activation-checkpointing] add tensor dedup and param offloading #4247

Conversation

kashif commented Oct 10, 2025

What does this PR do?

Uh oh!

kashif commented Oct 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sywangyi commented Oct 11, 2025

Uh oh!

kashif commented Oct 11, 2025

Uh oh!

kashif commented Oct 17, 2025

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants