[NF4] Support nf4 tensor shard and gather #2449

mori360 · 2025-06-26T18:32:47Z

Add nf4_all_gather_into_tensor and scatter_nf4tensor to enable dispatch of scatter and all_gather_into_tensor
Add unit test to show that nf4 tensor keeps the same after distribute and gather.

pytorch-bot · 2025-06-26T18:32:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2449

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 2b2f768 with merge base 8b57afe ():

NEW FAILURE - The following job has failed:

Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh)
Process completed with exit code 127.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2025-06-26T21:17:00Z

Presumably we should observe a comms time reduction? would be nice to see some profiles

mori360 · 2025-06-26T22:54:02Z

Presumably we should observe a comms time reduction? would be nice to see some profiles

Are there any baseline I could compare with?

weifengpy · 2025-06-27T06:29:59Z

torchao/dtypes/nf4tensor.py

+            )
+            updated_attrs.update(
+                {
+                    "stride": (nf4tensor.size()[1], 1),


why hardcode the stride?

weifengpy · 2025-06-27T06:32:18Z

torchao/dtypes/nf4tensor.py

+            )
+    else:
+        updated_attrs = {}
+        if nf4tensor.numel() != nf4tensor.size()[0] * nf4tensor.size()[1]:


is it possible to have out of bound access with nf4tensor.size()[1]? in the else branch, could len(size) == 0?

weifengpy · 2025-06-27T06:33:27Z

torchao/dtypes/nf4tensor.py

+        if input_tensors:
+            for input_tensor in input_tensors[0]:
+                if hasattr(input_tensor, attr):
+                    input_attrs.append(getattr(input_tensor, attr))


what happens when tensor are not evenly divisible? or is there is possibility for uneven sharding?

if sharded unevenly(did not go through the split dispatch), we will compare the input and output sizes here

weifengpy · 2025-06-27T06:34:51Z

torchao/dtypes/nf4tensor.py

@@ -22,7 +22,44 @@
 c10d_functional = torch.ops.c10d_functional


-NF4_OPS_TABLE: Dict[Any, Any] = {}
+def nf4_all_gather_into_tensor(func, *args, **kwargs):
+    nf4tensor = args[0][0]


can we assert len(args) and len(args[0]) before accessing them?

weifengpy · 2025-06-27T06:35:39Z

test/dtypes/test_nf4.py

+
+    @pytest.mark.skipif(
+        version.parse(torch.__version__).base_version < "2.4.0",
+        reason="torch >= 2.4 required",


which api is needed in 2.4? DTensor?

weifengpy · 2025-06-27T06:35:58Z

test/dtypes/test_nf4.py

@@ -435,7 +435,7 @@ def test_tensor_view_valid(self, input_size: Union[Tuple[int], int]):
            inner_tensor = getattr(viewed_tensor, attr)
            self.assertEqual(inner_tensor.size(0), inner_tensor.numel())

-    @parametrize("input_size", [(512 * 512,), (512, 512)])
+    @parametrize("input_size", [(512 * 512,)])


why removing (512, 512) ?

tensor=torch.randn(512,512) and tensor.view(512,512) is now valid after changes at nf4_view, move it to test_tensor_2d_view_valid

weifengpy

left some comments. probably need to polish the pr more

weifengpy · 2025-06-27T06:38:37Z

Presumably we should observe a comms time reduction? would be nice to see some profiles

perf should be on-par. This BE refactoring upstreams NF4 specific logic from torchtune to DTensor. It creates an example for people to follow to handle tensor subclass + DTensor state dict

mori360 added 3 commits June 26, 2025 11:18

support nf4 tensor shard and gather

b9a55b5

adjust unit test

23b9ebf

lint

aba9245

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2025

mori360 marked this pull request as ready for review June 26, 2025 19:30

mori360 added the topic: for developers Use this tag if this PR is mainly developer facing label Jun 26, 2025

mori360 requested review from andrewor14, msaroufim and drisspg June 26, 2025 20:10

mori360 marked this pull request as draft June 26, 2025 21:06

mori360 added 2 commits June 26, 2025 14:10

add timeout

a54089f

add lint

f781ddb

drisspg requested a review from weifengpy June 26, 2025 23:28

weifengpy reviewed Jun 27, 2025

View reviewed changes

mori360 added 6 commits June 27, 2025 11:17

fix some bugs

8ccd94f

add assert

c84f24f

2d view test

9fb66c4

remove

a27c4f9

lint

b2a52d6

add assert

3ec669e

remove one assert

2b2f768

mori360 marked this pull request as ready for review June 29, 2025 00:16

mori360 requested a review from weifengpy June 29, 2025 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NF4] Support nf4 tensor shard and gather #2449

[NF4] Support nf4 tensor shard and gather #2449

Uh oh!

mori360 commented Jun 26, 2025

Uh oh!

pytorch-bot bot commented Jun 26, 2025 •

edited

Loading

Uh oh!

msaroufim commented Jun 26, 2025 •

edited

Loading

Uh oh!

mori360 commented Jun 26, 2025

Uh oh!

weifengpy Jun 27, 2025

Uh oh!

mori360 Jun 27, 2025

Uh oh!

weifengpy Jun 27, 2025

Uh oh!

weifengpy Jun 27, 2025 •

edited

Loading

Uh oh!

mori360 Jun 27, 2025

Uh oh!

weifengpy Jun 27, 2025

Uh oh!

weifengpy Jun 27, 2025

Uh oh!

weifengpy Jun 27, 2025

Uh oh!

mori360 Jun 27, 2025 •

edited

Loading

Uh oh!

weifengpy left a comment

Uh oh!

weifengpy commented Jun 27, 2025

Uh oh!

Uh oh!

[NF4] Support nf4 tensor shard and gather #2449

Are you sure you want to change the base?

[NF4] Support nf4 tensor shard and gather #2449

Uh oh!

Conversation

mori360 commented Jun 26, 2025

Uh oh!

pytorch-bot bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2449

❌ 1 New Failure

Uh oh!

msaroufim commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mori360 commented Jun 26, 2025

Uh oh!

weifengpy Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

mori360 Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mori360 Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

mori360 Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy left a comment

Choose a reason for hiding this comment

Uh oh!

weifengpy commented Jun 27, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 26, 2025 •

edited

Loading

msaroufim commented Jun 26, 2025 •

edited

Loading

weifengpy Jun 27, 2025 •

edited

Loading

mori360 Jun 27, 2025 •

edited

Loading