TE inference executor for 8 bit #2632

t-vi · 2025-10-12T15:43:29Z

What does this PR do?

This implements inference (only) fp8 linears using transformer engine.
It is modelled after the bitsandbytes transform.

One question I'd have is what a good scaling would be. I'm currently using the max range (fp8_max / tensor.absmax()) on the weight and 1.0 on the input, but I have no idea what a good input scale would be (this would depend on the accumulation in fp8 matmuls).

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

t-vi · 2025-10-12T15:47:11Z

thunder/transforms/te_inference.py

+        def te_linear_fp8_impl(x, qweight, bias, absmax, scale):
+            wq = transformer_engine.pytorch.Float8Quantizer(
+                scale=scale,
+                amax=absmax,
+                fp8_dtype=transformer_engine_torch.DType.kFloat8E4M3,
+                rowwise=True,
+                columnwise=False,
+            )
+
+            w = wq.create_tensor_from_data(qweight, fake_dtype=x.dtype, requires_grad=False)
+
+            minmax = x.aminmax()
+            xmax = torch.maximum(minmax.min.abs(), minmax.max.abs()).to(torch.float32)
+            xq = transformer_engine.pytorch.Float8Quantizer(
+                scale=1.0 / xmax,  # this needs to 1 (or even somewhat smaller for accumulation?)
+                amax=xmax,
+                fp8_dtype=transformer_engine_torch.DType.kFloat8E4M3,
+                rowwise=True,
+                columnwise=False,
+            )
+
+            out, *_ = transformer_engine.pytorch.ops.BasicLinear._functional_forward(
+                x,
+                w,
+                input_quantizer=xq,
+                with_quantized_compute=True,
+                weight_requires_grad=False,
+                input_requires_grad=False,
+            )
+


This is the core of how I compute linears. Does this make sense? What is a good scale for the input? I used 1/absmax now such that x[j, k] * w[i, k] would be < fp8_max, not sure if that is needed.

kshitij12345 · 2025-10-13T09:20:48Z

thunder/transforms/te_inference.py

+
+            minmax = x.aminmax()
+            xmax = torch.maximum(minmax.min.abs(), minmax.max.abs()).to(torch.float32)
+            xq = transformer_engine.pytorch.Float8Quantizer(


Do we necessarily want to use Float8Quantizer which ties up with DelayedScaling recipe (but works for both Hopper and Blackwell archs).

Instead we can also use MXFP8Quantizer (note: this is only supported with Blackwell arch). For this recipe, the quantization of the input will not depend on scale from previous iteration.

https://github.com/NVIDIA/TransformerEngine/blob/7ad130efd52c3aa4a386d25f1d42b28d5aa20090/transformer_engine/pytorch/tensor/mxfp8_tensor.py#L29

Sadly, I don't currently have easy access to Blackwell, so I'm keen to support hopper, too.
Of course, I'd 100% love to have something flexible enough for fp8 + fp4 on Hopper / Blackwell to the extend that it is supported.

TE inference executor for 8 bit

b716a8f

t-vi requested review from KaelanDt, lantiga and mruberry as code owners October 12, 2025 15:43

t-vi commented Oct 12, 2025

View reviewed changes

crcrpar requested review from kshitij12345 and riccardofelluga October 12, 2025 16:58

kshitij12345 reviewed Oct 13, 2025

View reviewed changes

t-vi and others added 5 commits October 14, 2025 06:46

Merge remote-tracking branch 'origin/main' into tom/te-inference

b8d5f83

groupedmm

b37c9af

Merge branch 'main' into tom/te-inference

b1d4a04

Merge branch 'main' into tom/te-inference

481d991

add get executor

a548b0d

lantiga approved these changes Oct 17, 2025

View reviewed changes

t-vi merged commit a955b66 into main Oct 17, 2025
48 of 51 checks passed

t-vi deleted the tom/te-inference branch October 17, 2025 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TE inference executor for 8 bit #2632

TE inference executor for 8 bit #2632

Uh oh!

t-vi commented Oct 12, 2025

Uh oh!

t-vi Oct 12, 2025 •

edited

Loading

Uh oh!

kshitij12345 Oct 13, 2025

Uh oh!

t-vi Oct 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TE inference executor for 8 bit #2632

TE inference executor for 8 bit #2632

Uh oh!

Conversation

t-vi commented Oct 12, 2025

What does this PR do?

PR review

Did you have fun?

Uh oh!

t-vi Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshitij12345 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

t-vi Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

t-vi Oct 12, 2025 •

edited

Loading

t-vi Oct 13, 2025 •

edited

Loading