Skip to content

Conversation

jiawenliu64
Copy link
Member

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1869

Enable CUTLASS grouped GEMM for llama4x pretraining grad on GB200 and H100

Next steps:

  1. Currently enabled dgrad. Will build a new kernel for wgrad as followup
  2. Will further optimize perf on GB200

Reviewed By: jwfromm

Differential Revision: D81997154

jiawenliu64 and others added 2 commits September 10, 2025 13:09
Summary:
X-link: facebookresearch/FBGEMM#1868

Optimize BF16 CUTLASS GMM to bring 1.1x - 1.3x speedup for llama4x pretraining fprop shapes. More results can be found in this [spreadsheet](https://docs.google.com/spreadsheets/d/172Nm0F9K6XJenNFoNFqC5Sp1Ll2KhLtfOJpIfkuHDzc/edit?usp=sharing)

Differential Revision: D81704026
… H100

Summary:
X-link: facebookresearch/FBGEMM#1869

Enable CUTLASS grouped GEMM for llama4x pretraining grad on GB200 and H100

Next steps:
1. Currently enabled dgrad. Will build a new kernel for wgrad as followup
2. Will further optimize perf on GB200

Reviewed By: jwfromm

Differential Revision: D81997154
Copy link

netlify bot commented Sep 10, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 607e342
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68c1dff3dda5be0008929fe1
😎 Deploy Preview https://deploy-preview-4856--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Sep 10, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81997154

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in dc0ab6d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants