Support tuning cache for Cutlass FP8 GEMM #4301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

cthi wants to merge 4 commits into pytorch:main from cthi:export-D75541025

Contributor

cthi commented Jun 9, 2025

Summary:
This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.

I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
The next diff in this stack will add the new kernels D75820688, to make the review easier
- Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

netlify bot commented Jun 9, 2025 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`026366c`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/684c96b3c9c81d00082893ec
😎 Deploy Preview	https://deploy-preview-4301--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot added the cla signed label

Contributor

facebook-github-bot commented Jun 9, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 8f609f6 to ccbdff4 Compare

June 12, 2025 20:38

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

ccbdff4

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from ccbdff4 to 050d494 Compare

June 12, 2025 20:45

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

050d494

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

943b754

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 050d494 to 943b754 Compare

June 12, 2025 21:58

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

6c5535c

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 943b754 to 6c5535c Compare

June 12, 2025 22:08

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

83d2886

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 6c5535c to 83d2886 Compare

June 12, 2025 22:25

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

042892a

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 83d2886 to 042892a Compare

June 12, 2025 22:38

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

f761d19

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 042892a to f761d19 Compare

June 13, 2025 03:36

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

6aa693f

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from f761d19 to 6aa693f Compare

June 13, 2025 03:40

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 6aa693f to b28acf4 Compare

June 13, 2025 14:35

cthi force-pushed the export-D75541025 branch from 731129c to 8aea124 Compare

June 13, 2025 17:19

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

8aea124

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

e08e8d3

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 8aea124 to e08e8d3 Compare

June 13, 2025 19:02

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

136a6af

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from e08e8d3 to 136a6af Compare

June 13, 2025 19:10

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

ed34cfe

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 136a6af to ed34cfe Compare

June 13, 2025 20:14

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

e80f9ee

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from ed34cfe to e80f9ee Compare

June 13, 2025 20:54

cthi added 3 commits

June 13, 2025 14:09


          Add initial version of TuningCache and scripts for heuristic + kernel

bd0fa8f

Differential Revision: D75540999


          Support tuning cache for Cutlass BF16 grouped GEMM

71020f1

Differential Revision: D75541013


          Add new kernels for Cutlass BF16 grouped GEMM for tuning cache

0a56667

Differential Revision: D75806957

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

2dca156

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from e80f9ee to 2dca156 Compare

June 13, 2025 21:15


          Support tuning cache for Cutlass FP8 GEMM (pytorch#4301)

026366c

Summary:
Pull Request resolved: pytorch#4301

X-link: facebookresearch/FBGEMM#1377

This diff adds support for the tuning cache to the kernel. There should be no performance changes to the existing heuristics.
- I refactored the kernel dispatch logic to instead return the kernel function, as it removes some duplication of the kernel invoke.
- The next diff in this stack will add the new kernels D75820688, to make the review easier
  - Note that we are having some issues with adding the new kernels, as I have found this kernel is actually compiling 12 variants for each configuration, see D75820688 for more context. So for now we won't add the new kernels in D75820688, but we can just onboard it to auto tuning incase someone wants to compile them locally. Will revisit D75820688 later.

Reviewed By: q10, jiawenliu64

Differential Revision: D75541025

Contributor

facebook-github-bot commented Jun 13, 2025

This pull request was exported from Phabricator. Differential Revision: D75541025

cthi force-pushed the export-D75541025 branch from 2dca156 to 026366c Compare

June 13, 2025 21:22

facebook-github-bot closed this in

4c9313f

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Jun 14, 2025

This pull request has been merged in 4c9313f.

gchalump added feature:fp8 category:new feature:quantize labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category:new cla signed fb-exported feature:fp8 feature:quantize Merged