Skip to content

Conversation

jiawenliu64
Copy link
Member

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1399

Further tune FP8 grouped GEMM for Llama4 shapes. Found there is 13%-30% perf gain for Llama4 memory-bound cases. For other shapes in future models, I think there will be room to further improve perf by adding more heuristics like this Diff, that we will revisit and add offline cache file loading instead of adding under the kernel file.

Differential Revision: D76460456

Copy link

netlify bot commented Jun 11, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 65b29c4
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/684e61ed7b695300085c4dd5
😎 Deploy Preview https://deploy-preview-4326--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76460456

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76460456

jiawenliu64 added a commit to jiawenliu64/FBGEMM that referenced this pull request Jun 15, 2025
Summary:
Pull Request resolved: pytorch#4326

X-link: facebookresearch/FBGEMM#1399

Further tune FP8 grouped GEMM for Llama4 shapes. Found there is 13%-30% perf gain for Llama4 memory-bound cases. For other shapes in future models, I think there will be room to further improve perf by adding more heuristics like this Diff, that we will revisit and add offline cache file loading instead of adding under the kernel file.

Differential Revision: D76460456
Summary:
Pull Request resolved: pytorch#4326

X-link: facebookresearch/FBGEMM#1399

Further tune FP8 grouped GEMM for Llama4 shapes. Found there is 13%-30% perf gain for Llama4 memory-bound cases. For other shapes in future models, I think there will be room to further improve perf by adding more heuristics like this Diff, that we will revisit and add offline cache file loading instead of adding under the kernel file.

Differential Revision: D76460456
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76460456

jiawenliu64 added a commit to jiawenliu64/FBGEMM that referenced this pull request Jun 15, 2025
Summary:
Pull Request resolved: pytorch#4326

X-link: facebookresearch/FBGEMM#1399

Further tune FP8 grouped GEMM for Llama4 shapes. Found there is 13%-30% perf gain for Llama4 memory-bound cases. For other shapes in future models, I think there will be room to further improve perf by adding more heuristics like this Diff, that we will revisit and add offline cache file loading instead of adding under the kernel file.

Differential Revision: D76460456
jiawenliu64 added a commit to jiawenliu64/FBGEMM that referenced this pull request Jun 16, 2025
Summary:
Pull Request resolved: pytorch#4326

X-link: facebookresearch/FBGEMM#1399

Further tune FP8 grouped GEMM for Llama4 shapes. Found there is 13%-30% perf gain for Llama4 memory-bound cases. For other shapes in future models, I think there will be room to further improve perf by adding more heuristics like this Diff, that we will revisit and add offline cache file loading instead of adding under the kernel file.

Differential Revision: D76460456
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6ee9646.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants