-
Notifications
You must be signed in to change notification settings - Fork 49
[DRAFT] Add Lds transpose load (ds.read_tr16) support on gfx950 for f16/bf16 #2029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
stefankoncarevic
wants to merge
11
commits into
develop
Choose a base branch
from
dsreadtr16_lds_transpose_load
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,555
−295
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR adds support for direct to LDS. Allowing ThreadwiseReadIntoOp to write to LDS directly. We change the LDS layout to use this functionality. Also, we add new direct to LDS scheduling options.
- Implemented rock.lds_transpose_load TD definition supporting f16 and bf16. - Added verifier to ensure source memref is in workgroup (LDS) memory and indices match rank. - Implemented lowering pattern to amdgpu.transpose_load. - Created MLIR tests covering FP16 and BF16 loads with FileCheck patterns.
buffering pipeline When DirectToLDS is enabled, the pipeline now computes per-operand transpose decisions (decisionA, decisionB) based on MFMA shape and layout info before invoking LDS transpose loads.
This commit introduces the full implementation of LDS transpose load handling used in threadwise read and single-buffering and double buffering pipelines. It adds logic for computing per-lane base offsets, generating LdsTransposeLoadOp instructions, and managing vectorized fragment loading for MFMA operations. The implementation supports multiple layout kinds (e.g., L16x16, L32x16, L32x8) and dynamically expands offsets for multi-K fused cases.This enables more flexible data movement between LDS and registers for MFMA input tiles.
LDS transpose load decisions for both A and B operands.It adds architecture-aware selection of transpose configurations using hwtranspose::makeDecision, based on MFMA instruction shape and per-block tile sizes.Threadwise reads from LDS now attach transpose metadata when applicable, allowing the backend to emit LDSTransposeLoadOp for efficient wave-level data rearrangement.
Add a safeguard to skip configurations where mPerBlock or nPerBlock exceeds 32, since larger tile sizes are not yet supported by the current LDS transpose load implementation.
f16 and bf16 data types, with multiple K-dimension configurations and schedule versions. Add CFG file to restrict execution to gfx950 architecture only, ensuring tests run exclusively on supported hardware. All test cases have passed validation under gfx950.
if this is based on direct to lds branch, please can you change the PR so that it merges into that branch? So that it's easier to review. We won't merge it into that branch of course. |
a709ab4
to
fe5315c
Compare
fe5315c
to
4a0b2ce
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
DO NOT MERGE UNTIL #1906 IS MERGED
close: https://github.com/ROCm/rocMLIR-internal/issues/1858
Technical Details
This change adds full support for LDS transpose load integration within both single-buffering and double-buffering pipelines.
The implementation enables transpose-aware LDS loading for operands A and B, provided that both matrices use compatible memory layouts.
Currently, the logic performs iterations over the K dimension, while iteration over M and N dimensions is still under development and will be refined in the next update.
Future work will focus on performance evaluation and optimization of bank conflict patterns during LDS access.
git diff direct to lds vs transpose load:
dsreadtr16_vs_direct_to_lds2.txt
Test Plan
Basic functionality was verified using existing MFMA pipeline tests for both single and double buffering.
Next, I will extend the tests to cover various matrix layout configurations and measure execution performance.
A detailed performance table and LDS bank conflict statistics will be added in comment later to quantify the improvements.
Test Result
Submission Checklist