Skip to content

Add block ptr test for dot product with transpose #4510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 17, 2025

Conversation

alexbaden
Copy link
Contributor

Adds the end-to-end dot product on block ptr testing to the block load unit test (maybe it should be renamed test_block_ptr.py?). Adds additional shapes, A transpose, and B transpose. The cold runtime (no cache) is approximately 1 minute on PVC 1100 in my environment. I picked the block shapes somewhat randomly, trying to balance breadth and runtime.

This somewhat duplicates tutorial 10 but allows us to run many more combinations in shorter time. I added this because #4463 is passing CI but has a few bugs that are not being caught by existing unit tests, including tutorials.

@whitneywhtsang
Copy link
Contributor

I added this because #4463 is passing CI but has a few bugs that are not being caught by existing unit tests, including tutorials.

Is it because some configs/shapes are missing from tutorial 10 or is it because autotune would hide failures?

@alexbaden
Copy link
Contributor Author

Is it because some configs/shapes are missing from tutorial 10 or is it because autotune would hide failures?

Both reasons - plus it is more expensive because the overall shape is bigger. Here we make the overall shape smaller which dramatically improves runtime, which is fine because we only care about large enough to have different configurations in a single block vs tutorial 10 where we want to see performance and need bigger shapes to really get good throughput.

@alexbaden alexbaden requested a review from whitneywhtsang June 16, 2025 19:40
Co-authored-by: Whitney Tsang <whitney.tsang@intel.com>
Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can remove all comments about the kernel, as it adds maintenance cost, one could refer to tutorial 3 or 10 to understand how to write gemm in Triton.

@whitneywhtsang
Copy link
Contributor

Another thought is we can add test_op in tutorial 10, similar to how it is done in tutorial 6.

@alexbaden
Copy link
Contributor Author

Let's go with this approach for now, I can move it to tutorial 10 (or enable the tensor descriptor version of tutorial 10 and then add test_op there, if upstream does not do that first). After all theoretically block_ptr won't be around much longer...

@whitneywhtsang
Copy link
Contributor

Let's go with this approach for now, I can move it to tutorial 10 (or enable the tensor descriptor version of tutorial 10 and then add test_op there, if upstream does not do that first). After all theoretically block_ptr won't be around much longer...

Sure, I am fine to do this first.

@whitneywhtsang
Copy link
Contributor

IMO we can remove all comments about the kernel, as it adds maintenance cost, one could refer to tutorial 3 or 10 to understand how to write gemm in Triton.

@alexbaden WDYT?

@alexbaden
Copy link
Contributor Author

I think it's ok as is - long term we should either remove it, since block_ptr will be deprecated, or fold it into tutorial 10 (but probably remove it).
For now though this coverage is important as it lets us know the block load lowering has no regressions - we will still rely on that even if we move completely to tensor descriptor.

@alexbaden
Copy link
Contributor Author

Synced with Whitney offline - going to merge this now and then either merge it into tutorial 10 (if block ptr sticks around) or probably delete it alongside tutorial 10 when block ptr goes away.

@alexbaden alexbaden merged commit e279ab7 into main Jun 17, 2025
15 checks passed
@alexbaden alexbaden deleted the alex/add_dpas_block_tests branch June 17, 2025 00:53
david-hls pushed a commit to david-hls/intel-xpu-backend-for-triton that referenced this pull request Jun 18, 2025
This avoids a `Vector Type not specified properly` warning from ptxas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants