-
Notifications
You must be signed in to change notification settings - Fork 68
Add block ptr test for dot product with transpose #4510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is it because some configs/shapes are missing from tutorial 10 or is it because autotune would hide failures? |
Both reasons - plus it is more expensive because the overall shape is bigger. Here we make the overall shape smaller which dramatically improves runtime, which is fine because we only care about large enough to have different configurations in a single block vs tutorial 10 where we want to see performance and need bigger shapes to really get good throughput. |
Co-authored-by: Whitney Tsang <whitney.tsang@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we can remove all comments about the kernel, as it adds maintenance cost, one could refer to tutorial 3 or 10 to understand how to write gemm in Triton.
Another thought is we can add |
Let's go with this approach for now, I can move it to tutorial 10 (or enable the tensor descriptor version of tutorial 10 and then add |
Sure, I am fine to do this first. |
@alexbaden WDYT? |
I think it's ok as is - long term we should either remove it, since block_ptr will be deprecated, or fold it into tutorial 10 (but probably remove it). |
Synced with Whitney offline - going to merge this now and then either merge it into tutorial 10 (if block ptr sticks around) or probably delete it alongside tutorial 10 when block ptr goes away. |
This avoids a `Vector Type not specified properly` warning from ptxas.
Adds the end-to-end dot product on block ptr testing to the block load unit test (maybe it should be renamed
test_block_ptr.py
?). Adds additional shapes, A transpose, and B transpose. The cold runtime (no cache) is approximately 1 minute on PVC 1100 in my environment. I picked the block shapes somewhat randomly, trying to balance breadth and runtime.This somewhat duplicates tutorial 10 but allows us to run many more combinations in shorter time. I added this because #4463 is passing CI but has a few bugs that are not being caught by existing unit tests, including tutorials.