-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use 2D block loads for post-DPAS chained ops #3000
Conversation
Add matrix is noticeably improved (~5%), other benchmarks are inline w/ no regressions: https://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1&var-tag=ci%7Cci-adv%7Cci-dflt%7Cpr3000&var-bench=All&var-device=Intel%28R%29%20Data%20Center%20GPU%20Max%201550&var-compiler=triton&var-compiler=xetla&var-compiler=onednn&var-backend=All&var-baseline_backend=xetla-ci-XPU%201550&var-target_backend=triton-ci-XPU%201550 |
This also resolves the performance regression in #2834 for the add matrix case. We get even to slightly-higher performance when we use a transposed
Transposed
|
5532f2a
to
a312d98
Compare
I have been trying to break it today and have not uncovered any issues. I cleaned up a good chunk of the duplication and have some ideas for how to further abstract away the differences but I'm going to save that for another PR. This is now ready for review. |
9f64bf8
to
37b841e
Compare
37b841e
to
5b829bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
5b829bc
to
93ce4db
Compare
I modified the addmatrix benchmarks to include int8 to get some idea of correctness and performance there. The result is correct, and I see block loads in the IR. Performance is better:
I will PR the benchmark changes separately - torch does not support int8 mma on the GPU so we have to go back to CPU, which makes the benchmark take about 20x longer. We will have to decide how to handle that. |
? cast<DpasEncodingAttr>(encoding) | ||
: cast<DpasEncodingAttr>( | ||
getDotEncoding(tensorType).value().getParent()); | ||
auto dotOrder = dpasLayout.getThreadOrder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need get the order from the layout encoding instead of the parent layout for #ttg.dot_op
.
Maybe we can use encoding.getThreadOrder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, now that you point it out I do not think it was a good idea to combine the dpasLayout
and dotLayout
into the same dpasLayout
variable. I separated them again since the dotLayout
is only used in the code block after the conditional for the dpasLayout
load.
93ce4db
to
96e27f9
Compare
This is a proof of concept to use the 2D block load for tensor pointer loads where the layout is a DPAS layout but the result of the load is not directly used in the DPAS computation. I built the PoC arond the
gemm_postop_addmatrix_benchmark
kernel which computesAxB + D
. TheD
matrix uses a block pointer load and the TTGIR applies the DPAS MMA layout to theD
matrix load:But, when this code is lowered currently we use a scalar load. This requires extracting all the scalar values from the dpas registers, loading each scalar from a tensor of pointers for the
D
matrix, and computing the scalar add. A representative GEMM kernel forAxB
is approximately 1100 instructions in ASM, but with scalarD
matrix addition that increases to 7500 instructions and is quite slow. With this PR, the ASM is down to 1500 and performance is much better. I believe we both increase memory bandwidth by doing larger loads and decrease register pressure by using the 2D block load data directly from registers. I modified the GEMM with block pointers tutorial to do the matrix addition and compared directly with PyTorch.main:
w/ this change:
note however that our GEMM performance is not as good as PyTorch/oneDNN:
But now with the matrix addition, we're even with / sometimes slightly ahead of PyTorch. This demonstrates the benefit of operator fusion in Triton and if we can improve gemm performance a bit we should be able to pull ahead of PyTorch for this use case.
The code here contains a lot of duplication because I was not sure what parts of the existing block load code would be relevant, and what could be ignored. I plan to clean up the duplication (and I expect a few test failures that I will have to fix) before this is ready to be moved out of draft.
cc #2997