Improve performance for quantized models on Power10 CPU #408
mgiessing
started this conversation in
Support for Targets (OS / EPs / Hardware)
Replies: 2 comments 2 replies
-
@mgiessing, int4 doesn't have special kernel for Power10 CPU yet. I don't have plan to support it yet. Do you want to contribute? |
Beta Was this translation helpful? Give feedback.
2 replies
-
@yufenglee is that something that would needed to be addressed within onnxruntime-genai or via onnxruntime mlas? I just synced with our optimization & kernel team and they confirmed we have only implemented Power10 MMA (matrix math accelerator engine) for fp32 & int8 in onnxruntime/mlas That would lead to another question: Are you going to support int8 quantization? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The quantized version of the sample model (phi-2) seem to have poor performance compared to fp32:
How to reproduce?
Preparation of the models:
Script to run the inference:
System information
OS: AlmaLinux 9.3
IBM Power10 CPU (ppc64le)
$ pip3 list installed | grep onnxruntime onnxruntime 1.17.3 onnxruntime-genai 0.2.0rc4
Anyone got an idea why the quantized performance is poor - I'd could assume there is some data type conversion adding overhead?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions