Improve performance for quantized models on Power10 CPU #408

mgiessing · 2024-05-06T15:50:59Z

mgiessing
May 6, 2024

The quantized version of the sample model (phi-2) seem to have poor performance compared to fp32:

Phyiscal cores:		 8
Logical cores:		 16
Total Memory (GB):	 61.69
Avail. Memory (GB):	 58.84

FP32: It took 9.37 seconds to generate 100 tokens => approximately 10.67 tokens/second
FP16: It took 232.36 seconds to generate 100 tokens => approximately 0.43 tokens/second
INT4: It took 60.96 seconds to generate 100 tokens => approximately 1.64 tokens/second

How to reproduce?

Preparation of the models:

python3 -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p fp32 -o /data/LLMs/phi2_ort_genai_fp32
python3 -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p fp16 -o /data/LLMs/phi2_ort_genai_fp16
python3 -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p int4 -o /data/LLMs/phi2_ort_genai_int4

Script to run the inference:

import os, time, psutil
import onnxruntime_genai as og

def get_tokens(generated_tokens):
    # Dim 0 is no. of batches, Dim1 is length of each batch (truncated/padded to same dim)
    num_generated_tokens = len(generated_tokens)*len(generated_tokens[0])
    return num_generated_tokens

print(f"Phyiscal cores:\t\t {psutil.cpu_count(logical=False)}")
print(f"Logical cores:\t\t {psutil.cpu_count(logical=True)}")
print(f"Total Memory (GB):\t {psutil.virtual_memory().total / (1024 ** 3):.2f}")
print(f"Avail. Memory (GB):\t {psutil.virtual_memory().available / (1024 ** 3):.2f}\n")

QUANT_LEVEL = ["fp32", "fp16", "int4"]
for q in QUANT_LEVEL:
    model_path = os.path.abspath(f"/data/LLMs/phi2_ort_genai_{q}")
    model = og.Model(model_path)
    tokenizer = og.Tokenizer(model)

    prompt = '''def print_prime(n):
        """
        Print all primes between 1 and n
        """'''

    tokens = tokenizer.encode(prompt)

    params = og.GeneratorParams(model)
    params.set_search_options(min_length=100, max_length=100)
    params.input_ids = tokens

    st = time.time()
    output_tokens = model.generate(params)
    ed = time.time()

    #print(tokenizer.decode(output_tokens))
    print(f"{q.upper()}: It took {ed-st:.2f} seconds to generate {get_tokens(output_tokens)} tokens => approximately {get_tokens(output_tokens)/(ed-st):.2f} tokens/second")

System information

OS: AlmaLinux 9.3
IBM Power10 CPU (ppc64le)

$ pip3 list installed | grep onnxruntime
onnxruntime            1.17.3
onnxruntime-genai      0.2.0rc4

Anyone got an idea why the quantized performance is poor - I'd could assume there is some data type conversion adding overhead?

Thank you!

yufenglee · 2024-05-06T16:32:06Z

yufenglee
May 6, 2024
Collaborator

@mgiessing, int4 doesn't have special kernel for Power10 CPU yet. I don't have plan to support it yet. Do you want to contribute?

2 replies

mgiessing May 28, 2024
Author

Hi @yufenglee - just for my understanding:
int4 kernel for Power10 CPU would needed to be implemented in onnxruntime itself, not this repository (onnxruntime-genai) - is this correct?

baijumeswani May 28, 2024
Maintainer

Yes, your understanding is correct. onnxruntime-genai is a wrapper around onnxruntime and relies on ort to execute the onnx graph.

mgiessing · 2024-05-06T17:28:43Z

mgiessing
May 6, 2024
Author

@yufenglee is that something that would needed to be addressed within onnxruntime-genai or via onnxruntime mlas?

I just synced with our optimization & kernel team and they confirmed we have only implemented Power10 MMA (matrix math accelerator engine) for fp32 & int8 in onnxruntime/mlas

That would lead to another question: Are you going to support int8 quantization?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for quantized models on Power10 CPU #408

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Improve performance for quantized models on Power10 CPU #408

mgiessing May 6, 2024

How to reproduce?

Preparation of the models:

Script to run the inference:

System information

Replies: 2 comments · 2 replies

yufenglee May 6, 2024 Collaborator

mgiessing May 28, 2024 Author

baijumeswani May 28, 2024 Maintainer

mgiessing May 6, 2024 Author

mgiessing
May 6, 2024

Replies: 2 comments 2 replies

yufenglee
May 6, 2024
Collaborator

mgiessing May 28, 2024
Author

baijumeswani May 28, 2024
Maintainer

mgiessing
May 6, 2024
Author