Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves 2d tiled matmulnbits by repeating A, loads N times for each B load #23071

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sushraja-msft
Copy link
Contributor

Description

Improves on previous change Implement 2d tiled matmulnbits specialized for prefill by keeping B in shared memory and reloading just A N times.

This is based on the observation that loading B is more expensive than loading A, that is for a run of size 16 seq length [3072, 3072, 8192] this matrix multiplication takes 1.9ms. Removing loadA drops it to 1.8ms, removing loadB drops it to 1.44ms.

By sharing B across multiple A tiles, the cost to load B and dequantize is reduced N fold.

------------------Baseline With Prefill Optimization from previous change ----

C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500                                                                                                                                                               
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.2135e+07
        avg (tokens/s): 41.2856                                 << 
        p50 (us):       1.21288e+07
        stddev (us):    21282.1
        n:              5 * 501 token(s)
Token generation:
        avg (us):       78945.3
        avg (tokens/s): 12.667
        p50 (us):       78900.7
        stddev (us):    2232.43
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       20.5608
        avg (tokens/s): 48636.3
        p50 (us):       18.7
        stddev (us):    19.0409
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       22163.8
        p50 (ms):       22160.1
        stddev (ms):    31.3122
        n:              5
Peak working set size (bytes): 5478862848
WebGPU device lost (2): Device was destroyed.

-- With A_REPEAT of 8 ---
C:\onnxruntime>c:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500                                                                                                                                                               
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.1233e+07
        avg (tokens/s): 44.6006              <<<
        p50 (us):       1.12267e+07
        stddev (us):    13445.2
        n:              5 * 501 token(s)
Token generation:
        avg (us):       78740.4
        avg (tokens/s): 12.7
        p50 (us):       78763
        stddev (us):    2196.62
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       21.4592
        avg (tokens/s): 46600
        p50 (us):       20.3
        stddev (us):    10.3021
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       21235.9
        p50 (ms):       21226.8
        stddev (ms):    44.8555
        n:              5

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants