QLLM

KeyWords Quantization, GPTQ,AWQ, HQQ, ONNX, ONNXRuntime, VLLM

Quantize all LLMs in HuggingFace/Transformers with GPTQ/AWQ/HQQ in mixed bits(2-8bit), and export to onnx model

QLLM is a out-of-box quantization toolbox for large language models, It is designed to be a auto-quantization framework which takes layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.

We alread supported

GPTQ quantization
AWQ quantization
HQQ quantization

Features:

GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
Auto promoting bits/group-size for better accuracy
Export to ONNX model, Running by OnnxRuntime

Latest News 🔥

[2024/03] ONNX Models export API
[2024/01] Support HQQ algorithm
[2023/12] The first PyPi package released

Installation

Easy to install qllm from PyPi [cu121]

pip install qllm

Install from release package, CUDA-118/121 is supported. [py38, py39, py310] https://github.com/wejoncy/QLLM/releases

Build from Source

Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build

If you are using CUDA-121

pip install git+https://github.com/wejoncy/QLLM.git --no-build-isolation

OR CUDA-118/117

git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install

How to use it

Quantize llama2

#  Quantize and Save compressed model, method can be one of [gptq/awq/hqq]
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --nsamples=64 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --dataset=pileval --nsamples=16 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=hqq --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit

Convert to onnx model

use --export_onnx ./onnx_model to export and save onnx model

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/

or you can convert a existing model in HF Hub

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --export_onnx=./onnx
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --export_onnx=./onnx

(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)

#  Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential

NOTE:

only support GPTQ
allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
wjat different with gptq-for-llama is we grow bit by one instead of times 2.
all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.

Quantize model for vLLM

Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq

COMPATIBLE_WITH_AUTOGPTQ=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ

If you use GEMM pack_mode, then you don't have to set the var

python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM

python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM

Conversion among AWQ, GPTQ and MarLin

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ

Or you can use --pack_mode=AWQ to convert GPTQ to AWQ.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM

Or you can use --pack_mode=MARLIN to convert GPTQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Or you can use --pack_mode=MARLIN to convert AWQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Note:

Not all cases are supported, for example,

if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
if GPTQ model is quantized with --allow_mix_bits option, you can't convert it to AWQ.
if GPTQ model is quantized with --act_order option, you can't convert it to AWQ.

model inference with the saved model

python -m qllm --load ./Llama-2-7b-4bit --eval

model inference with ORT

you may want to use genai to do generation with ORT.

import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'

tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/decoder_merged.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 
          'attention_mask': mask, 
          'position_ids': np.arrange(0,sample_inputs[0], dtype=np.int64),
          'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
    inputs[f'past_key_values.{i}.key'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
    inputs[f'past_key_values.{i}.value'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)

Load quantized model from hugingface/transformers

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ  --use_plugin

start a chatbot

you may need to install fschat and accelerate with pip

pip install fschat accelerate

use --use_plugin to enable a chatbot plugin

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=awq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/

or 
python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/

use QLLM with API

from qllm import AutoModelQuantization

quantizer = AutoModelQuantization()
q_model = quantizer.api_quantize(model_or_model_path='meta-llama/Llama-2-7b-hf', method='gptq', wbits=4, groupsize=128)

OR

from qllm import AutoModelQuantization
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
q_model = quantizer.api_quantize(model_or_model_path=q_model, method='gptq', wbits=4, groupsize=128)

For some users has transformers connect issues.

Please set environment with PROXY_PORT=your http proxy port

PowerShell $env:PROXY_PORT=1080

Bash export PROXY_PORT=1080

windows cmd set PROXY_PORT=1080

Acknowledgements

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
assets		assets
csrc		csrc
qllm		qllm
.flake8		.flake8
.gitignore		.gitignore
.style.yapf		.style.yapf
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
qllm_colab.ipynb		qllm_colab.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QLLM

Installation

How to use it

Quantize llama2

Convert to onnx model

(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)

NOTE:

Quantize model for vLLM

Conversion among AWQ, GPTQ and MarLin

Note:

model inference with the saved model

model inference with ORT

Load quantized model from hugingface/transformers

start a chatbot

use QLLM with API

For some users has transformers connect issues.

Acknowledgements

About

Releases 10

Packages

Contributors 4

Languages

License

wejoncy/QLLM

Folders and files

Latest commit

History

Repository files navigation

QLLM

Installation

How to use it

Quantize llama2

Convert to onnx model

(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)

NOTE:

Quantize model for vLLM

Conversion among AWQ, GPTQ and MarLin

Note:

model inference with the saved model

model inference with ORT

Load quantized model from hugingface/transformers

start a chatbot

use QLLM with API

For some users has transformers connect issues.

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 4

Languages

Packages