PyTorch Transformer model Mobilebert-uncased for Natural Language Classification and Question Answering
This document describes evaluation of optimized checkpoints for transformer models Mobilebert-uncased for NL Classification and Question Answering tasks.
Please install and setup AIMET (Torch GPU variant) before proceeding further.
NOTE
- All AIMET releases are available here: https://github.com/quic/aimet/releases
- This model has been tested using AIMET version 1.27.0 (i.e. set
release_tag="1.27.0"
in the above instructions). - This model is compatible with the PyTorch GPU variant of AIMET (i.e. set
AIMET_VARIANT="torch_gpu"
in the above instructions).
pip install datasets==2.4.0
pip install transformers==4.11.3
export PYTHONPATH=$PYTHONPATH:<path to parent of aimet_model_zoo_path>
- Original full precision checkpoints without downstream training were downloaded through hugging face
- [Full precision model with downstream training weight files] are automatically downloaded using evaluation script
- [Quantization optimized model weight files] are automatically downloaded using evaluation script
- For NLP tasks, we use the General Language Understanding Evaluation (GLUE) benchmark dataset for evaluation.
- For Question Answering tasks, we use the Stanford Question Answering Dataset (SQuAD) benchmark dataset for evaluation.
- Dataset downloading is handled by evaluation script
python mobilebert_quanteval.py \
--model_config <MODEL_CONFIGURATION> \
--per_device_eval_batch_size 4 \
--output_dir <OUT_DIR> \
-
example
python mobilebert_quanteval.py --model_config mobilebert_w8a8_rte --per_device_eval_batch_size 4 --output_dir ./evaluation_result
-
supported values of model_config are "mobilebert_w8a8_rte","mobilebert_w8a8_stsb","mobilebert_w8a8_mrpc","mobilebert_w8a8_cola","mobilebert_w8a8_sst2","mobilebert_w8a8_qnli","mobilebert_w8a8_qqp","mobilebert_w8a8_mnli", "mobilebert_w8a8_squad", "mobilebert_w4a8_rte","mobilebert_w4a8_stsb","mobilebert_w4a8_mrpc","mobilebert_w4a8_cola","mobilebert_w4a8_sst2","mobilebert_w4a8_qnli","mobilebert_w4a8_qqp","mobilebert_w4a8_mnli", "mobilebert_w4a8_squad"
The following configuration has been used for the above models for IN4/INT8 quantization:
- Weight quantization: 4/8 bits, symmetric quantization
- Bias parameters are not quantized
- Activation quantization: 8 bits, asymmetric quantization
- Model inputs are quantized
- Different quantization scheme for different weight bithwidth and downstreaming tasks
- Mask values of -6 was applied in attention layers
- Quantization aware training (QAT) was used to obtain optimized quantized weights, detailed hyperparameters listed in Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort, "Understanding and Overcoming the Challenges of Efficient Transformer Quantization", EMNLP 2021.
QAT Configuration | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE |
W8A8 | per-tensor, tf | per-channel, tf | per-channel, tf | per-channel, tf | per-channel, tf | per-tensor, tf_enhanced | per-tensor, tf_enhanced | per-channel, tf |
W4A8 | per-tensor, tf_enhanced | per-channel, tf_enhanced | per-channel, tf_enhanced | per-channel, tf_enhanced | per-tensor, tf_enhanced | per-tensor, tf_enhanced | per-tensor, tf_enhanced | per-channel, tf_enhanced |
QAT Configuration | |
W8A8 | per-channel, tf |
W4A8 | per-channel, range_learning_with_tf_enhanced_init |
Below are the results of the Pytorch transformer model MobileBert for GLUE dataset:
CoLA (corr) | SST-2 (acc) | MRPC (f1) | STS-B (corr) | QQP (acc) | MNLI (acc) | QNLI (acc) | RTE (acc) | GLUE | |
FP32 | 51.48 | 91.60 | 85.86 | 88.22 | 90.66 | 83.54 | 91.18 | 68.60 | 81.27 |
W8A8 | 52.51 | 91.63 | 90.81 | 88.19 | 90.80 | 83.46 | 91.12 | 68.95 | 82.18 |
W4A8 | 50.34 | 91.28 | 87.61 | 87.30 | 90.48 | 82.90 | 89.42 | 68.23 | 80.95 |
EM | F1 | |
FP32 | 82.75 | 90.11 |
W8A8 | 81.96 | 89.41 |
W4A8 | 81.88 | 89.33 |