llama-2 pytorch intel-extension-for-transformers neural-compressor transformers

Llama-2-13b-hf-onnx-int4

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository of INT4 weight only quantization for the 13B fine-tuned model in ONNX format, powered by Intel® Neural Compressor and Intel® Extension for Transformers.

Note: Use of this model is governed by the Meta license. Please ensure you have accepted that License and got access to the FP32 model before downloading models here.

This INT4 model is generated with Intel® Neural Compressor's weight-only quantization method.

Model Detail Description
Model Authors - Company Intel
Date August 29, 2023
Version 1
Type Text Generation
Paper or Other Resources -
License https://ai.meta.com/resources/models-and-libraries/llama-downloads/
Questions or Comments Community Tab
Intended Use Description
Primary intended uses You can use the raw model for text generation inference
Primary intended users Anyone doing text generation inference
Out-of-scope uses This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

Export to ONNX Model

The FP32 model is exported with meta-llama/Llama-2-13b-hf:

optimum-cli export onnx --model meta-llama/Llama-2-13b-hf --task text-generation ./llama2_13b

Install ONNX Runtime

Install onnxruntime>=1.16.0 to support MatMulFpQ4 operator.

Run Quantization

The weight-only quantization cofiguration is as below:

dtype group_size scheme algorithm
INT4 32 asym GPTQ

Build Intel® Neural Compressor from master branch and run INT4 weight-only quantization. We provide the key code below. For the complete quantization script, please refer to llama weight-only example.

from neural_compressor import quantization, PostTrainingQuantConfig

config = PostTrainingQuantConfig(
    approach="weight_only",
    calibration_sampling_size=[8],
    op_type_dict={".*": {"weight": {"bits": 4, 
                                    "algorithm": ["GPTQ"], 
                                    "scheme": ["asym"], 
                                    "group_size": 32}}},)

q_model = quantization.fit(
    "/path/to/llama2_13b/decoder_model.onnx", # FP32 model path
    config,
    calib_dataloader=dataloader)
q_model.save("/path/to/Llama-2-13b-hf-onnx-int4/decoder_model.onnx") # INT4 model path

Evaluation

Operator Statistics

Below shows the operator statistics in the INT4 ONNX model:

Op Type Total INT4 weight FP32
MatMul 321 281 40

Evaluation of perplexity

Evaluate the model with evaluation API of Intel® Extension for Transformers v1.2 on lambada_openai task.

from intel_extension_for_transformers.llm.evaluation.lm_eval import evaluate

model_path = "/path/to/Llama-2-13b-hf-onnx-int4" # folder contains the INT4 model
tokenizer = "Intel/Llama-2-13b-hf-onnx-int4"
batch_size = 64
tasks=["lambada_openai"]

results = evaluate(
    model="hf-causal",
    model_args="pretrained=" + model_path + ",tokenizer="+ tokenizer,
    batch_size=batch_size,
    tasks=tasks,
    model_format="onnx"
)
Model Model Size (GB) lambada_openai acc lambada_openai ppl
FP32 49 0.7677 3.0438
INT4 8.5 0.7607 3.1562