Intel/Llama-2-13b-hf-onnx-int4 - AI Model Zoo

Llama-2-13b-hf-onnx-int4

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository of INT4 weight only quantization for the 13B fine-tuned model in ONNX format, powered by Intel® Neural Compressor and Intel® Extension for Transformers.

Note: Use of this model is governed by the Meta license. Please ensure you have accepted that License and got access to the FP32 model before downloading models here.

This INT4 model is generated with Intel® Neural Compressor's weight-only quantization method.

Model Detail	Description
Model Authors - Company	Intel
Date	August 29, 2023
Version	1
Type	Text Generation
Paper or Other Resources	-
License	https://ai.meta.com/resources/models-and-libraries/llama-downloads/
Questions or Comments	Community Tab

Intended Use	Description
Primary intended uses	You can use the raw model for text generation inference
Primary intended users	Anyone doing text generation inference
Out-of-scope uses	This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

Export to ONNX Model

The FP32 model is exported with meta-llama/Llama-2-13b-hf:

optimum-cli export onnx --model meta-llama/Llama-2-13b-hf --task text-generation ./llama2_13b

Install ONNX Runtime

Install onnxruntime>=1.16.0 to support MatMulFpQ4 operator.

Run Quantization

The weight-only quantization cofiguration is as below:

dtype	group_size	scheme	algorithm
INT4	32	asym	GPTQ

Build Intel® Neural Compressor from master branch and run INT4 weight-only quantization. We provide the key code below. For the complete quantization script, please refer to llama weight-only example.

from neural_compressor import quantization, PostTrainingQuantConfig

config = PostTrainingQuantConfig(
    approach="weight_only",
    calibration_sampling_size=[8],
    op_type_dict={".*": {"weight": {"bits": 4, 
                                    "algorithm": ["GPTQ"], 
                                    "scheme": ["asym"], 
                                    "group_size": 32}}},)

q_model = quantization.fit(
    "/path/to/llama2_13b/decoder_model.onnx", # FP32 model path
    config,
    calib_dataloader=dataloader)
q_model.save("/path/to/Llama-2-13b-hf-onnx-int4/decoder_model.onnx") # INT4 model path

Evaluation

Operator Statistics

Below shows the operator statistics in the INT4 ONNX model:

Op Type	Total	INT4 weight	FP32
MatMul	321	281	40

Evaluation of perplexity

Evaluate the model with evaluation API of Intel® Extension for Transformers v1.2 on lambada_openai task.

from intel_extension_for_transformers.llm.evaluation.lm_eval import evaluate

model_path = "/path/to/Llama-2-13b-hf-onnx-int4" # folder contains the INT4 model
tokenizer = "Intel/Llama-2-13b-hf-onnx-int4"
batch_size = 64
tasks=["lambada_openai"]

results = evaluate(
    model="hf-causal",
    model_args="pretrained=" + model_path + ",tokenizer="+ tokenizer,
    batch_size=batch_size,
    tasks=tasks,
    model_format="onnx"
)

Model	Model Size (GB)	lambada_openai acc	lambada_openai ppl
FP32	49	0.7677	3.0438
INT4	8.5	0.7607	3.1562

Llama-2-13b-hf-onnx-int4

Export to ONNX Model

Install ONNX Runtime

Run Quantization

Evaluation

NSDT 3DConvert

UnrealSynth

DreamTexture.js