gptq auto-gptq quantized

stablelm-tuned-alpha-3b-gptq-4bit-128g

This is a quantized model saved with auto-gptq. At time of writing, you cannot directly load models from the hub, but will need to clone this repo and load locally.

git lfs install
git clone https://huggingface.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g

See the below excerpt from the tutorial for instructions.


Auto-GPTQ Quick Start

Quick Installation

Start from v0.0.4, one can install auto-gptq directly from pypi using pip:

pip install auto-gptq

AutoGPTQ supports using triton to speedup inference, but it currently only supports Linux. To integrate triton, using:

pip install auto-gptq[triton]

For some people who want to try the newly supported llama type models in 🤗 Transformers but not update it to the latest version, using:

pip install auto-gptq[llama]

By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.

To disable building CUDA extension, you can use the following commands:

For Linux

BUILD_CUDA_EXT=0 pip install auto-gptq

For Windows

set BUILD_CUDA_EXT=0 && pip install auto-gptq

Basic Usage

The full script of basic usage demonstrated here is examples/quantization/basic_usage.py

The two main classes currently used in AutoGPTQ are AutoGPTQForCausalLM and BaseQuantizeConfig.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

Load quantized model and do inference

Instead of .from_pretrained, you should use .from_quantized to load a quantized model.

device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)

This will first read and load quantize_config.json in opt-125m-4bit-128g directory, then based on the values of bits and group_size in it, load gptq_model-4bit-128g.bin model file into the first GPU.

Then you can initialize 🤗 Transformers' TextGenerationPipeline and do inference.

from transformers import TextGenerationPipeline

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])

Conclusion

Congrats! You learned how to quickly install auto-gptq and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.