text-generation-inference

GPTQ Algorithm with auto-gptq Integration

Model Description

The GPTQ algorithm, developed by Frantar et al., is designed to compress transformer-based language models into fewer bits with minimal performance degradation. The auto-gptq library, based on the GPTQ algorithm, has been seamlessly integrated into the 🤗 transformers, enabling users to load and work with models quantized using the GPTQ algorithm.

Features

Intended Use

This integration is intended for users who want to compress their transformer-based language models without significant performance loss. It's especially useful for deployment scenarios where model size is a constraint.

Limitations and Considerations

Training Data

The GPTQ algorithm requires calibration data for optimal quantization. Users can either use supported datasets like "c4", "wikitext2", etc., or provide a custom dataset for calibration.

Evaluation Results

Performance after quantization may vary based on the dataset used for calibration and the bit precision chosen for quantization. It's recommended to evaluate the quantized model on relevant tasks to ensure it meets the desired performance criteria.

References