llama-65b-4bit

This works with my branch of GPTQ-for-LLaMa: https://github.com/catid/GPTQ-for-LLaMa-65B-2GPU

To test it out on two RTX4090 GPUs and 64GB RAM (might work with a big swap file haven't tested):

# Install git-lfs
sudo apt install git git-lfs

# Clone the code
git clone https://github.com/catid/GPTQ-for-LLaMa-65B-2GPU
cd GPTQ-for-LLaMa-65B-2GPU

# Clone the model weights
git lfs install
git clone https://huggingface.co/catid/llama-65b-4bit

# Set up conda environment
conda create -n gptq python=3.10
conda activate gptq

# Install script dependencies
pip install -r requirements.txt

# Work around protobuf error
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

# Run a test
python llama_inference.py llama-65b-4bit --load llama-65b-4bit/llama65b-4bit-128g.safetensors --groupsize 128 --wbits 4 --text "I woke up with a dent in my forehead.  " --max_length 128 --min_length 32

license: bsd-3-clause