GPTQ quantized version of LLongMA 2 13B 16K as found here: https://huggingface.co/conceptofmind/LLongMA-2-13b-16k

As of writing, that model is missing a model card, so see here to get a general idea: https://huggingface.co/conceptofmind/LLongMA-2-7b

Quantized using GPTQ-for-LLaMA (59cabb9) using the following command:

CUDA_VISIBLE_DEVICES=0 python llama.py <model-path> c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors <out-path>/4bit-32g-tsao.safetensors

You'll probably want to run this with ExLlama; use --compress_pos_emb 4.0 and up to --length/--max_seq_len 16384

Using ExLlama, full context fits into 24GB of VRAM with only 1-2GB to spare.