Model Details

This is a quantized version of the Meta Llama Chat 70b model. Quantized at 5.0 bpw with Exllama V2 in exl2 format.

Tested on 2x3090s to deliver full 4096 context on Linux with Flash Attention installed. GPU split 21/23.5.

Quantization was performed in Exllama V2, version 0.0.5, at 4096 token sequence length on a parquet file that combined WikiText and some long form fiction.