Llama-2-70b-chat-EXL2-2.500b

A quantized version of Llama-2-70b-chat-hf in EXL2 format.

It was created with WizardLM_evol_instruct_70k as the parquet calibration file and this command:

python convert.py \
  -i ../NousResearch_Llama-2-70b-chat-hf \
  -o ~/working \
  -cf Llama-2-70b-chat-EXL2-2.500b \
  -c Evol-Instruct-Code-80k-v1.parquet \
  -b 2.500

It is very similar to the quantization created by turboderp. The only difference is that I used WizardLM_evol_instruct_70k as the calibration dataset instead of wikitext, with the hope that this will lead to a better performance for typical instruct tasks.

Note

If you get gibberish output, remove the BOS token from the beginning of your prompts.

In text-generation-webui, this can be done by unchecking "Add the bos_token to the beginning of prompts" under "Parameters" > "Generation".

See this issue for details: https://github.com/turboderp/exllamav2/issues/123