Llama-2-70b-chat-EXL2-2.500b
A quantized version of Llama-2-70b-chat-hf in EXL2 format.
It was created with WizardLM_evol_instruct_70k as the parquet calibration file and this command:
python convert.py \
-i ../NousResearch_Llama-2-70b-chat-hf \
-o ~/working \
-cf Llama-2-70b-chat-EXL2-2.500b \
-c Evol-Instruct-Code-80k-v1.parquet \
-b 2.500
It is very similar to the quantization created by turboderp. The only difference is that I used WizardLM_evol_instruct_70k as the calibration dataset instead of wikitext, with the hope that this will lead to a better performance for typical instruct tasks.
Note
If you get gibberish output, remove the BOS token from the beginning of your prompts.
In text-generation-webui, this can be done by unchecking "Add the bos_token to the beginning of prompts" under "Parameters" > "Generation".
See this issue for details: https://github.com/turboderp/exllamav2/issues/123