not-for-all-audiences

GGML's of Pygmalion Vicuna 1.1 7B

<!-- header start --> <div style="width: 100%;"> <img src="https://huggingface.co/spaces/shadowsword/misc/resolve/main/huggingface_shadowsword_ggml.png" alt="Shadowsword GGML Reuploads" style="width: 100%; min-width: 400px; display: block; margin: auto;"> </div> <!-- header end -->

a GGML re-upload by Shadowsword

https://huggingface.co/TehVenom/Pygmalion-Vicuna-1.1-7b

ggmlv3 from TheBloke's make-ggml.py commit to huggingface repo

example$ python3 ./make-ggml.py --model /home/inpw/Pygmalion-1.1-7b --outname Pygmalion-Vicuna-1.1-7b --outdir /home/inpw/Pygmalion-Vicuna-1.1-7b --keep_fp16 --quants ...

It was mentioned that Pygmalion LLM are no longer allowed on Google Colabs!

Includes USE_POLICY.md making sure to comply with license agreements / legalities.

Provided GGML Quants

Quant Method Use Case
Q2_K New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
Q3_K_S New k-quant method. Uses GGML_TYPE_Q3_K for all tensors
Q3_K_M New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
Q3_K_L New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
Q4_0 Original quant method, 4-bit.
Q4_1 Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
Q4_K_S New k-quant method. Uses GGML_TYPE_Q4_K for all tensors
Q4_K_M New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
Q5_0 Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference.
Q5_1 Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference.
Q5_K_S New k-quant method. Uses GGML_TYPE_Q5_K for all tensors
Q5_K_M New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
Q6_K New k-quant method. Uses GGML_TYPE_Q8_K for all tensors - 6-bit quantization
fp16 Compiled Safetensors, can be used to quantize

Thanks to TheBloke for the information on quant use cases.

RAM/VRAM Parameters GPU Offload (2K ctx, Q4_0, 6GB RTX 2060)
4GB 3B
8GB 7B 32 Layers
16GB 13B 18 Layers
32GB 30B 8 Layers
64GB 65B

Original Card:

Pygmalion Vicuna 1.1 7B

The LLaMA based Pygmalion-7b model:

https://huggingface.co/PygmalionAI/pygmalion-7b

Merged alongside lmsys's Vicuna v1.1 deltas:

https://huggingface.co/lmsys/vicuna-13b-delta-v1.1

This merge was done using an weighted average merge strategy, and the end result is a model composed of:

Pygmalion-7b [60%] + LLaMA Vicuna v1.1 [40%]

This was done under request, but the end result is intended to lean heavily towards Pygmalion's chatting + RP tendencies, and to inherit some of Vicuna's Assistant / Instruct / Helpful properties.

Due to the influence of Pygmalion, this model will very likely generate content that is considered NSFW.

The specific prompting is unknown, but try Pygmalion's prompt styles first, then a mix of the two to see what brings most interesting results.