facebook meta pytorch llama llama-2

🦚Merak-7B-v3-Mini-Orca GPTQ🐳

<p align="center"> <img src="https://i.imgur.com/39sQd3h.png" alt="Merak Orca" width="300" height="300"/> </p>

These files are GPTQ model files for Merak-7B-v3-Mini-Orca

Merak-7B-v3-Mini-Orca is Ichsan2895's Merak-7B-v3 fine-tuned on Bahasa Indonesia translated psmathur's orca_mini_v1_dataset.

Prompt format

You can use Vicuna 1.1 format for Ooobabooga's text generation webui.

SYSTEM: Anda adalah asisten AI. Anda akan diberi tugas. Anda harus menghasilkan jawaban yang rinci dan panjang.
USER: <prompt> (without the <>)
ASSISTANT:

How to easily download and use this model in text-generation-webui.

Please make sure you're using the latest version of text-generation-webui.

It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.

  1. Click the Model tab.
  2. Under Download custom model or LoRA, enter asyafiqe/Merak-7B-v3-Mini-Orca-Indo-GPTQ.
  1. Click Download.
  2. The model will start downloading. Once it's finished it will say "Done"
  3. In the top left, click the refresh icon next to Model.
  4. In the Model dropdown, choose the model you just downloaded: Merak-7B-v3-Mini-Orca-Indo-GPTQ
  5. In the Model Loader dropdown, choose ExLlamav2_HF as the model loader.
  6. Click load.
  7. Click the Default tab
  8. Copy prompt format mentioned above to the input box.
  9. Enter a prompt and click generate! Click continue to get longer response.

How to use this GPTQ model from Python code

First make sure you have AutoGPTQ installed:

GITHUB_ACTIONS=true pip install auto-gptq pip install sentencepiece

Then try the following example code:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "asyafiqe/Merak-7B-v3-Mini-Orca-Indo-GPTQ"
model_basename = "Merak-7B-v3-Mini-Orca-Indo-GPTQ"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)


prompt = "Buat rencana untuk menghemat listrik di rumah"
system_message = "Anda adalah asisten AI. Anda akan diberi tugas. Anda harus menghasilkan jawaban yang rinci dan panjang.\n"
prompt_template=f'''SYSTEM: {system_message}
USER: {prompt}
ASSISTANT: '''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

Compatibility

The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.

ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.

Credits

TheBloke for the Readme template.