Model Card for Carpincho-30b

This is Carpincho-30B qlora 4-bit checkpoint, an Instruction-tuned LLM based on LLama-30B. It is trained to answer in colloquial spanish Argentine language.

It was trained on 2x3090 (48G) for 120 hs using huggingface QLoRA code (4-bit quantization)

Model Details

The model is provided in LoRA format.

Usage

Here is example inference code, you will need to install the following requirements:

bitsandbytes==0.39.0
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
accelerate @ git+https://github.com/huggingface/accelerate.git
einops==0.6.1
evaluate==0.4.0
scikit-learn==1.2.2
sentencepiece==0.1.99
wandb==0.15.3

import time
import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer

model_name = "models/huggyllama_llama-30b/"
adapters_name = 'carpincho-30b-qlora'

print(f"Starting to load the model {model_name} into memory")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map="sequential"
)

print(f"Loading {adapters_name} into memory")
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = LlamaTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1

stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")

def main(tokenizer):
    prompt = '''Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
%s
### Response:
    ''' % "Hola, como estas?"

    batch = tokenizer(prompt, return_tensors="pt")
    batch = {k: v.cuda() for k, v in batch.items()}

    with torch.no_grad():
        generated = model.generate(inputs=batch["input_ids"],
                               do_sample=True, use_cache=True,
                               repetition_penalty=1.1,
                               max_new_tokens=100,
                               temperature=0.9,
                               top_p=0.95,
                               top_k=40,
                               return_dict_in_generate=True,
                               output_attentions=False,
                               output_hidden_states=False,
                               output_scores=False)
    result_text = tokenizer.decode(generated['sequences'].cpu().tolist()[0])
    print(result_text)

main(tokenizer)

Model Description

Developed by: Alfredo Ortega (@ortegaalfredo)
Model type: 30B LLM QLoRA
Language(s): (NLP): English and colloquial Argentine Spanish
License: Free for non-commercial use, but I'm not the police.
Finetuned from model: https://huggingface.co/huggyllama/llama-30b

Model Sources [optional]

Repository: https://huggingface.co/huggyllama/llama-30b
Paper [optional]: https://arxiv.org/abs/2302.13971

Uses

This is a generic LLM chatbot that can be used to interact directly with humans.

Bias, Risks, and Limitations

This bot is uncensored and may provide shocking answers. Also it contains bias present in the training material.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Model Card Contact

Contact the creator at @ortegaalfredo on twitter/github