RedPajama-INCITE-Chat-3B-v1-ONNX

<!-- Provide a quick summary of what the model is/does. -->

RedPajama-INCITE-Chat-3B-v1 model by Together Computer exported to ONNX for faster CPU inference and fine-tuning.

Inference


from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

# init
tokenizer = AutoTokenizer.from_pretrained("orangetin/RedPajama-INCITE-Chat-3B-v1-ONNX")
model = ORTModelForCausalLM.from_pretrained("orangetin/RedPajama-INCITE-Chat-3B-v1-ONNX", use_cache=False)

# infer
prompt = "<human>: Who is Alan Turing?\n<bot>:"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
input_length = inputs.input_ids.shape[1]
outputs = model.generate(
    **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
)
token = outputs.sequences[0, input_length:]
output_str = tokenizer.decode(token)
print(output_str)

Fine-tuning

Refer to this example for fine-tuning: https://github.com/huggingface/optimum/tree/main/examples/onnxruntime/training/language-modeling