Model description
Intended uses & limitations
Training data
Training procedure
Evaluation results
Environmental impact

Quokka

Model description

Quokka is our first generative pre-trained transformer (GPT) model for Portuguese from Portugal (PT-PT). Our model is a fine-tuned version of Phoenix that was released on 04/08/2023. The backbone of Phoenix is BLOOMZ, which was fine-tuned using a vast dataset consisting of 267k samples of instructions and 189k samples of conversations.

Intended uses & limitations

You can use the model for text generation in Portuguese or fine-tune it on a downstream task.

How to use

You can use this model directly with a pipeline for text generation:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


checkpoint = "automaise/quokka-7b"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

Follows some examples of the capabilities of our model:

Text summarization

prompt = "Quero que ajas como um sumarizador de texto e me ajudes a criar um sumário conciso do seguinte texto apresentado, realçando apenas os pontos essenciais do texto original: Rabo de Peixe, internacionalmente conhecida como Turn of the Tide, foi criada por Augusto Fraga e Patrícia Cerqueira; e produzida para a Netflix pela Ukbar Filmes. A história é protagonizada por um grupo de amigos numa freguesia da ilha de São Miguel e foi inspirada no surgimento, em 2001, de toneladas de cocaína na costa açoriana após ter sido largada no Oceano Atlântico por traficantes. \"Quando um barco carregado de cocaína naufraga na sua ilha natal, Eduardo vê uma oportunidade arriscada, mas empolgante, de ganhar dinheiro e realizar sonhos impossíveis\", diz a sinopse apresentada pela Netflix."

generator(f"<human>{prompt}<bot>", max_new_tokens=512, temperature=0.0)

>> Turn of the Tide é uma série de televisão portuguesa criada por Augusto Fraga e Patrícia Cerqueira para a Netflix. A história é ambientada em uma freguesia da ilha de São Miguel e segue um grupo de amigos que se envolvem em um mistério envolvendo a cocaína que foi largada no Oceano Atlântico em 2001. A série é inspirada em um evento real que ocorreu na costa açoriana.

Question answering

prompt = "Lê o seguinte texto: Rabo de Peixe, internacionalmente conhecida como Turn of the Tide, foi criada por Augusto Fraga e Patrícia Cerqueira; e produzida para a Netflix pela Ukbar Filmes. A história é protagonizada por um grupo de amigos numa freguesia da ilha de São Miguel e foi inspirada no surgimento, em 2001, de toneladas de cocaína na costa açoriana após ter sido largada no Oceano Atlântico por traficantes. \"Quando um barco carregado de cocaína naufraga na sua ilha natal, Eduardo vê uma oportunidade arriscada, mas empolgante, de ganhar dinheiro e realizar sonhos impossíveis\", diz a sinopse apresentada pela Netflix. \n\n Onde foi filmada a série Rabo de Peixe?"

generator(f"<human>{prompt}<bot>", max_new_tokens=512, temperature=0.0)

>> A série Rabo de Peixe foi filmada na ilha de São Miguel, nos Açores.

Synthetic data

prompt = "Gera uma frase semelhante à seguinte frase: Bom dia, em que posso ser útil?"

generator(f"<human>{prompt}<bot>", max_new_tokens=512, temperature=0.0)

>> Bom dia, como posso ajudá-lo?

Limitations and bias

Language: the model was fine-tuned on Portuguese data only and might not generalize appropriately to other languages.
Prompt Engineering: the model's performance may vary depending on the prompt. We recommend writing clear and specific instructions.
Bias: the model might produce factually incorrect outputs or perpetuate biases present in its training data. It is fundamental to be aware of these limitations and exercise caution when using the model for human-facing interactions. This bias will also impact all subsequent fine-tuned versions of this model.

We did notice that the model avoids answering questions of religious or political nature:

prompt = "Que partido político é que apoias?"

generator(f"<human>{prompt}<bot>", max_new_tokens=512, temperature=0.0)

>> Como uma IA, não tenho preferências políticas.

Training data

Quokka was fine-tuned on a dataset collected from different sources:

Initially, we used the Cabrita dataset that consists of a translation of Alpaca's training data. The Portuguese translation was generated using ChatGPT. Therefore, it is important to note that these translations may not be of the highest quality.
Then, we incorporated the Bactrian-X dataset, which involves the translation of 67k English instructions (52k from Alpaca and 15k from Dolly v2) into 51 languages using Google Translate API. For our intended purposes, we exclusively selected the Portuguese subset and focused on the samples pertaining to Dolly v2.

Additionally, we conducted data curation to remove elements such as:

Samples exhibiting a high ratio of prompt length to output length, as these were deemed likely to induce model hallucinations.
Samples that lost meaning during the translation process, particularly those instructing the translation of a given text.

As a result, our final dataset comprises 56k samples.

Training procedure

This model was trained on a 1 x NVIDIA A100 40GB for about 4-5 hours using QLoRA. This fine-tuning approach allowed us to significantly reduce memory usage and computation time.

Evaluation results

To evaluate the performance of our model, we translated 70 questions, which were originally used to assess the capabilities of the Phoenix model, from English to Portuguese. We then conducted their automatic evaluation using GTP-3.5 as the evaluator and the general prompt as the metric evaluation prompt. This prompt was designed to elicit assessments of answers in terms of helpfulness, relevance, accuracy, and level of detail. Additional prompts are provided for assessing overall performance on different perspectives.

Follows the results against GPT-3.5 and two of the highest performing open-source models at the moment, Vicuna (13B) and Falcon (40B):

Automatic Evaluation in Portuguese:

	Lose	Tie	Win
Quokka vs. GPT-3.5	63.8%	10.1%	26.1%
Quokka vs. Vicuna-13B	66.2%	8.8%	25.0%
Quokka vs. Falcon-40B	17.4%	1.4%	81.2%

It is important to observe that the automatic evaluation of large language models is still an ongoing area of research and development, and these automatic tests may not always yield fair or comprehensive assessments. Therefore, these results should be taken with caution and not be treated as definitive.

Environmental impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

Hardware Type: 1 x NVIDIA A100 40GB
Hours used: 4-5
Cloud Provider: Google Cloud Platform
Compute Region: europe-west4
Carbon Emitted: 0.71 kg eq. CO2

Table of Contents