Casper: a Catalan Automatic Speech Recognition Model

Table of Contents

Model Details

Uses

Direct Use

This model can be used for Catalan transcription task.

Limitations

Training

Training Data

This model was fine-tuned on Common Voice and ParlamentParla Ctalan datasets.

The Common Voice project seeks to provide a platform where everyone can contribute their own voice to an open source multilingual data bank. The model developers used the the version mozilla-foundation/common_voice_11_0 ca for training.

The ParlamentParla speech corpus contains audio segments extracted from recordings from the Catalan Parliment plenary sessions between 2007/07/11 - 2018/07/1.

Training Procedure

Step Training Loss Epoch Validation Loss Validation WER
10000 0.11 0.43 0.14 6.49%
20000 0.09 0.86 0.13 6.28%
30000 0.05 1.28 0.13 5.91%
40000 0.06 1.71 0.12 5.90%
50000 0.03 2.14 0.13 5.70%
60000 0.03 2.57 0.13 5.82%
70000 0.03 3.00 0.13 5.56%
80000 0.01 3.43 0.14 5.64%
90000 0.01 3.85 0.14 5.59%
100000 0.01 4.28 0.14 5.50%
110000 0.01 4.71 0.14 5.42%
120000 0.01 5.14 0.15 5.83%
130000 0.01 5.57 0.15 5.65%
140000 0.01 6.00 0.15 5.54%
150000 0.003 6.42 0.15 5.56%

Evaluation

Evaluation Data

The evaluation dataset was created by the developer Xavier using the webinars from the University of Barcelona and is mostly domain-specific, surrounding topics of linguistics and language policy.

The distribution of different specifications in the evaluation set is as follows:

Specification Category # %
Register Formal 88 57.14%
Informal 66 42.86%
Accent Central 33 21.43%
Balearic 44 28.57%
Valencian 44 28.57%
Western 33 21.43%
Gender Male 66 42.86%
Female 88 57.14%

Evaluation Metrics

The model developers evaluated Casper using two metrics: Word Error Rate (WER) and BLEU for machine translation (MT) from Catalan to Spanish and to English.

Our fine-tuned Whisper model Casper significantly outperforms the zero-shot performance of the pre-trained Whisper model across different specifications on the evaluation dataset and such improvements lead to better outcomes in the MT downstream task.

WER
Specification Category Whisper-small Fine-tuned Whisper-small
Register Formal 31.21% 17.71%
Informal 53.10% 22.10%
Accent Central 16.38% 14.39%
Balearic 29.76% 29.68%
Valencian 77.28% 16.15%
Western 21.10% 17.48%
Gender Male 57.49% 15.14%
Female 24.60% 23.39%
Total / 40.12% 19.50%
BLEU
Language Target Correct Transcript Whisper-small Fine-tuned whisper-small
Spanish Human Translation 83.5498 54.0836 63.7367
Machine-assisted Translation 84.219 54.5868 63.9436
English Human Translation 32.7 29.5 30.8
Machine-assisted Translation 33.5 30.3 31.6

How to Get Started With the Model

from transformers import  WhisperProcessor, WhisperForConditionalGeneration, WhisperConfig
import torch
import torchaudio

# Load Casper and its processor :
processor = WhisperProcessor.from_pretrained("maximilianchen/casper")
model = WhisperForConditionalGeneration.from_pretrained("maximilianchen/casper")

# Load an audio sample
## Please make sure that the audio sample has been resampled to 16kHz before being loaded
sa, sr = torchaudio.load(filename)
sa = sa.squeeze(0)

# Convert input audio sample into features
inputs = processor(sa, sampling_rate=sr, return_tensors="pt").input_features

# Generate token ids
with torch.no_grad():
    generated_ids = model.generate(inputs=inputs)

# Decode token ids to text
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(transcription)
['miraré de destacar molt breument què té d específic i essencial la coordinació aquesta estructura aparentment trivial on diem que coordinem dues categories aparentment iguals què té d especial què té de específic perquè és complicat si té raó o eies per això es parla d equivalència sintàctica i semàntica i llavors el repte és veure exactament què què té de sintàctica què té de semàntica']