Casper: a Catalan Automatic Speech Recognition Model

Model Details
Uses
Limitations
Training
Evaluation
How to Get Started With the Model

Model Details

Model Description: Casper is a state-of-the-art automatic speech recognition (ASR) model for Catalan by finetuning whisper-small on Catalan datasets.
Developed by: Yongjian Chen, Xavier Bonet-Casals, Mireia Farrús.
Model Type: Transformer-based encoder-decoder model
Language(s): Catalan
Parent Model: See the whisper-small for more information about the Whisper model.

Uses

Direct Use

This model can be used for Catalan transcription task.

Limitations

This model does not do punctuation and casing.
This model only supports audio samples of up to 30 seconds in duration as an inherent property from the parent model. To transcribe audio samples of more than 30 seconds, an additional chunking algorithm is needed to preprocess the samples.

Training

Training Data

This model was fine-tuned on Common Voice and ParlamentParla Ctalan datasets.

The Common Voice project seeks to provide a platform where everyone can contribute their own voice to an open source multilingual data bank. The model developers used the the version mozilla-foundation/common_voice_11_0 ca for training.

The ParlamentParla speech corpus contains audio segments extracted from recordings from the Catalan Parliment plenary sessions between 2007/07/11 - 2018/07/1.

Training Procedure

Step	Training Loss	Epoch	Validation Loss	Validation WER
10000	0.11	0.43	0.14	6.49%
20000	0.09	0.86	0.13	6.28%
30000	0.05	1.28	0.13	5.91%
40000	0.06	1.71	0.12	5.90%
50000	0.03	2.14	0.13	5.70%
60000	0.03	2.57	0.13	5.82%
70000	0.03	3.00	0.13	5.56%
80000	0.01	3.43	0.14	5.64%
90000	0.01	3.85	0.14	5.59%
100000	0.01	4.28	0.14	5.50%
110000	0.01	4.71	0.14	5.42%
120000	0.01	5.14	0.15	5.83%
130000	0.01	5.57	0.15	5.65%
140000	0.01	6.00	0.15	5.54%
150000	0.003	6.42	0.15	5.56%

Evaluation

Evaluation Data

The evaluation dataset was created by the developer Xavier using the webinars from the University of Barcelona and is mostly domain-specific, surrounding topics of linguistics and language policy.

The distribution of different specifications in the evaluation set is as follows:

Specification	Category	#	%
Register	Formal	88	57.14%
	Informal	66	42.86%
Accent	Central	33	21.43%
	Balearic	44	28.57%
	Valencian	44	28.57%
	Western	33	21.43%
Gender	Male	66	42.86%
	Female	88	57.14%

Evaluation Metrics

The model developers evaluated Casper using two metrics: Word Error Rate (WER) and BLEU for machine translation (MT) from Catalan to Spanish and to English.

Our fine-tuned Whisper model Casper significantly outperforms the zero-shot performance of the pre-trained Whisper model across different specifications on the evaluation dataset and such improvements lead to better outcomes in the MT downstream task.

WER

Specification	Category	Whisper-small	Fine-tuned Whisper-small
Register	Formal	31.21%	17.71%
	Informal	53.10%	22.10%
Accent	Central	16.38%	14.39%
	Balearic	29.76%	29.68%
	Valencian	77.28%	16.15%
	Western	21.10%	17.48%
Gender	Male	57.49%	15.14%
	Female	24.60%	23.39%
Total	/	40.12%	19.50%

BLEU

Language	Target	Correct Transcript	Whisper-small	Fine-tuned whisper-small
Spanish	Human Translation	83.5498	54.0836	63.7367
	Machine-assisted Translation	84.219	54.5868	63.9436
English	Human Translation	32.7	29.5	30.8
	Machine-assisted Translation	33.5	30.3	31.6

How to Get Started With the Model

from transformers import  WhisperProcessor, WhisperForConditionalGeneration, WhisperConfig
import torch
import torchaudio

# Load Casper and its processor :
processor = WhisperProcessor.from_pretrained("maximilianchen/casper")
model = WhisperForConditionalGeneration.from_pretrained("maximilianchen/casper"）

# Load an audio sample
## Please make sure that the audio sample has been resampled to 16kHz before being loaded
sa, sr = torchaudio.load(filename)
sa = sa.squeeze(0)

# Convert input audio sample into features
inputs = processor(sa, sampling_rate=sr, return_tensors="pt").input_features

# Generate token ids
with torch.no_grad():
    generated_ids = model.generate(inputs=inputs)

# Decode token ids to text
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(transcription)
['miraré de destacar molt breument què té d específic i essencial la coordinació aquesta estructura aparentment trivial on diem que coordinem dues categories aparentment iguals què té d especial què té de específic perquè és complicat si té raó o eies per això es parla d equivalència sintàctica i semàntica i llavors el repte és veure exactament què què té de sintàctica què té de semàntica']