Whisper Small zh-HK - Alvin

This model is a fine-tuned version of openai/whisper-small on the Common Voice 11.0 dataset. This version has a lower CER (by 1%) compared to the previous one.

Training and evaluation data

For training, three datasets were used:

Common Voice 11 Canto Train Set
CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf

Using the Model

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

Alternatively, you can use huggingface pipelines

from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]

Training Hyperparameters

learning_rate: 5e-5
train_batch_size: 25 (on 2 GPUs)
eval_batch_size: 8
gradient_accumulation_steps: 2
total_train_batch_size: 25x2x2=100
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 14000
mixed_precision_training: Native AMP
augmentation: SpecAugment

Training Results

Training Loss	Epoch	Step	Validation Loss	Normalized CER
0.4610	0.55	2000	0.3106	13.08
0.3441	1.11	4000	0.2875	11.79
0.3466	1.66	6000	0.2820	11.44
0.2539	2.22	8000	0.2777	10.59
0.2312	2.77	10000	0.2822	10.60
0.1639	3.32	12000	0.2859	10.17
0.1569	3.88	14000	0.2866	10.11

Whisper Small zh-HK - Alvin

Training and evaluation data

Using the Model

Training Hyperparameters

Training Results

NSDT 3DConvert

UnrealSynth

DreamTexture.js