whisper-event generated_from_trainer

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

Whisper Small zh-HK - Alvin

This model is a fine-tuned version of openai/whisper-small on the Common Voice 11.0 dataset. This version has a lower CER (by 1%) compared to the previous one.

Training and evaluation data

For training, three datasets were used:

Using the Model

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)
from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]

Training Hyperparameters

Training Results

Training Loss Epoch Step Validation Loss Normalized CER
0.4610 0.55 2000 0.3106 13.08
0.3441 1.11 4000 0.2875 11.79
0.3466 1.66 6000 0.2820 11.44
0.2539 2.22 8000 0.2777 10.59
0.2312 2.77 10000 0.2822 10.60
0.1639 3.32 12000 0.2859 10.17
0.1569 3.88 14000 0.2866 10.11