whisper-event

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

Whisper Large V2 zh-HK - Alvin

This model is a fine-tuned version of openai/whisper-large-v2 on the Common Voice 11.0 dataset. This is trained with PEFT LoRA+BNB INT8 with a Normalized CER of 7.77%

To use the model, use the following code. It should be able to inference with less than 4GB VRAM (batch size of 1).

from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer, WhisperTokenizer, WhisperProcessor

peft_model_id = "alvanlii/whisper-largev2-cantonese-peft-lora"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)

task = "transcribe"
tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, task=task)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, task=task)
feature_extractor = processor.feature_extractor
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)

audio = # load audio here
text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]

Training and evaluation data

For training, three datasets were used:

Training Hyperparameters

Training Results

Training Loss Epoch Step Validation Loss Normalized CER
0.8604 1.99 12000 0.2129 0.07766