Whisper-base Thai finetuned

1) Environment Setup

# visit https://pytorch.org/get-started/locally/ to install pytorch
pip3 install transformers librosa

2) Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa

device = "cuda" # cpu, cuda

model = WhisperForConditionalGeneration.from_pretrained("juierror/whisper-tiny-thai").to(device)
processor = WhisperProcessor.from_pretrained("juierror/whisper-tiny-thai", language="Thai", task="transcribe")

path = "/path/to/audio/file"

def inference(path: str) -> str:
    Get the transcription from audio path

        path(str): path to audio file (can be load with librosa)

        str: transcription
    audio, sr = librosa.load(path, sr=16000)
    input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
    generated_tokens = model.generate(
    transcriptions = processor.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    return transcriptions[0]


3) Evaluate Result

This model has been trained and evaluated on three datasets:

The Character Error Rate (CER) is calculated by removing spaces in both the labels and predicted text, and then computing the CER. The Word Error Rate (WER) is calculated using the PythaiNLP newmm tokenizer to tokenize both the labels and predicted text, and then computing the WER.

These are the results.

Dataset WER CER
Common Voice 13 23.14 6.74
Gowajee 24.79 11.39
Thai Elderly Speech (Smart Home) 13.28 4.14
Thai Elderly Speech (Health Care) 12.99 3.41