automatic-speech-recognition openslr robust-speech-event km generated_from_trainer hf-asr-leaderboard

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the openslr dataset. It achieves the following results on the evaluation set:

Evaluation results on OpenSLR "test" (self-split 10%) (Running ./eval.py):

Evaluation results with language model on OpenSLR "test" (self-split 10%) (Running ./eval.py):

Installation

Install the following libraries on top of HuggingFace Transformers for the supports of language model.

pip install pyctcdecode
pip install https://github.com/kpu/kenlm/archive/master.zip

Usage

Approach 1: Using HuggingFace's pipeline, this will cover everything end-to-end from raw audio input to text output.

from transformers import pipeline

# Load the model
pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")

# Process raw audio
output = pipe("sound_file.wav", chunk_length_s=10, stride_length_s=(4, 2))

Approach 2: More custom way to predict phonemes.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC 
import librosa
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")

# Read and process the input
speech_array, sampling_rate = librosa.load("sound_file.wav", sr=16_000)
inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, axis=-1)      
predicted_sentences = processor.batch_decode(predicted_ids)
print(predicted_sentences)

Intended uses & limitations

The data used for this model is only around 4 hours of recordings.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

Training results

Training Loss Epoch Step Validation Loss Wer
5.0795 5.47 400 4.4121 1.0
3.5658 10.95 800 3.5203 1.0
3.3689 16.43 1200 2.8984 0.9996
2.01 21.91 1600 1.0041 0.7288
1.6783 27.39 2000 0.6941 0.5989
1.527 32.87 2400 0.5599 0.5282
1.4278 38.35 2800 0.4827 0.4806
1.3458 43.83 3200 0.4429 0.4532
1.2893 49.31 3600 0.4156 0.4330
1.2441 54.79 4000 0.4020 0.4040
1.188 60.27 4400 0.3777 0.3866
1.1628 65.75 4800 0.3607 0.3858
1.1324 71.23 5200 0.3534 0.3604
1.0969 76.71 5600 0.3428 0.3624
1.0897 82.19 6000 0.3387 0.3567
1.0625 87.66 6400 0.3339 0.3499
1.0601 93.15 6800 0.3288 0.3446
1.0474 98.62 7200 0.3281 0.3462

Framework versions