automatic-speech-recognition generated_from_trainer hf-asr-leaderboard mozilla-foundation/common_voice_8_0 robust-speech-event xlsr-fine-tuning-week

Czech wav2vec2-xls-r-300m-cs-250

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the common_voice 8.0 dataset as well as other datasets listed below.

It achieves the following results on the evaluation set:

The eval.py script results using a LM are:

Model description

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Czech using the Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "cs", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-250")
model = Wav2Vec2ForCTC.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-250")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

Evaluation

The model can be evaluated using the attached eval.py script:

python eval.py --model_id comodoro/wav2vec2-xls-r-300m-cs-250 --dataset mozilla-foundation/common-voice_8_0 --split test --config cs

Training and evaluation data

The Common Voice 8.0 train and validation datasets were used for training, as well as the following datasets:

Training hyperparameters

The following hyperparameters were used during training:

Training results

Training Loss Epoch Step Validation Loss Wer Cer
3.4203 0.16 800 3.3148 1.0 1.0
2.8151 0.32 1600 0.8508 0.8938 0.2345
0.9411 0.48 2400 0.3335 0.3723 0.0847
0.7408 0.64 3200 0.2573 0.2840 0.0642
0.6516 0.8 4000 0.2365 0.2581 0.0595
0.6242 0.96 4800 0.2039 0.2433 0.0541
0.5754 1.12 5600 0.1832 0.2156 0.0482
0.5626 1.28 6400 0.1827 0.2091 0.0463
0.5342 1.44 7200 0.1744 0.2033 0.0468
0.4965 1.6 8000 0.1705 0.1963 0.0444
0.5047 1.76 8800 0.1604 0.1889 0.0422
0.4814 1.92 9600 0.1604 0.1827 0.0411
0.4471 2.09 10400 0.1566 0.1822 0.0406
0.4509 2.25 11200 0.1619 0.1853 0.0432
0.4415 2.41 12000 0.1513 0.1764 0.0397
0.4313 2.57 12800 0.1515 0.1739 0.0392
0.4163 2.73 13600 0.1445 0.1695 0.0377
0.4142 2.89 14400 0.1478 0.1699 0.0385
0.4184 3.05 15200 0.1430 0.1669 0.0376
0.3886 3.21 16000 0.1433 0.1644 0.0374
0.3795 3.37 16800 0.1426 0.1648 0.0373
0.3859 3.53 17600 0.1357 0.1604 0.0361
0.3762 3.69 18400 0.1344 0.1558 0.0349
0.384 3.85 19200 0.1379 0.1576 0.0359
0.3762 4.01 20000 0.1344 0.1539 0.0346
0.3559 4.17 20800 0.1339 0.1525 0.0351
0.3683 4.33 21600 0.1315 0.1518 0.0342
0.3572 4.49 22400 0.1307 0.1507 0.0342
0.3494 4.65 23200 0.1294 0.1491 0.0335
0.3476 4.81 24000 0.1287 0.1491 0.0336
0.3475 4.97 24800 0.1271 0.1475 0.0329

Framework versions