Pomak Slavic

wav2vec2-xls-r-slavic-pomak

To train a Pomak ASR model, we fine-tuned a Slavic model (classla/wav2vec2-large-slavic-parlaspeech-hr) on 11h of recorded Pomak speech.

Recordings

Fours native Pomak speakers (2 female and 2 male) agreed to read Pomak texts at the ILSP audio-visual studio in Xanthi, Greece, resulting in a total of 14h.

Speaker Gender Total recorded hours
NK9dIF F 4h 44m 45s
xoVY9q M 4h 36m 12s
9G75fk F 1h 44m 03s
n5WzHj M 3h 44m 04s

To fine-tune the model, we split the long recordings into smaller segments of a maximum of 25 seconds each. This removed the majority of pauses and resulted in a total dataset duration of 11h 8m.

Metrics

The test set consists of 10% of the dataset recordings.

Model CER WER
pre-trained 87.31% 31.47%
fine-tuned 9.06% 3.12%

Training hyperparameters

To fine-tune the wav2vec2-large-slavic-parlaspeech-hr model, we used the following hyperparameters:

arg value
per_device_train_batch_size 8
gradient_accumulation_steps 2
num_train_epochs 35
learning_rate 3e-4
warmup_steps 500