wav2vec2-xls-r-slavic-pomak

To train a Pomak ASR model, we fine-tuned a Slavic model (classla/wav2vec2-large-slavic-parlaspeech-hr) on 11h of recorded Pomak speech.

Recordings

Fours native Pomak speakers (2 female and 2 male) agreed to read Pomak texts at the ILSP audio-visual studio in Xanthi, Greece, resulting in a total of 14h.

Speaker	Gender	Total recorded hours
NK9dIF	F	4h 44m 45s
xoVY9q	M	4h 36m 12s
9G75fk	F	1h 44m 03s
n5WzHj	M	3h 44m 04s

To fine-tune the model, we split the long recordings into smaller segments of a maximum of 25 seconds each. This removed the majority of pauses and resulted in a total dataset duration of 11h 8m.

Metrics

The test set consists of 10% of the dataset recordings.

Model	CER	WER
pre-trained	87.31%	31.47%
fine-tuned	9.06%	3.12%

Training hyperparameters

To fine-tune the wav2vec2-large-slavic-parlaspeech-hr model, we used the following hyperparameters:

arg	value
`per_device_train_batch_size`	8
`gradient_accumulation_steps`	2
`num_train_epochs`	35
`learning_rate`	3e-4
`warmup_steps`	500

wav2vec2-xls-r-slavic-pomak

Recordings

Metrics

Training hyperparameters

NSDT 3DConvert

UnrealSynth

DreamTexture.js