automatic-speech-recognition hf-asr-leaderboard openslr_SLR53 robust-speech-event

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the OPENSLR_SLR53 - bengali dataset. It achieves the following results on the evaluation set.

Without language model :

With 5 gram language model trained on indic-text dataset :

Note : 10% of a total 218703 samples have been used for evaluation. Evaluation set has 21871 examples. Training was stopped after 30k steps. Output predictions are available under files section.

Training hyperparameters

The following hyperparameters were used during training:

Framework versions

Note : Training and evaluation script modified from and Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.

Note 2 : Minimum audio duration of 0.1s has been used to filter the training data which excluded may be 10-20 samples.


@misc {tahsin_mayeesha_2023, author = { {Tahsin Mayeesha} }, title = { wav2vec2-bn-300m (Revision e10defc) }, year = 2023, url = { }, doi = { 10.57967/hf/0939 }, publisher = { Hugging Face } }