wav2vec2-base-sk-17k

This is a monolingual Slovak Wav2Vec 2.0 base model pre-trained from 17 thousand of hours of Slovak speech.

This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created, and the model should be fine-tuned on labeled data.

The model was initialized from Czech pre-trained model fav-kky/wav2vec2-base-cs-80k-ClTRUS. We found this cross-language transfer learning approach better than pre-training from scratch. See our paper for details.

Pretraining data

Almost 18 thousand hours of unlabeled Slovak speech:

unlabeled data from VoxPopuli dataset (12.2k hours),
recordings from TV shows (4.5k hours),
oral history archives (800 hours),
CommonVoice 13.0 (24 hours)

Usage

Inputs must be 16kHz mono audio files.

This model can be used e.g. to extract per-frame contextual embeddings from audio:

from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
import torchaudio

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-sk-17k")
model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-sk-17k")

speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
inputs = feature_extractor(
    speech_array, 
    sampling_rate=16_000, 
    return_tensors="pt"
)["input_values"][0]

output = model(inputs)
embeddings = output.last_hidden_state.detach().numpy()[0]

Speech recognition results

After fine-tuning, the model scored the following results on public datasets:

Slovak portion of CommonVoice v13.0: WER = 8.82%
Slovak portion of VoxPopuli: WER = 8.88%

See our paper for details.

Paper

The preprint of our paper (accepted to TSD 2023) is available at https://arxiv.org/abs/2306.04399.

Citation

If you find this model useful, please cite our paper:

@inproceedings{wav2vec2-base-sk-17k,
  title = {{Transfer Learning of Transformer-based Speech Recognition Models from Czech to Slovak}},
  author = {
    Jan Lehe\v{c}ka and 
    Josef V. Psutka and 
    Josef Psutka
  },
  booktitle = {{Text, Speech, and Dialogue}},
  publisher = {{Springer International Publishing}},
  year = {2023},
  note = {(in press)},
  url = {https://arxiv.org/abs/2306.04399},
}

Related models

fav-kky/wav2vec2-base-cs-80k-ClTRUS

wav2vec2-base-sk-17k

Pretraining data

Usage

Speech recognition results

Paper

Citation

Related papers

Related models

NSDT 3DConvert

UnrealSynth

DreamTexture.js