How to use
See example of inference pipeline for Russian TTS (G2P + FastPitch + HifiGAN) in this notebook. Or use this bash-script.
Input
This model accepts batches of mel spectrograms.
Output
This model outputs audio at 22050Hz.
Training
The NeMo toolkit [1] was used for training the model for several epochs. Full training script is here.
Datasets
This model is trained on RUSLAN [2] corpus (single speaker, male voice) sampled at 22050Hz.
References
- [1] NVIDIA NeMo Toolkit
- [2] Gabdrakhmanov L., Garaev R., Razinkov E. (2019) RUSLAN: Russian Spoken Language Corpus for Speech Synthesis. In: Salah A., Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science, vol 11658. Springer, Cham