SpeechT5 - Russian translit

This model is a fine-tuned version of microsoft/speecht5_tts on the Common Voice 13 dataset. It achieves the following results on the evaluation set:

Loss: 0.4853

Model description

Input should be a russian text in transliterated form (use transliterate package). This is just a test for the hands-on excercise of HF Audio Course! Not intended for actual use!

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 400
training_steps: 2000

Training results

Training Loss	Epoch	Step	Validation Loss
1.0359	0.6	50	0.8176
0.8866	1.19	100	0.6899
0.787	1.79	150	0.6478
0.7477	2.38	200	0.6233
0.6734	2.98	250	0.5630
0.6216	3.58	300	0.5429
0.593	4.17	350	0.5304
0.5817	4.77	400	0.5282
0.5734	5.37	450	0.5167
0.5688	5.96	500	0.5209
0.5662	6.56	550	0.5095
0.5609	7.15	600	0.5127
0.554	7.75	650	0.5041
0.5522	8.35	700	0.5038
0.5372	8.94	750	0.4984
0.5432	9.54	800	0.4995
0.5384	10.13	850	0.4971
0.5345	10.73	900	0.4981
0.5358	11.33	950	0.4942
0.5332	11.92	1000	0.4906
0.5334	12.52	1050	0.4897
0.5301	13.11	1100	0.4914
0.5298	13.71	1150	0.4894
0.524	14.31	1200	0.4871
0.5221	14.9	1250	0.4884
0.525	15.5	1300	0.4883
0.5232	16.1	1350	0.4866
0.5261	16.69	1400	0.4858
0.521	17.29	1450	0.4852
0.5225	17.88	1500	0.4849
0.5219	18.48	1550	0.4860
0.5207	19.08	1600	0.4839
0.5192	19.67	1650	0.4851
0.516	20.27	1700	0.4860
0.5186	20.86	1750	0.4811
0.5233	21.46	1800	0.4841
0.5145	22.06	1850	0.4819
0.5159	22.65	1900	0.4822
0.5146	23.25	1950	0.4831
0.5175	23.85	2000	0.4853

Framework versions

Transformers 4.31.0
Pytorch 2.0.1+cu118
Datasets 2.14.4
Tokenizers 0.13.3