TTS text-to-speech

V2.5 Model : Fine tune of my V2 model on all CommonVoice dataset (517k sample) on 2.5k step (batch size 200), Voice cloning has improved a bit but is still not great. However, if you fine tune this model on your own personality dataset then you can get pretty good results. A good V3 model would be to fine tune for like 50k steps on this dataset and I think there would be a way to get good results but I won't try

V2 Model :

Tortoise base model Fine tuned on a custom multispeaker French dataset of 120k samples (SIWIS + Common Voice subset + M-AILABS) on 10k step with a RTX 3090 (~= 21 hours of training), with Text LR Weight at 1 Result : The model can speak French much better without an English accent but the voice clone hardly works

V1 Model :

Tortoise base model Fine tuned on a custom multispeaker French dataset of 24k samples (SIWIS + Common Voice subset) on 8850 step with a RTX 3090 (~= 19 hours of training)

Inference :

Fine tuning :