faster-whisper finetuned model for PL Phonetic transcription
This model is a result of finetuning openai/whisper-medium
model on custom PL dataset and then conversion to faster-whisper
model.
In training dataset there were also 5 english speakers and 4 japanese speakers for which polish transcription was manually created.
About model:
- I created this because original whisper model is not doing precise transcription, e.g. some disfluences like stuttering or repetition are normalized.
- This model generates more accurate transcriptions so it's better for automatic creation of unsupervised dataset for Text-To-Speech model training.
- I noticed it also normalized numbers so it's in word form, there are no digits generated in transcript.
- English audio is transcribed into phonetic polish transcription instead of leaving original english form or translating to polish language like it's in original whisper model (however due to low amount of data it was trained on, it's far from perfection)
Example:
from faster_whisper import WhisperModel
import huggingface_hub
model_path = huggingface_hub.snapshot_download("shmart/shmisper-medium-PL")
model = WhisperModel(model_path, device="cuda", compute_type="float16")
options = {
'language': "pl",
'beam_size': 5,
'without_timestamps': True,
'suppress_tokens': [],
'log_prob_threshold': None,
'no_speech_threshold': 0.05
}
input_wav_path = './audio.wav'
result, info = model.transcribe(input_wav_path, **options)
text = ' '.join([r.text for r in result])
print(text)