transcribe whisper

Fine-tune Whisper-small for Korean Speech Recognition sample data (PoC)

Fine-tuning was performed using sample voices recorded from this csv data(https://github.com/hyeonsangjeon/job-transcribe/blob/main/meta_voice_data_3922.csv). We do not publish sample voices, so if you want to fine-tune yourself from scratch, please record separately or use a public dataset.

Fine tuning training based on the guide at https://huggingface.co/blog/fine-tune-whisper

[Note] In the voice recording data used for training, the speaker spoke clearly and slowly as if reading a textbook.

Training

Base model

OpenAI's whisper-small (https://huggingface.co/openai/whisper-small)

Parameters

We used heuristic parameters without separate hyperparameter tuning. The sampling rate is set to 16,000Hz.

Usage

You need to install librosa package in order to convert wave to Mel Spectrogram. (pip install librosa)

inference.py

import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# prepare your sample data (.wav)
file = "nlp-voice-3922/data/0002d3428f0ddfa5a48eec5cc351daa8.wav"

# Convert to Mel Spectrogram
arr, sampling_rate = librosa.load(file, sr=16000)

# Load whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("daekeun-ml/whisper-small-ko-finetuned-single-speaker-3922samples")

# Preprocessing
input_features = processor(arr, return_tensors="pt", sampling_rate=sampling_rate).input_features 

# Prediction
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ko", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription)