Whisper Tamil Medium

This model is a fine-tuned version of openai/whisper-medium on the Tamil data available from multiple publicly available ASR corpuses. It has been fine-tuned as a part of the Whisper fine-tuning sprint.

NOTE: The code used to train this model is available for re-use in the whisper-finetune repository.

Usage

In order to evaluate this model on an entire dataset, the evaluation codes available in the whisper-finetune repository can be used.

The same repository also provides the scripts for faster inference using whisper-jax.

In order to infer a single audio file using this model, the following code snippet can be used:

>>> import torch
>>> from transformers import pipeline

>>> # path to the audio file to be transcribed
>>> audio = "/path/to/audio.format"
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> transcribe = pipeline(task="automatic-speech-recognition", model="vasista22/whisper-tamil-medium", chunk_length_s=30, device=device)
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="ta", task="transcribe")

>>> print('Transcription: ', transcribe(audio)["text"])

For faster inference of whisper models, the whisper-jax library can be used. Please follow the necessary installation steps as mentioned here, before using the following code snippet:

>>> import jax.numpy as jnp
>>> from whisper_jax import FlaxWhisperForConditionalGeneration, FlaxWhisperPipline

>>> # path to the audio file to be transcribed
>>> audio = "/path/to/audio.format"

>>> transcribe = FlaxWhisperPipline("vasista22/whisper-tamil-medium", batch_size=16)
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="ta", task="transcribe")

>>> print('Transcription: ', transcribe(audio)["text"])

Training and evaluation data

Training Data:

IISc-MILE Tamil ASR Corpus
ULCA ASR Corpus
Shrutilipi ASR Corpus
Microsoft Speech Corpus (Indian Languages)
Google/Fleurs Train+Dev set
Babel ASR Corpus

Evaluation Data:

Microsoft Speech Corpus (Indian Languages) Test Set
Google/Fleurs Test Set
IISc-MILE Test Set
Babel Test Set

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 24
eval_batch_size: 48
seed: 22
optimizer: adamw_bnb_8bit
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 17500
training_steps: 33892 (Initially set to 84730 steps)
mixed_precision_training: True

Acknowledgement

This work was done at Speech Lab, IIT Madras.

The compute resources for this work were funded by "Bhashini: National Language translation Mission" project of the Ministry of Electronics and Information Technology (MeitY), Government of India.

Whisper Tamil Medium

Usage

Training and evaluation data

Training hyperparameters

Acknowledgement

NSDT 3DConvert

UnrealSynth

DreamTexture.js