audio automatic-speech-recognition speech xlsr-fine-tuning-week

Wav2Vec2-Large-XLSR-53-Spanish-With-LM

This is a model copy of Wav2Vec2-Large-XLSR-53-Spanish that has language model support.

This model card can be seen as a demo for the pyctcdecode integration with Transformers led by this PR. The PR explains in-detail how the integration works.

In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor.

The only change from the existing ASR pipeline will be:

import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F


model_id = "patrickvonplaten/wav2vec2-xlsr-53-es-kenlm"

sample = next(iter(load_dataset("common_voice", "es", split="test", streaming=True)))
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()

model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

input_values = processor(resampled_audio, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values).logits

-prediction_ids = torch.argmax(logits, dim=-1)
-transcription = processor.batch_decode(prediction_ids)
+transcription = processor.batch_decode(logits.numpy()).text
# => 'bien y qué regalo vas a abrir primero'

Improvement

This model has been compared on 512 speech samples from the Spanish Common Voice Test set and gives a nice 20 % performance boost:

The results can be reproduced by running from this model repository:

Model WER CER
patrickvonplaten/wav2vec2-xlsr-53-es-kenlm 8.44% 2.93%
jonatasgrosman/wav2vec2-large-xlsr-53-spanish 10.20% 3.24%
bash run_ngram_wav2vec2.py 1 512
bash run_ngram_wav2vec2.py 0 512

with run_ngram_wav2vec2.py being https://huggingface.co/patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm/blob/main/run_ngram_wav2vec2.py