Model Card for NoRefER

Referenceless Error Metric for Automatic Speech Recognition
via Contrastive Fine-Tuning of mMiniLMv2 without References

Model Details

How to use

import re
from transformers import AutoTokenizer, AutoModel

def preprocess(text: str):
  text = text.lower()
  text = re.sub(r'[\(\[].*?[\)\]]', '', text)
  text = re.sub(r'[^\w\s]', '', text)
  return text

tokenizer = AutoTokenizer.from_pretrained("aixplain/NoRefER")
model = AutoModel.from_pretrained("aixplain/NoRefER", trust_remote_code=True)

# preprocess
texts = [
    "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.",
    "In Italy, pizzas serves in formal settings, such as at an restaurant, is presented unslicing."
]
preprocessed_texts = [preprocess(text) for text in texts]
# tokenize
tokens = tokenizer(preprocessed_texts, padding=True, return_tensors="pt")
# score
scores = model.score(**tokens)

Model Description

This work presents a novel multi-language referenceless quality metric for automatic speech recognition (ASR). The metric is based on a language model (LM) trained with contrastive-learning and without references indicating the quality. Instead, the known order of quality in-between increasing compression levels of the same ASR model is used for self-supervision. All unique pair combinations are extracted from the outputs of ASR models in multiple compression levels to compile a dataset for model training and validation. The LM is part of a siamese network architecture (with shared weights) for giving pair-wise ranking decisions considering the ASR output quality. The referenceless metric achieves 77% validation accuracy in this pair-wise ranking task and can generalize for quality comparisons in-between different ASR models. When experimented on a blind test dataset consisting of outputs of top commercial ASR engines, it has been observed that the referenceless metric has a high correlation with word-error-rate (WER) ranks of them across samples, and can outperform the best engine's WER by +7% via selecting among alternative hypotheses. The referenceless metric is compared against the perplexity metric from various state-of-art pre-trained LM(s) and obtained superior performance in all experiments. The referenceless metric allows comparing the performance of different ASR models on a speech dataset that lacks ground-truth references. It also enables obtaining an ensemble of ASR models that can outperform any individual model in the ensemble. Finally, it can be used to prioritize hypotheses for referencing (via post-editing) or human-evaluation processes within ASR model improvement lifecycle in production, and for A/B testing different versions of an ASR model (such as previous and current) on a referenceless production data stream.