Wav2Vec2-Large-XLSR-53-Vietnamese

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Vietnamese using the Common Voice, Infore_25h dataset (Password: BroughtToYouByInfoRe)

When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "vi", split="test[:2%]") 
processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese") 
model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Evaluation

The model can be evaluated as follows on the Vietnamese test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "vi", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese") 
model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' 
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 58.63 %

Training

The Common Voice train, validation, and Infore_25h datasets were used for training

The script used for training can be found here

=======================To here===============================>

Your model in then available under huggingface.co/CuongLD/wav2vec2-large-xlsr-vietnamese for everybody to use 🎉.

How to evaluate my trained checkpoint

Having uploaded your model, you should now evaluate your model in a final step. This should be as simple as copying the evaluation code of your model card into a python script and running it. Make sure to note the final result on the model card both under the YAML tags at the very top and below your evaluation code under "Test Results".

Rules of training and evaluation

In this section, we will quickly go over what data is allowed to be used as training data, what kind of data preprocessing is allowed be used, and how the model should be evaluated.

To make it very simple regarding the first point: All data except the official common voice test data set can be used as training data. For models trained in a language that is not included in Common Voice, the author of the model is responsible to leave a reasonable amount of data for evaluation.

Second, the rules regarding the preprocessing are not that as straight-forward. It is allowed (and recommended) to normalize the data to only have lower-case characters. It is also allowed (and recommended) to remove typographical symbols and punctuation marks. A list of such symbols can e.g. be fonud here - however here we already must be careful. We should not remove a symbol that would change the meaning of the words, e.g. in English, we should not remove the single quotation mark ' since it would change the meaning of the word "it's" to "its" which would then be incorrect. So the golden rule here is to not remove any characters that could change the meaning of a word into another word. This is not always obvious and should be given some consideration. As another example, it is fine to remove the "Hypen-minus" sign "-" since it doesn't change the meaninng of a word to another one. E.g. "fine-tuning" would be changed to "finetuning" which has still the same meaning.

Since those choices are not always obvious when in doubt feel free to ask on Slack or even better post on the forum, as was done, e.g. here.

Tips and tricks

This section summarizes a couple of tips and tricks across various topics. It will continously be updated during the week.

How to combine multiple datasets into one

Check out this post.

How to effectively preprocess the data

How to do efficiently load datasets with limited ram and hard drive space

Check out this post.

How to do hyperparameter tuning

How to preprocess and evaluate character based languages

FAQ

Can a participant fine-tune models for more than one language? Yes! A participant can fine-tune models in as many languages she/he likes
Can a participant use extra data (apart from the common voice data)? Yes! All data except the official common voice test data can be used for training. If a participant wants to train a model on a language that is not part of Common Voice (which is very much encouraged!), the participant should make sure that some test data is held out to make sure the model is not overfitting.
Can we fine-tune for high-resource languages? Yes! While we do not really recommend people to fine-tune models in English since there are already so many fine-tuned speech recognition models in English. However, it is very much appreciated if participants want to fine-tune models in other "high-resource" languages, such as French, Spanish, or German. For such cases, one probably needs to train locally and apply might have to apply tricks such as lazy data loading (check the "Lazy data loading" section for more details).