This is facebook/wav2vec2-large-960h-lv60-self enhanced with a Wikipedia language model.

The dataset used is wikipedia/20200501.en. All articles were used. It was cleaned of references and external links and all text inside of parantheses. It has 8092546 words.

The language model was built using KenLM. It is a 5-gram model where all singletons of 3-grams and bigger were pruned. It was built as:

kenlm/build/bin/lmplz -o 5 -S 120G --vocab_estimate 8092546 --text text.txt --arpa text.arpa --prune 0 0 1

Suggested usage:

from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="gxbag/wav2vec2-large-960h-lv60-self-with-wikipedia-lm")
output = pipe("/path/to/audio.wav", chunk_length_s=30, stride_length_s=(6, 3))
output

Note that in the current version of transformers (as of the release of this model), when using striding in the pipeline it will chop off the last portion of audio, in this case 3 seconds. Add 3 seconds of silence to the end as a workaround. This problem was fixed in the GitHub version of transformers.