automatic-speech-recognition speech wav2vec2.0 audio

The Mandarin-wav2vec2.0 model is pre-trained on 1000 hours of AISHELL-2 dataset. The pre-training detail can be found at This model is fine-tuned on 178 hours of AISHELL-1 dataset and is the baseline model in the paper "A context-aware knowledge transferring strategy for CTC-based ASR "(preprint).

Results on AISHELL-1

CER dev test
vanilla w2v2-CTC 4.85 5.13


Note: the model is fine-tuned using ESPNET toolkit, then converted to huggingface model for simple usage.

import torch
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

class ExtendedWav2Vec2ForCTC(Wav2Vec2ForCTC):
    In ESPNET there is a LayerNorm layer between encoder output and CTC classification head.
    def __init__(self, config):
        self.lm_head = torch.nn.Sequential(
model = ExtendedWav2Vec2ForCTC.from_pretrained("kehanlu/mandarin-wav2vec2-aishell1")
processor = Wav2Vec2Processor.from_pretrained("kehanlu/mandarin-wav2vec2-aishell1")

audio_input, sample_rate ="/path/to/data_aishell/wav/dev/S0724/BAC009S0724W0121.wav")
inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)

# 广州市房地产中介协会分析


The pre-trained corpus, AISHELL-2, is supported by AISHELL fundation. The outcome model also follow the licence of AISHELL-2. It is free to use for academic purpose and should not be used on any commercial purpose without the permission from AISHELL fundation. (

   author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
   title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
   journal = {ArXiv},
   eprint = {1808.10583},
   primaryClass = "cs.CL",
   year = 2018,
   month = Aug,

if you find this useful, please cite

  title={A context-aware knowledge transferring strategy for CTC-based ASR},
  author={Lu, Ke-Han and Chen, Kuan-Yu},
  journal={arXiv preprint arXiv:2210.06244},