speech quantization

Highlights

This model is used for speech codec or quantization on English utterances.

FunCodec model

This model is trained with FunCodec, an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group. This repository provides a pre-trained model on the LibriTTS corpus. It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis and other academic research topics. Compared with EnCodec and SoundStream, the following improved techniques are utilized to train the model, resulting in higher codec quality and ViSQOL scores under the same band width:

Model description

This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain several parallel sequences of discrete latent representations. Here is an overview of FunCodec models. <p align="center"> <img src="fig/framework.png" alt="FunCodec architecture"/> </p>

In general, FunCodec models consist of five modules: a domain transformation module, an encoder, a RVQ module, a decoder and a domain inversion module.

More details can be found at:

Intended uses & sceneries

Inference with FunCodec

You can extract codecs and reconstruct them back to waveforms with FunCodec repository.

FunCodec installation

# Install Pytorch GPU (version >= 1.12.0):
conda install pytorch==1.12.0
# for other versions, please refer: https://pytorch.org/get-started/locally

# Download codebase:
git clone https://github.com/alibaba-damo-academy/FunCodec.git

# Install FunCodec codebase:
cd FunCodec
pip install --editable ./

Codec extraction

# Enter the example directory 
cd egs/LibriTTS/codec
# Specify the model name
model_name="audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch"
# Download the model
git lfs install
git clone https://huggingface.co/alibaba-damo/${model_name}
mkdir exp
mv ${model_name} exp/$model_name
# Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
  --wav_scp input_wav.scp --out_dir outputs/codecs
# input_wav.scp has the following format:
# uttid1 path/to/file1.wav
# uttid2 path/to/file2.wav
# ...

Reconstruct waveforms from codecs

# Reconstruct waveforms into "outputs/recon_wavs"
bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
  --wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs 
# codecs.txt is the output of stage 1, which has the following format:
# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
# ...

Inference with Huggingface Transformers

Inference with Huggingface transformers package is under development.

Application sceneries

Running environment

Intended using sceneries

Evaluation results

Training configuration

Experimental Results

Test set: LibriTTS test-clean, ViSQOL scores

testset 50 tk/s 100 tk/s 200 tk/s 400 tk/s
LibriTTS 3.43 3.86 4.12 4.29

Limitations and bias

BibTeX entry and citation info

@misc{du2023funcodec,
      title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
      author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
      year={2023},
      eprint={2309.07405},
      archivePrefix={arXiv},
      primaryClass={cs.Sound}
}