speechbrain VAD SAD Voice Activity Detection Speech Activity Detection Speaker Diarization pytorch CRDNN

<iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe> <br/><br/>

Work-in-Progress: This model is not yet ready to be used.

Streamable Voice Activity Detection with a resource efficient CRDNN model trained on Libriparty

This repository provides all the necessary tools to perform real-time voice activity detection with SpeechBrain using a model pre-trained on Libriparty.

Differently from the offline recipe, can run with a realtime stream from a microphone.

The system expects input recordings sampled at 16kHz (single channel). If your signal has a different sample rate, resample it (e.g., using torchaudio or sox) before using the interface.

For a better experience, we encourage you to learn more about SpeechBrain.

Results

The model performance on the LibriParty test set is:

Release hyperparams file Test Precision Test Recall. Test F-Score Model link GPUs
2021-09-09 streamable.yaml 0.9417 0.9007 0.9208 Model NVIDIA RTX 3090

Environment setup

To setup the environment, run:

pip install speechbrain

git clone https://github.com/speechbrain/speechbrain/
cd speechbrain/recipes/LibriParty/streamable_VAD/
pip install -r extra-dependencies.txt

Running realtime inference

Note: As of now, PyTorch's streamreader only supports Apple devices, and so does our script. We will add support to more in the future. To run real-time inference, you can download and adapt the inference script.

To download the inference script, run:

git clone https://github.com/speechbrain/speechbrain/

The inference script is located in recipes/LibriParty/streamable_VAD/inference.py.

In order to run the script, you should insert the ID of your microphone, you can do so on your system following the next steps.

To retrieve the ID of your microphone, run: ffmpeg -hide_banner -list_devices true -f avfoundation -i dummy and copy the ID of the microphone. If you don't have ffmpeg install, you can install it via conda, using: conda install ffmpeg, or by following the instructions on this website.

After retrieving your device ID, modify the script as follows you can run the inference script with

cd speechbrain/recipes/LibriParty/streamable_VAD/
python inference.py {MICROPHONE_ID}

This will open a window displaying the raw waveform on the top row, and the speech presence probability on the bottom row. You can close the demo via CTRL+C. After the execution, the script saves two images containing the processed waveform both offline (offline_processing.png) and realtime (streaming.png) for comparison.

Jupyter notebook

Pipeline description

This system is composed of a CRDNN that outputs posteriors probabilities with a value close to one for speech frames and close to zero for non-speech segments. A threshold is applied on top of the posteriors to detect candidate speech boundaries.

Depending on the active options, these boundaries can be post-processed (e.g, merging close segments, removing short segments, etc) to further improve the performance. See more details below.

Please notice that we encourage you to read our tutorials and learn more about SpeechBrain.

Reproducing the trainings

Training heavily relies on data augmentation. Make sure you have downloaded all the datasets needed:

and, after cloning the speechbrain repo, run:

cd speechbrain/recipes/LibriParty/VAD
python train.py hparams/streamable.yaml --data_folder=/localscratch/dataset/ --musan_folder=/localscratch/musan/ --commonlanguage_folder=/localscratch/common_voice_kpd

Remember to change the paths with your local ones.

Limitations

The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

Citing SpeechBrain

Please, cite SpeechBrain if you use it for your research or business.

@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}