<iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe> <br/><br/>
Work-in-Progress: This model is not yet ready to be used.
Streamable Voice Activity Detection with a resource efficient CRDNN model trained on Libriparty
This repository provides all the necessary tools to perform real-time voice activity detection with SpeechBrain using a model pre-trained on Libriparty.
Differently from the offline recipe, can run with a realtime stream from a microphone.
The system expects input recordings sampled at 16kHz (single channel). If your signal has a different sample rate, resample it (e.g., using torchaudio or sox) before using the interface.
For a better experience, we encourage you to learn more about SpeechBrain.
Results
The model performance on the LibriParty test set is:
Release | hyperparams file | Test Precision | Test Recall. | Test F-Score | Model link | GPUs |
---|---|---|---|---|---|---|
2021-09-09 | streamable.yaml | 0.9417 | 0.9007 | 0.9208 | Model | NVIDIA RTX 3090 |
Environment setup
To setup the environment, run:
pip install speechbrain
git clone https://github.com/speechbrain/speechbrain/
cd speechbrain/recipes/LibriParty/streamable_VAD/
pip install -r extra-dependencies.txt
Running realtime inference
Note: As of now, PyTorch's streamreader only supports Apple devices, and so does our script. We will add support to more in the future. To run real-time inference, you can download and adapt the inference script.
To download the inference script, run:
git clone https://github.com/speechbrain/speechbrain/
The inference script is located in recipes/LibriParty/streamable_VAD/inference.py
.
In order to run the script, you should insert the ID of your microphone, you can do so on your system following the next steps.
To retrieve the ID of your microphone, run:
ffmpeg -hide_banner -list_devices true -f avfoundation -i dummy
and copy the ID of the microphone. If you don't have ffmpeg
install, you can install it via conda, using: conda install ffmpeg
, or by following the instructions on this website.
After retrieving your device ID, modify the script as follows you can run the inference script with
cd speechbrain/recipes/LibriParty/streamable_VAD/
python inference.py {MICROPHONE_ID}
This will open a window displaying the raw waveform on the top row, and the speech presence probability on the bottom row. You can close the demo via CTRL+C. After the execution, the script saves two images containing the processed waveform both offline (offline_processing.png) and realtime (streaming.png) for comparison.
Jupyter notebook
Pipeline description
This system is composed of a CRDNN that outputs posteriors probabilities with a value close to one for speech frames and close to zero for non-speech segments. A threshold is applied on top of the posteriors to detect candidate speech boundaries.
Depending on the active options, these boundaries can be post-processed (e.g, merging close segments, removing short segments, etc) to further improve the performance. See more details below.
Please notice that we encourage you to read our tutorials and learn more about SpeechBrain.
Reproducing the trainings
Training heavily relies on data augmentation. Make sure you have downloaded all the datasets needed:
- LibriParty: https://www.dropbox.com/s/ns63xdwmo1agj3r/LibriParty.tar.gz?dl=1
- Musan: https://www.openslr.org/resources/17/musan.tar.gz
- CommonLanguage: https://zenodo.org/record/5036977/files/CommonLanguage.tar.gz?download=1
and, after cloning the speechbrain repo, run:
cd speechbrain/recipes/LibriParty/VAD
python train.py hparams/streamable.yaml --data_folder=/localscratch/dataset/ --musan_folder=/localscratch/musan/ --commonlanguage_folder=/localscratch/common_voice_kpd
Remember to change the paths with your local ones.
Limitations
The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
Citing SpeechBrain
Please, cite SpeechBrain if you use it for your research or business.
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}