pyannote pyannote-audio pyannote-audio-pipeline audio voice speech speaker speaker-diarization speaker-change-detection voice-activity-detection overlapped-speech-detection automatic-speech-recognition

Using this open-source pipeline in production?
Make the most of it thanks to our consulting services.

🎹 Speaker diarization 3.0

This pipeline has been trained by Séverin Baroudi with pyannote.audio 3.0.0 using a combination of the training sets of AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an Annotation instance:

Requirements

  1. Install pyannote.audio 3.0 with pip install pyannote.audio
  2. Accept pyannote/segmentation-3.0 user conditions
  3. Accept pyannote/speaker-diarization-3.0 user conditions
  4. Create access token at hf.co/settings/tokens.

Usage

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.0",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch
pipeline.to(torch.device("cuda"))

Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).

In other words, it takes approximately 1.5 minutes to process a one hour conversation.

Processing from memory

Pre-loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    diarization = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

Benchmark

This pipeline has been benchmarked on a large collection of datasets.

Processing is fully automatic:

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

Benchmark DER% FA% Miss% Conf% Expected output File-level evaluation
AISHELL-4 12.3 3.8 4.4 4.1 RTTM eval
AliMeeting (channel 1) 24.3 4.4 10.0 9.9 RTTM eval
AMI (headset mix, only_words) 19.0 3.6 9.5 5.9 RTTM eval
AMI (array1, channel 1, only_words) 22.2 3.8 11.2 7.3 RTTM eval
AVA-AVD 49.1 10.8 15.7 22.5 RTTM eval
DIHARD 3 (Full) 21.7 6.2 8.1 7.3 RTTM eval
MSDWild 24.6 5.8 8.0 10.7 RTTM eval
REPERE (phase 2) 7.8 1.8 2.6 3.5 RTTM eval
VoxConverse (v0.3) 11.3 4.1 3.4 3.8 RTTM eval

Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}