Implementation of "End-to-end speaker segmentation for overlap-aware resegmentation" with modifications for speaker change detection. Learn more in the presentation.
This code is based on pyannote/pyannote-audio. Some functions are identical to those in pyannote.audio
, some are slightly modified, and some are heavily modified. Additionally, there is novel code to perform speaker change detection and to connect everything together.
This code can prepare data, train, and perform inference for two different tasks: speaker change detection and speaker segmentation. However, the outputs from both models/configurations can be processed into speaker change points.
Model Weights (including short_scd_bigdata.ckpt): Available from this Google Drive folder.
Training GIFs (more details in the presentation):
Speaker Change Detection | Segmentation |
---|---|
Speaker change detection identifies timestamps where the active speaker changes. If someone starts speaking, stops speaking, and starting speaking again (and no one else started speaking while they were not speaking), no speaker change occurs. If two people are speaking and one of them stops or another person starts speaking, a speaker change occurs. See slide 6 of the presentation.
Segmentation splits a conversation into turns. It identifies when people are speaking. This is not voice activity detection since if multiple people are talking the model will output probabilities indicating multiple speakers. This is not speaker diarization because speakers are not identified for the entire length of an audio file.
The code is mostly organized according to PyTorch Lightning's structure. Package management is handled by Poetry.
The dataset used is the AMI Meeting Corpus. It was downloaded and repaired using the scripts available in the pyannote/AMI-diarization-setup GitHub repository.
- Clone the repo:
git clone --recurse-submodules https://github.com/HHousen/speaker-change-detection/ && cd speaker-change-detection
- Install requirements and activate environment:
poetry install
thenpoetry shell
- Download the data:
cd AMI-diarization-setup/pyannote && sh download_ami.sh
(more details) - Train a model:
python train.py
. SetDO_SCD
in train.py toTrue
to do speaker change detection or set it toFalse
to do segmentation. - Perform inference using process_file.py. Replace
short_scd_bigdata.ckpt
with the path to your model checkpoint andtest_audio_similar.wav
with the path to your audio file. SetDO_SCD
to the same value used for training.
train.py
: Execute to train a model on the dataset. Loads the data using aSegmentationAndSCDData
datamodule and instantiates aSSCDModel
model. Logs to Weights and Biases and trains on a GPU using the PyTorch LightningTrainer
.model.py
: Defines theSSCDModel
model architecture, training loop, optimizers, loss function, etc.sincnet.py
: An implementation of the SincNet model, which is used inSSCDModel
, from this GitHub repo.data.py
: Defines theSegmentationAndSCDData
datamodule, which processes the data into the format accepted by the model. Usespyannote.database
to load and do some initial processing of the data.inference.py
: Contains functions necessary to perform inference on a complete audio file. Can be used easily on a file by runningprocess_file.py
.process_file.py
: Processes an audio file end-to-end using theInference
object defined ininference.py
.process_file.ipynb
: Similar toprocess_file.py
, but as a Jupyter notebook to take advantage ofpyannote.core
's plotting functions.
Note: database.yml
tells pyannote.database
where the data is located.
- Train longer, larger model
- Data augmentation to address overfitting
- Tested transformers, but performed worse than LSTM.
- LSTM ROC AUC ≈ 90 & Transformer ROC AUC ≈ 80
- SincNet supposedly better than handcrafted features, but should test MFCCs, FBANKs, etc.
- More advanced inference techniques:
- Remove short gaps in active speaker output
- Remove segments only active for short time
- Separate activate and deactivate thresholds
This idea and code are primarily based on the paper "End-to-end speaker segmentation for overlap-aware resegmentation" by Hervé Bredin & Antoine Laurent.
Also, SincNet is a key component of the model architecture: "Speaker Recognition From Raw Waveform With Sincnet" by Mirco Ravanelli, Yoshua Bengio