Skip to content

Latest commit

 

History

History
87 lines (61 loc) · 5.39 KB

ReadMe.md

File metadata and controls

87 lines (61 loc) · 5.39 KB

- Speaker Diarization

Group: Enthusiasm_Overflow
Shivam Kumar 170668 
Yash Mittal 170818 
Prateek Varshney 170494

pipeline

Instructions for setting up Drive

Since we ran all our experiments on Google Colab, to reproduce our code the user will need to download the above data folders and upload them at the following locations (respectively) on their Google Drive:

Folders to be downloaded Description Path at which to upload in your Google Drive
YashVAD, CNN, TransferLearningBestModels Folders containing Model weights '/content/drive/MyDrive/'
LSTM_keras_50epochs_completedata_nonfreeze_SGD.h5 LSTM_keras_50epochs_completedata_history_nofreeze_SGD Saved Weights for Transfer Learning Variant 3 ‘/content/drive/MyDrive/’
ATML Folder containing ami_public_manual_1.6.2 and code folder ‘/content/drive/MyDrive/’
amicorpusfinal Training AMI WAV dataset. ‘/content/drive/MyDrive/’
Hindi Constains dataset, model & python scripts for Hindi_English BiLSTM Model "/content/drive/MyDrive/"
plots (create an empty folder) create an empty folder named 'plots' to store generated plots ‘/content/drive/MyDrive/’

Discription of files present in this Github Repo.

Main Project Codes

Contains the following jupyter notebooks:

Files Description
Resemblyser_spectral.ipynb Contains the baseline Speaker Diarization code which uses a pre-trained instance of their model (trained on fixed-length segmentsextracted from a large corpus) as the Embedding module for our Speaker Diarization system.
CNN_embedding_submission.ipynb Uses Mel-log spectrum and MFCC feature extractor as well as a denoiser to remove the silence parts and speech noise and a CNN Model to generate the embeddings
AMI_LSTM_Submission_BaseLine.ipynb Uses the log-melspectrum of the wav chunks as the input vectors (features) to the LSTM based Embedding module.
DER_Hindi_English.ipynb Contains code for Speaker Diarization using BiLSTM model trained on Hindi English Custom Dataset.
vad_comparisons.ipynb Compares the performance of the three VAD methods: WebRTC-VAD, Voice Activity Detector, LSTM based Model

Transfer Learning Variants

Contains the following jupyter notebooks:

Files Description
Transfer_Learning_Variant1.ipynb Passes the dataset to the pre-trained Hindi-English-BiLSTM and the resulting "refined" features to train a new Embedding Module from scratch. This is similar to passing the dataset through a sequence of 2 models aligned one after the other.
Transfer_Learning_Variant2.ipynb Combines the above 2 models into one: freezes the weights of the BiLSTM layers of the Hindi-English-BiLSTM Model, removes and replaces the TimeDistributed Dense Layers with one LSTM + Simple Dense Layers and retrain the model using MFCC features of the AMI-Corpus Dataset, thereby enabling only the training of the top layers.
Transfer_Learning_Variant3.ipynb Similar to Variant 2 except that it also unfreezes the BiLSTM layers as well, i.e., trains the "pre-trained" model (after replacing the Dense Layers) end to end on the current dataset and finetunes it accordingly.

Demo

Demos_Part1 Contains the results of speaker diarization on live run on Youtube Clip.

Demo_Part2 Contains the results of our variation of applying transfer learning to adapt our model from one dataset to another dataset.

Libraries needed to be imported

We use the following libraries:

  • pydub
  • xmltodict
  • resemblyzer
  • pyannote
  • noisereduce
  • spectralcluster
  • PyTorch
  • pyannote.metrics
  • pyannote.core
  • hdbscan
  • keras
  • tensorflow_addons
  • python_speech_features

Note: To install any of the above libraries:

  1. Use pip install library_name for your local system.
  2. Use !pip install library_name when installing on Colab.