Group: Enthusiasm_Overflow
Shivam Kumar 170668
Yash Mittal 170818
Prateek Varshney 170494
Since we ran all our experiments on Google Colab, to reproduce our code the user will need to download the above data folders and upload them at the following locations (respectively) on their Google Drive:
Folders to be downloaded | Description | Path at which to upload in your Google Drive |
---|---|---|
YashVAD, CNN, TransferLearningBestModels | Folders containing Model weights | '/content/drive/MyDrive/' |
LSTM_keras_50epochs_completedata_nonfreeze_SGD.h5 LSTM_keras_50epochs_completedata_history_nofreeze_SGD | Saved Weights for Transfer Learning Variant 3 | ‘/content/drive/MyDrive/’ |
ATML | Folder containing ami_public_manual_1.6.2 and code folder | ‘/content/drive/MyDrive/’ |
amicorpusfinal | Training AMI WAV dataset. | ‘/content/drive/MyDrive/’ |
Hindi | Constains dataset, model & python scripts for Hindi_English BiLSTM Model | "/content/drive/MyDrive/" |
plots (create an empty folder) | create an empty folder named 'plots' to store generated plots | ‘/content/drive/MyDrive/’ |
Contains the following jupyter notebooks:
Files | Description |
---|---|
Resemblyser_spectral.ipynb | Contains the baseline Speaker Diarization code which uses a pre-trained instance of their model (trained on fixed-length segmentsextracted from a large corpus) as the Embedding module for our Speaker Diarization system. |
CNN_embedding_submission.ipynb | Uses Mel-log spectrum and MFCC feature extractor as well as a denoiser to remove the silence parts and speech noise and a CNN Model to generate the embeddings |
AMI_LSTM_Submission_BaseLine.ipynb | Uses the log-melspectrum of the wav chunks as the input vectors (features) to the LSTM based Embedding module. |
DER_Hindi_English.ipynb | Contains code for Speaker Diarization using BiLSTM model trained on Hindi English Custom Dataset. |
vad_comparisons.ipynb | Compares the performance of the three VAD methods: WebRTC-VAD, Voice Activity Detector, LSTM based Model |
Contains the following jupyter notebooks:
Files | Description |
---|---|
Transfer_Learning_Variant1.ipynb | Passes the dataset to the pre-trained Hindi-English-BiLSTM and the resulting "refined" features to train a new Embedding Module from scratch. This is similar to passing the dataset through a sequence of 2 models aligned one after the other. |
Transfer_Learning_Variant2.ipynb | Combines the above 2 models into one: freezes the weights of the BiLSTM layers of the Hindi-English-BiLSTM Model, removes and replaces the TimeDistributed Dense Layers with one LSTM + Simple Dense Layers and retrain the model using MFCC features of the AMI-Corpus Dataset, thereby enabling only the training of the top layers. |
Transfer_Learning_Variant3.ipynb | Similar to Variant 2 except that it also unfreezes the BiLSTM layers as well, i.e., trains the "pre-trained" model (after replacing the Dense Layers) end to end on the current dataset and finetunes it accordingly. |
Demos_Part1 Contains the results of speaker diarization on live run on Youtube Clip.
Demo_Part2 Contains the results of our variation of applying transfer learning to adapt our model from one dataset to another dataset.
We use the following libraries:
- pydub
- xmltodict
- resemblyzer
- pyannote
- noisereduce
- spectralcluster
- PyTorch
- pyannote.metrics
- pyannote.core
- hdbscan
- keras
- tensorflow_addons
- python_speech_features
- Use
pip install library_name
for your local system. - Use
!pip install library_name
when installing on Colab.