A novel multi-task self-supervised learning approach, capable of learning both augmentation invariant and equivariant features in a parameter efficient manner.
The paper can be found either in the InterSpeech23 proceedings or in ArXiv.
- 23/8/23: Presented work in poster format at InterSpeech23. Poster released in this repo
- 16/8/23: Paper formally released by InterSpeech23, see here
- 8/8/23: Pre-trained wieghts for MT-SLVR models released
- 1/6/23: Blog post with additional details and diagrams released: here
- 29/5/23: Paper and code made public
- 17/5/23: MT-SLVR accepted to InterSpeech23, to be presented in August
If you find this work useful or related to your own, please consider citing it:
@inproceedings{heggan23_interspeech,
author={Calum Heggan and Tim Hospedales and Sam Budgett and Mehrdad Yaghoobi},
title={{MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={4399--4403},
doi={10.21437/Interspeech.2023-1064}
}
Simply put, the MT-SLVR algorithm utilises multi-task learning between contrastive and predictive self-supervised learning techniques. These features learnt by each of these algorithm are expected to be heavily conflicting (i.e one tries to learn augmentation invariance while the other tries to learn augmentation equivariance). To allow both to co-exist and be readily available for downstream tasks, we utilise adapters fit throughout the neural network, allowing each task (contrastive/predictive) some of its own specific parameters to learn upon.
This repo contains a few distinct parts which can be used to both reproduce the results from our work and train new models for varying purposes. Within this repo we include the following sub-codebases:
- Contrastive Only Methods: Our baseline SimSiam and SimCLR only approaches
- Predictive Methods: Our baseline transformation prediction only approach
- Multi-Task Method (MT-SLVR): Our novel multi-task approach
We note that although there are unique parts to each of three major codebases, there is a significant amount of overlapping code, e.g. dataset and augmentation classes. We left the overall codebase like this instead of reformatting and removing repeated scripts so that each section can be used independently, effectively increasing the immediate usability of the repo.
We exclude our evaluation framework from this specific repo (due to its additional complexity and potential usefulness as a standalone codebase) and instead host it here. This evaluation repo is still under construction with respect to documentation.
We use miniconda for our experimental setup. For the purposes of reproduction we include the environment file. This can be set up using the following command
conda env create --file torch_gpu_env.txt
There are likely some redundant packages in this section, we will attempt to trim it down for future releases.
For pre-training we use the balanced version of AudioSet. The decision to use this set was based in ease and manageable size. Unfortunately this set is not easily available to download. This being said, the set can be reproduced using a YouTube scraping script. Details and references for this process can be found here.
We also release the weights of the models used in the original work. The script along with the details on how to do this can be found here
Additional details on how to run the the MT-SLVR can be found in its sub-codebase but the main line is of the format:
python NEW_RUN.py --cont_framework simclrv1 --pred_framework trans --pred_weight 1.0 --adapter parallel --num_splits 2 --batch_size 100 --lr 0.00005 --p 1.0 --data_name AS_BAL --dims 2 --in_channels 3 --model_fc_out 1000 --gpu 0
Hyperparameter descriptions can be found in the "NEW_RUN.py".
Details on running baselines can be found in their respective sub-codebases.