Skip to content

MorenoLaQuatra/ARCH

Repository files navigation

ARCH: Audio Representations benCHmark

logo
ARCH results

This repository contains the code for the ARCH benchmark. It is intended to be used to evaluate audio representations on a wide range of datasets and tasks. The benchmark is intended to be easy to use and to allow the comparison of different audio representations.

The main features of ARCH are:

  • Plug and play: the benchmark is designed to be easy to use. It provides a unified interface to load the datasets and to evaluate audio representations.
  • Extensibility: the benchmark is designed to be easy to extend. It is possible to add new datasets and tasks as well as new models to evaluate its audio representations.
  • Standardization: the benchmark wants to standardize the evaluation of audio representations. The pletora of ARL models and datasets makes it difficult to compare them. The benchmark aims at providing a standard way to evaluate audio representations.

The main components and their interactions are illustrated in the following figure:


logo

Installation

ARCH can be installed by just cloning the repository and installing it with pip:

git clone https://github.com/MorenoLaQuatra/ARCH.git
cd ARCH
pip install -e .

Reproducing the results provided in the first release

The benchmark can be used by importing the arch module. The file evaluate_hf_models.py contains an example of how to use the benchmark. It contains the following parameters that can be used to configure the benchmark:

  • model: the name of the model to evaluate. It can be any model from the HuggingFace model hub or a local model exposing the same interface.
  • device: the device to use for the evaluation. It can be cpu or cuda.
  • max_epochs: the maximum number of epochs to train the linear classifier.
  • verbose: if True, it prints the results of the evaluation and other information on the standard output.
  • tsv_logging_file: the file where to save the results of the evaluation in TSV format.
  • n_iters: the number of times to repeat the evaluation, it can be used to compute the average of multiple runs and their standard deviation.
  • data_config_file: the file containing the configuration of the datasets to use for the evaluation (you can find it at configs/data_config.json)
  • enabled_datasets: the list of datasets to use for the evaluation. It can be any of the following: esc50, us8k, fsd50k, vivae, fma_small, magna_tag_a_tune, irmas, medleydb, ravdess, audio_mnist, slurp, emovo.

Datasets and tasks

The benchmark includes multiple datasets and, at the moment, only classification tasks. The following table contains the list of the datasets and tasks currently supported by the benchmark.

Dataset Task Type Reference Version
ESC-50 Single-label classification Sound events ESC: Dataset for Environmental Sound Classification Version 1
US8K Single-label classification Sound events A Dataset and Taxonomy for Urban Sound Research Version 1
FSD50K Single-label classification Sound events FSD50K: An Open Dataset of Human-Labeled Sound Events Version 1
VIVAE Single-label classification Sound events The Variably Intense Vocalizations of Affect and Emotion (VIVAE) corpus prompts new perspective on nonspeech perception Version 1
FMA-small Single-label classification Music FMA: A Dataset For Music Analysis Version 1
MagnaTagATune Multi-label classification Music Evaluation of algorithms using games: the case of music annotation Version 1
IRMAS Multi-label classification Music A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals Version 1
Medley-solos-DB Single-label classification Music Deep convolutional networks on the pitch spiral for musical instrument recognition Version 1
RAVDESS Single-label classification Speech The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English Version 1
AudioMNIST Single-label classification Speech Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals Version 1
SLURP Single-label classification Speech SLURP: A Spoken Language Understanding Resource Package Version 1
EMOVO Single-label classification Speech EMOVO: A Dataset for Emotion Recognition in Spontaneous Speech Version 1

Version 1: 2022-02-26 - The first released version of the benchmark. The table above indicate which dataset is included for each version of the benchmark.

The instructions to download the datasets are available on the data_download/README.md file.

Detailed information and results of the first version of the benchmark are available on this space. The results include both the numbers reported in the paper and the specific versions of the models evaluated.

Models

We have currently evaluated models that are summarized in the following table. We report the name of the model, the number of parameters, and the number of GFLOPs. The results are reported in the [dedicated 🤗 space]](https://huggingface.co/spaces/ALM/ARCH) page.

Model # Params GFLOPs
facebook/wav2vec2-base ~90M ~70
microsoft/wavlm-base ~90M ~70
microsoft/wavlm-base-plus ~90M ~70
facebook/hubert-base-ls960 ~90M ~70
facebook/data2vec-audio-base ~90M ~70
ALM/wav2vec2-base-audioset (new) ~90M ~70
ALM/hubert-base-audioset (new) ~90M ~70
facebook/wav2vec2-large-robust ~300M ~190
facebook/wav2vec2-xls-r-300m ~300M ~190
microsoft/wavlm-large ~300M ~190
facebook/hubert-large-ll60k ~300M ~190
facebook/data2vec-audio-large ~300M ~190
ALM/wav2vec2-large-audioset (new) ~300M ~190
ALM/hubert-large-audioset (new) ~300M ~190
facebook/wav2vec2-xls-r-1b ~1B ~530
facebook/hubert-xlarge-ll60k ~1B ~530

Usage

The framework is designed to evaluate the performance of a model on the desired dataset. If the model follows one of the available architectures (see the models section), you can just import the model and use it to evaluate the performance on the desired dataset. The following is an example of how to use the framework to evaluate a Wav2Vec2-style model on the ESC-50 dataset.

import json
from configs.w2v2_wrapper import Wav2Vec2ModelWrapper

from arch_eval import Model, ClassificationModel, ClassificationDataset
from arch_eval import ESC50
from transformers import AutoModel, AutoFeatureExtractor


device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME_OR_PATH = "facebook/wav2vec2-base"
MAX_EPOCHS = 200
# load the dataset information - update this file according to the downloaded dataset(s)
with open("configs/data_config.json") as f:
    datasets_info = json.load(f)
dataset_name = "esc50"

audio_model = AutoModel.from_pretrained(MODEL_NAME_OR_PATH)
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME_OR_PATH)
audio_model = audio_model.to(device)
# create the model wrapper
model = Wav2Vec2ModelWrapper(
    audio_model, 
    feature_extractor, 
    device, 
    max_length=datasets_info[dataset_name]["max_length_seconds"]*16_000
)
# evaluator for the ESC-50 dataset
evaluator = ESC50(datasets_info[dataset_name]["path"], verbose=True)
res_dataset = evaluator.evaluate(
    model, 
    mode="linear", 
    device=device, 
    batch_size=8, 
    max_num_epochs=MAX_EPOCHS
)

for metric, value in res_dataset.items():
    print(f"{metric}: {value}")

In the example above, the model is evaluated on the ESC-50 dataset and the benchmark exploit the configuration file configs/data_config.json to retrieve the path to the dataset and the maximum length of the audio files in seconds. The configuration file is a JSON file that contains the information for each dataset. The following is an example of the configuration file for the ESC-50 dataset.

{
    "esc50": {
        "path": "PATH_TO_AUDIO_DATASETS/esc50/",
        "max_length_seconds": 5,
        "is_multilabel": false
    }
}

Contributing

We welcome contributions to the benchmark. If you want to add a dataset or a model, please follow the instructions in the CONTRIBUTING.md file. If you want to add new features, fix bugs, improve the documentation, or just add new results, please open an issue or a pull request.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Bow icons in the logo created by Slidicon - Flaticon

Authors

Moreno La Quatra

Alkis Koudounas

Lorenzo Vaiani

Acknowledgments

This work could not have been possible without the support of the authors of the datasets and the models used in the benchmark. We would like to thank them for their work and for making their datasets and models publicly available.

References

The table above contains the link and references of the datasets used in the benchmark, if you use them in your work, please cite them accordingly.

The specific models evaluated for each version of the benchmark are reported in the results page, if you use them in your work, please cite them accordingly.

If you use the benchmark in your work, please cite the following paper:

Version 1:

@INPROCEEDINGS{ARCH,
  author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
  booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)}, 
  title={Benchmarking Representations for Speech, Music, and Acoustic Events}, 
  year={2024},
  pages={505-509},
  keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning},
  doi={10.1109/ICASSPW62465.2024.10625960}
}