This repository contains the code for the ARCH benchmark. It is intended to be used to evaluate audio representations on a wide range of datasets and tasks. The benchmark is intended to be easy to use and to allow the comparison of different audio representations.
The main features of ARCH are:
- Plug and play: the benchmark is designed to be easy to use. It provides a unified interface to load the datasets and to evaluate audio representations.
- Extensibility: the benchmark is designed to be easy to extend. It is possible to add new datasets and tasks as well as new models to evaluate its audio representations.
- Standardization: the benchmark wants to standardize the evaluation of audio representations. The pletora of ARL models and datasets makes it difficult to compare them. The benchmark aims at providing a standard way to evaluate audio representations.
The main components and their interactions are illustrated in the following figure:
ARCH can be installed by just cloning the repository and installing it with pip:
git clone https://github.com/MorenoLaQuatra/ARCH.git
cd ARCH
pip install -e .
The benchmark can be used by importing the arch
module. The file evaluate_hf_models.py
contains an example of how to use the benchmark. It contains the following parameters that can be used to configure the benchmark:
model
: the name of the model to evaluate. It can be any model from the HuggingFace model hub or a local model exposing the same interface.device
: the device to use for the evaluation. It can becpu
orcuda
.max_epochs
: the maximum number of epochs to train the linear classifier.verbose
: ifTrue
, it prints the results of the evaluation and other information on the standard output.tsv_logging_file
: the file where to save the results of the evaluation in TSV format.n_iters
: the number of times to repeat the evaluation, it can be used to compute the average of multiple runs and their standard deviation.data_config_file
: the file containing the configuration of the datasets to use for the evaluation (you can find it atconfigs/data_config.json
)enabled_datasets
: the list of datasets to use for the evaluation. It can be any of the following:esc50
,us8k
,fsd50k
,vivae
,fma_small
,magna_tag_a_tune
,irmas
,medleydb
,ravdess
,audio_mnist
,slurp
,emovo
.
The benchmark includes multiple datasets and, at the moment, only classification tasks. The following table contains the list of the datasets and tasks currently supported by the benchmark.
Version 1: 2022-02-26 - The first released version of the benchmark. The table above indicate which dataset is included for each version of the benchmark.
The instructions to download the datasets are available on the data_download/README.md file.
Detailed information and results of the first version of the benchmark are available on this space. The results include both the numbers reported in the paper and the specific versions of the models evaluated.
We have currently evaluated models that are summarized in the following table. We report the name of the model, the number of parameters, and the number of GFLOPs. The results are reported in the [dedicated 🤗 space]](https://huggingface.co/spaces/ALM/ARCH) page.
Model | # Params | GFLOPs |
---|---|---|
facebook/wav2vec2-base | ~90M | ~70 |
microsoft/wavlm-base | ~90M | ~70 |
microsoft/wavlm-base-plus | ~90M | ~70 |
facebook/hubert-base-ls960 | ~90M | ~70 |
facebook/data2vec-audio-base | ~90M | ~70 |
ALM/wav2vec2-base-audioset (new) | ~90M | ~70 |
ALM/hubert-base-audioset (new) | ~90M | ~70 |
facebook/wav2vec2-large-robust | ~300M | ~190 |
facebook/wav2vec2-xls-r-300m | ~300M | ~190 |
microsoft/wavlm-large | ~300M | ~190 |
facebook/hubert-large-ll60k | ~300M | ~190 |
facebook/data2vec-audio-large | ~300M | ~190 |
ALM/wav2vec2-large-audioset (new) | ~300M | ~190 |
ALM/hubert-large-audioset (new) | ~300M | ~190 |
facebook/wav2vec2-xls-r-1b | ~1B | ~530 |
facebook/hubert-xlarge-ll60k | ~1B | ~530 |
The framework is designed to evaluate the performance of a model on the desired dataset. If the model follows one of the available architectures (see the models section), you can just import the model and use it to evaluate the performance on the desired dataset. The following is an example of how to use the framework to evaluate a Wav2Vec2-style model on the ESC-50 dataset.
import json
from configs.w2v2_wrapper import Wav2Vec2ModelWrapper
from arch_eval import Model, ClassificationModel, ClassificationDataset
from arch_eval import ESC50
from transformers import AutoModel, AutoFeatureExtractor
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME_OR_PATH = "facebook/wav2vec2-base"
MAX_EPOCHS = 200
# load the dataset information - update this file according to the downloaded dataset(s)
with open("configs/data_config.json") as f:
datasets_info = json.load(f)
dataset_name = "esc50"
audio_model = AutoModel.from_pretrained(MODEL_NAME_OR_PATH)
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME_OR_PATH)
audio_model = audio_model.to(device)
# create the model wrapper
model = Wav2Vec2ModelWrapper(
audio_model,
feature_extractor,
device,
max_length=datasets_info[dataset_name]["max_length_seconds"]*16_000
)
# evaluator for the ESC-50 dataset
evaluator = ESC50(datasets_info[dataset_name]["path"], verbose=True)
res_dataset = evaluator.evaluate(
model,
mode="linear",
device=device,
batch_size=8,
max_num_epochs=MAX_EPOCHS
)
for metric, value in res_dataset.items():
print(f"{metric}: {value}")
In the example above, the model is evaluated on the ESC-50 dataset and the benchmark exploit the configuration file configs/data_config.json
to retrieve the path to the dataset and the maximum length of the audio files in seconds.
The configuration file is a JSON file that contains the information for each dataset. The following is an example of the configuration file for the ESC-50 dataset.
{
"esc50": {
"path": "PATH_TO_AUDIO_DATASETS/esc50/",
"max_length_seconds": 5,
"is_multilabel": false
}
}
We welcome contributions to the benchmark. If you want to add a dataset or a model, please follow the instructions in the CONTRIBUTING.md file. If you want to add new features, fix bugs, improve the documentation, or just add new results, please open an issue or a pull request.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Bow icons in the logo created by Slidicon - Flaticon
This work could not have been possible without the support of the authors of the datasets and the models used in the benchmark. We would like to thank them for their work and for making their datasets and models publicly available.
The table above contains the link and references of the datasets used in the benchmark, if you use them in your work, please cite them accordingly.
The specific models evaluated for each version of the benchmark are reported in the results page, if you use them in your work, please cite them accordingly.
If you use the benchmark in your work, please cite the following paper:
Version 1:
@INPROCEEDINGS{ARCH,
author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
title={Benchmarking Representations for Speech, Music, and Acoustic Events},
year={2024},
pages={505-509},
keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning},
doi={10.1109/ICASSPW62465.2024.10625960}
}