-
Notifications
You must be signed in to change notification settings - Fork 93
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 611d974
Showing
35 changed files
with
4,069 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.pt filter=lfs diff=lfs merge=lfs -text | ||
*.gz filter=lfs diff=lfs merge=lfs -text |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# AudioCLIP | ||
## Extending [CLIP](https://github.com/openai/CLIP) to Image, Text and Audio | ||
|
||
This repository contains implementation of the models described in the paper [arXiv:2106.13043](https://arxiv.org/abs/2106.13043). | ||
This work based on our previous works: | ||
* [ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021)](https://github.com/AndreyGuzhov/ESResNeXt-fbsp). | ||
* [ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020)](https://github.com/AndreyGuzhov/ESResNet). | ||
|
||
### Abstract | ||
|
||
In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. | ||
Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. | ||
|
||
In this work, we present an extension of the CLIP model that handles audio in addition to text and images. | ||
Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. | ||
Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. | ||
|
||
AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. | ||
Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively). | ||
|
||
Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. | ||
For the sake of reproducibility, our code is published. | ||
|
||
### How to Run the Model | ||
|
||
The required Python version is >= 3.7. | ||
|
||
#### AudioCLIP | ||
|
||
##### On the [ESC-50](https://github.com/karolpiczak/ESC-50) dataset | ||
python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50 | ||
|
||
##### On the [UrbanSound8K](https://urbansounddataset.weebly.com/) dataset | ||
python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K | ||
|
||
### Cite Us | ||
|
||
``` | ||
@misc{guzhov2021audioclip, | ||
title={AudioCLIP: Extending CLIP to Image, Text and Audio}, | ||
author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel}, | ||
year={2021}, | ||
eprint={2106.13043}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.SD} | ||
} | ||
``` |
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
This folder contains snapshots of the pre-trained models. |
Git LFS file not shown
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Training Wrapper | ||
|
||
Utility code to run training and evaluation of the model. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
import os as _os | ||
import sys as _sys | ||
|
||
from ignite_trainer.version import __version__ | ||
from ._trainer import main, run | ||
from ._utils import load_class | ||
from ._interfaces import AbstractNet, AbstractTransform | ||
|
||
__all__ = [ | ||
'__version__', | ||
'main', 'run', | ||
'load_class', | ||
'AbstractNet', 'AbstractTransform' | ||
] | ||
|
||
_sys.path.extend([_os.getcwd()]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
import abc | ||
import torch | ||
|
||
from typing import Tuple | ||
from typing import Union | ||
from typing import Callable | ||
from typing import Optional | ||
|
||
|
||
TensorPair = Tuple[torch.Tensor, torch.Tensor] | ||
TensorOrTwo = Union[torch.Tensor, TensorPair] | ||
|
||
|
||
class AbstractNet(abc.ABC, torch.nn.Module): | ||
|
||
@abc.abstractmethod | ||
def forward(self, x: torch.Tensor, y: Optional[torch.Tensor] = None) -> TensorOrTwo: | ||
pass | ||
|
||
@abc.abstractmethod | ||
def loss_fn(self, y_pred: torch.Tensor, y: torch.Tensor) -> torch.Tensor: | ||
pass | ||
|
||
@property | ||
@abc.abstractmethod | ||
def loss_fn_name(self) -> str: | ||
pass | ||
|
||
|
||
class AbstractTransform(abc.ABC, Callable[[torch.Tensor], torch.Tensor]): | ||
|
||
@abc.abstractmethod | ||
def __call__(self, x: torch.Tensor) -> torch.Tensor: | ||
pass | ||
|
||
def __repr__(self): | ||
return self.__class__.__name__ + '()' |
Oops, something went wrong.