Skip to content

Commit

Permalink
create repo
Browse files Browse the repository at this point in the history
  • Loading branch information
AndreyGuzhov committed Jun 28, 2021
0 parents commit 611d974
Show file tree
Hide file tree
Showing 35 changed files with 4,069 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.pt filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
47 changes: 47 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# AudioCLIP
## Extending [CLIP](https://github.com/openai/CLIP) to Image, Text and Audio

This repository contains implementation of the models described in the paper [arXiv:2106.13043](https://arxiv.org/abs/2106.13043).
This work based on our previous works:
* [ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021)](https://github.com/AndreyGuzhov/ESResNeXt-fbsp).
* [ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020)](https://github.com/AndreyGuzhov/ESResNet).

### Abstract

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains.
Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.

In this work, we present an extension of the CLIP model that handles audio in addition to text and images.
Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset.
Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion.

AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets.
Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively).

Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results.
For the sake of reproducibility, our code is published.

### How to Run the Model

The required Python version is >= 3.7.

#### AudioCLIP

##### On the [ESC-50](https://github.com/karolpiczak/ESC-50) dataset
python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50

##### On the [UrbanSound8K](https://urbansounddataset.weebly.com/) dataset
python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K

### Cite Us

```
@misc{guzhov2021audioclip,
title={AudioCLIP: Extending CLIP to Image, Text and Audio},
author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
year={2021},
eprint={2106.13043},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
```
3 changes: 3 additions & 0 deletions assets/AudioCLIP-Full-Training.pt
Git LFS file not shown
3 changes: 3 additions & 0 deletions assets/AudioCLIP-Partial-Training.pt
Git LFS file not shown
3 changes: 3 additions & 0 deletions assets/CLIP.pt
Git LFS file not shown
3 changes: 3 additions & 0 deletions assets/ESRNXFBSP.pt
Git LFS file not shown
1 change: 1 addition & 0 deletions assets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This folder contains snapshots of the pre-trained models.
3 changes: 3 additions & 0 deletions assets/bpe_simple_vocab_16e6.txt.gz
Git LFS file not shown
3 changes: 3 additions & 0 deletions ignite_trainer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Training Wrapper

Utility code to run training and evaluation of the model.
16 changes: 16 additions & 0 deletions ignite_trainer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import os as _os
import sys as _sys

from ignite_trainer.version import __version__
from ._trainer import main, run
from ._utils import load_class
from ._interfaces import AbstractNet, AbstractTransform

__all__ = [
'__version__',
'main', 'run',
'load_class',
'AbstractNet', 'AbstractTransform'
]

_sys.path.extend([_os.getcwd()])
37 changes: 37 additions & 0 deletions ignite_trainer/_interfaces.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import abc
import torch

from typing import Tuple
from typing import Union
from typing import Callable
from typing import Optional


TensorPair = Tuple[torch.Tensor, torch.Tensor]
TensorOrTwo = Union[torch.Tensor, TensorPair]


class AbstractNet(abc.ABC, torch.nn.Module):

@abc.abstractmethod
def forward(self, x: torch.Tensor, y: Optional[torch.Tensor] = None) -> TensorOrTwo:
pass

@abc.abstractmethod
def loss_fn(self, y_pred: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
pass

@property
@abc.abstractmethod
def loss_fn_name(self) -> str:
pass


class AbstractTransform(abc.ABC, Callable[[torch.Tensor], torch.Tensor]):

@abc.abstractmethod
def __call__(self, x: torch.Tensor) -> torch.Tensor:
pass

def __repr__(self):
return self.__class__.__name__ + '()'
Loading

0 comments on commit 611d974

Please sign in to comment.