create repo

AndreyGuzhov · Jun 28, 2021 · 611d974 · 611d974
commit 611d974
Show file tree

Hide file tree

Showing 35 changed files with 4,069 additions and 0 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
diff --git a/README.md b/README.md
@@ -0,0 +1,47 @@
+# AudioCLIP
+## Extending [CLIP](https://github.com/openai/CLIP) to Image, Text and Audio
+
+This repository contains implementation of the models described in the paper [arXiv:2106.13043](https://arxiv.org/abs/2106.13043).
+This work based on our previous works:
+* [ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021)](https://github.com/AndreyGuzhov/ESResNeXt-fbsp).
+* [ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020)](https://github.com/AndreyGuzhov/ESResNet).
+
+### Abstract
+
+In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains.
+Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.
+
+In this work, we present an extension of the CLIP model that handles audio in addition to text and images.
+Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset.
+Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion.
+
+AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets.
+Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively).
+
+Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results.
+For the sake of reproducibility, our code is published.
+
+### How to Run the Model
+
+The required Python version is >= 3.7.
+
+#### AudioCLIP
+
+##### On the [ESC-50](https://github.com/karolpiczak/ESC-50) dataset
+    python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50
+
+##### On the [UrbanSound8K](https://urbansounddataset.weebly.com/) dataset
+    python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K
+
+### Cite Us
+
+```
+@misc{guzhov2021audioclip,
+      title={AudioCLIP: Extending CLIP to Image, Text and Audio}, 
+      author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
+      year={2021},
+      eprint={2106.13043},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD}
+}
+```
diff --git a/assets/AudioCLIP-Full-Training.pt b/assets/AudioCLIP-Full-Training.pt
diff --git a/assets/AudioCLIP-Partial-Training.pt b/assets/AudioCLIP-Partial-Training.pt
diff --git a/assets/CLIP.pt b/assets/CLIP.pt
diff --git a/assets/ESRNXFBSP.pt b/assets/ESRNXFBSP.pt
diff --git a/assets/README.md b/assets/README.md
@@ -0,0 +1 @@
+This folder contains snapshots of the pre-trained models.
diff --git a/assets/bpe_simple_vocab_16e6.txt.gz b/assets/bpe_simple_vocab_16e6.txt.gz
diff --git a/ignite_trainer/README.md b/ignite_trainer/README.md
@@ -0,0 +1,3 @@
+# Training Wrapper
+
+Utility code to run training and evaluation of the model.
diff --git a/ignite_trainer/__init__.py b/ignite_trainer/__init__.py
@@ -0,0 +1,16 @@
+import os as _os
+import sys as _sys
+
+from ignite_trainer.version import __version__
+from ._trainer import main, run
+from ._utils import load_class
+from ._interfaces import AbstractNet, AbstractTransform
+
+__all__ = [
+    '__version__',
+    'main', 'run',
+    'load_class',
+    'AbstractNet', 'AbstractTransform'
+]
+
+_sys.path.extend([_os.getcwd()])
diff --git a/ignite_trainer/_interfaces.py b/ignite_trainer/_interfaces.py
@@ -0,0 +1,37 @@
+import abc
+import torch
+
+from typing import Tuple
+from typing import Union
+from typing import Callable
+from typing import Optional
+
+
+TensorPair = Tuple[torch.Tensor, torch.Tensor]
+TensorOrTwo = Union[torch.Tensor, TensorPair]
+
+
+class AbstractNet(abc.ABC, torch.nn.Module):
+
+    @abc.abstractmethod
+    def forward(self, x: torch.Tensor, y: Optional[torch.Tensor] = None) -> TensorOrTwo:
+        pass
+
+    @abc.abstractmethod
+    def loss_fn(self, y_pred: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
+        pass
+
+    @property
+    @abc.abstractmethod
+    def loss_fn_name(self) -> str:
+        pass
+
+
+class AbstractTransform(abc.ABC, Callable[[torch.Tensor], torch.Tensor]):
+
+    @abc.abstractmethod
+    def __call__(self, x: torch.Tensor) -> torch.Tensor:
+        pass
+
+    def __repr__(self):
+        return self.__class__.__name__ + '()'