AesFormer

Mixture-of-Modality-Experts for Unified Image Aesthetic Assessment with Multi-Level Adaptation

Abstract

Image aesthetic assessment (IAA) is challenging, because human’s judgement of aesthetic is a metaphysical integration of multi-level information, including color, composition, and semantic. Most existing methods try to learn such information merely from images for the Vision-only IAA (VIAA) task. Recently, a number of Multi-modal IAA (MIAA) methods have been proposed to additionally explore text comments for capturing comprehensive information. However, these MIAA methods are not applicable or show limited performance, when there are no text comments available. To combat this challenge, we propose a unified IAA framework, termed AesFormer, by using mixtures of vision-language Transformers. Specially, AesFormer first learns aligned image-text representations through contrastive learning, and uses a vision-language head for MIAA prediction. Afterward, we propose a multi-level adaptation method to adapt the learned MIAA model to the case without text comments, and use another vision head for VIAA prediction. Extensive experiments are conducted on the AVA, Photo.net, and JAS datasets. The results show that AesFormer significantly outperforms previous methods in both MIAA and VIAA tasks, on all datasets. Remarkably, all the three main metrics, including the classification accuracy, PLCC, and SRCC, break through 90% for the first time, on the AVA dataset. Our codes and models have been released online at: https://github.com/AiArt-HDU/aesformer.

Comparison with SOTAs

Attention Visualization

Prerequisites

Linux or macOS
Python 3.8
Pytorch 1.8
CPU or NVIDIA GPU + CUDA CuDNN

Pretrained Models

Aesformer-T : [Baidu CLoud] pwd:6guc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AesFormer

Abstract

Comparison with SOTAs

Attention Visualization

Prerequisites

Pretrained Models

Files

README.md

Latest commit

History

README.md

File metadata and controls

AesFormer

Abstract

Comparison with SOTAs

Attention Visualization

Prerequisites

Pretrained Models