Mixture-of-Modality-Experts for Unified Image Aesthetic Assessment with Multi-Level Adaptation
Image aesthetic assessment (IAA) is challenging, because human’s judgement of aesthetic is a metaphysical integration of multi-level information, including color, composition, and semantic. Most existing methods try to learn such information merely from images for the Vision-only IAA (VIAA) task. Recently, a number of Multi-modal IAA (MIAA) methods have been proposed to additionally explore text comments for capturing comprehensive information. However, these MIAA methods are not applicable or show limited performance, when there are no text comments available. To combat this challenge, we propose a unified IAA framework, termed AesFormer, by using mixtures of vision-language Transformers. Specially, AesFormer first learns aligned image-text representations through contrastive learning, and uses a vision-language head for MIAA prediction. Afterward, we propose a multi-level adaptation method to adapt the learned MIAA model to the case without text comments, and use another vision head for VIAA prediction. Extensive experiments are conducted on the AVA, Photo.net, and JAS datasets. The results show that AesFormer significantly outperforms previous methods in both MIAA and VIAA tasks, on all datasets. Remarkably, all the three main metrics, including the classification accuracy, PLCC, and SRCC, break through 90% for the first time, on the AVA dataset. Our codes and models have been released online at: https://github.com/AiArt-HDU/aesformer.
- Linux or macOS
- Python 3.8
- Pytorch 1.8
- CPU or NVIDIA GPU + CUDA CuDNN
- Aesformer-T : [Baidu CLoud] pwd:6guc