Official PyTorch implementation and benchmark dataset for IGARSS 2024 ORAL paper. [arXiv
]
In recent years, earth observation (EO) through remote sensing (RS) has witnessed an enormous growth in data volume, creating a challenge in managing and extracting relevant information. Remote sensing image retrieval (RSIR), which aims to search and retrieve images from RS image archives, has emerged as a key solution. However, RSIR methods encounter a major limitation: the reliance on a query of single modality. This constraint often restricts users from fully expressing their specific requirements.
To tackle this constraint, we introduce a new task, remote sensing composed image retrieval. RSCIR, integrating both image and text in the search query, is designed to retrieve images that are not only visually similar to the query image but also relevant to the details of the accompanying query text. Our RSCIR approach, called WeiCom, is expressive, flexible and training-free based on a vision-language model, utilizing a weighting parameter λ for more image- or text-oriented results, with λ → 0 or λ → 1 respectively.
In this work, we recognize, present and qualitatively evaluate the capabilities and challenges of RSCIR. We demonstrate how users can now pair a query image with a query text specifying modifications related to color, context, density, existence, quantity, shape, size or texture of one or more classes.
Quantitatively, we focus on color, context, density, existence, quantity, and shape modifications, establishing a new benchmark dataset, called PatterCom and an evaluation protocol.
In summary, we make the following contributions:
- We introduce remote sensing composed image retrieval (RSCIR), accompanied with PatterCom, a new benchmark dataset.
- We introduce WeiCom, a training-free method utilizing a modality control parameter for more image- or text-oriented results according to the needs of each search.
- We evaluate both qualitatively and quantitatively the performance of WeiCom, setting the state-of-the-art on RSCIR.
For our experiments, you need to download CLIP and RemoteCLIP, both with a ViT-L/14 image encoder. After downloading, place them inside the models/
folder.
This code folder structure should then look like this:
rscir/
|-- .github/
|-- models/
|-- CLIP-ViT-L-14.bin
|-- RemoteCLIP-ViT-L-14.pt
|-- .gitignore
|-- LICENSE
|-- README.md
|-- evaluate.py
|-- extract_features.py
|-- utils.py
Create this environment for our experiments:
conda create -n rscir python=3.9 -y
conda activate rscir
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install open_clip_torch
PatterCom is based on PatternNet, a large-scale, high-resolution remote sensing dataset that comprises 38 classes, with each class containing 800 images of 256×256 pixels.
Download PatternNet from here and unzip it into PatternNet/
folder. Download patternnet.csv
for here and place it in the same folder too. Finally, download PatternCom from here and place it into the same folder too.
The PatternNet/
folder structure should look like this:
PatternNet/
|-- images/
|-- PatternCom/
|-- color.csv
|-- context.csv
|-- density.csv
|-- existence.csv
|-- quantity.csv
|-- shape.csv
|-- patternnet.csv
|-- patternnet_description.pdf
To extract CLIP or RemoteCLIP features from PatternNet dataset, run:
python extract_features.py --model_name clip --dataset_path /path/to/PatternNet/
Replace clip
with remoteclip
for RemoteCLIP features.
Note that this will save features as pickle files inside PatternNet/features/
folder. Thus, the new folder structure should look like this:
PatternNet/
|-- features/
|-- patternnet_clip.pkl
|-- patternnet_remoteclip.pkl
|-- images/
|-- PatternCom/
|-- patternnet.csv
|-- patternnet_description.pdf
To evaluate extracted features on PatternCom RSCIR using baselines, run:
python evaluate.py --model_name clip --dataset_path /path/to/PatternNet/ --methods "Image only" "Text only" "Average Similarities"
Replace clip
with remoteclip
for RemoteCLIP features.
To evaluate extracted features on PatternCom RSCIR using WeiCom, run:
python evaluate.py --model_name clip --dataset_path /path/to/PatternNet/ --methods "Weighted Similarities Norm" --lambdas 0.5
Replace clip
with remoteclip
for RemoteCLIP features.
Following our ablation, you can use optimal --lambdas 0.3
for CLIP, --lambdas 0.6
for RemoteCLIP.
Running the code as described above, you should get the following results. In these tables, for each attribute value of an attribute (e.g. "rectangular" of Shape), the average mAP over all the rest attribute values (e.g. "oval" of Shape) is shown. Avg
represents the average mAP over all combinations.
Method | Color | Context | Density | Existence | Quantity | Shape | Avg |
---|---|---|---|---|---|---|---|
Text | 13.47 | 4.83 | 3.58 | 2.00 | 3.31 | 6.22 | 5.57 |
Image | 14.66 | 8.32 | 13.49 | 16.47 | 7.84 | 15.76 | 12.74 |
Text & Image | 23.13 | 11.02 | 15.87 | 16.93 | 10.13 | 21.38 | 16.41 |
46.08 | 17.45 | 16.49 | 8.36 | 18.15 | 23.97 | 21.75 | |
46.74 | 20.97 | 22.07 | 13.22 | 20.96 | 26.22 | 25.03 |
Method | Color | Context | Density | Existence | Quantity | Shape | Avg |
---|---|---|---|---|---|---|---|
Text | 10.75 | 8.87 | 22.16 | 6.98 | 8.25 | 24.12 | 13.52 |
Image | 14.40 | 6.62 | 15.11 | 13.10 | 6.99 | 15.18 | 11.90 |
Text & Image | 23.67 | 10.01 | 18.45 | 13.98 | 7.97 | 19.63 | 15.62 |
43.68 | 31.45 | 39.94 | 14.92 | 20.51 | 29.78 | 30.05 | |
41.04 | 31.59 | 41.56 | 14.56 | 20.79 | 31.24 | 30.13 |
NTUA thanks NVIDIA for the support with the donation of GPU hardware.
This repository is released under the Apache 2.0 license as found in the LICENSE file.
If you find this repository useful, please consider giving a star 🌟 and citation:
@inproceedings{psomas2024composed,
title={Composed Image Retrieval for Remote Sensing},
author={Psomas, B. and Kakogeorgiou, I. and Efthymiadis, N. and Tolias, G. and Chum, O. and Avrithis, Y. and Karantzalos, K.},
booktitle={IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium},
year={2024}
}