Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multimodal dataset based on COCO text-image pairs #559

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fabiocarrara
Copy link

@fabiocarrara fabiocarrara commented Dec 20, 2024

This PR adds two ANN datasets drived from the COCO text-image pairs dataset:

  1. COCO Text-to-Image Multimodal Dataset (coco-t2i-512-angular):
    • Text is used as queries, and images comprise the search set.
    • This dataset presents a challenge due to the distribution data shift between queries and the search set, with the 100 nearest neighbors of queries having a cosine similarity of 0.30 +- 0.02.
  2. COCO Image-to-Image Intra-modal Dataset (coco-i2i-512-angular):
    • Images are used as both queries and search set.
    • This dataset does not exhibit the distribution shift and can serve as a reference, sharing the same datapoints as the t2i dataset.

Extraction Process
Features vectors are the CLS output token of the OpenAI's CLIP with ViT-B/16 architecture (512 dimensions) of the visual or textual encoder. Thanks to @lorebianchi98 and @mesnico for performing extraction and preparation.

Split definition
Based on Karpathy's split of COCO 2014:

  • The search sets include vectors extracted from the images of the training set (113,287) and of the validation set (5,000), for a total of 118,287 vectors.
  • Queries:
    • Visual (i2i): 5,000 vectors from test set images.
    • Textual (t2i): 5,000 vectors from the first caption (out of the five available) of test set images.

@maumueller: Lucia (@vadicamo) told me you were searching for a multimodal dataset for the SISAP indexing challenge. You can check whether those are a good fit if still needed. Let me know if you'd like more details.

@fabiocarrara fabiocarrara marked this pull request as ready for review January 7, 2025 16:11
@fabiocarrara
Copy link
Author

fabiocarrara commented Jan 8, 2025

Some results on an Intel Core™ i9-9900K, 64GB RAM. I've put the same x and y limits to compare the plots better. Note the shift towards lower recalls in t2i.

coco-t2i-512-angular

coco-t2i-512-angular

coco-i2i-512-angular

coco-i2i-512-angular

@maumueller
Copy link
Collaborator

Thanks for the contribution, @fabiocarrara! I particularly like that the order of algorithms is changing between the different datasets.

What I'm worried about in the current setup is that we are trusting your hdf5 computation. Is it possible to change the pipeline in such a way that only the vectors are downloaded and ann-benchmarks does the train/test-split and computes the groundtruth? For example, even the openblas baseline doesn't seem to achieve recall 1 on the image-2-image version.

@fabiocarrara
Copy link
Author

Thanks for the tip @maumueller. I've changed the dataset function to downlaod the vectors and perform split and GT computations (with train_test_split() and write_output()).
I ignored the COCO splits, put all features back together and randomly picked 10000 test features.
However, I have picked the same queries for the two modalities to keep those aligned.
Below the updated results. Bruteforce still cannot reach 100% recall. I have not investigated further yet; could it be due to the float16 datatype?

coco-t2i-512-angular

coco-t2i-512-angular

coco-i2i-512-angular

coco-i2i-512-angular

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants