Add OpenAI ArXiv Dataset #299

magdalendobson · 2024-08-07T15:19:31Z

This PR adds a 2 million size embedding dataset of 1536-dimensional OpenAI ada-002 embeddings of the abstracts of ArXiv papers. The original ArXiv dataset was released by Cornell University on kaggle under a CC0 license. We provide a set of 20000 queries also embedded from the abstracts of ArXiv articles, as well as groundtruth for the first 100000 vectors and the full 2321096 vectors.

magdalendobson · 2024-08-09T15:09:24Z

Added comment in datasets.py describing the dataset, marking as ready for review.

harsha-simhadri · 2024-08-08T02:40:34Z

benchmark/datasets.py

@@ -619,6 +619,44 @@ def get_dataset(self):
    def distance(self):
        return "ip"

+class OpenAIArXivDataset(DatasetCompetitionFormat):


Could you please add the description in comments here?

magdalendobson added 2 commits August 7, 2024 08:06

added openai arxiv dataset

a764cfe

added comment describing dataset

99d447e

magdalendobson marked this pull request as ready for review August 9, 2024 15:09

harsha-simhadri approved these changes Aug 9, 2024

View reviewed changes

harsha-simhadri merged commit df1e53a into main Aug 9, 2024
0 of 76 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI ArXiv Dataset #299

Add OpenAI ArXiv Dataset #299

magdalendobson commented Aug 7, 2024 •

edited

Loading

magdalendobson commented Aug 9, 2024

harsha-simhadri Aug 8, 2024

Add OpenAI ArXiv Dataset #299

Add OpenAI ArXiv Dataset #299

Conversation

magdalendobson commented Aug 7, 2024 • edited Loading

magdalendobson commented Aug 9, 2024

harsha-simhadri Aug 8, 2024

Choose a reason for hiding this comment

magdalendobson commented Aug 7, 2024 •

edited

Loading