Skip to content

Commit

Permalink
added descriptions to new datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
magdalendobson committed Jul 31, 2024
1 parent fe29759 commit b35329d
Showing 1 changed file with 11 additions and 0 deletions.
11 changes: 11 additions & 0 deletions benchmark/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,11 @@ def __init__(self, nb_M=1000):
def distance(self):
return "euclidean"

'''
The base vectors of Wikipedia-Cohere consist of 35 million cohere embeddings of the title and text of Wikipedia English articles.
The 5000 query vectors consist of 5000 cohere embeddings of the title and text of Wikipedia simple articles.
See https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings?row=2 for more details.
'''
class WikipediaDataset(BillionScaleDatasetCompetitionFormat):
def __init__(self, nb=35000000):
self.nb = nb
Expand Down Expand Up @@ -537,6 +542,12 @@ def get_dataset(self):
def distance(self):
return "ip"

'''
The MSMarco Web Search dataset has 100,924,960 base vectors consisting of embeddings of web documents
from the ClueWeb22 document dataset, while its 9,374 queries correspond to web queries collected from
the Microsoft Bing search engine.
See https://github.com/microsoft/MS-MARCO-Web-Search for more details.
'''
class MSMarcoWebSearchDataset(BillionScaleDatasetCompetitionFormat):
def __init__(self, nb=101070374):
self.nb = nb
Expand Down

0 comments on commit b35329d

Please sign in to comment.