-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to bulid index for dataset in BEIR? #7
Comments
Hi @BluesPizza, I think your command (1) looks correct. You should be able to directly use the flat index from command(1) to search. Any warnings/errors appeared in the log? What the scores did you get? besides, we have prebuild contriever index for all beir datasets. |
@MXueguang I noticed that the scifact index id file provided by Pyserini has 5183 lines, which matches the number of lines in the corpus.json file of the dataset. Therefore, I encoded this file accordingly. However, Pyserini requires each line of the file to be in the format of {id:, contents:} to bulid faiss index. I'm not sure what content should be placed in the "contents" field. Could you please demonstrate how to process the scifact content to build an faiss index? |
@MXueguang 首先感谢您的回答。对scifact数据集而言,我在进行第一步构建的时候没有出现特别的报错,第二步也仅仅出现过节点数过少这种提示信息,我认为问题应该和指令关系不大。但是使用我自行构建的索引进行检索性能远低于论文中给出的数据,我推测和我对scifact的数据集处理有关系。以下是我下载数据集的地址:https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz |
Hello,I changed the instructions with '--pooling mean'. test.jsonl: |
I tried to download scifact and use corpus.json in the zip file to bulit faiss index on it with Contriever (from Hugging Face). But no matter what type of index I choose, its performance is very poor. In fact, I don't even know which part of corpus.json should be used as "contents",So I only used "abstract" as the "contents" and "doc_id" as "id".
Below are the instructions I wrote following the guidelines from Pyserini:
Could you tell me how to handle several datasets in BEIR and use Pyserini to build indexes for them? (It would be best if there are instructions or processed .jsonl file samples.)
The text was updated successfully, but these errors were encountered: