Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to bulid index for dataset in BEIR? #7

Open
BluesPizza opened this issue Apr 21, 2024 · 4 comments
Open

How to bulid index for dataset in BEIR? #7

BluesPizza opened this issue Apr 21, 2024 · 4 comments

Comments

@BluesPizza
Copy link

I tried to download scifact and use corpus.json in the zip file to bulit faiss index on it with Contriever (from Hugging Face). But no matter what type of index I choose, its performance is very poor. In fact, I don't even know which part of corpus.json should be used as "contents",So I only used "abstract" as the "contents" and "doc_id" as "id".
Below are the instructions I wrote following the guidelines from Pyserini:

  1. python -m pyserini.encode input --corpus /home/scifact.jsonl --fields text --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings /home/encoding --to-faiss encoder --encoder /home/facebook/contriever --fields text --batch 32 --fp16
  2. python -m pyserini.index.faiss --input /home/encoding --output /home/index --hnsw (I tried all types of index Pyserini supports)
    Could you tell me how to handle several datasets in BEIR and use Pyserini to build indexes for them? (It would be best if there are instructions or processed .jsonl file samples.)
@MXueguang
Copy link
Contributor

Hi @BluesPizza, I think your command (1) looks correct. You should be able to directly use the flat index from command(1) to search. Any warnings/errors appeared in the log? What the scores did you get?

besides, we have prebuild contriever index for all beir datasets.
e.g. https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss/faiss-flat.beir-v1.0.0-scifact.contriever.20230124.tar.gz
Please see details in https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py

@BluesPizza
Copy link
Author

@MXueguang
Thank you for your response first. When it comes to the scifact dataset, I didn't encounter any specific errors during the first step of index construction, and during the second step, the only warning message I received was about insufficient node count.
Therefore, I believe it is not an issue with the instructions. However, the retrieval performance using the index I built myself is much lower than the data provided in the paper. I suspect there might be issues with my processing of the scifact dataset. Here is the dataset download link: https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz

I noticed that the scifact index id file provided by Pyserini has 5183 lines, which matches the number of lines in the corpus.json file of the dataset. Therefore, I encoded this file accordingly. However, Pyserini requires each line of the file to be in the format of {id:, contents:} to bulid faiss index. I'm not sure what content should be placed in the "contents" field.

Could you please demonstrate how to process the scifact content to build an faiss index?

@BluesPizza
Copy link
Author

@MXueguang 首先感谢您的回答。对scifact数据集而言,我在进行第一步构建的时候没有出现特别的报错,第二步也仅仅出现过节点数过少这种提示信息,我认为问题应该和指令关系不大。但是使用我自行构建的索引进行检索性能远低于论文中给出的数据,我推测和我对scifact的数据集处理有关系。以下是我下载数据集的地址:https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz
我观察到Pyserini提供的scifact索引id文件共5183行,这和数据集中corpus.json文件的行数相同,于是我对这个文件进行编码。但是,Pyserini要求文件每一行使用{id:,contents:}格式,我并不知道哪些内容该被放在“contents”中。
我不清楚是我对数据集处理有问题还是编码的文件有问题。因此,您能展示一下如何处理scifact的内容以构建密集索引吗?

@BluesPizza
Copy link
Author

Hi @BluesPizza, I think your command (1) looks correct. You should be able to directly use the flat index from command(1) to search. Any warnings/errors appeared in the log? What the scores did you get?

besides, we have prebuild contriever index for all beir datasets. e.g. https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss/faiss-flat.beir-v1.0.0-scifact.contriever.20230124.tar.gz Please see details in https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py

Hello,I changed the instructions with '--pooling mean'.
!python -m pyserini.encode input --corpus /home/test.jsonl --fields title text --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings /home/encoding --to-faiss encoder --encoder /home/facebook/contriever --fields text --batch 32 --fp16 --pooling mean

test.jsonl:
屏幕截图 2024-05-04 165434
I got a lower result: nDCG@10=61.81 and R@100=90.93. How can I get a better result?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants