How to bulid index for dataset in BEIR? #7

BluesPizza · 2024-04-21T12:08:34Z

I tried to download scifact and use corpus.json in the zip file to bulit faiss index on it with Contriever (from Hugging Face). But no matter what type of index I choose, its performance is very poor. In fact, I don't even know which part of corpus.json should be used as "contents"，So I only used "abstract" as the "contents" and "doc_id" as "id".
Below are the instructions I wrote following the guidelines from Pyserini：

python -m pyserini.encode input --corpus /home/scifact.jsonl --fields text --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings /home/encoding --to-faiss encoder --encoder /home/facebook/contriever --fields text --batch 32 --fp16
python -m pyserini.index.faiss --input /home/encoding --output /home/index --hnsw (I tried all types of index Pyserini supports)
Could you tell me how to handle several datasets in BEIR and use Pyserini to build indexes for them? (It would be best if there are instructions or processed .jsonl file samples.)

MXueguang · 2024-04-21T23:26:46Z

Hi @BluesPizza, I think your command (1) looks correct. You should be able to directly use the flat index from command(1) to search. Any warnings/errors appeared in the log? What the scores did you get?

besides, we have prebuild contriever index for all beir datasets.
e.g. https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss/faiss-flat.beir-v1.0.0-scifact.contriever.20230124.tar.gz
Please see details in https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py

BluesPizza · 2024-04-22T12:02:29Z

@MXueguang
Thank you for your response first. When it comes to the scifact dataset, I didn't encounter any specific errors during the first step of index construction, and during the second step, the only warning message I received was about insufficient node count.
Therefore, I believe it is not an issue with the instructions. However, the retrieval performance using the index I built myself is much lower than the data provided in the paper. I suspect there might be issues with my processing of the scifact dataset. Here is the dataset download link: https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz

I noticed that the scifact index id file provided by Pyserini has 5183 lines, which matches the number of lines in the corpus.json file of the dataset. Therefore, I encoded this file accordingly. However, Pyserini requires each line of the file to be in the format of {id:, contents:} to bulid faiss index. I'm not sure what content should be placed in the "contents" field.

Could you please demonstrate how to process the scifact content to build an faiss index?

BluesPizza · 2024-04-22T12:08:20Z

@MXueguang 首先感谢您的回答。对scifact数据集而言，我在进行第一步构建的时候没有出现特别的报错，第二步也仅仅出现过节点数过少这种提示信息，我认为问题应该和指令关系不大。但是使用我自行构建的索引进行检索性能远低于论文中给出的数据，我推测和我对scifact的数据集处理有关系。以下是我下载数据集的地址：https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz
我观察到Pyserini提供的scifact索引id文件共5183行，这和数据集中corpus.json文件的行数相同，于是我对这个文件进行编码。但是，Pyserini要求文件每一行使用{id：，contents：}格式，我并不知道哪些内容该被放在“contents”中。
我不清楚是我对数据集处理有问题还是编码的文件有问题。因此，您能展示一下如何处理scifact的内容以构建密集索引吗？

BluesPizza · 2024-05-04T08:57:28Z

Hi @BluesPizza, I think your command (1) looks correct. You should be able to directly use the flat index from command(1) to search. Any warnings/errors appeared in the log? What the scores did you get?

besides, we have prebuild contriever index for all beir datasets. e.g. https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss/faiss-flat.beir-v1.0.0-scifact.contriever.20230124.tar.gz Please see details in https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py

Hello，I changed the instructions with '--pooling mean'.
!python -m pyserini.encode input --corpus /home/test.jsonl --fields title text --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings /home/encoding --to-faiss encoder --encoder /home/facebook/contriever --fields text --batch 32 --fp16 --pooling mean

test.jsonl:

I got a lower result: nDCG@10=61.81 and R@100=90.93. How can I get a better result?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to bulid index for dataset in BEIR? #7

How to bulid index for dataset in BEIR? #7

BluesPizza commented Apr 21, 2024

MXueguang commented Apr 21, 2024

BluesPizza commented Apr 22, 2024

BluesPizza commented Apr 22, 2024

BluesPizza commented May 4, 2024

How to bulid index for dataset in BEIR? #7

How to bulid index for dataset in BEIR? #7

Comments

BluesPizza commented Apr 21, 2024

MXueguang commented Apr 21, 2024

BluesPizza commented Apr 22, 2024

BluesPizza commented Apr 22, 2024

BluesPizza commented May 4, 2024