Dense Index of MARCO V2 and KILT #974

paulowoicho · 2022-01-29T18:13:48Z

paulowoicho
Jan 29, 2022

I have been trying to make a dense from both the MARCO V2 and KILT collections (about 17M documents) on an AWS EC2 instance running Ubuntu with 8 vCPUs and 32GB memory. Each time I run:

# Dense Index
python3 -m pyserini.encode input --corpus converted_collection \
                            --fields url title text \
                            output --embeddings indexes/dense \
                            --to-faiss \
                            encoder --encoder castorini/ance-msmarco-doc-maxp \
                            --fields url title text \
                            --batch 16

The process gets killed after about 3 minutes, probably because it uses too much memory.

When I try sharding, using:

# Dense Index
python3 -m pyserini.encode input --corpus converted_collection \
                            --fields url title text \
                            --shard-id 0 \
                            --shard-num 20 \
                            output --embeddings indexes/dense_0 \
                            --to-faiss \
                            encoder --encoder castorini/ance-msmarco-doc-maxp \
                            --fields url title text \
                            --batch 16

I still experience the same outcome.

However, I successfully created a dense index using a subset of the collection (~200K documents) using the command above. I wonder if there's anything I'm missing in terms of optimizing the process? or do I need to try it again on a "beefier" machine?

Thanks!

Answered by MXueguang

Jan 29, 2022

Hi @paulowoicho,
The reason for OOM here is probably due to loading the entire corpus into the memory at the beginning (before sharding).
I would suggest splitting the raw collection in advance.
i.e. use Linux cli tool split to split the converted_collection into multiple small files.
then run the encoding with:

python3 -m pyserini.encode input --corpus converted_collection/split00.jsonl \
                            --fields url title text \
                            output --embeddings indexes/dense_0/split00 \
                            --to-faiss \
                            encoder --encoder castorini/ance-msmarco-doc-maxp \
                            --fields url title text \
 …

View full answer

MXueguang · 2022-01-29T19:06:46Z

MXueguang
Jan 29, 2022
Collaborator

Hi @paulowoicho,
The reason for OOM here is probably due to loading the entire corpus into the memory at the beginning (before sharding).
I would suggest splitting the raw collection in advance.
i.e. use Linux cli tool split to split the converted_collection into multiple small files.
then run the encoding with:

python3 -m pyserini.encode input --corpus converted_collection/split00.jsonl \
                            --fields url title text \
                            output --embeddings indexes/dense_0/split00 \
                            --to-faiss \
                            encoder --encoder castorini/ance-msmarco-doc-maxp \
                            --fields url title text \
                            --batch 16

3 replies

paulowoicho Jan 30, 2022
Author

Thanks!

paulowoicho Feb 11, 2022
Author

Hi @MXueguang ,

I have a follow up question. I was able to run the encoding the way you suggested, and effectively have 69 sub-indexes of the collection. When I tried combining them with:

python3 -m pyserini.index.merge_faiss_indexes --prefix indexes/dense_ --shard-num 69

I get a MemoryError: std::bad_alloc error, potentially because I run out of memory. I was wondering if you had any suggestions to work around this.

Thanks

MXueguang Feb 11, 2022
Collaborator

In this case, I would suggest retrieval by the index shards and then merge the results.
i.e. suppose you have 69 indexes
instead of merging the 69 indexes into one, you can do retrieval on each index and get 69 retrieval results.
Then merge the results into one with following code.

import argparse
import pandas as pd

# example usage: python merge_shards.py -f *_run.mrtydi.english.test.txt -o full_run.mrtydi.english.test.txt
def main():
    parser = argparse.ArgumentParser(description='Concatenate the results of multiple shards.')
    parser.add_argument('-f','--files', nargs='+', help='<Required> Files to search in', required=True)
    parser.add_argument('-o','--output', help='<Required> File to write to.', required=True)
    parser.add_argument('-k', help='<Optional> How many results to return for each query [k = 100 by default].', required=False, default=100)
    args = parser.parse_args()
    files = args.files
    output_file = args.output
    k = args.k

    shard_dfs = []
    for f in range(files):
        df = pd.read_csv(f, delimiter='\t', names=["qid", "docid", "score"])
        shard_dfs.append(df)

    full_df = pd.concat(shard_dfs, axis=0)
    sorted_df = full_df.sort_values('score', ascending=False).groupby('qid').head(k).sort_values('qid')
    sorted_df.to_csv(output_file, sep='\t', index=False, header=False)
    print(f"Output has been written to {output_file}")

if __name__ == '__main__':
    main()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dense Index of MARCO V2 and KILT #974

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dense Index of MARCO V2 and KILT #974

paulowoicho Jan 29, 2022

Replies: 1 comment · 3 replies

MXueguang Jan 29, 2022 Collaborator

paulowoicho Jan 30, 2022 Author

paulowoicho Feb 11, 2022 Author

MXueguang Feb 11, 2022 Collaborator

paulowoicho
Jan 29, 2022

Replies: 1 comment 3 replies

MXueguang
Jan 29, 2022
Collaborator

paulowoicho Jan 30, 2022
Author

paulowoicho Feb 11, 2022
Author

MXueguang Feb 11, 2022
Collaborator