Skip to content

Latest commit

 

History

History
416 lines (348 loc) · 39.1 KB

experiments-msmarco-passage.md

File metadata and controls

416 lines (348 loc) · 39.1 KB

Pyserini: BM25 Baseline for MS MARCO Passage Ranking

This guide contains instructions for running a BM25 baseline on the MS MARCO passage ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO document ranking task. This exercise will require a machine with >8 GB RAM and >15 GB free disk space.

If you're a Waterloo student traversing the onboarding path (which starts here), make sure you've already done the BM25 Baselines for MS MARCO Passage Ranking in Anserini. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.

Learning outcomes for this guide, building on previous steps in the onboarding path:

  • Be able to use Pyserini to build a Lucene inverted index on the MS MARCO passage collection.
  • Be able to use Pyserini to perform a batch retrieval run on the MS MARCO passage collection with the dev queries.
  • Be able to evaluate the retrieved results above.
  • Be able to generate the retrieved results above interactively by directly manipulating Pyserini Python classes.

In short, you'll do everything you did with Anserini (in Java) on the MS MARCO passage ranking test collection, but now with Pyserini (in Python).

What's Pyserini? Well, it's the repo that you're in right now. Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. The toolkit provides Python bindings for our group's Anserini IR toolkit, which is built on Lucene (in Java). Pyserini provides entrée into the broader deep learning ecosystem, which is heavily Python-centric.

Data Prep

The guide requires the development installation. So get your Python environment set up.

Once you've done that: congratulations, you've passed the most difficult part! Everything else below mirrors what you did in Anserini (in Java), so it should be easy.

We're going to use collections/msmarco-passage/ as the working directory. First, we need to download and extract the MS MARCO passage dataset:

mkdir collections/msmarco-passage

wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

To confirm, collectionandqueries.tar.gz should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

Next, we need to convert the MS MARCO tsv collection into Pyserini's jsonl files (which have one json object per line):

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl, each with 1M lines (except for the last one, which should have 841,823 lines).

Indexing

We can now index these documents as a JsonCollection using Pyserini:

python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input collections/msmarco-passage/collection_jsonl \
  --index indexes/lucene-index-msmarco-passage \
  --generator DefaultLuceneDocumentGenerator \
  --threads 9 \
  --storePositions --storeDocvectors --storeRaw

The command-line invocation should look familiar: it essentially mirrors the command with Anserini (in Java). If you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.

Upon completion, you should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.

Retrieval

The 6980 queries in the development set are already stored in the repo. Let's take a peek:

$ head tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number

$ wc tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
    6980   48335  290193 tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt

Each line contains a tab-delimited (query id, query) pair. Conveniently, Pyserini already knows how to load and iterate through these pairs. We can now perform retrieval using these queries:

python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.txt \
  --output-format msmarco \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68 \
  --threads 4 --batch-size 16

Here, we set the BM25 parameters to k1=0.82, b=0.68 (tuned by grid search). The option --output-format msmarco says to generate output in the MS MARCO output format. The option --hits specifies the number of documents to return per query. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

Once again, if you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.

Retrieval speed will vary by hardware: On a reasonably modern CPU with an SSD, we might get around 13 qps (queries per second), and so the entire run should finish in under ten minutes (using a single thread). We can perform multi-threaded retrieval by using the --threads and --batch-size arguments. For example, setting --threads 16 --batch-size 64 on a CPU with sufficient cores, the entire run will finish in a couple of minutes.

Evaluation

After the run finishes, we can evaluate the results using the official MS MARCO evaluation script, which has been incorporated into Pyserini:

$ python -m pyserini.eval.msmarco_passage_eval \
   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   runs/run.msmarco-passage.bm25tuned.txt

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

We can also use the official TREC evaluation tool, trec_eval, to compute metrics other than MRR@10.

The tool needs a different run format, so it's easier to just run retrieval again:

python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.trec \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68 \
  --threads 4 --batch-size 16

The only difference here is that we've removed --output-format msmarco.

Then, convert qrels files to the TREC format:

python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
  --input collections/msmarco-passage/qrels.dev.small.tsv \
  --output collections/msmarco-passage/qrels.dev.small.trec

Finally, run the trec_eval tool, which has been incorporated into Pyserini:

$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap \
   collections/msmarco-passage/qrels.dev.small.trec \
   runs/run.msmarco-passage.bm25tuned.trec

map                   	all	0.1957
recall_1000           	all	0.8573

If you want to examine the MRR@10 for qid 1048585:

$ python -m pyserini.eval.trec_eval -q -c -M 10 -m recip_rank \
    collections/msmarco-passage/qrels.dev.small.trec \
    runs/run.msmarco-passage.bm25tuned.trec | grep 1048585

recip_rank            	1048585	1.0000

Once again, if you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.

Otherwise, congratulations! You've done everything that you did in Anserini (in Java), but now in Pyserini (in Python).

Interactive Retrieval

There's one final thing we should go over. Because we're in Python now, we get the benefit of having an interactive shell. Thus, we can run Pyserini interactively.

Try the following:

from pyserini.search.lucene import LuceneSearcher

searcher = LuceneSearcher('indexes/lucene-index-msmarco-passage')
searcher.set_bm25(0.82, 0.68)
hits = searcher.search('what is paula deen\'s brother')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')

The LuceneSearcher class provides search capabilities for BM25. In the code snippet above, we're issuing the query about Paula Deen's brother (from above). Note that we're explicitly setting the BM25 parameters, which are not the default parameters. We get back a list of results (hits), which we then iterate through and print out:

 1 7187158 18.811600
 2 7187157 18.333401
 3 7187163 17.878799
 4 7546327 16.962099
 5 7187160 16.564699
 6 8227279 16.432501
 7 7617404 16.239901
 8 7187156 16.024900
 9 2298838 15.701500
10 7187155 15.513300

You can confirm that the output is the same as pyserini.search.lucene from above.

$ grep 1048585 runs/run.msmarco-passage.bm25tuned.trec | head -10
1048585 Q0 7187158 1 18.811600 Anserini
1048585 Q0 7187157 2 18.333401 Anserini
1048585 Q0 7187163 3 17.878799 Anserini
1048585 Q0 7546327 4 16.962099 Anserini
1048585 Q0 7187160 5 16.564699 Anserini
1048585 Q0 8227279 6 16.432501 Anserini
1048585 Q0 7617404 7 16.239901 Anserini
1048585 Q0 7187156 8 16.024900 Anserini
1048585 Q0 2298838 9 15.701500 Anserini
1048585 Q0 7187155 10 15.513300 Anserini

To pull up the actual contents of a hit:

hits[0].lucene_document.get('raw')

And you should get:

'{\n  "id" : "7187158",\n  "contents" : "Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦"\n}'

Everything make sense? If so, now you're truly done with this guide and are ready to move on and learn about the relationship between sparse and dense retrieval!

Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.

Reproduction Log*