This page contains instructions for reproducing the "Elasticsearch optimized multi_match best_fields" entry (2020/11/25) on the the MS MARCO Document Ranking Leaderboard using Pyserini. Details behind this run are described in this blog post; the official leaderboard submission corresponds to the run denoted "multi_match best_fields tuned (all-in-one): all params" in the blog post.
This run makes sure to preserve the distinction between document fields when preparing and indexing documents. For ranking, we use a disjunction max query to combine score contributions across fields; the weights for the disjunction max query are taken from the blog post reference above.
To match the leaderboard results, this run makes use of a custom stopwords file
elastic-msmarco-stopwords.txt
. The file contains the default English stopwords
from Lucene, plus some additional words targeted at question-style queries.
We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO document dataset:
mkdir collections/msmarco-doc
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz -P collections/msmarco-doc
# Alternative mirror:
# wget https://www.dropbox.com/s/zly8cbyvt18l3u0/msmarco-docs.tsv.gz -P collections/msmarco-doc
gunzip collections/msmarco-doc/msmarco-docs.tsv.gz
To confirm, msmarco-docs.tsv.gz
should have an MD5 checksum of 103b19e21ad324d8a5f1ab562425c0b4
.
First we need to convert the file to JSON lines format. Each document will correspond to a JSON object with distinct fields for title, URL, and body:
python tools/scripts/msmarco/convert_doc_collection_to_jsonl.py \
--collection-path collections/msmarco-doc/msmarco-docs.tsv \
--output-folder collections/msmarco-doc-json
We then build the index with the following command:
python -m pyserini.index \
--input collections/msmarco-doc-json/ \
--collection JsonCollection \
--generator DefaultLuceneDocumentGenerator \
--index indexes/msmarco-doc/lucene-index-msmarco \
--threads 4 \
--fields title url \
--storeRaw \
--stopwords docs/elastic-msmarco-stopwords.txt
On a modern desktop with an SSD, indexing takes around 15 minutes. There should be a total of 3,201,821 documents indexed.
After indexing finishes, we can do a retrieval run. A few minor details to pay attention to: the official metric is MRR@100, so we want to only return the top 100 hits, and the submission files to the leaderboard have a slightly different format.
python -m pyserini.search.lucene \
--topics msmarco-doc-dev \
--index indexes/msmarco-doc/lucene-index-msmarco/ \
--output runs/run.msmarco-doc.leaderboard-dev.elastic.txt \
--output-format msmarco \
--hits 100 \
--bm25 --k1 1.2 --b 0.75 \
--fields contents=10.0 title=8.63280262513067 url=0.0 \
--dismax --dismax.tiebreaker 0.3936135232328522 \
--stopwords docs/elastic-msmarco-stopwords.txt
After the run completes, we can evaluate the results:
$ python -m pyserini.eval.msmarco_doc_eval \
--judgments msmarco-doc-dev \
--run runs/run.msmarco-doc.leaderboard-dev.elastic.txt
#####################
MRR @100: 0.3071421845448626
QueriesRanked: 5193
#####################