This page contains instructions for running BM25 baselines on the MS MARCO passage ranking task. Note that there is a separate MS MARCO document ranking task. This exercise will require a machine with >8 GB RAM and >15 GB free disk space .
If you're a Waterloo student traversing the onboarding path, start here. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.
Learning outcomes for this guide, building on previous steps in the onboarding path:
- Be able to use Anserini to build a Lucene inverted index on the MS MARCO passage collection.
- Be able to use Anserini to perform a batch retrieval run on the MS MARCO passage collection with the dev queries.
- Be able to evaluate the retrieved results above.
- Understand the MRR metric.
What's Anserini? Well, it's the repo that you're in right now. Anserini is a toolkit (in Java) for reproducible information retrieval research built on the Luence search library. The Lucene search library provides components of the popular Elasticsearch platform.
Think of it this way: Lucene provides a "kit of parts". Elasticsearch provides "assembly of parts" targeted to production search applications, with a REST-centric API. Anserini provides an alternative way of composing the same core components together, targeted at information retrieval researchers. By building on Lucene, Anserini aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications. That is, most things done with Anserini can be "translated" into Elasticsearch quite easily.
In this guide, we're just going through the mechanical steps of data prep. To better understand what you're actually doing, go through the start here guide. The guide contains the same exact instructions, but provide more detailed explanations.
We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset:
mkdir collections/msmarco-passage
wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage
# Alternative mirror:
# wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/collectionandqueries.tar.gz -P collections/msmarco-passage
tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage
To confirm, collectionandqueries.tar.gz
should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69
.
Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl
The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl
, each with 1M lines (except for the last one, which should have 841,823 lines).
We need to do a bit of data munging on the queries as well. There are queries in the dev set that don't have relevance judgments. Let's discard them:
python tools/scripts/msmarco/filter_queries.py \
--qrels collections/msmarco-passage/qrels.dev.small.tsv \
--queries collections/msmarco-passage/queries.dev.tsv \
--output collections/msmarco-passage/queries.dev.small.tsv
The output queries file collections/msmarco-passage/queries.dev.small.tsv
should contain 6980 lines.
In building a retrieval system, there are generally two phases:
- In the indexing phase, an indexer takes the document collection (i.e., corpus) and builds an index, which is a data structure that supports efficient retrieval.
- In the retrieval (or search) phase, the retrieval system returns a ranked list given a query q, with the aid of the index constructed in the previous phase.
(There's also a training phase when we start to discuss models that learn from data, but we're not there yet.)
Given a (static) document collection, indexing only needs to be performed once, and hence there are fewer constraints on latency, throughput, and other aspects of performance (just needs to be "reasonable"). On the other hand, retrieval needs to be fast, i.e., low latency, high throughput, etc.
With the data prep above, we can now index the MS MARCO passage collection in collections/msmarco-passage/collection_jsonl
.
If you haven't built Anserini already, build it now using the instructions in anserini#-getting-started.
We index these docs as a JsonCollection
(a specification of how documents are encoded) using Anserini:
target/appassembler/bin/IndexCollection \
-collection JsonCollection \
-input collections/msmarco-passage/collection_jsonl \
-index indexes/msmarco-passage/lucene-index-msmarco \
-generator DefaultLuceneDocumentGenerator \
-threads 9 -storePositions -storeDocvectors -storeRaw
In this case, Lucene creates what is known as an inverted index.
Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.
In the above step, we've built the inverted index. Now we can now perform a retrieval run using queries we've prepared:
target/appassembler/bin/SearchCollection \
-index indexes/msmarco-passage/lucene-index-msmarco \
-topics collections/msmarco-passage/queries.dev.small.tsv \
-topicreader TsvInt \
-output runs/run.msmarco-passage.dev.small.tsv -format msmarco \
-parallelism 4 \
-bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000
This is the retrieval (or search) phase. We're performing retrieval in batch, on a set of queries.
Retrieval here uses a "bag-of-words" model known as BM25. A "bag of words" model just means that documents are scored based on the matching of query terms (i.e., words) that appear in the documents, without regard to the structure of the document, the order of the words, etc. BM25 is perhaps the most popular bag-of-words retrieval model; it's the default in the popular Elasticsearch platform. We'll discuss retrieval models in much more detail later.
The above command uses BM25 with tuned parameters k1=0.82
, b=0.68
.
The option -hits
specifies the number of documents per query to be retrieved.
Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.
Retrieval speed will vary by machine:
On a reasonably modern desktop with an SSD, with four threads (as specified above), the run takes a couple of minutes.
Adjust the parallelism by changing the -parallelism
argument.
Congratulations, you've performed your first retrieval run!
Recap of what you've done: You've fed the retrieval system a bunch of queries and the retrieval run is the output. For each query, the retrieval system produced a ranked list of results (i.e., a list of hits). The retrieval run contains the ranked lists for all queries you fed to it.
Let's take a look:
$ head runs/run.msmarco-passage.dev.small.tsv
1048585 7187158 1
1048585 7187157 2
1048585 7187163 3
1048585 7546327 4
1048585 7187160 5
1048585 8227279 6
1048585 7617404 7
1048585 7187156 8
1048585 2298838 9
1048585 7187155 10
The first column is the qid
(corresponding to the query).
From above, we can see that qid
1048585 is the query "what is paula deen's brother".
The second column is the docid
of the retrieved result (i.e., the hit), and the third column is the rank position.
That is, in a search interface, docid
7187158 would be shown in the top position, docid
7187157 would be shown in the second position, etc.
You can grep through the collection to see what that actual passage is:
$ grep 7187158 collections/msmarco-passage/collection.tsv
7187158 Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�
In this case, the document (hit) seems relevant. That is, it contains information that addresses the information need. So here, the retrieval system "did well". Remember that this document was indeed marked relevant in the qrels, as we saw in the start here guide.
As an additional sanity check, run the following:
$ cut -f 1 runs/run.msmarco-passage.dev.small.tsv | uniq | wc
6980 6980 51039
This tells us that there are 6980 unique tokens in the first column of the run file.
Since the first column indicates the qid
, it means that the file contains ranked lists for 6980 queries, which checks out.
Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script:
python tools/scripts/msmarco/msmarco_passage_eval.py \
collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.small.tsv
And the output should be like this:
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################
(Yea, the number of digits of precision is a bit... excessive)
Remember from the start here guide that with relevance judgments (qrels), we can automatically evaluate the retrieval system output (i.e., the run).
The final ingredient is a metric, i.e., how to quantify the "quality" of a ranked list.
Here, we're using a metric called MRR, or mean reciprocal rank.
The idea is quite simple:
We look at where the relevant docid
appears.
If it appears at rank 1, the system gets a score of one.
If it appears at rank 2, the system gets a score of 1/2.
If it appears at rank 3, the system gets a score of 1/3.
And so on.
MRR@10 means that we only go down to rank 10.
If the relevant docid
doesn't appear in the top 10, then the system gets a score of zero.
That's the score of a query. We take the average of the scores across all queries (6980 in this case), and we arrive at the score for the entire run.
You can find this run on the MS MARCO Passage Ranking Leaderboard as the entry named "BM25 (Lucene8, tuned)", dated 2019/06/26. So you've just reproduced (part of) a leaderboard submission!
We can also use the official TREC evaluation tool, trec_eval
, to compute other metrics than MRR@10.
For that we first need to convert runs and qrels files to the TREC format:
python tools/scripts/msmarco/convert_msmarco_to_trec_run.py \
--input runs/run.msmarco-passage.dev.small.tsv \
--output runs/run.msmarco-passage.dev.small.trec
python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
--input collections/msmarco-passage/qrels.dev.small.tsv \
--output collections/msmarco-passage/qrels.dev.small.trec
And run the trec_eval
tool:
tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec
The output should be:
map all 0.1957
recall_1000 all 0.8573
In many retrieval applications, average precision and recall@1000 are the two metrics we care about the most.
You can use trec_eval
to compute the MRR@10 also, which gives results identical to above (just fewer digits of precision):
tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec
It's a different command-line incantation of trec_eval
to compute MRR@10.
And if you add -q
, the tool will spit out the MRR@10 per query (for all 6908 queries, in addition to the final average).
tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec
We can find the MRR@10 for qid
1048585 above:
$ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.dev.small.trec | grep 1048585
recip_rank 1048585 1.0000
This is consistent with the example we worked through above. At this point, make sure that the connections between a query, the relevance judgments for a query, the ranked list, and the metric (MRR@10) are clear in your mind. Work through a few more examples (take another query, look at its qrels and ranked list, and compute its MRR@10 by hand) to convince yourself that you understand what's going on.
The tl;dr is that there are different formats for run files and lots of different metrics you can compute.
trec_eval
is a standard tool used by information retrieval researchers (which has many command-line options that you'll slowly learn over time).
Researchers have been trying to answer the question "how do we know if a search result is good and how do we measure it" for over half a century... and the question still has not been fully resolved.
In short, it's complicated.
At this time, look back through the learning outcomes again and make sure you're good. As a next step in the onboarding path, you basically do the same thing again in Python with Pyserini (as opposed to Java with Anserini here).
Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd
, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text.
In the description of your pull request, please provide some details on your setup (e.g., operating system, environment and configuration, etc.).
In addition, also provide some indication of success (e.g., everything worked) or document issues you encountered.
If you think this guide can be improved in any way (e.g., you caught a typo or think a clarification is warranted), feel free to include it in the pull request.
This section is not part of the onboarding path, so feel free to skip.
Note that this figure differs slightly from the value reported in Document Expansion by Query Prediction, which uses the Anserini (system-wide) default of k1=0.9
, b=0.4
.
Tuning was accomplished with tools/scripts/msmarco/tune_bm25.py
, using the queries found here; the basic approach is grid search of parameter values in tenth increments.
There are five different sets of 10k samples (using the shuf
command).
We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization).
In separate trials, we optimized for:
- recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
- MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).
It turns out that optimizing for MRR@10 and MAP yields the same settings.
Here's the comparison between the Anserini default and optimized parameters:
Setting | MRR@10 | MAP | Recall@1000 |
---|---|---|---|
Default (k1=0.9 , b=0.4 ) |
0.1840 | 0.1926 | 0.8526 |
Optimized for recall@1000 (k1=0.82 , b=0.68 ) |
0.1874 | 0.1957 | 0.8573 |
Optimized for MRR@10/MAP (k1=0.60 , b=0.62 ) |
0.1892 | 0.1972 | 0.8555 |
As mentioned above, the BM25 run with k1=0.82
, b=0.68
corresponds to the entry "BM25 (Lucene8, tuned)" dated 2019/06/26 on the MS MARCO Passage Ranking Leaderboard.
The BM25 run with default parameters k1=0.9
, b=0.4
roughly corresponds to the entry "BM25 (Anserini)" dated 2019/04/10 (but Anserini was using Lucene 7.6 at the time).
Reproduction Log*
- Results reproduced by @ronakice on 2019-08-12 (commit
5b29d16
) - Results reproduced by @MathBunny on 2019-08-12 (commit
5b29d16
) - Results reproduced by @JMMackenzie on 2020-01-08 (commit
f63cd22
) - Results reproduced by @edwinzhng on 2020-01-08 (commit
5cc923d
) - Results reproduced by @LuKuuu on 2020-01-15 (commit
f21137b
) - Results reproduced by @kevinxyc1 on 2020-01-18 (commit
f21137b
) - Results reproduced by @nikhilro on 2020-01-21 (commit
631589e
) - Results reproduced by @yuki617 on 2020-03-29 (commit
074723c
) - Results reproduced by @weipang142857 on 2020-04-20 (commit
074723c
) - Results reproduced by @HangCui0510 on 2020-04-23 (commit
0ae567d
) - Results reproduced by @x65han on 2020-04-25 (commit
f5496b9
) - Results reproduced by @y276lin on 2020-04-26 (commit
8f48f8e
) - Results reproduced by @stephaniewhoo on 2020-04-26 (commit
8f48f8e
) - Results reproduced by @eiston on 2020-05-04 (commit
dd84a5a
) - Results reproduced by @rohilg on 2020-05-09 (commit
20ee950
) - Results reproduced by @wongalvis14 on 2020-05-09 (commit
ebac5d6
) - Results reproduced by @YimingDou on 2020-05-14 (commit
3b0a642
) - Results reproduced by @richard3983 on 2020-05-14 (commit
a65646f
) - Results reproduced by @MXueguang on 2020-05-20 (commit
3b2751e
) - Results reproduced by @shaneding on 2020-05-23 (commit
b6e0367
) - Results reproduced by @adamyy on 2020-05-28 (commit
94893f1
) - Results reproduced by @kelvin-jiang on 2020-05-28 (commit
d55531a
) - Results reproduced by @TianchengY on 2020-05-28 (commit
2947a16
) - Results reproduced by @stariqmi on 2020-05-28 (commit
4914305
) - Results reproduced by @justinborromeo on 2020-06-10 (commit
7954eab
) - Results reproduced by @yxzhu16 on 2020-07-03 (commit
68ace26
) - Results reproduced by @LizzyZhang-tutu on 2020-07-13 (commit
8c98d5b
) - Results reproduced by @estella98 on 2020-07-29 (commit
99092a8
) - Results reproduced by @tangsaidi on 2020-08-19 (commit
aba846
) - Results reproduced by @qguo96 on 2020-09-07 (commit
e16b3c1
) - Results reproduced by @yuxuan-ji on 2020-09-08 (commit
0f9a8ec
) - Results reproduced by @wiltan-uw on 2020-09-09 (commit
93d913f
) - Results reproduced by @JeffreyCA on 2020-09-13 (commit
bc2628b
) - Results reproduced by @jhuang265 on 2020-10-15 (commit
66711b9
) - Results reproduced by @rayyang29 on 2020-10-27 (commit
ad8cc5a
) - Results reproduced by @Dahlia-Chehata on 2020-11-11 (commit
22c0ad3
) - Results reproduced by @rakeeb123 on 2020-12-07 (commit
f50dcce
) - Results reproduced by @jrzhang12 on 2021-01-02 (commit
be4e44d
) - Results reproduced by @HEC2018 on 2021-01-04 (commit
4de21ec
) - Results reproduced by @KaiSun314 on 2021-01-08 (commit
113f1c7
) - Results reproduced by @yemiliey on 2021-01-18 (commit
179c242
) - Results reproduced by @larryli1999 on 2021-01-22 (commit
179c242
) - Results reproduced by @ArthurChen189 on 2021-04-08 (commit
45a5a21
) - Results reproduced by @printfCalvin on 2021-04-11 (commit
d808d4a
) - Results reproduced by @saileshnankani on 2021-04-26 (commit
5781c87
) - Results reproduced by @andrewyguo on 2021-04-29 (commit
71f3ca6
) - Results reproduced by @mayankanand007 on 2021-05-04 (commit
906ca50
) - Results reproduced by @Albert-Ma on 2021-05-07 (commit
5bcbccd
) - Results reproduced by @rootofallevii on 2021-05-14 (commit
626da95
) - Results reproduced by @jpark621 on 2021-06-01 (commit
2591e06
) - Results reproduced by @nimasadri11 on 2021-06-27 (commit
6f9352f
) - Results reproduced by @mzzchy on 2021-07-05 (commit
589928b
) - Results reproduced by @d1shs0ap on 2021-07-16 (commit
43ad899
) - Results reproduced by @apokali on 2021-08-19 (commit
ad4caeb
) - Results reproduced by @leungjch on 2021-09-12 (commit
f79fb67
) - Results reproduced by @AlexWang000 on 2021-10-10 (commit
fc2ddb0
) - Results reproduced by @ToluClassics on 2021-10-20 (commit
fcc2aff
) - Results reproduced by @manveertamber on 2021-12-05 (commit
aee51ad
) - Results reproduced by @lingwei-gu on 2021-12-15 (commit
30605f5
) - Results reproduced by @tyao-t on 2021-12-18 (commit
6500560
) - Results reproduced by @kevin-wangg on 2022-01-04 (commit
c3e14dc
) - Results reproduced by @vivianliu0 on 2022-01-06 (commit
c3e14dc
) - Results reproduced by @mikhail-tsir on 2022-01-07 (commit
806ac89
) - Results reproduced by @AceZhan on 2022-01-13 (commit
7ff99e0
) - Results reproduced by @jh8liang on 2022-02-06 (commit
5cdf9ec
) - Results reproduced by @mayankanand007 on 2022-02-22 (commit
6a70804
) - Results reproduced by @jasper-xian on 2022-03-27 (commit
2e8e9fd
) - Results reproduced by @jx3yang on 2022-04-25 (commit
b429218
) - Results reproduced by @AreelKhan on 2022-04-27 (commit
7adee1d
) - Results reproduced by @alvind1 on 2022-05-05 (commit
9b2dd5f5
) - Results reproduced by @Pie31415 on 2022-06-22 (commit
6aef2eb
) - Results reproduced by @aivan6842 on 2022-07-11 (commit
8010d5c
) - Results reproduced by @AileenLin on 2022-09-11 (commit
760dab0
) - Results reproduced by @Jasonwu-0803 on 2022-09-27 (commit
b5ecc5a
) - Results reproduced by @limelody on 2022-09-27 (commit
252b5e2
) - Results reproduced by @minconszhang on 2022-11-25 (commit
6556550
) - Results reproduced by @jingliu on 2022-12-08 (commit
6872c87
) - Results reproduced by @farazkh80 on 2022-12-18 (commit
4527a5d
) - Results reproduced by @Cath on 2023-01-14 (commit
732cba4
) - Results reproduced by @dlrudwo1269 on 2023-03-06 (commit
4b7662c7
) - Results reproduced by @aryamancodes on 2023-04-11 (commit
ed89401
) - Results reproduced by @tailaiwang on 2023-04-29 (commit
44dc9db
) - Results reproduced by @Jocn2020 on 2023-04-30 (commit
30269d6
) - Results reproduced by @billcui57 on 2023-05-11 (commit
a913b52
) - Results reproduced by @zoehahaha on 2023-05-12 (commit
b429218
) - Results reproduced by @Singularity-tian on 2023-05-12 (commit
d82b6f7
) - Results reproduced by @Richard5678 on 2023-06-11 (commit
2d484d3
) - Results reproduced by @pratyushpal on 2023-07-14 (commit
17d5fc7
) - Results reproduced by @sahel-sh on 2023-07-22 (commit
4b8f051
) - Results reproduced by @Mofetoluwa on 2023-08-03 (commit
7314128
) - Results reproduced by @yilinjz on 2023-08-24 (commit
d4cb6fd
) - Results reproduced by @Andrwyl on 2023-08-26 (commit
b64a412
) - Results reproduced by @UShivani3 on 2023-08-29 (commit
24ab292
) - Results reproduced by @ErrenYeager on 2023-09-02 (commit
4ae518b
) - Results reproduced by @lucedes27 on 2023-09-03 (commit
211e74f
) - Results reproduced by @mchlp on 2023-09-03 (commit
211e74f
) - Results reproduced by @Edward-J-Xu on 2023-09-04 (commit
6030b1
) - Results reproduced by @Sumon-Kanti-Dey on 2023-09-04 (commit
a9de13a
) - Results reproduced by @MojTabaa4 on 2023-09-13 (commit
adf480b
) - Results reproduced by @Kshama on 2023-09-17 (commit
f27984b
) - Results reproduced by @MelvinMo on 2023-09-23 (commit
21efc3f
) - Results reproduced by @dourian on 2023-09-25 (commit
24ab292
) - Results reproduced by @ksunisth on 2023-09-27 (commit
7c95d91
) - Results reproduced by @maizerrr on 2023-10-01 (commit
2abdd37
) - Results reproduced by @shayanbali on 2023-10-12 (commit
8194b8e
) - Results reproduced by @gituserbs on 2023-10-14 (commit
8194b8e
) - Results reproduced by @oscarbelda86 on 2023-10-31 (commit
4c06d8a
) - Results reproduced by @shakibaam on 2023-11-03 (commit
2f3e7d5
) - Results reproduced by @gitHubAndyLee2020 on 2023-11-03 (commit
2f3e7d5
) - Results reproduced by @Melissa1412 on 2023-11-04 (commit
cf459b3
)