This guide describes how to reproduce the uniCOIL experiments in the following paper:
Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.
And further detailed in:
Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2. Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), July 2022.
Here, we start with versions of the MS MARCO V1 corpora that have already been processed with uniCOIL, i.e., we have applied model inference on every document and stored the output sparse vectors.
Quick Links:
To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V1 passage. The passage ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.
We're going to use the Pyserini repository's root directory as the working directory. First, we need to download and unpack the corpus:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar -P collections/
tar xvf collections/msmarco-passage-unicoil.tar -C collections/
To confirm, msmarco-passage-unicoil.tar
is 3.4 GB and has MD5 checksum 78eef752c78c8691f7d61600ceed306f
.
We can now index these docs:
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco-passage-unicoil/ \
--index indexes/lucene-index.msmarco-passage-unicoil/ \
--generator DefaultLuceneDocumentGenerator \
--threads 12 \
--impact --pretokenized
The important indexing options to note here are --impact --pretokenized
: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.
We can now run retrieval using the castorini/unicoil-msmarco-passage
model available on Huggingface's model hub to encode the queries:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-passage-unicoil/ \
--topics msmarco-passage-dev-subset \
--encoder castorini/unicoil-msmarco-passage \
--output runs/run.msmarco-passage.unicoil.tsv \
--output-format msmarco \
--batch 36 --threads 12 \
--hits 1000 \
--impact
Here, we are using the transformer model to encode the queries on the fly using the CPU.
Note that the important option here is --impact
, where we specify impact scoring.
With these impact scores, query evaluation is already slower than bag-of-words BM25; on top of that we're adding neural inference on the CPU.
A complete run typically takes around 30 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil.tsv
#####################
MRR @10: 0.3508734138354477
QueriesRanked: 6980
#####################
There might be small differences in score due to non-determinism in neural inference; see these notes for details. The above score was obtained on Linux.
Alternatively, we can use pre-tokenized queries with pre-computed weights, which are already included in Pyserini. We can run retrieval as follows:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-passage-unicoil/ \
--topics msmarco-passage-dev-subset-unicoil \
--output runs/run.msmarco-passage.unicoil.tsv \
--output-format msmarco \
--batch 36 --threads 12 \
--hits 1000 \
--impact
Here, we also specify --impact
for impact scoring.
Since we're not applying neural inference over the queries, speed is faster, typically less than 10 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil.tsv
#####################
MRR @10: 0.35155222404147896
QueriesRanked: 6980
#####################
Note that in this case, the results should be deterministic.
To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V1 doc. The document ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference (although see note below for more details).
We're going to use the Pyserini repository's root directory as the working directory. First, we need to download and unpack the corpus:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/
tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/
To confirm, msmarco-doc-segmented-unicoil.tar
is 19 GB and has MD5 checksum 6a00e2c0c375cb1e52c83ae5ac377ebb
.
We can now index these docs:
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco-doc-segmented-unicoil/ \
--index indexes/lucene-index.msmarco-doc-segmented-unicoil/ \
--generator DefaultLuceneDocumentGenerator \
--threads 12 \
--impact --pretokenized
The important indexing options to note here are --impact --pretokenized
: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around an hour.
We can now run retrieval:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-doc-segmented-unicoil \
--topics msmarco-doc-dev \
--encoder castorini/unicoil-msmarco-passage \
--output runs/run.msmarco-doc-segmented-unicoil.tsv \
--output-format msmarco \
--batch 36 --threads 12 \
--hits 1000 --max-passage --max-passage-hits 100 \
--impact
Here, we are using the transformer model to encode the queries on the fly using the CPU.
Note that the important option here is --impact
, where we specify impact scoring.
With these impact scores, query evaluation is already slower than bag-of-words BM25; on top of that we're adding neural inference on the CPU.
A complete run can take around 40 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev \
--run runs/run.msmarco-doc-segmented-unicoil.tsv
#####################
MRR @100: 0.3530641289682811
QueriesRanked: 5193
#####################
There might be small differences in score due to non-determinism in neural inference; see these notes for details. The above score was obtained on Linux.
Alternatively, we can use pre-tokenized queries with pre-computed weights, which are already included in Pyserini. We can run retrieval as follows:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-doc-segmented-unicoil \
--topics msmarco-doc-dev-unicoil \
--output runs/run.msmarco-doc-segmented-unicoil.tsv \
--output-format msmarco \
--batch 36 --threads 12 \
--hits 1000 --max-passage --max-passage-hits 100 \
--impact
Here, we also specify --impact
for impact scoring.
Since we're not applying neural inference over the queries, speed is faster, typically less than 10 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev \
--run runs/run.msmarco-doc-segmented-unicoil.tsv
#####################
MRR @100: 0.352997702662614
QueriesRanked: 5193
#####################
Note that in this case, the results should be deterministic.
A final detail: with MaxP and the need to generate runs to different depths, we can set --hits
and --max-passage-hits
differently.
Due to tie-breaking effects, we get slightly different results with different settings: see Anserini experiments for additional details.
Because of slightly different parameter settings, the results here do not exactly match the results in the two-click reproduction matrix for MS MARCO V1 doc.
Reproduction Log*
- Results reproduced by @ArthurChen189 on 2021-07-13 (commit
228d5c9
) - Results reproduced by @lintool on 2021-07-14 (commit
ed88e4c
) - Results reproduced by @lintool on 2021-09-17 (commit
79eb5cf
) - Results reproduced by @mayankanand007 on 2021-09-18 (commit
331dfe7
) - Results reproduced by @apokali on 2021-09-23 (commit
82f8422
) - Results reproduced by @yuki617 on 2022-02-08 (commit
e03e068
) - Results reproduced by @lintool on 2022-06-01 (commit
b7bcf51
)