castorini · lintool · Dec 7, 2024 · Nov 27, 2024 · Nov 28, 2024 · Nov 28, 2024
diff --git a/README.md b/README.md
diff --git a/bin/run.sh b/bin/run.sh
@@ -1,3 +1,3 @@
 #!/bin/sh
 
-java -cp `ls target/*-fatjar.jar` -Xms512M -Xmx64G --add-modules jdk.incubator.vector $@
+java -cp `ls target/*-fatjar.jar` -Xms512M -Xmx192G --add-modules jdk.incubator.vector $@
diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat-int8.cached.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat-int8.cached.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.flat-int8.cached
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat-int8.onnx.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat-int8.onnx.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.flat-int8.onnx
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat.cached.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat.cached.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.flat.cached
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat.onnx.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.flat.onnx.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.flat.onnx
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.cached.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.cached.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.hnsw-int8.cached
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 
@@ -65,9 +65,6 @@ bin/run.sh io.anserini.index.IndexHnswDenseVectors \
 The path `/path/to/msmarco-passage-bge-base-en-v1.5/` should point to the corpus downloaded above.
 Upon completion, we should have an index with 8,841,823 documents.
 
-Furthermore, we are using Lucene's [Automatic Byte Quantization](https://www.elastic.co/search-labs/blog/articles/scalar-quantization-in-lucene) feature, which increase the on-disk footprint of the indexes since we're storing both the int8 quantized vectors and the float32 vectors, but only the int8 quantized vectors need to be loaded into memory.
-See [issue #2292](https://github.com/castorini/anserini/issues/2292) for some experiments reporting the performance impact.
-
 ## Retrieval
 
 Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.onnx.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw-int8.onnx.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.hnsw-int8.onnx
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 
@@ -65,9 +65,6 @@ bin/run.sh io.anserini.index.IndexHnswDenseVectors \
 The path `/path/to/msmarco-passage-bge-base-en-v1.5/` should point to the corpus downloaded above.
 Upon completion, we should have an index with 8,841,823 documents.
 
-Furthermore, we are using Lucene's [Automatic Byte Quantization](https://www.elastic.co/search-labs/blog/articles/scalar-quantization-in-lucene) feature, which increase the on-disk footprint of the indexes since we're storing both the int8 quantized vectors and the float32 vectors, but only the int8 quantized vectors need to be loaded into memory.
-See [issue #2292](https://github.com/castorini/anserini/issues/2292) for some experiments reporting the performance impact.
-
 ## Retrieval
 
 Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.cached.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.cached.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.hnsw.cached
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 
@@ -65,7 +65,6 @@ bin/run.sh io.anserini.index.IndexHnswDenseVectors \
 The path `/path/to/msmarco-passage-bge-base-en-v1.5/` should point to the corpus downloaded above.
 Upon completion, we should have an index with 8,841,823 documents.
 
-
 ## Retrieval
 
 Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.

diff --git a/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.onnx.md b/docs/regressions/regressions-dl19-passage.bge-base-en-v1.5.hnsw.onnx.md
@@ -20,7 +20,7 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
 python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.hnsw.onnx
 ```
 
-We make available a version of the MS MARCO Passage Corpus that has already been encoded with cosDPR-distil.
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
 
 From any machine, the following command will download the corpus and perform the complete regression, end to end:
 
@@ -65,7 +65,6 @@ bin/run.sh io.anserini.index.IndexHnswDenseVectors \
 The path `/path/to/msmarco-passage-bge-base-en-v1.5/` should point to the corpus downloaded above.
 Upon completion, we should have an index with 8,841,823 documents.
 
-
 ## Retrieval
 
 Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.

diff --git a/...gressions/regressions-dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached.md b/...gressions/regressions-dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached.md
@@ -0,0 +1,117 @@
+# Anserini Regressions: TREC 2019 Deep Learning Track (Passage)
+
+**Model**: [BGE-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) with quantized flat indexes (using cached queries)
+
+This page describes regression experiments, integrated into Anserini's regression testing framework, using the [BGE-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) model on the [TREC 2019 Deep Learning Track passage ranking task](https://trec.nist.gov/data/deep2019.html), as described in the following paper:
+
+> Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. [C-Pack: Packaged Resources To Advance General Chinese Embedding.](https://arxiv.org/abs/2309.07597) _arXiv:2309.07597_, 2023.
+
+In these experiments, we are using cached queries (i.e., cached results of query encoding).
+
+Note that the NIST relevance judgments provide far more relevant passages per topic, unlike the "sparse" judgments provided by Microsoft (these are sometimes called "dense" judgments to emphasize this contrast).
+For additional instructions on working with MS MARCO passage collection, refer to [this page](experiments-msmarco-passage.md).
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and then run `bin/build.sh` to rebuild the documentation.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```bash
+python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached
+```
+
+We make available a version of the MS MARCO Passage Corpus that has already been encoded by the BGE-base-en-v1.5 model.
+
+From any machine, the following command will download the corpus and perform the complete regression, end to end:
+
+```bash
+python src/main/python/run_regression.py --download --index --verify --search --regression dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached
+```
+
+The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.
+
+## Corpus Download
+
+Download the corpus and unpack into `collections/`:
+
+```bash
+wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-bge-base-en-v1.5.parquet.tar -P collections/
+tar xvf collections/msmarco-passage-bge-base-en-v1.5.parquet.tar -C collections/
+```
+
+To confirm, `msmarco-passage-bge-base-en-v1.5.parquet.tar` is 39 GB and has MD5 checksum `b235e19ec492c18a18057b30b8b23fd4`.
+With the corpus downloaded, the following command will perform the remaining steps below:
+
+```bash
+python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached \
+  --corpus-path collections/msmarco-passage-bge-base-en-v1.5.parquet
+```
+
+## Indexing
+
+Sample indexing command, building quantized flat indexes:
+
+```bash
+bin/run.sh io.anserini.index.IndexFlatDenseVectors \
+  -threads 16 \
+  -collection ParquetDenseVectorCollection \
+  -input /path/to/msmarco-passage-bge-base-en-v1.5.parquet \
+  -generator ParquetDenseVectorDocumentGenerator \
+  -index indexes/lucene-flat-int8.msmarco-v1-passage.bge-base-en-v1.5/ \
+  -quantize.int8 \
+  >& logs/log.msmarco-passage-bge-base-en-v1.5.parquet &
+```
+
+The path `/path/to/msmarco-passage-bge-base-en-v1.5.parquet/` should point to the corpus downloaded above.
+Upon completion, we should have an index with 8,841,823 documents.
+
+## Retrieval
+
+Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
+The regression experiments here evaluate on the 43 topics for which NIST has provided judgments as part of the TREC 2019 Deep Learning Track.
+The original data can be found [here](https://trec.nist.gov/data/deep2019.html).
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```bash
+bin/run.sh io.anserini.search.SearchFlatDenseVectors \
+  -index indexes/lucene-flat-int8.msmarco-v1-passage.bge-base-en-v1.5/ \
+  -topics tools/topics-and-qrels/topics.dl19-passage.bge-base-en-v1.5.jsonl.gz \
+  -topicReader JsonIntVector \
+  -output runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-int8-cached.topics.dl19-passage.bge-base-en-v1.5.jsonl.txt \
+  -hits 1000 -threads 16 &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```bash
+bin/trec_eval -m map -c -l 2 tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-int8-cached.topics.dl19-passage.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-int8-cached.topics.dl19-passage.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -m recall.100 -c -l 2 tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-int8-cached.topics.dl19-passage.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -m recall.1000 -c -l 2 tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-int8-cached.topics.dl19-passage.bge-base-en-v1.5.jsonl.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **AP@1000**                                                                                                  | **BGE-base-en-v1.5**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.4435    |
+| **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.7065    |
+| **R@100**                                                                                                    | **BGE-base-en-v1.5**|
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.6171    |
+| **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
+| [DL19 (Passage)](https://trec.nist.gov/data/deep2020.html)                                                   | 0.8472    |
+
+The above figures are from running brute-force search with cached queries on non-quantized indexes.
+With cached queries on quantized indexes, results may differ slightly.
+
+❗ Retrieval metrics here are computed to depth 1000 hits per query (as opposed to 100 hits per query for document ranking).
+For computing nDCG, remember that we keep qrels of _all_ relevance grades, whereas for other metrics (e.g., AP), relevance grade 1 is considered not relevant (i.e., use the `-l 2` option in `trec_eval`).
+The experimental results reported here are directly comparable to the results reported in the [track overview paper](https://arxiv.org/abs/2003.07820).
+
+## Reproduction Log[*](reproducibility.md)
+
+To add to this reproduction log, modify [this template](../../src/main/resources/docgen/templates/dl19-passage.bge-base-en-v1.5.parquet.flat-int8.cached.template) and run `bin/build.sh` to rebuild the documentation.